Friday, June 30, 2006

Semantic Web in a nutshell

A lot of people have been asking me about the Semantic Web recently. The first question is usually something like 'is the Semantic Web going to cause a security nightmare as information that is currently inaccessible becomes visible?' The next question is 'what is the Semantic Web anyway?'

Descriptions of the Semantic Web tend to either say so little that the point is lost 'the Web of Data' or quickly dive into the details here for example is Tim's own roadmap.

We have a Web of data today, the problem is that we cannot use most of it in interesting ways. The Semantic Web is about building a Web of knowledge. Knowledge is information we can use.

Most data that is on the Web today is designed to be read by humans. That is fine if the user already knows what they are trying to find. It works much less well when you are trying to find information through a search engine.

Consider the problem of finding a specific part for a refrigerator. If you type in "refrigerator parts" as a search term you will find a lot of places that sell refrigerators but a lot more that you don't want. Refrigerators appearing in films, people blogging about buying them, cleaning them, repairing them. What you really want is a way to restrict the search to only return sites that actually have refrigerator parts for sale. You probably want to go a stage further and restrict the search to distributors of parts rather than shops that mostly sell refrigerators but will be happy to sell you a valve that costs $1 to make and should cost no more than $5 to buy for $66.50 plus sales tax (I do not exaggerate).

Yahoo started as an attempt to create a taxonomy for the Web. The problem is that the Yahoo directory works the same way as a yellow pages with hundreds of people working on classifying Web pages. This worked in the early days when the Web had a few million users. The results are no longer very useful or current now that the Web has a billion users.

The answer is for people to describe the information that they put on the Web in ways that search engines and other tools that have not yet been invented can find it.

The last piece is the important part. A long time ago Yuri Rabinski and I got into an argument with Alan Kay which eventually turned into an argument between Tim Berners-Lee and Alan Kay. The day before Alan had criticized the declarative nature of HTML, asserting that the procedural model of postcript was superior because the data describes the way that it should be used.

The example is a good one because it proves exactly the point each side was making. Postscript is a superior language to HTML if all you want to do is to produce a printed version of the document. HTML supports a much larger range of uses. You can print it, show it on a graphics display or on a character cell terminal, send it to a PDA or convert it into speech. You can take the text, edit it and produce something new.

Which of these is 'better' depends on your point of view. If you are trying to produce the best technology to meet the needs you anticipate the postscript approach is best. If you want to prevent dangerous unauthorized uses of your technology then it is a lot better. If you want to allow the information to be used in the widest possible variety of ways then the HTML approach is best.

Semantic Web is about presenting data in ways that allow machines to do the type of work people do today. A lot of Web sites have addresses on them for hotels, businesses and so on. Some of the more thoughtful businesses add links to an online mapping site so it is easy to get directions. In the Semantic Web approach the address is tagged so that the Web browser can recognize that this sequence of data is an address and bring up the mapping tool of the user's choice.

The point here is not just a question of whether you want to choose between using Google Maps or Mapquest to find your directions. Both will do a fine job. Bringing up the tool of your own choice means that it can do things that the creator of the Web site could never anticipate. For example take those coordinates and send them to the GPS mapping unit in your car so that the destination is already programmed in when you start driving.

This naturally leads to the question we began with. Will the ability to make sense of information be used by Internet criminals?

As with all powerful technologies the Semantic Web can be used for good or for evil. In this case though the balance is firmly and definitely for good.

The Semantic Web is not really making more information available to the bad guys, it is merely making the information more visible. The Semantic Web will not cause people to put more credit card numbers on the Web. It may make it easier to answer questions such as 'what was your grandmother's maiden name' however.

This would be a real problem if security systems that depend on security through obscurity were working before the Semantic Web. The fact is however that they are already collapsing. We have to get rid of static passwords, static credit card numbers and the rest.

The potential benefits of the Semantic Web are much greater. Its not just the bad guys who can search for potentially compromising information. We can do that too. Information leakages will become more visible. There will be much greater incentives to avoid them.

The tools being built to support Semantic Web look remarkably like the tools we use to track down Internet criminals and for fraud detection. The difference between Semantic Web and what we do today is that Semantic Web makes it possible to share that information with other people in ways their systems can understand.

Today the bad guys adopt a divide and conquer strategy. They design their attacks knowing where information can be readily exchanged and where it cannot. Semantic Web gives us the tools to link our systems and unite them.

No comments: