Why the Semantic Web Has Failed
I know, up front, that I'm going to get a lot of flack for this article. Bear me out and read to the end.
I work fairly heavily in the semantic space. There are all kinds of very interesting applications for semantics, because so much of the information within enterprises tends to be far more relational and interconnected than is commonly realized.
Yet there's one thing that, in more than a decade of working with RDF, I have yet to do - find a good excuse for the semantic web.
This may sound like heresy, but my personal belief is that the semantic web has failed. Not in "just give it a few more years and it'll catch on" or "it's just a matter of tooling and editors". No, I'd argue that, as admirable as the whole goal of the semantic web is, it's just not working in reality.
There are several very good reasons for this.
- Semantics is hard to understand.
- Semantics is invisible to most people.
- Semantics is difficult to open up and read.
- Semantics does not fit into the dominant Algol/C++/Java OOP paradigm.
- Semantics is hard to add manually to content.
In commercial projects, such limitations can be overcome, because semantics does provide clear tangible benefits as well:
- Semantics can connect things within an organization.
- Semantics can make it possible to query across multiple entities.
- Semantics can significantly enhance search and NLP.
- Semantics can contain subclassing to simplify notation.
- Semantics can be added to documents and data stores dynamically.
- Semantics provide powerful constraint queries.
All of these are useful, but they are a lot less useful when extended to the web itself, because of the fairly diffuse nature of information on the web. I can, with Google or Bing or whatever similar engine you're using, get a pretty fair handle on what documents I want without search - and because neither Google nor Bing relies upon user supplied semantics, to any meaningful extent. That doesn't mean that they don't use semantic technology on the back-end - they certainly do - but it does mean in general that this is information that isn't generally available to the web developer or application writer. This means that, with a few exceptions, semantics are invisible.
Now, I love curies. I love the ability to declare a namespace once, then reduce everything down to rdf:this and skos:that, without having to type in http://www.w3.org/1999/02/22-rdf-syntax-ns#this or http://www.w3.org/2004/02/skos/core#that every time I need to access these namespaces. I'd love it even more if Google and Mozilla and Microsoft would simply CACHE namespace prefix declarations, then warn me whenever I have a conflict. That'd make it a lot easier for me to actually add RDFa to content manually. They don't of course. The WHATWG group is congenitally allergic to namespaces. It smacks too much of XML, and XML is the work of the devil.
However, even given that, writing good RDFa is hard, almost harder than writing good RDF. You have to understand the idea of semantic concepts, the distinction between the document as URL and document as conceptual entity and document as concrete entity. And that assumes that the URL is static. Today, most of the web isn't static - it's generated dynamically, and the URL to a document can very dramatically based upon query string parameters (which often then get subsumed into REST interfaces by people who neither understand nor care to understand what REST actually is).
I find that most taxonomists despair about getting web content writers to even put keywords into their document. Forget about trying to develop deep graphs of information about a given article that you're getting $30 to write. Yes it would be useful to the publisher, but it likely is not going to make such a difference that they are willing to spend another hour on their own dime to encode.
Certain places can encode RDF or RDFa handily. Wikis are getting better about this, although most specialized wikis also have well known and clearly defined ontologies to draw upon to encode information - and, let's face it, with the exception of Wikipedia itself, most wikis are specialized. When this happens though, it is seldom the author doing such encoding, nor even an editor - it is a process, assiduously collecting links, comparing them to existing ontologies and then converting them into assertions.
That in fact, is how most semantic wikis work - a link becomes an assertion with a label and a relationship, and it then becomes incumbant upon the processor, not a human being, to figure out what that relationship is. This is in fact a problem with semantics in general, but on the wild and wooly web, it's almost intractable beyond the simple assertion that "this is a link to that."
Now, again, there's a surprisingly simple fix for this, but one that isn't used much at all. The fix is to add a field in web editors to a pop-up that gets one piece of link information: "rel" for relationship. What this means in practice is that when I am editing something using Tiny MCE or similar web editor, when I click the link button, I get something like the following:
The "Keywords" section is the important thing here, and it is where you establish these relationships, which are then stored in the rel field. Note that rel: is used to handle certain type of server information as well, such as rel="nofollow" being used to tell a search engine not to follow a given link, so the specific encoding for keywords may need to be formalized (perhaps such as <a href="target" rel="keywords:'javascript','semantics','resume' nofollow">Title</a>).
What this would do is define a semantic structure of the form:
[] <link> _:b1.
_:b1 <title> "Online Portfolio";
_:b1 <keyword> "javascript","semantics","resume";
_:b1 <url> "http://portfolio.semanticalllc.com";
A processor could then take this and resolve it, using sparql look up to create something like:
<resource:article1>
<alink> <alink:alink125>.
<url> "http://myarticle.com/article1";
<alink:alink125>
<linkLabel> "Online Portfolio";
<link> <resource:article53>;
<linkTopic> <ont:javascript>,<ont:semantics>,<ont:resume>.
<resource:article53>
<rdfs:label> "Kurt Cagle's Online Portfolio";
<url> "http://portfolio.semanticalllc.com".
That's all that's needed. The rels will establish relationships to conceptual terms, and a simple sparql query can then build out how to resources are related:
construct {?source ?rel ?target} where
{
?source <alink> ?alink.
?alink <linkTopic> ?rel.
?alink <link> ?target.
}
This would in turn create three links + another coming from a little extra analysis of the selected title text:
<resource:article1> <ont:javascript> <resource:article53>.
<resource:article1> <ont:semantics> <resource:article53>.
<resource:article1> <ont:resume> <resource:article53>.
<resource:article1> <ont:portfolio> <resource:article53>.
That's all it would take to make semantics much easier on the web. It would require some post-processing to go from a string to a semantic representation (typically by querying against an established taxonomy). This might add a bit more semantic information to the dataset (such as which ontology was used to resolve this and how to access it if such a lexicon is public), but the idea is that you can do a lot of semantic work simply by focusing on links and not worrying about abstractions.
This, however, runs counter to how most people think about the semantic web, as a way for resources to describe themselves. This unfortunately makes no sense, because it puts the onus of providing a abtract of information on the producer of that information, and abstraction is hard. So instead, we put the onus of abstraction on the person creating the link. Why is this link relevant to the link creator, not the document creator?
One upshot of this approach is that by going this way, it lets the document be described by others. For instance, from the perspective of article 53, I now have three critical pieces of information:
select ?term ?url where {
?src ?rel <resource:article53>.
?rel <rdfs:label> ?term.
?src <url> ?url.
} order by ?term
term url
"Code" "http://myarticle.com/article17"
"Javascript" "http://myarticle.com/article1"
"Portfolio" "http://myarticle.com/article1"
"Resume" "http://myarticle.com/article1"
"Resume" "http://myarticle.com/article17"
"Resume" "http://myarticle.com/article42"
"Semantics" "http://myarticle.com/article1"
Or put another way, the referenced article 51 has been determined to be an example of code, javascript and semantics, and appears also to contain a resume ... all without having done anything with RDFa or similar inline content.
This is a comparatively simple solution that crowd sources classification by having each linker tell why the resource is relevant to them. It's fully as "semantic" as RDFa, and more to the point, it doesn't require that the creator of the links know anything about RDFa.
In a way, this is the fundamental problem that the Semantic Web has. It places too much of the onus of classification on the creator, and then compounds this by building a system of expecting users of the technology to understand how and why it works. Creating a generally accessible concept taxonomy (or even keeping taxonomic terms as literals and then linking after the fact to a core concept with a sameAs type relationship) is not hard, though it does put the onus of the processing of such content with the content publishers. However, this is perhaps more appropriate anyway, because the content publishers in turn get their datasets classified for free.
With this shift from author classification of content to linker classification of that content, what happens?
- Content becomes classified more objectively, because it is based upon the utility of the content to the linker.
- Namespaces disappear from consideration within web content, which makes web developers happier. All you do is add a ref field to your links.
- The decision of what ontology to classify terms moves from the creator of the content to the host of the content, which has a bigger vested interest in maintaining a comprehensive taxonomy.
- The producer can expose (or not) information without having to give away the farm. If you've just invested $50K in a comprehensive custom ontology, you're not necessarily inclined to share that content.
- Because the rel content is within the HTML, application developers can write their own semantifiers for the same content without having to store the content directly.
- Services such as Google could expose their back-chain links (they don't now, as far as I know) - those documents that link to a given document.
- Even without this, any spidering function will do the same thing, but with a smaller corpus of documents.
Browsers (or a browser app) can also maintain their own semantics databases that contain autoclassification links - you type in a set of keywords, and the app will identify those pages that match those keywords (at the simplest) or even perform more complex queries as appropriate, without necessarily needing the full semantic web infrastructure.
There are places where the linked data web has succeeded, but this is not the same as the Semantic Web. LD for the most part consists of making specific data repositories available in a queryable form. This is laudable, and slowly happening, but is still a very small subsection of the total web. By rethinking how and where classification occurs, there's definitely an opportunity to open up far more of the web to semantic practices, at very low effective cost.
Kurt Cagle is the founder of Semantical LLC, and writes frequently on semantics and web-related issues.
Such a spot on post! I've been a huge fan of RDF, but the barriers to adoption, even among educated engineers has prevented acceptance, particularly when basic indexing/search can produce similar (not equivalent) results in a much dev friendlier way. I'm still enamored with the idea of semantic web, regardless.
I said from day-one - invented by PhDs for PhDs - why do spreadsheets work? Because secretaries can do them. Can I do semantics in the spreadsheet? Sure love them.
I think it's illustrative to see where it's failed and where it hasn't. A Bing or Google search for 'apple pie' brings a structured recipe to the top. There's clear promotional, and therefore indirect monetary value to web site owners of putting that kind of semantic structure around recipes. Yes, it's (just a little) hard to tag; but they see business value and so they do it. So far with the web, too many other elements of schema.org don't YET have a direct and obvious value. For instance, I'm really interested in learning resources and LRMI. It sounds good as a semantic framework, but what business value does that tagging effort have? Google doesn't today do anything with it? Right now the only answer for the monetization/value of this effort is the lowest-common-denominator, Google. To me it seems inevitable that as the web matures, advertising-driven information-seeking like Google will be replaced by something that is more in service of the information seeker. At what point will I, and all other consumers, be willing to pay for a service that gives me the clean, well-lighted place that caters to my needs, pushes information to me that I've asked for, and doesn't have advertising-driven and spam-driven corruption of my information-seeking intent? free of untrustworthy sources? Probably pretty soon.
Good article. In your 'good reasons for this' list at the top you have 'Semantics is hard to add manually to content.' This can be restated as: _the_ reason the Sem Web has failed (insofar as it has) is that semantics can be found and are knowable in _modelled_ content - because they are there by design. Most of what is on the web is not modelled semantically, only in terms of UI and/or workflow tagging. Information systems that start with business / domain models (and even better - ontologies and terminologies) can expose information for which reliable RDF can be written, and reliable inferencing can be done. For most of what's on the web, figuring out the semantics is mosty reverse-engineering, and it's completely unreliable. Coupled with no backing terminologies / ontologies, there was never any hope. Also, XML _is_ the work of the devil, and should be banned from polite society :)