Microformats and RDFa deployment across the Web
I have presented on previous occasions (at Semtech 2009, SemTech 2010, and later at FIA Ghent 2010, see slides for the latter, also in ISWC 2009) some information about microformat and RDFa deployment on the Web. As such information is hard to come by, this has generated some interest from the audience. Unfortunately, Q&A time after presentations is too short to get into details, hence some additional background on how we obtained this data and what it means for the Web. This level of detail is also important to compare this with information from other sources, where things might be measured differently.
The chart below shows the deployment of certain microformats and RDFa markup on the Web, as percentage of all web pages, based on an analysis of 12 billion web pages indexed by Yahoo! Search. The same analysis has been done at three different time-points and therefore the chart also shows the evolution of deployment.
The data is given below in a tabular format.
There are a couple of comments to make:
- There are many microformats (see microformats.org) and I only include data for the ones that are most common on the Web. To my knowledge at least, all other microformats are less common than the ones listed above.
- eRDF has been a predecessor to RDFa, and has been obsoleted by it. RDFa is more fully featured than eRDF, and has been adopted as a standard by the W3C.
- The data for the tag, adr and geo formats is missing from the first measurement.
- The numbers cannot be aggregated to get a total percentage of URLs with metadata. The reason is that a webpage may contain multiple microformats and/or RDFa markup. In fact, this is almost always the case with the adr and geo microformats, which are typically used as part of hcard. The hcard microformat itself can be part of hatom markup etc.
- Not all data is equally useful, depending on what you are trying to do. The tag microformat, for example, is nothing more than a set of keywords attached to a webpage. RDFa itself covers data using many different ontologies.
- The data doesn’t include “trivial” RDFa usage, i.e. documents that only contain triples from the xhtml namespace. Such triples are often generated by RDFa parsers even when the page author did not intend to use RDFa.
- This data includes all valid RDFa, and not just namespaces or vocabularies supported by Yahoo! or any other company.
The data shows that the usage of RDFa has increased 510% between March, 2009 and October, 2010, from 0.6% of webpages to 3.6% of webpages (or 430 million webpages in our sample of 12 billion). This is largely thanks to the efforts of the folks at Yahoo! (SearchMonkey), Google (Rich Snippets) and Facebook (Open Graph), all of whom recommend the usage of RDFa. The deployment of microformats has not advanced significantly in the same period, except for the hatom microformat.
These results make me optimistic that the Semantic Web is here already in large ways. I don’t expect that a 100% of webpages will ever adopt microformats or RDFa markup, simply because not all web pages contain structured data. As this seems interesting to watch, I will try to publish updates to the data and include the update chart here or in future presentations.