Welcome to schema.org

Bing, Google, and Yahoo! have announced schema.org yesterday, a collaboration between the three search providers in the area of vocabularies for structured data. As the ‘schema guy’ at Yahoo!, I have been part of the very small core team that developed technical content for schema.org. It’s been an interesting process: if you doubt that achieving an agreement in the search domain is hard, consider that the last time such an agreement happened was apparently sitemaps.org in 2006.

However, over the years, the lack of agreement on schemas have become such a major pain point for publishers new to the Semantic Web, that eventually cooperation became the only sensible thing to do for the future of the Semantic Web project. Consider that until yesterday any publisher that wanted to provide structured data for Bing, Google, and Yahoo needed to navigate three sets of documentation, and worse, choose between three different schemas and multiple formats (microdata, RDFa, microformats) and markup their pages using all of them.

So how did we get here? In my personal view, one of the key problems of the Semantic Web design of the W3C has been that it considered only technical issues, and not the need for a social process that would lead to bootstrap the system with data and schemas. We are now doing better on the data front thanks to large community efforts such as Linked Data. With regard to schemas — we used to call them ontologies, until we found it scared people away — the expectation was that they would be developed in a distributed manner and machines would do the hard job of schema matching or somehow agreements would emerge. However, schema matching is a hard problem to automate. Agreements were slow to come due to a lack of space for schema development and discussions. We have tried a number of things in this respect, for example some of you might know that I’ve been one of the instigators of the VoCamp movement which peaked around 2009. The W3C itself accepted some RDF-based schemas as member submissions, but it didn’t see itself as the organization that should deal with schemas, and there has been no process for dealing with these submissions either. (As an example, we learned when we started with SearchMonkey that there have been actually two versions of VCard in RDF submitted by two different members of the W3C. This problem has since been resolved.) Other schemas just appeared on websites abandoned by their owners. Finding stable and mature schemas with sufficient adoption has eventually become a major pain point. In the search domain, the situation improved somewhat when search providers preselected some schemas for publishers to use, and started providing specific documentation, with examples and a way to validate webpages. However, as illustrated above, the efforts have been still too fragmented until yesterday.

Given the above history, I’m extremely glad that cooperation prevailed in the end and hopefully schema.org will become a central point for vocabularies for the Semantic Web for a long time to come. Note that it will almost certainly not be the only one. schema.org covers the core interests of search providers, i.e. the stuff that people search for the most (hence the somewhat awkward term ‘search vocabularies’). As the simple needs are the most common in search logs, this includes things like addresses of businesses, reviews and recipes. schema.org will hopefully evolve with extensions over time but it may never cover complex domains such as biotechnology, e-government or others where people have been using Semantic Web technology with success. Nor do I think that schema.org is ‘perfect’. Personally, I would have liked to see RDFa used as the syntax for the basic examples, because I consider it more mature, and a superior standard to microdata in many ways. You will notice that RDF(a) in particular would have offered a standard way to extend schema.org schemas and map them to other schemas on the Web. Currently, there is an example of using the schemas in RDFa, but the support for this version of the markup will depend on its adoption.

Please take a look at schema.org, and if you have comments please consider using the schema.org feedback mechanisms (we have a feedback form as well as a discussion group).

Enhanced by Zemanta

Semantic Search Challenge sponsored by Yahoo! Labs

Together with my co-chairs Marko Grobelnik, Thanh Tran Duc and Haofen Wang, we again got the opportunity of organizing the 4th Semantic Search Workshop, the premier event for research on retrieving information from structured data collections or text collections annotated with metadata. Like last year, the Workshop will take place at the WWW conference, to be held March 29, 2011, in Hyderabad, India. If you wish to submit a paper, there are still a few days left: the deadline is Feb 26, 2011. We welcome both short and long submissions.

In conjunction with the workshop, and with a number of co-organizers helping us, we are also launching  a Semantic Search Challenge (sponsored by Yahoo! Labs), which is hosted at semsearch.yahoo.com. The competition will feature two tracks. The first track (entity retrieval) is the same task we evaluated last year: retrieving resources that match a keyword query, where the query contains the name of an entity, with possibly some context (such as “starbucks barcelona”). We are adding this year a new task (list retrieval) which represents the next level of difficulty: finding resources that belong to a particular set of entities, such as “countries in africa”. These queries are more complex to answer since they don’t name a particular entity. Unlike in other similar competitions, the task is to retrieve the answers from a real (messy…) dataset crawled from the Semantic Web. There is a small prize ($500) to win in each track.

The entry period will start March 1, and run through March 15. Please consider participating in either of these tracks: it’s early days in Semantic Search, and there is so much to discover.

Microformats and RDFa deployment across the Web

I have presented on previous occasions (at Semtech 2009, SemTech 2010, and later at FIA Ghent 2010, see slides for the latter, also in ISWC 2009) some information about microformat and RDFa deployment on the Web. As such information is hard to come by, this has generated some interest from the audience. Unfortunately, Q&A time after presentations is too short to get into details, hence some additional background on how we obtained this data and what it means for the Web. This level of detail is also important to compare this with information from other sources, where things might be measured differently.

The chart below shows the deployment of certain microformats and RDFa markup on the Web, as percentage of all web pages, based on an analysis of 12 billion web pages indexed by Yahoo! Search. The same analysis has been done at three different time-points and therefore the chart also shows the evolution of deployment.

Microformats and RDFa deployment on the Web (% of all web pages)

The data is given below in a tabular format.

Date RDFa eRDF tag hcard adr hatom xfn geo hreview
09-2008 0.238 0.093 N/A 1.649 N/A 0.476 0.363 N/A 0.051
03-2009 0.588 0.069 2.657 2.005 0.872 0.790 0.466 0.228 0.069
10-2010 3.591 0.000 2.289 1.058 0.237 1.177 0.339 0.137 0.159

There are a couple of comments to make:

  • There are many microformats (see microformats.org) and I only include data for the ones that are most common on the Web. To my knowledge at least, all other microformats are less common than the ones listed above.
  • eRDF has been a predecessor to RDFa, and has been obsoleted by it. RDFa is more fully featured than eRDF, and has been adopted as a standard by the W3C.
  • The data for the tag, adr and geo formats is missing from the first measurement.
  • The numbers cannot be aggregated to get a total percentage of URLs with metadata. The reason is that a webpage may contain multiple microformats and/or RDFa markup. In fact, this is almost always the case with the adr and geo microformats, which are typically used as part of hcard. The hcard microformat itself can be part of hatom markup etc.
  • Not all data is equally useful, depending on what you are trying to do. The tag microformat, for example, is nothing more than a set of keywords attached to a webpage. RDFa itself covers data using many different ontologies.
  • The data doesn’t include “trivial” RDFa usage, i.e. documents that only contain triples from the xhtml namespace. Such triples are often generated by RDFa parsers even when the page author did not intend to use RDFa.
  • This data includes all valid RDFa, and not just namespaces or vocabularies supported by Yahoo! or any other company.

The data shows that the usage of RDFa has increased 510% between March, 2009 and October, 2010, from 0.6% of webpages to 3.6% of webpages (or 430 million webpages in our sample of 12 billion). This is largely thanks to the efforts of the folks at Yahoo! (SearchMonkey), Google (Rich Snippets) and Facebook (Open Graph), all of whom recommend the usage of RDFa. The deployment of microformats has not advanced significantly in the same period, except for the hatom microformat.

These results make me optimistic that the Semantic Web is here already in large ways. I don’t expect that a 100% of webpages will ever adopt microformats or RDFa markup, simply because not all web pages contain structured data. As this seems interesting to watch, I will try to publish updates to the data and include the update chart here or in future presentations.

Enhanced by Zemanta

Semantic Search Workshop 2010… and a bit of competition

Together with my co-organizers Haofen Wang, Thanh Tranh and Marko Grobelnik, we have been again given the fantastic opportunity to organize the next edition of our Semantic Search workshop at WWW 2010,. The papers are due soon (March 6), so it’s a bit too late to advertize this part of the event, but I definitely wanted to mention what will hopefully become another key part of the event: the Evaluation of Entity Search track, a competition we are organizing in conjunction with the workshop.

We have already noticed last year that our growing field has been lacking a firm ground to evaluate our results, so right after WWW 2009 we have to decided to put serious effort into thinking about the evaluation of Semantic Search. One of our talented interns at Yahoo Research Barcelona, Jeff Pound has spent his time last summer to formalize possible evaluation tasks in semantic search and we have a paper at WWW 2010 on this topic. Just as importantly, however, we were joined by other helpful folks (Harry Halpin, Daniel Herzig, Henry Thompson) to actually organize a public competition.

So what does this competition look like? Participants will be given queries sampled from a web search query log provided by the Yahoo Webscope program, and have to try to answer those queries using the Billion Triples Challenge corpus from 2009. The queries that are selected are all entity queries in that they are looking to find information about a single entity. (The queries may contain context information, e.g. peter mika barcelona). The results have to be URIs of resources from the dataset. In short, a competition with keyword queries over structured data (RDF), which also sets it apart from other competitions focused on entity retrieval.

Everyone interested will have until April 10 to submit results as explained on the website of the workshop. As we don’t yet know what to expect, there will be no prizes this time… other than fame and glory!

Welcome the new Yahoo homepage!

And in case you still doubted whether this is a company ran by nerds, watch this video.

Common Tag semantic tagging format released today

The Common Tag format for semantic tagging has been finally released today after almost a year of intense work on it by a group of Web companies active in the semantic technologies area, among them Yahoo. It’s been great fun working on this and I’m proud to have been involved: while there have been vocabularies before for representing tags in RDF, this effort is different in at least two respects.

First, a significant effort of time has been spent on making sure the specification meets the needs of all partners involved. The support of these companies for the specification will ensure that developers in the future can rely on a single format for annotation with semantic tags and interchanging tag data. The website already lists a number of applications but I’m pretty sure that a common tagging format will open entirely new possibilities in searching, navigating and aggregating web content.

Second, the format has been developed with publishers in mind, in particular in making it as easy as possible to embed semantic tags in HTML using RDFa, a syntax universally embraced by all those involved. The choice for RDF also means that unlike in the case of the rel-tag microformat, Common Tags can be applied to any object, not just documents.

So, it’s time for a new era in tagging!

Reblog this post [with Zemanta]

VoCamp Sunnyvale: June 18-19, 2009

VoCamps provide the missing social interaction needed for vocabulary creation and management on the Semantic Web: a space where members of the community can discuss the current issues related to vocabularies and semantic interoperability. Unlike Semantic Web meetups which typically take just a few hours and where the discussion focuses on a single presentation, VoCamps are two-day events that allow in-depth discussions and working in small groups.

Following the success of VoCamp Ibiza, we are organizing another similar event at Yahoo, but this time in the US, where VoCamps are now also taking hold. (VoCampDC will be organized at the end of May and has already reached it’s full capacity!) This VoCamp will take place in Sunnyvale, directly after the SemTech 2009 conference.

If you would like to join this next edition of the VoCamp series, please sign up on the VoCampSunnyvale2009 wiki page! The space is limited, but we will try to expand if necessary. Hope to see many of you in San Jose and Sunnyvale!

Upcoming events: VoCampIbiza, FoWS and SemSearch

It’s rare that I’m involved involved in organizing three events at the same time, especially that those events take place in a period of two weeks. Nevertheless, I’m equally excited about each of them for different reasons.

The first one will be VoCampIbiza. I have a great feeling about VoCamp,  since in the year we first discussed the idea with Tom Heath, VoCamps have grown into a movement, even jumping across the Atlantic. Feeling at least partly responsible, I’ve felt the need to organize at least one such event somewhere nearby. (VoCamps are organized at different times and places unlike regular, local Semantic Web meetups which are also on the rise, see check out the Semantic Web meetup alliance)  So it’s going to take happen on Ibiza during April 15-16, on the week before WWW, and if you want to come, you just have to sign up! There is no registration fee and Ibiza is cheap to reach from many places in Europe.

The second event, the Future of Web Seach Workshop is a regular yearly get-together organized by Ricardo (Baeza-Yates), but this year it will have a special focus on semantic search. The format is again fairly flexible, no papers, only interesting presentations. (The list of presentations is already fixed and posted on the Website, but the registration is still open if you want to join!) It will take place right after VoCamp, April 17-18, so that’s again the week before WWW.

Lastly, there will be the second Semantic Search Workshop taking place at WWW in Madrid on April 21.  Together with my co-organizers (Thanh Tran Duc from AIFB in Karlsruhe, Haofen Wang from the Apex Lab and Marko Grobelnik from JSI) we already had the feeling that WWW is probably the best place to take this workshop as the conference naturally brings together researchers from IR and the Semantic Web. The number and quality of papers, as well as the number of participants who have registered so far are certainly very promising indicators!

SearchMonkey simplified

I’m proud to say that today we have released ‘SearchMonkey Objects’, which might seem like a small step in the evolution of the Monkey (huh!) but we are hoping that it will radically simplify the ramp-up for site owners when it comes to enabling rich, structured results for their websites.

So what’s happening here? Well, until now if you wanted to have an enhanced result based on metadata, you needed to mark up your site, and then create an application to transform the metadata into a search result presentation.  These applications were simple (all they do is map fields in the data to parts of a presentation template) but it still required developers to write PHP code.

We realized that this could be simpler! In particular, from now on if you provide Yahoo structured data using vocabularies (formats) that we understand, we can create a rich result for you without you having to write a single line of code! Obviously, if you want to customize the presentation to your particular site, you can still do that by writing a presentation application, but if you are happy with the standard treatment, you don’t have to.

SearchMonkey Objects is a simple website that not only shows you how to mark up your page for certain types of objects, but it also let’s you validate immediately if your markup is correct.  Again, this is something that many have asked for in the past. The first objects that we support are Video, Games and Documents, but more are on the way.

I believe this is an exciting step because it will no doubt lead to a great adoption of RDFa and other forms of semantic markup, bringing us even closer to stucturing the Web. And as always, tell us what you think!

Yahoo makes the World’s metadata available through BOSS

I’m thrilled to spread the news further that we have just made available all public metadata (microformats, eRDF and RDFa data) that we crawl through the BOSS API. See the official announcement and commentaries throughout the Web.

This means essentially opening up all public data available to SearchMonkey applications, and thus making it possible for anyone to experiment with various forms of semantic search that go well beyond changing the way abstracts look. Consider for example the microsearch prototype I blogged about a few months ago, which showed rich abstracts, plus temporal and geographic visualizations based on metadata. You can now build something like microsearch in a matter of hours, and in a highly scalable fashion. Last, but not least, the terms of the BOSS API allows you to monetize your search engine in any way you want.

So if you think you can build a better search engine through semantics, this is a great time to start! All we ask is to give us feedback on what you do, minimally by tagging your experiment with the tag ‘bossmashup’.