year of the monkey: lessons from the first year of searchmonkey

Year of the MonkeyPeter Mika

Researcher and Data Architect

Yahoo!

- 2 -

Acme.com’sdatabase

Index

RDF/Microformat Markup

site owners/publishers share structured data with Yahoo!. 1

consumers customize their search experience with Enhanced Results or Infobars

3

site owners & third-party developers build SearchMonkey apps.2

DataRSS feed

Web Services

Page Extraction

Acme.com’s Web Pages

SearchMonkey

- 3 -

March, 2008: The news is out!

“Yahoo’s support for semantic web standards like RDF and microformats is exactly the incentive websites need to adopt them. Instead of semantic silos scattered across the Web (think Twine), Yahoo will be pulling all the semantic information together when available, as a search engine should. Until now, there were few applications that demanded properly structured data from third parties. That changes today.” – TechCrunch

- 4 -

May, 2008: Time to party!

- 5 -

What happened since the launch?

• It’s working in and out

– We are contributing significantly to the growth of the Semantic Web

• Users are delighted Publishers are willing to invest More structured data More applications More users are delighted

– Increasing excitement all around

• A year later, even Google gets excited

• This presentation is about the things we learned along the way

1. How far are we?

2. Lessons about technology

3. Lessons about our communities

4. Moving ahead

How far are we?

- 7 -

The Web is getting more structured

- 8 -

But just how far are we: lots of data or very little?

Percentage of URLs with embedded metadata in various formats

Sep, 2008 Mar, 2009

>400% increase in RDFa data

- 9 -

The Semantic Gap

• The real question is whether this data serves a purpose

• Our purpose: fulfilling the information needs of our users

– It’s not about the size! Consider Wikipedia.

Demand= needs

Supply= information

- 10 -

Analysis through query logs

• Research questions:

– How much of this data would ever be encountered by a user through search?

– What categories of queries can be answered?

– What’s the role of large sites?

• Method– Imitating the average search behavior of users through web search

query log analysis

– Reproducible experiments (given query log data)

• BOSS web search API

– Returns metadata for search result URLs in RDF/XML or DataRSS

- 11 -

Data

• Microformats, eRDF, RDFa data

• Query log data

– US query log

– Random sample of 7k queries

– Recent query log covering over a month period

• Query classification data

– US query log

– 1000 queries classified into various categories

- 12 -

Caveats

• For us, and for the time being, search = document search– For this experiment, we assume current bag-of-words

document retrieval is a reasonable approximation of semantic search

• For us, search = web search– We are dealing with the average search user– There are many queries the users have learned not to ask

• Volume is a rough approximation of value

– There are rare information needs with high pay-offs, e.g. patent search, financial data, biomedical data…

- 13 -

Number of queries with a given number of results with particular formats (N=7081)

Impressions Average impressions per query

Notes: - Queries with 0 results with metadata not shown- You cannot add numberss in columns: a query may return documents with different formats- Assume queries return more than 10 results

1 2 3 4 5 6 7 8 9 10ANY 2127 1164 492 244 85 24 10 5 3 1 7623 1.08

hcard 1457 370 93 11 3 0 0 0 0 0 2535 0.36

rel-tag 1317 350 95 44 14 8 6 3 1 1 2681 0.38

adr 456 77 21 6 1 0 0 0 0 0 702 0.10

hatom 450 52 8 1 0 0 0 0 0 0 582 0.08

license 359 21 1 1 0 0 0 0 0 0 408 0.06

xfn 339 26 1 1 0 0 0 1 0 0 406 0.06

On average, a query has at least one result with metadata.

Are tags as useful as hCard?

That’s only 1 in every 16 queries.

- 14 -

The influence of head sites (N=7081)


1 2 3 4 5 6 7 8 9 10

ANY 2127 1164 492 244 85 24 10 5 3 1 7623 1.08

hcard 1457 370 93 11 3 0 0 0 0 0 2535 0.36

rel-tag 1317 350 95 44 14 8 6 3 1 1 2681 0.38

wikipedia.org 1676 1 0 0 0 0 0 0 1 0 1687 0.24

adr 456 77 21 6 1 0 0 0 0 0 702 0.10

hatom 450 52 8 1 0 0 0 0 0 0 582 0.08

youtube.com 475 1 0 0 0 0 0 2 0 0 493 0.07

license 359 21 1 1 0 0 0 0 0 0 408 0.06

xfn 339 26 1 1 0 0 0 1 0 0 406 0.06

amazon.com 345 3 0 0 0 0 1 0 0 0 358 0.05

If YouTube came up with a microformat, it would be the fifth most important.

- 15 -

Restricted by category: local queries (N=129)

1 2 3 4 5 6 7 8 9 10

ANY 36 16 10 0 4 1 0 0 0 0 124 0.96

hcard 31 7 5 1 0 0 0 0 0 0 64 0.50

adr 15 8 2 1 0 0 0 0 0 0 41 0.32

local.yahoo.com 24 0 0 0 0 0 0 0 0 0 24 0.19

en.wikipedia.org 24 0 0 0 0 0 0 0 0 0 24 0.19

rel-tag 19 2 0 0 0 0 0 0 0 0 23 0.18

geo 16 5 0 0 0 0 0 0 0 0 26 0.20

www.yelp.com 16 0 0 0 0 0 0 0 0 0 16 0.12

www.yellowpages.com 14 0 0 0 0 0 0 0 0 0 14 0.11


The query category largely determines which sites are important.

- 16 -

Summary

• It’s not how much, it’s how useful

– For us, it’s a matter of who is looking for it

• We would need to break down usefulness

– Usefulness for improving presentation

– Usefulness for ranking

– Usefulness for disambiguation

– …

– Usefulness by how much someone is willing to pay for it

• Linked Data will need to be studied separately

Lessons about technology

- 18 -

Publisher’s dilemma: choosing the syntax I.

• microformats

– Cover the most common types of objects

• microformats.org, but also ‘unofficial’ such as Facebook Share

– Great for the publisher

• As long as it exactly fits: microformats are not extensible

• Various degrees of adoption and readiness

– Not so great from a parsing perspective

• Combinations of microformats are a problem

– Low end of semantics

• Very little structure

- 19 -

Publisher’s dilemma: choosing the syntax II.

• eRDF

– The dying breed, is rapidly replaced by RDFa

• RDFa

– When microformats are not enough

• Full RDF expressiveness

• Bring your own vocabulary

– Syntax is difficult with many caveats

• e.g. the notion of unfinished or ‘dangling’ triples

• e.g. combination of the rel and typeof attributes

– Still, RDFa is on the rise and now the default choice

- 20 -

Publisher’s dilemma: choosing the syntax III.

• DataRSS feeds

– Good for private data and not having to modify the page

• XSLT

– Often a good initial step in experimentation

– Writing XSLTs is hard, but visual tools such as MashMaker help

– No support for GRDDL yet

• Linked Data using RDF/XML

– Relationship of Linked Data and HTML pages is unclear

• Linked data is often published for someone else, e.g. Dbpedia

– Not supported by our infrastructure yet

- 21 -

The language tradeoff: complexity vs. simplicity

• Complexity means more mistakes and less adoption

– Resource or literal?

• <meta property=“vcard:url”>http://www.example.org</meta>

• <meta property=“vcard:tel”>0034691792522</meta>

– Webpage or resource?

• Should we allow a resource have the same URI as an existing webpage?

• This is the default in eRDF/RDFa!

– Types vs. datatypes

• <meta datatype=“mytype:email”>[email protected]</meta>

– Extensibility

• rdfs:movies

• But simplicity results in loss of semantics

– <span class=“open”> <b>Mon-Wed</b> 10am-5pm </span>

• Not the last word on the subject, see HTML5

- 22 -

Where are the vocabularies?

• No vocabularies in many domains

– Books, movies, stuff people care about…

• Too many competing proposals in other domains

– Often versions of the same proposal

– Example: vocabularies for microformats

• Not maintained

– I cannot maintain your vocabulary for you

• Limited tool support

– Too many expert tools until now

• Many vocabularies are not designed for annotation

• Missing meeting point and social process

– An ontology is a shared, formal representation of a domain

- 23 -

How do we build communities? www.vocamp.org

Learning about our communities

- 25 -

Publishers, developers, users and marketers

• Most applications are developed by the site owner

– Exposing only the data that is required for the application

– How to encourage other types of applications and more data?

• Typical developer is a front-end engineer

– Mostly new to semantic technologies, but motivated

– However, ramp-up is steep. Learning RDF plus SearchMonkey.

– How to simplify development?

• Users have little interest in customizing their search experience

– They attach less value to customization than we thought

– How to give our users more value without the need for customization?

- 26 -

Helping our publishers

• What if we could remove the need for programming?

• SearchMonkey objects

– Generate enhanced results based on markup in common formats

– Copy-paste code

– Validator

- 27 -

LATE BREAKING NEWS

Five new objects: Product, Local, News, Event, Discussion

- 28 -

Opening up new ways of accessing structured data

• BOSS (Build your Own Search Service) API

– Full service web search API

– Access metadata with search results• view=searchmonkey_rdf&format=xml

– Use magic words to restrict search to results with certain kinds of metadata

• e.g. searchmonkey:com.yahoo.page.uf.hcard

• YQL

– Query web services as virtual relational tables

– Create mashups by joining tables

– The microformats ‘table’ allows similar access as with BOSS

- 29 -

New default on applications for our users

Moving ahead

- 31 -

CELEBRATING 1 YEAR ANNIVERSARY

OF SEARCHMONKEYIn 23 markets around the world

70 million enhanced results viewed daily (US)

>15% increase in click-through rates

200 people enter dev tool to start creating an app a day

>15,000 developers registered to build apps

>400 applications in gallery

Amount of RDFa structured data increased by 413%

Summary

- 32 -

What we’ve done

• Opened up SearchMonkey via BOSS and YQL

• Significantly simplified the work of publishers and developers

• Improved user experience through a number of new applications

• Working with the community to establish standards and best practices

– Vocabulary developers (microformats, FOAF, SIOC, GoodRelations etc.)

– Standard organizations (e.g. providing feedback for HTML5)

– Semantic data consumers and producers (e.g. Common Tag)

– The Semantic Web community (e.g. VoCamps)

- 33 -

The Future

• Supporting the community with new tools

– We need your feedback!

• Creating entirely new search experiences based on structured data

– Intent-driven

– Task/goal oriented

– Stateful

– Delightful

- 34 -

Contact

• Peter Mika

– [email protected]

• SearchMonkey

– developer.yahoo.com/searchmonkey/

– mailing lists

• [email protected]

• [email protected]

– forums

• http://suggestions.yahoo.com/searchmonkey

- 35 -

Hear more about what we do

• Wednesday 8:30AM: Executive Round Table on Semantic Search with Andrew Tomkins, Chief Scientist, Yahoo Search

• Wednesday 11:45AM: SearchMonkey and the Semantic Web with Kevin Haas, Senior Engineering Manager, Yahoo! Search

• Wednesday 2:30PM: Year of the Monkey: Lessons from the first year of SearchMonkey with Peter Mika, Yahoo! Research

• Thursday 11AM: The Semantic Web Gang looks back at SemTech 2009 with Peter Mika, Yahoo! Research

year of the monkey: lessons from the first year of searchmonkey

Technology

data microformats

query category

financial data

document search

data architectyahoo

biomedical data

query notes

query hasat