year of the monkey: lessons from the first year of searchmonkey

35
Year of the Monkey Peter Mika Researcher and Data Architect Yahoo!

Upload: peter-mika

Post on 20-Jan-2015

1.497 views

Category:

Technology


1 download

DESCRIPTION

Presentation at SemTech 2009

TRANSCRIPT

Page 1: Year of the Monkey: Lessons from the first year of SearchMonkey

Year of the MonkeyPeter Mika

Researcher and Data Architect

Yahoo!

Page 2: Year of the Monkey: Lessons from the first year of SearchMonkey

- 2 -

Acme.com’sdatabase

Index

RDF/Microformat Markup

site owners/publishers share structured data with Yahoo!. 1

consumers customize their search experience with Enhanced Results or Infobars

3

site owners & third-party developers build SearchMonkey apps.2

DataRSS feed

Web Services

Page Extraction

Acme.com’s Web Pages

SearchMonkey

Page 3: Year of the Monkey: Lessons from the first year of SearchMonkey

- 3 -

March, 2008: The news is out!

“Yahoo’s support for semantic web standards like RDF and microformats is exactly the incentive websites need to adopt them. Instead of semantic silos scattered across the Web (think Twine), Yahoo will be pulling all the semantic information together when available, as a search engine should. Until now, there were few applications that demanded properly structured data from third parties. That changes today.” – TechCrunch

Page 4: Year of the Monkey: Lessons from the first year of SearchMonkey

- 4 -

May, 2008: Time to party!

Page 5: Year of the Monkey: Lessons from the first year of SearchMonkey

- 5 -

What happened since the launch?

• It’s working in and out

– We are contributing significantly to the growth of the Semantic Web

• Users are delighted Publishers are willing to invest More structured data More applications More users are delighted

– Increasing excitement all around

• A year later, even Google gets excited

• This presentation is about the things we learned along the way

1. How far are we?

2. Lessons about technology

3. Lessons about our communities

4. Moving ahead

Page 6: Year of the Monkey: Lessons from the first year of SearchMonkey

How far are we?

Page 7: Year of the Monkey: Lessons from the first year of SearchMonkey

- 7 -

The Web is getting more structured

Page 8: Year of the Monkey: Lessons from the first year of SearchMonkey

- 8 -

But just how far are we: lots of data or very little?

Percentage of URLs with embedded metadata in various formats

Sep, 2008 Mar, 2009

>400% increase in RDFa data

Page 9: Year of the Monkey: Lessons from the first year of SearchMonkey

- 9 -

The Semantic Gap

• The real question is whether this data serves a purpose

• Our purpose: fulfilling the information needs of our users

– It’s not about the size! Consider Wikipedia.

Demand= needs

Supply= information

Page 10: Year of the Monkey: Lessons from the first year of SearchMonkey

- 10 -

Analysis through query logs

• Research questions:

– How much of this data would ever be encountered by a user through search?

– What categories of queries can be answered?

– What’s the role of large sites?

• Method– Imitating the average search behavior of users through web search

query log analysis

– Reproducible experiments (given query log data)

• BOSS web search API

– Returns metadata for search result URLs in RDF/XML or DataRSS

Page 11: Year of the Monkey: Lessons from the first year of SearchMonkey

- 11 -

Data

• Microformats, eRDF, RDFa data

• Query log data

– US query log

– Random sample of 7k queries

– Recent query log covering over a month period

• Query classification data

– US query log

– 1000 queries classified into various categories

Page 12: Year of the Monkey: Lessons from the first year of SearchMonkey

- 12 -

Caveats

• For us, and for the time being, search = document search– For this experiment, we assume current bag-of-words

document retrieval is a reasonable approximation of semantic search

• For us, search = web search– We are dealing with the average search user– There are many queries the users have learned not to ask

• Volume is a rough approximation of value

– There are rare information needs with high pay-offs, e.g. patent search, financial data, biomedical data…

Page 13: Year of the Monkey: Lessons from the first year of SearchMonkey

- 13 -

Number of queries with a given number of results with particular formats (N=7081)

Impressions Average impressions per query

Notes: - Queries with 0 results with metadata not shown- You cannot add numberss in columns: a query may return documents with different formats- Assume queries return more than 10 results

1 2 3 4 5 6 7 8 9 10ANY 2127 1164 492 244 85 24 10 5 3 1 7623 1.08

hcard 1457 370 93 11 3 0 0 0 0 0 2535 0.36

rel-tag 1317 350 95 44 14 8 6 3 1 1 2681 0.38

adr 456 77 21 6 1 0 0 0 0 0 702 0.10

hatom 450 52 8 1 0 0 0 0 0 0 582 0.08

license 359 21 1 1 0 0 0 0 0 0 408 0.06

xfn 339 26 1 1 0 0 0 1 0 0 406 0.06

On average, a query has at least one result with metadata.

Are tags as useful as hCard?

That’s only 1 in every 16 queries.

Page 14: Year of the Monkey: Lessons from the first year of SearchMonkey

- 14 -

The influence of head sites (N=7081)

Impressions Average impressions per query

1 2 3 4 5 6 7 8 9 10

ANY 2127 1164 492 244 85 24 10 5 3 1 7623 1.08

hcard 1457 370 93 11 3 0 0 0 0 0 2535 0.36

rel-tag 1317 350 95 44 14 8 6 3 1 1 2681 0.38

wikipedia.org 1676 1 0 0 0 0 0 0 1 0 1687 0.24

adr 456 77 21 6 1 0 0 0 0 0 702 0.10

hatom 450 52 8 1 0 0 0 0 0 0 582 0.08

youtube.com 475 1 0 0 0 0 0 2 0 0 493 0.07

license 359 21 1 1 0 0 0 0 0 0 408 0.06

xfn 339 26 1 1 0 0 0 1 0 0 406 0.06

amazon.com 345 3 0 0 0 0 1 0 0 0 358 0.05

If YouTube came up with a microformat, it would be the fifth most important.

Page 15: Year of the Monkey: Lessons from the first year of SearchMonkey

- 15 -

Restricted by category: local queries (N=129)

1 2 3 4 5 6 7 8 9 10

ANY 36 16 10 0 4 1 0 0 0 0 124 0.96

hcard 31 7 5 1 0 0 0 0 0 0 64 0.50

adr 15 8 2 1 0 0 0 0 0 0 41 0.32

local.yahoo.com 24 0 0 0 0 0 0 0 0 0 24 0.19

en.wikipedia.org 24 0 0 0 0 0 0 0 0 0 24 0.19

rel-tag 19 2 0 0 0 0 0 0 0 0 23 0.18

geo 16 5 0 0 0 0 0 0 0 0 26 0.20

www.yelp.com 16 0 0 0 0 0 0 0 0 0 16 0.12

www.yellowpages.com 14 0 0 0 0 0 0 0 0 0 14 0.11

Impressions Average impressions per query

The query category largely determines which sites are important.

Page 16: Year of the Monkey: Lessons from the first year of SearchMonkey

- 16 -

Summary

• It’s not how much, it’s how useful

– For us, it’s a matter of who is looking for it

• We would need to break down usefulness

– Usefulness for improving presentation

– Usefulness for ranking

– Usefulness for disambiguation

– …

– Usefulness by how much someone is willing to pay for it

• Linked Data will need to be studied separately

Page 17: Year of the Monkey: Lessons from the first year of SearchMonkey

Lessons about technology

Page 18: Year of the Monkey: Lessons from the first year of SearchMonkey

- 18 -

Publisher’s dilemma: choosing the syntax I.

• microformats

– Cover the most common types of objects

• microformats.org, but also ‘unofficial’ such as Facebook Share

– Great for the publisher

• As long as it exactly fits: microformats are not extensible

• Various degrees of adoption and readiness

– Not so great from a parsing perspective

• Combinations of microformats are a problem

– Low end of semantics

• Very little structure

Page 19: Year of the Monkey: Lessons from the first year of SearchMonkey

- 19 -

Publisher’s dilemma: choosing the syntax II.

• eRDF

– The dying breed, is rapidly replaced by RDFa

• RDFa

– When microformats are not enough

• Full RDF expressiveness

• Bring your own vocabulary

– Syntax is difficult with many caveats

• e.g. the notion of unfinished or ‘dangling’ triples

• e.g. combination of the rel and typeof attributes

– Still, RDFa is on the rise and now the default choice

Page 20: Year of the Monkey: Lessons from the first year of SearchMonkey

- 20 -

Publisher’s dilemma: choosing the syntax III.

• DataRSS feeds

– Good for private data and not having to modify the page

• XSLT

– Often a good initial step in experimentation

– Writing XSLTs is hard, but visual tools such as MashMaker help

– No support for GRDDL yet

• Linked Data using RDF/XML

– Relationship of Linked Data and HTML pages is unclear

• Linked data is often published for someone else, e.g. Dbpedia

– Not supported by our infrastructure yet

Page 21: Year of the Monkey: Lessons from the first year of SearchMonkey

- 21 -

The language tradeoff: complexity vs. simplicity

• Complexity means more mistakes and less adoption

– Resource or literal?

• <meta property=“vcard:url”>http://www.example.org</meta>

• <meta property=“vcard:tel”>0034691792522</meta>

– Webpage or resource?

• Should we allow a resource have the same URI as an existing webpage?

• This is the default in eRDF/RDFa!

– Types vs. datatypes

• <meta datatype=“mytype:email”>[email protected]</meta>

– Extensibility

• rdfs:movies

• But simplicity results in loss of semantics

– <span class=“open”> <b>Mon-Wed</b> 10am-5pm </span>

• Not the last word on the subject, see HTML5

Page 22: Year of the Monkey: Lessons from the first year of SearchMonkey

- 22 -

Where are the vocabularies?

• No vocabularies in many domains

– Books, movies, stuff people care about…

• Too many competing proposals in other domains

– Often versions of the same proposal

– Example: vocabularies for microformats

• Not maintained

– I cannot maintain your vocabulary for you

• Limited tool support

– Too many expert tools until now

• Many vocabularies are not designed for annotation

• Missing meeting point and social process

– An ontology is a shared, formal representation of a domain

Page 23: Year of the Monkey: Lessons from the first year of SearchMonkey

- 23 -

How do we build communities? www.vocamp.org

Page 24: Year of the Monkey: Lessons from the first year of SearchMonkey

Learning about our communities

Page 25: Year of the Monkey: Lessons from the first year of SearchMonkey

- 25 -

Publishers, developers, users and marketers

• Most applications are developed by the site owner

– Exposing only the data that is required for the application

– How to encourage other types of applications and more data?

• Typical developer is a front-end engineer

– Mostly new to semantic technologies, but motivated

– However, ramp-up is steep. Learning RDF plus SearchMonkey.

– How to simplify development?

• Users have little interest in customizing their search experience

– They attach less value to customization than we thought

– How to give our users more value without the need for customization?

Page 26: Year of the Monkey: Lessons from the first year of SearchMonkey

- 26 -

Helping our publishers

• What if we could remove the need for programming?

• SearchMonkey objects

– Generate enhanced results based on markup in common formats

– Copy-paste code

– Validator

Page 27: Year of the Monkey: Lessons from the first year of SearchMonkey

- 27 -

LATE BREAKING NEWS

Five new objects: Product, Local, News, Event, Discussion

Page 28: Year of the Monkey: Lessons from the first year of SearchMonkey

- 28 -

Opening up new ways of accessing structured data

• BOSS (Build your Own Search Service) API

– Full service web search API

– Access metadata with search results• view=searchmonkey_rdf&format=xml

– Use magic words to restrict search to results with certain kinds of metadata

• e.g. searchmonkey:com.yahoo.page.uf.hcard

• YQL

– Query web services as virtual relational tables

– Create mashups by joining tables

– The microformats ‘table’ allows similar access as with BOSS

Page 29: Year of the Monkey: Lessons from the first year of SearchMonkey

- 29 -

New default on applications for our users

Page 30: Year of the Monkey: Lessons from the first year of SearchMonkey

Moving ahead

Page 31: Year of the Monkey: Lessons from the first year of SearchMonkey

- 31 -

CELEBRATING 1 YEAR ANNIVERSARY

OF SEARCHMONKEYIn 23 markets around the world

70 million enhanced results viewed daily (US)

>15% increase in click-through rates

200 people enter dev tool to start creating an app a day

>15,000 developers registered to build apps

>400 applications in gallery

Amount of RDFa structured data increased by 413%

Summary

Page 32: Year of the Monkey: Lessons from the first year of SearchMonkey

- 32 -

What we’ve done

• Opened up SearchMonkey via BOSS and YQL

• Significantly simplified the work of publishers and developers

• Improved user experience through a number of new applications

• Working with the community to establish standards and best practices

– Vocabulary developers (microformats, FOAF, SIOC, GoodRelations etc.)

– Standard organizations (e.g. providing feedback for HTML5)

– Semantic data consumers and producers (e.g. Common Tag)

– The Semantic Web community (e.g. VoCamps)

Page 33: Year of the Monkey: Lessons from the first year of SearchMonkey

- 33 -

The Future

• Supporting the community with new tools

– We need your feedback!

• Creating entirely new search experiences based on structured data

– Intent-driven

– Task/goal oriented

– Stateful

– Delightful

Page 34: Year of the Monkey: Lessons from the first year of SearchMonkey

- 34 -

Contact

• Peter Mika

[email protected]

• SearchMonkey

– developer.yahoo.com/searchmonkey/

– mailing lists

[email protected]

[email protected]

– forums

• http://suggestions.yahoo.com/searchmonkey

Page 35: Year of the Monkey: Lessons from the first year of SearchMonkey

- 35 -

Hear more about what we do

• Wednesday 8:30AM: Executive Round Table on Semantic Search with Andrew Tomkins, Chief Scientist, Yahoo Search

• Wednesday 11:45AM: SearchMonkey and the Semantic Web with Kevin Haas, Senior Engineering Manager, Yahoo! Search

• Wednesday 2:30PM: Year of the Monkey: Lessons from the first year of SearchMonkey with Peter Mika, Yahoo! Research

• Thursday 11AM: The Semantic Web Gang looks back at SemTech 2009 with Peter Mika, Yahoo! Research