semantic search at yahoo

Semantic Search at Yahoo

P R E S E N T E D B Y P e t e r M i k a , S r . R e s e a r c h S c i e n t i s t , Y a h o o L a b s A p r i l 1 6 , 2 0 1 4 ⎪

04/11/20232

The Semantic Web (2001-)

Part of Tim Berners-Lee’s original proposal for the Web

Beginning of a research community› Ontology engineering› Logical inference› Agents, web services

Rough start in deployment› Misplaced expectations› Lack of adoption

04/11/20233

The Semantic Web, May 2001 “At the doctor's office, Lucy instructed her

Semantic Web agent through her handheld Web browser. The agent promptly retrieved information about Mom's prescribed treatment from the doctor's agent, looked up several lists of providers, and checked for the ones in-plan for Mom's insurance within a 20-mile radius of her home and with a rating of excellent or very good on trusted rating services. It then began trying to find a match between available appointment times (supplied by the agents of individual providers through their Web sites) and Pete's and Lucy's busy schedules.”

(The emphasized keywords indicate terms whose semantics, or meaning, were defined for the agent through the Semantic Web.)

Misplaced expectations?

Lack of adoption

Standardization ahead of adoption› URI, RDF, RDF/XML, RDFa, JSON-LD,

OWL, RIF, SPARQL, OWL-S, POWDER …

Chicken and egg problem› No users/use cases, hence no data› No data, because no users/use cases

By 2007, some modest progress› Metadata in HTML: microformats› Linked Data: simplifying the stack

5

Web search by 2007

Large classes of queries are solved to perfection Improvements in web search are harder and harder to come by

› Relevance models, hyperlink structure and interaction data› Combination of features using machine learning› Heavy investment in computational power

• real-time indexing, instant search, datacenters and edge services

Language issues› Multiple interpretations

• jaguar

• paris hilton

› Secondary meaning

• george bush (and I mean the beer brewer in Arizona)

› Subjectivity

• reliable digital camera

• paris hilton sexy

› Imprecise or overly precise searches

• jim hendler

Complex needs› Missing information

• brad pitt zombie

• florida man with 115 guns

• 35 year old computer scientist living in barcelona

› Category queries

• countries in africa

• barcelona nightlife

› Transactional or computational queries

• 120 dollars in euros

• digital camera under 300 dollars

• world temperature in 2020

Poorly solved information needs remain

Many of these queries would not be asked by users, who learned over time what search technology can and can not do.

7

Web search by 2007

Are there even any true keyword queries?› Lyrics, quotes and bugs… anything else?

Remaining challenges are not computational, but in modeling user cognition› Need a deeper understanding of the query, the content and/or the world at large

Microsearch internal prototype (2007)

Personal and private homepageof the same person(clear from the snippet but it could be also automaticallyde-duplicated)

Conferences he plans to attend and his vacations from homepageplus bio events from LinkedIn

Geolocation

Enhanced Results Computing abstracts is hard

› Summarization of HTML

• Template detection

• Selecting relevant snippets

• Composing readable text

› Efficiency constraints

Structured data to replace or complement text summary› Key/value pairs› Deep links› Image or Video

Yahoo SearchMonkey (2008)

1. Extract structured data› Semantic Web markup

• Example:

<span property=“vcard:city”>Santa Clara</span>

<span property=“vcard:region”>CA</span>

› Information Extraction

2. Presentation› Fixed presentation templates

• One template per object type

› Applications

• Third-party modules to display data (SearchMonkey)

Effectiveness of enhanced results Explicit user feedback

› Side-by-side editorial evaluation (A/B testing)

• Editors are shown a traditional search result and enhanced result for the same page

• Users prefer enhanced results in 84% of the cases and traditional results in 3% (N=384)

Implicit user feedback› Click-through rate analysis

• Long dwell time limit of 100s (Ciemiewicz et al. 2010)

• 15% increase in ‘good’ clicks

› User interaction model

• Enhanced results lead users to relevant documents (IV) even though less likely to clicked than textual (III)

• Enhanced results effectively reduce bad clicks!

See› Kevin Haas, Peter Mika, Paul Tarjan, Roi Blanco: Enhanced results for web search. SIGIR

2011: 725-734

Adoption among search providers Google announces Rich Snippets - June, 2009

› Faceted search for recipes - Feb, 2011

Bing tiles – Feb, 2011 Facebook’s Like button and the Open Graph Protocol (2010)

› Shows up in profiles and news feed› Site owners can later reach users who have liked an object

schema.org

Agreement on a shared set of schemas for common types of web content› Bing, Google, and Yahoo! as initial founders (June, 2011)

• Yandex joins schema.org in Nov, 2011

› Similar in intent to sitemaps.org

• Use a single format to communicate the same information to all three search engines

schema.org covers areas of interest to all search engines› Business listings (local), creative works (video), recipes, reviews and more› Microdata, RDFa, JSON-LD syntax

Collaborative effort› Growing number of 3rd party contributions› schema.org discussions at [email protected]

Adoption among publishers

R.V. Guha: Light at the end of the tunnel (ISWC 2013 keynote)› Over 15% of all pages now have schema.org markup› Over 5 million sites, over 25 billion entity references› In other words

• Same order of magnitude as the web

See also › P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012

• Based on Bing US corpus

• 31% of webpages, 5% of domains contain some metadata

› WebDataCommons

• Based on CommonCrawl Nov 2013

• 26% of webpages, 14% of domains contain some metadata

http://events.linkeddata.org/ldow2012/papers/ldow2012-inv-paper-1.pdf

Semantic Search

Active research field at the intersection of IR, NLP, DB and SemWeb› ESAIR at SIGIR, SemSearch at ESWC/WWW, EOS and JIWES at SIGIR, Semantic Search at

VLDB

Exploiting semantic understanding in the retrieval process› User intent and resources are represented using semantic models

• Not just symbolic representations

› Semantic models are exploited in the matching and ranking of resources

Tasks › information extraction

› information reconciliation/tracking

› query understanding

› retrieving/ranking entities/attributes/relations

› result presentation

Information extraction and reconciliation

Ontology management › Editorially maintained OWL ontology with 300+ classes › Covering the domains of interest of Yahoo

Information extraction› Automated information extraction

• e.g. wrapper induction

› Metadata from HTML pages

• Focused crawler

› Public datasets (e.g. Dbpedia)› Proprietary data

Data fusion› Manual mapping from the source schemas to the ontology› Supervised entity reconciliation

• Kedar Bellare, Carlo Curino, Ashwin Machanavajihala, Peter Mika, Mandar Rahurkar, Aamod Sane: WOO: A Scalable and Multi-tenant Platform for Continuous Knowledge Base Synthesis. PVLDB 2013

• Michael J. Welch, Aamod Sane, Chris Drome: Fast and accurate incremental entity resolution relative to an entity knowledge base. CIKM 2012

Curation and quality assessment

Yahoo’s Knowledge Graph

Chicago Cubs

Chicago

Barack Obama

Carlos Zambrano

10% off ticketsfor

plays for

plays in

lives in

Brad Pitt

Angelina Jolie

Steven Soderbergh

George Clooney

Ocean’s Twelve

partner

directs

casts in

E/R

casts in

takes place in

Fight Club

casts in

Dust Brotherscasts

in

music by

Nicolas Torzec: Making knowledge reusable at Yahoo!: a Look at the Yahoo! Knowledge Base (SemTech 2013)

21

Query understanding

~70% of queries contain a named entity Entity linking in queries and query sessions

› Online as input to ranking› Semantic log mining

• Laura Hollink, Peter Mika, Roi Blanco: Web usage mining with semantic analysis. WWW 2013: 561-570

See tutorial on Entity Linking and Retrieval by Edgar Meij, Krisztián Balog and Daan Odijk

http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/










list search

related entity finding

entity searchSemSearch 2010/11

list completion

SemSearch 2011

TREC ELC taskTREC REF-LOD task

semantic search

Common retrieval tasks in Semantic Search

question-answeringQALD 2012/13/14

Entity Retrieval evaluation

SemSearch challenge (2010/2011)› Queries

• 50 entity-mention queries selected from the Search Query Tiny Sample v1.0 dataset, provided by the Yahoo! Webscope program

› Data

• Billion Triples Challenge 2009 data set

• Combination of crawls of multiple semantic search engines

› Evaluation

• Mechanical Turk

See report:› Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S.

Thompson, Thanh Tran: Repeatable and reliable semantic search evaluation. J. Web Sem. 21: 14-29 (2013)

Glimmer: open-source retrieval engine over RDF data

› Extension of MG4J from University of Milano› Indexing

• MapReduce-based

• Horizontal indexing (subject/predicate/object fields)

• Vertical indexing (one field per predicate)

› Retrieval

• BM25F with machine-learned weights for properties and domains

• 52% improvement over the best system in SemSearch 2010

› Roi Blanco, Peter Mika, Sebastiano Vigna: Effective and Efficient Entity Search in RDF Data. International Semantic Web Conference (1) 2011: 83-97

› https://github.com/yahoo/Glimmer/

http://mg4j.di.unimi.it/

https://github.com/yahoo/Glimmer/

Entity-seeking queries make up 40-50% of the query volume› Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc

object retrieval in the web of data. WWW 2010: 771-780

› Thomas Lin, Patrick Pantel, Michael Gamon, Anitha Kannan, Ariel Fuxman: Active objects: actions for entity-centric search. WWW 2012: 589-598

Show a summary of the most likely information-needs› Including related entities for navigation› Roi Blanco, Berkant Barla Cambazoglu,

Peter Mika, Nicolas Torzec: Entity Recommendations in Web Search. ISWC 2013

Application: entity displays in web search

Application: personalization in online news

Entity linking Entity ranking according to relevance to the document

New appl icat ions

Mobile search on the rise

Information access on-the-go requires hands-free operation› Driving, walking, gym, etc.

• Americans spend 540 hours a year in their cars [1] vs. 348 hours browsing the Web [2]

~50% of queries are coming from mobile devices (and growing)› Changing habits, e.g. iPad usage peaks before bedtime› Limitations in input/output

[1] http://answers.google.com/answers/threadview?id=392456[2] http://articles.latimes.com/2012/jun/22/business/la-fi-tn-top-us-brands-news-web-sites-20120622

http://answers.google.com/answers/threadview?id=392456



29

Mobile search challenges and opportunities

Interaction› Question-answering› Support for interactive retrieval› Spoken-language access› Task completion

Contextualization› Personalization› Geo› Context (work/home/travel)

• Try getaviate.com

Interactive, conversational voice search

Parlance EU project› Complex dialogs within a domain

• Requires complete semantic understanding

Complete system (mixed license)› Automated Speech Recognition (ASR)› Spoken Language Understanding (SLU)› Interaction Management› Knowledge Base› Natural Language Generation (NLG)› Text-to-Speech (TTS)

Video

http://www.youtube.com/watch?feature=player_detailpage&v=lHfLr1MF7DI

31

Task completion

We would like to help our users in task completion› But we have trained our users to talk in nouns

• Retrieval performance decreases by adding verbs to queries

› We need to understand what the available actions are

Ongoing work in schema.org in modeling actions› Understand what actions can be taken on a page› Help users in mapping their query to potential actions› Applications in web search, email etc.

THING

THING

Applications

Email (Gmail) SERP (Yandex)

Q&A

Many thanks to members of the Semantic Search team at Yahoo Labs Barcelona and to Yahoos around the world

Contact me› [email protected]› @pmika› www.slideshare.net/pmika› Ask about our internships and other opportunities

semantic search at yahoo

Technology