joint international workshop on entity-oriented and semantic search (jiwes) 2012 @ sigir conference

63
Simple Models, Lots of Data: Mining semantics about entities using Web-Scale Data Kaushik Chakrabarti Joint International Workshop on Entity- oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Upload: colby-winward

Post on 31-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Simple Models, Lots of Data: Mining semantics about entities using Web-Scale Data

Kaushik Chakrabarti

Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Page 2: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Computers Doing Complex Tasks

Today, computers are doing complex, “human” tasksMachine translationSpeech recognitionComputer VisionSpelling correctionWeb RankingSpam FilteringEntity Semantics Mining…

No neat, concise algorithms Cannot be explained by neat formulas like F=ma Or elegant algorithm like Dijkstra’s shortest path algorithm

Page 3: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Initial attempts: Using algorithms, rules and heuristics

Machine translation:Hand-coded rules that capture underlying structure of language

Speed recognition:Rules, heuristics and signal processing tricks

Named entity recognitionHand-crafted grammars by experienced computation linguists

Page 4: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Initial attempts: Using algorithms, rules and heuristics

Not driven by dataWhy?

Belief: human intelligence can be encapsulated in algorithms and hand-crafted rulesData was not available, computers did not have capacity (kilobytes of memory, megabytes of disk)

Page 5: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Emergence of Data-driven systems

Machine translation:Large training set of input-output pairs availableInitially, data-driven systems were worse (till early 2000s)Rapid progress in quality: getting better with more training data, already exceeds best rule-based systemsCan expand to new domains rapidly

Speech recognition:Large training sets availableInitially, data-driven methods (HMMs) were worse (till late 1980s)Rapid progress in quality with more training data; have become the norm

Page 6: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Outline of talk

Entity semantics MiningEarlier effortsData-driven approaches

Case Study 1: Entity synonymsCase Study 2: Entity attribute discoveryCase Study 3: Entity augmentationCase Study 4: Entity linkingCase Study 5: Entity tagging

Outlook and challenges

Page 7: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Why mine semantics about entities?Entities are everywhere

Data:Most data on the web and inside the enterprise are about entitiesProduct catalogs in e-tailersTables on the web Tables within an enterpriseKnowledgebases (Wikipedia, Freebase)…

Tasks:Most consumer and enterprise tasks are about entitiesSearching for products, movies, places, hotels, restaurants, ….Business Data AnalysisSocial Media Analytics…

RedmondBellevueAuburnKirklandBothellEverettSeattle

#stores0100001

2011 sales0

43560000

8176

Population Income Population under 10

Page 8: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Examples of Entity Semantics MiningGiven an entity (or set of entities), what are other ways it is referred to?

Entity SynonymsGiven a set of entities, what are the interesting attributes of these entities?

Entity Attribute DiscoveryGiven a set of entities and an attribute, what are values of the entities on that attribute?

Entity AugmentationGiven a set of entities and a text corpus, what are the semantic mentions of the entity in the corpus?

Entity LinkingGiven a set of entities, find phrases (“tags”) describing those entities?

Entity Tagging

Page 9: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Outline of talk

Entity semantics MiningEarlier effortsData-driven approaches

Case Study 1: Entity synonymsCase Study 2: Entity attribute discoveryCase Study 3: Entity augmentationCase Study 4: Entity linkingCase Study 5: Entity tagging

Outlook and challenges

Page 10: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Earlier Efforts Towards Entity Semantics

Entity recognizersHand-crafted rules by experience computational linguists

Entity synonymsManually discovered by domain-experts

Entity information gathering (attribute discovery, augmentation)

Writing wrappers for specific web sitesNot data driven: used rules, heuristics rather than data

Page 11: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Outline of talk

Entity semantics MiningEarlier effortsData-driven approaches

Case Study 1: Entity synonymsCase Study 2: Entity attribute discoveryCase Study 3: Entity augmentationCase Study 4: Entity linkingCase Study 5: Entity tagging

Outlook and challenges

Page 12: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Why entity synonyms?

Users search for entities:E-tailers like Walmart:

allow users to search for products

Information providers like YellowPages:

allow users to search for local businesses

Online music stores like iTunes and Zune Marketplace:

allow users to search for songs/albums/artists

Page 13: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Why entity synonyms?

Powered by enterprise search engines like FAST and EndecaUsers refer to the same entity in many different ways

Without this knowledge, search fails to return relevant resultsBad user experience, loss of web site conversions and sales, customer attrition

Cannot treat them as strings, need to capture semantic relationship

Entity Alternative ways to refer to that entity

canon 400d canon rebel xti, cannon 400d

harry potter and the half blood prince

harry potter 6, half blood prince, harry potter and the half blooded prince

washington state department of licensing

wa dol, dol washington, wa dmv, wash state dept of licensing

university of washington uw, uofw, univ of wa, univ of washington

Page 14: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Discovering SynonymsPrevious approach: domain experts provide synonyms manually

Tedious, costly: e-tailers have thousands, even millions of entitiesNew entities keep coming, old entities get obsolete, hard to keep up

Main Insight:Search engine’s query click log captures this => if two different search queries refer to same entity, significant overlap among the links clicked for the two queries. Can we leverage it to find synonyms automatically?Requirements:

Domain independent, high quality, scalable algorithm

Page 15: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Step 1: Generation of Candidate Synonyms

microsoft excel

Entity

Clicked UrlsCandidate Synonyms

http://blogs.office.com/b/microsoft-excel

http://office.microsoft.com/en-us/excel/

http://en.wikipedia.org/wiki/Microsoft_Excelmicrosoft excel

Query

microsoft word

excel spreadsheet

ms excel tutorial

ms office

Page 16: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Step 2: Filtering of Candidate SynonymsLook for strong semantic relationship in both

directionsFilter out candidates that are weakly related

“microsoft word”

Filter out candidates that are not exclusively related “ms office”

http://blogs.office.com/b/microsoft-excel

http://office.microsoft.com/en-us/excel/

http://en.wikipedia.org/wiki/Microsoft_Excelmicrosoft excel

Query

microsoft word

excel spreadsheet

ms excel tutorial

ms office

Page 17: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Not good enough: Click log sparsity

𝑑1

𝑑2

microsoft spreadsheet

ms excel

ms spreadsheet

microsoft excel

Pseudo-document similarity (“A Framework for Robust Discovery of Entity Synonyms”, KDD 2012)

Page 18: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Not good enough: Not of same semantic type

“microsoft excel” and “ms excel tutorial” are very strongly related but not of the same “type”

Software vs. tutorial

A synonym (of an entity) should beStrongly related in both directions (measured by pseudo document similarity)Of the same type

Page 19: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Context Similarity

IntuitionIf X and Y are entities of the same type, X and Y are exchangeable under various contexts

Where to find contexts?Query logs

Examplesmicrosoft excel help ms spreadsheet help powerpoint help microsoft excel download ms spreadsheet download powerpoint download

Page 20: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Construct Context Vectors

microsoft excel

ms spreadsheet

ms excel tutorial

Entities and candidates

Other Queries

microsoft excel download

microsoft excel help

microsoft excel

ms spreadsheet download

ms spreadsheet help

ms excel tutorial class

ms excel tutorial ppt

Entities/Candidates

Contexts

microsoft excel

download

microsoft excel help

microsoft excel review

ms spreadsheet

download

ms spreadsheet

help

ms excel tutorial

class

ms excel tutorial

ppt

Page 21: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Same type => High Context Similarity

class, ppt, guide, …

download, help, review, install, …

download, help, install, …

Context from query log

ms spreadsheetTrue Synonym

ms spreadsheetms excel tutorial

microsoft excel

Page 22: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Not good enough: Long entity names

Canon EOS 350D - digital camera, 8MP, 3x Optical Zoom

Entity

Query

Page 23: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

SummaryLots of data available anyway

Similar to machine translation, speech recognitionThe main insight is how to connect the problem with that data

Insight is important, algorithms are relatively simpleA classifier with less than 10 featuresAchieves ~90% precision, ~3 synonyms per entity for products

Taking care of details important for high accuracyClick log sparsity, context similarity, long entity names

Need to scale to large query logsMapReduce implementation

Further details: “A Framework for Robust Discovery of Entity Synonyms”, KDD 2012

Page 24: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Outline of talk

Entity semantics MiningEarlier effortsData-driven approaches

Case Study 1: Entity synonymsCase Study 2: Entity attribute discoveryCase Study 3: Entity augmentationCase Study 4: Entity linkingCase Study 5: Entity tagging

Outlook and challenges

Page 25: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Information gathering

Need to gather information for entities to make a decision

Consumer space:Research for products (say, cameras, cars, appliances)Research for stocks….

Enterprise space:Information workers, business data analyticsApplication developers, mashup creators

Mostly manual today, very labor intensiveCan we automate it?Semantic task: need to understand entities, attributes and values, not just strings

Page 26: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

3 APIs for Information Gathering

Input Query Table

Output Table

Augmentation By Attribute Name (ABA)

Augmentation By Example (ABE)

Attribute Discovery (AD)

Model Brand

S80

A10

GX-1S

T1460

Model Brand

S80 Nikon

A10 Canon

GX-1S Samsung

T1460 Benq

S80 Nikon

A10 Canon

GX-1S

T1460

S80 NikonA10 Canon

GX-1S SamsungT1460 Benq

S80A10

GX-1ST1460

brand | make | manufacturer | mfr

resolution | mp | megapixel | res

price | retail price | offerzoom | optical zoom

Page 27: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Main InsightTables on the web have a lot of structured information about entities and their attributesCan we leverage it?This data is available anyway

Similar to machine translation, speech recognition, entity synonymsThe main insight is to connect the problem with that data

Page 28: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Initial focus

Focus on relational tablesDetails in paper:

“InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables”, SIGMOD 2012

Page 29: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Model Brand

S80

A10

GX-1S

T1460

Query Table

Model Brand

S80 Nikon

Easyshare CD44 Kodak

DSC W570 Sony

Optio E60 Pentax

Model Brand

S80 Benq

A10 InnostreamA460 Samsung

S710 HTC

0.25

Model Brand

SP 600UZ Olympus

DSC W570 Sony

GX-1S Samsung

A10 Canon

0.5

0.5

Page 30: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Model Brand

S80

A10

GX-1S

T1460

Query Table

Model Brand

S80 Nikon

Easyshare CD44 Kodak

DSC W570 Sony

Optio E60 Pentax

Model Brand

S80 Benq

A10 InnostreamA460 Samsung

S710 HTC

0.25

Model Brand

SP 600UZ Olympus

DSC W570 Sony

GX-1S Samsung

A10 Canon

0.5

0.5

Page 31: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Model Brand

S80 Benq

A10

GX-1S

T1460

Query Table

Model Brand

S80 Nikon

Easyshare CD44 Kodak

DSC W570 Sony

Optio E60 Pentax

Model Brand

S80 Benq

A10 InnostreamA460 Samsung

S710 HTC

0.25

Model Brand

SP 600UZ Olympus

DSC W570 Sony

GX-1S Samsung

A10 Canon

0.5

0.5

Page 32: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Model Brand

S80

A10

GX-1S

T1460

Query Table

Model Brand

S80 Nikon

Easyshare CD44 Kodak

DSC W570 Sony

Optio E60 Pentax

Model Brand

S80 Benq

A10 InnostreamA460 Samsung

S710 HTC

0.25

Model Brand

SP 600UZ Olympus

DSC W570 Sony

GX-1S Samsung

A10 Canon

0.5

0.5

Page 33: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Model Brand

S80

A10 Innostream

GX-1S

T1460

Query Table

Model Brand

S80 Nikon

Easyshare CD44 Kodak

DSC W570 Sony

Optio E60 Pentax

Model Brand

S80 Benq

A10 InnostreamA460 Samsung

S710 HTC

0.25

Model Brand

SP 600UZ Olympus

DSC W570 Sony

GX-1S Samsung

A10 Canon

0.5

0.5

Page 34: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Model Brand

S80

A10

GX-1S

T1460

Query Table

Model Brand

S80 Nikon

Easyshare CD44 Kodak

DSC W570 Sony

Optio E60 Pentax

Model Brand

S80 Benq

A10 InnostreamA460 Samsung

S710 HTC

0.25

Model Brand

SP 600UZ Olympus

DSC W570 Sony

GX-1S Samsung

A10 Canon

0.5

0.5

Page 35: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Model Brand

S80

A10

GX-1S Samsung

T1460

Query Table

Model Brand

S80 Nikon

Easyshare CD44 Kodak

DSC W570 Sony

Optio E60 Pentax

Model Brand

S80 Benq

A10 InnostreamA460 Samsung

S710 HTC

0.25

Model Brand

SP 600UZ Olympus

DSC W570 Sony

GX-1S Samsung

A10 Canon

0.5

0.5

Page 36: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Model Brand

S80 Benq

A10 Innostream

GX-1S Samsung

T1460

Query Table

Model Brand

S80 Nikon

Easyshare CD44 Kodak

DSC W570 Sony

Optio E60 Pentax

Model Brand

S80 Benq

A10 InnostreamA460 Samsung

S710 HTC

0.25

Model Brand

SP 600UZ Olympus

DSC W570 Sony

GX-1S Samsung

A10 Canon

0.5

0.5

Direct MatchingApproach (DMA)

Page 37: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Model Brand

S80 Benq

A10 Innostream

GX-1S Samsung

T1460

Query Table

Model Brand

S80 Nikon

Easyshare CD44 Kodak

DSC W570 Sony

Optio E60 Pentax

Model Brand

S80 Benq

A10 InnostreamA460 Samsung

S710 HTC

0.25

Model Brand

SP 600UZ Olympus

DSC W570 Sony

GX-1S Samsung

A10 Canon

0.5

0.5

Part No Mfg

DSC W570 Sony

T1460 Benq

Optio E60 Pentax

S8100 Nikon

0.5

0.5

0.25

0.2

Indirect matchingtable

Page 38: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Model Brand

S80 Benq

A10 Innostream

GX-1S Samsung

T1460 Benq

Query Table

Model Brand

S80 Nikon

Easyshare CD44 Kodak

DSC W570 Sony

Optio E60 Pentax

Model Brand

S80 Benq

A10 InnostreamA460 Samsung

S710 HTC

0.25

Model Brand

SP 600UZ Olympus

DSC W570 Sony

GX-1S Samsung

A10 Canon

0.5

0.5

Part No Mfg

DSC W570 Sony

T1460 Benq

Optio E60 Pentax

S8100 Nikon

0.5

0.5

0.25

0.2

Page 39: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Model Brand

S80 Benq

A10 Innostream

GX-1S Samsung

T1460 Benq

Query Table

Model Brand

S80 Nikon

Easyshare CD44 Kodak

DSC W570 Sony

Optio E60 Pentax

Model Brand

S80 Benq

A10 InnostreamA460 Samsung

S710 HTC

0.25

Model Brand

SP 600UZ Olympus

DSC W570 Sony

GX-1S Samsung

A10 Canon

0.5

0.5

T1

T2

T3

Part No Mfg

DSC W570 Sony

T1460 Benq

Optio E60 Pentax

S8100 Nikon

0.5

0.5

0.25

0.2

Page 40: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Model Brand

S80 Benq Nikon

A10 InostreamCanon

GX-1S Samsung

T1460 Benq

Query Table

Model Brand

S80 Nikon

Easyshare CD44 Kodak

DSC W570 Sony

Optio E60 Pentax

Model Brand

S80 Benq

A10 InnostreamA460 Samsung

S710 HTC

0.25

Model Brand

SP 600UZ Olympus

DSC W570 Sony

GX-1S Samsung

A10 Canon

0.5

0.5

Part No Mfg

DSC W570 Sony

T1460 Benq

Optio E60 Pentax

S8100 Nikon

0.5

0.5

0.25

0.2

Page 41: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

How to model holistic match?

Topic Sensitive Pagerank

Page 42: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Data Model

Entity-Attribute Binary (EAB) relationsThe Query table is an EAB relationWeb tables are split into EAB relations

URL, title, context retained

Model Brand

S80

A10

GX-1S

T1460

Query Table

Model Brand Res Optical Zoom

S80 Nikon 14.1mp 5x

Easyshare CD44

Kodak 12mp 3x

DSC W570 Sony 16.1 5x

Optio E60 Pentax 10.1 3x

Model Brand

S80 Nikon

Easyshare CD44 Kodak

DSC W570 Sony

Optio E60 Pentax

Model Res

S80 14.1mp

Easyshare CD44 12mp

DSC W570 16.1

Optio E60 10.1

Model Optical Zoom

S80 5x

Easyshare CD44 3x

DSC W570 5x

Optio E60 3x

Web Table

Page 43: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Augmentation Framework2 steps:

Identify Matching TablesPredict Values Using Matching Tables

Model Brand

S80 Nikon

A10 Canon

GX-1S Samsung

T1460 Benq

Query Table

Model Brand

S80 Nikon

Easyshare CD44 Kodak

DSC W570 Sony

Optio E60 Pentax

Model Brand

S80 Benq

A10 InnostreamA460 Samsung

S710 HTC

0.25

Model Brand

SP 600UZ Olympus

DSC W570 Sony

GX-1S Samsung

A10 Canon

0.5

0.5

Part No Mfg

DSC W570 Sony

T1460 Benq

Optio E60 Pentax

S8100 Nikon

0.5

0.5

0.25

0.2

Page 44: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Problem to solve: Holistic matching of EAB Tables

Model Brand

S80 Nikon

A10 Canon

GX-1S Samsung

T1460 Benq

Query Table

Model Brand

S80 Nikon

Easyshare CD44 Kodak

DSC W570 Sony

Optio E60 Pentax

Model Brand

S80 Benq

A10 InnostreamA460 Samsung

S710 HTC

0.25

Model Brand

SP 600UZ Olympus

DSC W570 Sony

GX-1S Samsung

A10 Canon

0.5

0.5

Part No Mfg

DSC W570 Sony

T1460 Benq

Optio E60 Pentax

S8100 Nikon

0.5

0.5

0.25

0.2

Query Table

Directly matching EAB tables = Topic

nodesDMA Scores =

Preference Scores (β)

Nodes = EAB tables

Edges = Semantic Similarity

0.25

0.5

0.5

Page 45: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Query Table

Model Brand

S80 Nikon

Easyshare CD44 Kodak

DSC W570 Sony

Optio E60 Pentax

Part No Mfg

DSC W570 Sony

T1460 Benq

Optio E60 Pentax

S8100 Nikon

Model Brand

S80 Benq

A10 InnostreamA460 Samsung

S710 HTC

0.25

0.5

Model Brand

SP 600UZ Olympus

DSC W570 Sony

GX-1S Samsung

A10 Canon

0.5

0.5

0.5

T1

T2

T3

T4

0.2

Model Brand

S80

A10

GX-1S

T1460

Page 46: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

SummaryHeavy use of preprocessing:

Compute full personalized pagerank (FPPR) of semantic similarity graph upfrontCompute TSP scores on the fly

Scalability is a huge challenge:How to build the SS graph?

> 1billion nodes

How to compute FPPR?Use algorithm from Bahmani et. al. (SIGMOD 2011)

Leveraging huge amount of data available anywayAs in MT, Speech Recognition

Insight is important, algorithms are relatively simpleDetails in paper: “InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables”, SIGMOD 2012

Page 47: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Outline of talk

Entity semantics MiningEarlier effortsData-driven approaches

Case Study 1: Entity synonymsCase Study 2: Entity attribute discoveryCase Study 3: Entity augmentationCase Study 4: Entity linkingCase Study 5: Entity tagging

Outlook and challenges

Page 48: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Entity Linking ProblemGiven a “catalog” of entities and a set of documents, find mentions of those entities in those documents

Many applicationsMain problem is identify true mentions among candidate mentions (refer it to as “targeted disambiguation”)

Entity Id Entity Name

e1 Microsoft

e2 Apple

e3 HP

Microsoft and Apple are the developers of three of the most popular operating systemsApple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet

strategy Audi is offering a racing version of its hottest TT

model: a 380 HP, front-wheel …

Target entities

d1

d2

d3

d4

d5

Candidate Mentions

Page 49: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Existing solutions

Solutions exist when entity catalog is “rich” with description about each entity

WikipediaWhat if these entities are enterprise entities?

Not in Wikipedia, referred to as “ad hoc entities”Documents can be enterprise documents or web documents

Again, leverage the data!Assume the entities in the catalog are homogeneous (of the same “type”)

Page 50: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Insight – leverage homogeneity (1)

Microsoft and Apple are the developers of three of the most popular operating systems

Apple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet strategy

Audi is offering a racing version of its hottest TT model: a 380 HP, front-wheel …

Similar

Context Similarity

Page 51: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Insight – leverage homogeneity (1)

Microsoft and Apple are the developers of three of the most popular operating systems

Apple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet strategy

Audi is offering a racing version of its hottest TT model: a 380 HP, front-wheel …

Similar

Context Similarity

Page 52: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Insight – leverage homogeneity (1)

Microsoft and Apple are the developers of three of the most popular operating systemsApple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for

the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet

strategy Audi is offering a racing version of its hottest TT

model: a 380 HP, front-wheel …

Dissimilar

Context Similarity

Page 53: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Insight – leverage homogeneity (1)

Microsoft and Apple are the developers of three of the most popular operating systemsApple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for

the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet

strategy Audi is offering a racing version of its hottest TT

model: a 380 HP, front-wheel …

Dissimilar

Context Similarity

Page 54: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Insight – leverage homogeneity (1)Hypothesis: • context between two

true mentions (of any entities) is more similar than between two false mentions (of distinct entities) as well as between a true mention and a false mention

• Detail: the context of false mentions can be similar among themselves within an entity

Page 55: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Insight – leverage homogeneity (2)

Microsoft and Apple are the developers of three of the most popular operating systemsApple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet

strategy Audi is offering a racing version of its hottest TT

model: a 380 HP, front-wheel …

High confidence

Co-mention

Page 56: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Insight – leverage homogeneity (3)

Microsoft and Apple are the developers of three of the most popular operating systemsApple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet

strategy Audi is offering a racing version of its hottest TT

model: a 380 HP, front-wheel …

True

Interdependency

Page 57: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Insight – leverage homogeneity (3)

Microsoft and Apple are the developers of three of the most popular operating systemsApple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for

the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet

strategy Audi is offering a racing version of its hottest TT

model: a 380 HP, front-wheel …

True

True

Interdependency

Page 58: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Insight – leverage homogeneity (3)

Microsoft and Apple are the developers of three of the most popular operating systemsApple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for

the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet

strategy Audi is offering a racing version of its hottest TT

model: a 380 HP, front-wheel …

True

True

True

Interdependency

Page 59: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Insight – leverage homogeneity (3)

Microsoft and Apple are the developers of three of the most popular operating systemsApple trees take four to five years to produce their first fruit…

Microsoft’s new operating system, Windows 8, is a PC operating system for

the tablet age …

CEO Meg Whitman said that HP is focusing on Windows 8 for its tablet

strategy Audi is offering a racing version of its hottest TT

model: a 380 HP, front-wheel …

True

True

True

False

False

Interdependency

Page 60: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Modeling using Graph (MentionRank)

Prior score of each candidate mention: co-mentionEdge weight: context similarityInterdependency modeling using PageRank-style propagation

Page 61: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Summary

Using content in the documents to solve hard problemInsights are important, algorithms are simpleScalability is a huge challengeFurther details: “Targeted Disambiguation of Ad-hoc, Homogeneous Sets of Named Entities”, WWW 2012

Page 62: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Outlook and ChallengesMining semantics of entities is critical

Both on the web and within the enterpriseE.g., entity synonyms, entity attribute discovery, entity augmentation, entity linkingMany, many more problems like this…

Lots of data on the web, within the enterpriseLeverage the data!

Main challenges:Not complex algorithmsBut connecting the desired semantics with the right data (“Statistical Semantics”)And scalability (terabytes of query log, billions of web pages, 100s of millions of web tables, etc.)

Page 63: Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012 @ SIGIR Conference

Thank you!

Vivek Narasayya

Surajit Chaudhuri

Zhimin Chen

Kris Ganjam

Kaushik Chakrabarti

Tao Cheng