tulip: lightweight entity recognition and disambiguation...
TRANSCRIPT
![Page 1: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/1.jpg)
Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids
Marek LipczakArash Koushkestani
Evangelos Milios
![Page 2: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/2.jpg)
Problem definition
2
The goal of Entity Recognition and Disambiguation (ERD)
□ Identify mentions of entities
□ Link the mentions to a relevant entry in an external knowledge base
□ The knowledge base is typically a large subset of Wikipedia articles
Example:
The selling offsets decent earnings from Cisco Systems and Home Depot. Techs fall, led by Microsoft and Intel. Michael Kors rises. Gold and oil slip.
![Page 3: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/3.jpg)
Recognition and Disambiguation
3
The selling offsets decent earnings from Cisco Systems and Home Depot. Techs fall, led by Microsoft and Intel. Michael Kors rises. Gold and oil slip.
Recognition
□ Is this a valid mention of an entity present in the knowledge base?
Disambiguation
□ Which of the potential entities (senses) is correct?
![Page 4: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/4.jpg)
Recognition and Disambiguation
4
The selling offsets decent earnings from Cisco Systems and Home Depot. Techs fall, led by Microsoft and Intel. Michael Kors rises. Gold and oil slip.
Recognition
□ Is this a valid mention of an entity present in the knowledge base?
Disambiguation
□ Which of the potential entities (senses) is correct?
Default sense – the entity with a largest number of wiki-links with the mention as the anchor text
□ Tulip focuses on default sense entities
□ Main goal is to recognize whether the default sense is consistent with the document
![Page 5: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/5.jpg)
Our background
5
Visual Text Analytics Lab
□ Some experience with using ERD systems
□ No experience implementing ERD systems
Key issue with state-of-the-art systems: obvious false positive mistakes
□ Visualize Prof. Smith's research interests: Data Mining Machine Learning 50 cent
Our goal: minimize the number of false positives
![Page 6: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/6.jpg)
Tulip – system overview
6
Spotter
□ Find all mentions of entities in the text (Solr Text Tagger)
□ Special handling for personal names
Recognizer
□ Retrieve profiles of spotted entities (from Sunflower)
□ Generate a topic centroid representing the document
□ Select entities consistent with the document
![Page 7: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/7.jpg)
Spotter
7
Spotter
□ Find all mentions of entities in the text (Solr Text Tagger)
□ Special handling for personal names
Recognizer
□ Retrieve profiles of spotted entities (from Sunflower)
□ Generate a topic centroid representing the document
□ Select entities consistent with the document
![Page 8: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/8.jpg)
Solr Text Tagger
8
Solr (Lucene) is a text search engine
□ Indexes textual documents
□ Retrieve documents for keyword-based queries
Solr Text Tagger
□ Indexes entity surface forms stored in a lexicon E.g., Baltimore Ravens, Ravens, Baltimore (…)
□ Uses full text documents as queries
□ Finds all entity mentions in the document
□ Retrieves the mentioned entities (candidate selection)
□ Implemented based on Solr's Finite State Transducers By David Smiley and Rupert Westenthaler (thanks!)
![Page 9: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/9.jpg)
Building the lexicon
9
Three sources of entity surface forms (external datasets)
□ Entity names (from Freebase)
□ Wiki-links anchor text (from Wikipedia)
□ Web anchor text (from Google's Wikilinks corpus)
![Page 10: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/10.jpg)
Building the lexicon
10
Three sources of entity surface forms (external datasets)
□ Entity names (from Freebase)
□ Wiki-links anchor text (from Wikipedia)
□ Web anchor text (from Google's Wikilinks corpus)
Special handling of personal names
□ “Jack” and “London” are not allowed as surface forms for Jack London
□ Instead they are indexed as “generic” personal names and will be matched only if Jack London is mentioned by his full name
![Page 11: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/11.jpg)
Building the lexicon
11
Three sources of entity surface forms (external datasets)
□ Entity names (from Freebase)
□ Wiki-links anchor text (from Wikipedia)
□ Web anchor text (from Google's Wikilinks corpus)
Special handling of personal names
□ “Jack” and “London” are not allowed as surface forms for Jack London
□ Instead they are indexed as “generic” personal names and will be matched only if Jack London is mentioned by his full name
Flagging suspicious surface forms (e.g., “It” - Stephen King's novel)
□ stop-word filter marks all stop-words or phrases composed of stop-words (e.g., This is)
□ Wiktionary filter marks all common nouns, verbs, adjectives, etc. found in Wiktionary
□ lower-case filter marks all lower-case words or phrases
![Page 12: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/12.jpg)
Spotter – example
12
The [1] (...) [97] selling offsets decent earnings from Cisco Systems [1] and Home Depot [1]. Techs fall (1) (...) [7], led by Microsoft [1] (...) [13] and Intel [1] (...) [9]. Michael Kors [1] rises. Gold (1) (...) [31] and oil slip.
Default sense for all mentions (Freebase only)
![Page 13: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/13.jpg)
Spotter – example
13
The [1] (...) [97] selling offsets decent earnings from Cisco Systems [1] and Home Depot [1]. Techs fall (1) (...) [7], led by Microsoft [1] (...) [13] and Intel [1] (...) [9]. Michael Kors [1] rises. Gold (1) (...) [31] and oil slip.
Default sense for all mentions (Freebase only)
Default sense for all mentions (Freebase + Wikpedia)
![Page 14: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/14.jpg)
Spotter – example
14
The [1] (...) [97] selling offsets decent earnings from Cisco Systems [1] and Home Depot [1]. Techs fall (1) (...) [7], led by Microsoft [1] (...) [13] and Intel [1] (...) [9]. Michael Kors [1] rises. Gold (1) (...) [31] and oil slip.
Default sense for all mentions (Freebase only)
Default sense for all mentions (Freebase + Wikpedia)
Suspicious mentions removed
![Page 15: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/15.jpg)
Spotter – example
15
The [1] (...) [97] selling offsets decent earnings from Cisco Systems [1] and Home Depot [1]. Techs fall (1) (...) [7], led by Microsoft [1] (...) [13] and Intel [1] (...) [9]. Michael Kors [1] rises. Gold (1) (...) [31] and oil slip.
Default sense for all mentions (Freebase only)
Default sense for all mentions (Freebase + Wikpedia)
Suspicious mentions removed
How can we remove Michael Kors and bring back Home Depot?
□ Relatedness of entities to the document
![Page 16: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/16.jpg)
Recognizer
16
Spotter
□ Find all mentions of entities in the text (Solr Text Tagger)
□ Special handling for personal names
Recognizer
□ Retrieve profiles of spotted entities (from Sunflower)
□ Generate a topic centroid representing the document
□ Select entities consistent with the document
![Page 17: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/17.jpg)
Relatedness score
17
The selling offsets decent earnings from Cisco Systems and Home Depot. Techs fall, led by Microsoft and Intel. Michael Kors rises. Gold and oil slip.
Our solution
□ Retrieve a profile of every entity mentioned in the text
□ Agglomerate the profiles in a centroid representing the document
□ Check which entities are coherent with the topics (relatedness score)
How strongly or are related to the document?
![Page 18: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/18.jpg)
Relatedness score
18
The selling offsets decent earnings from Cisco Systems and Home Depot. Techs fall, led by Microsoft and Intel. Michael Kors rises. Gold and oil slip.
Our solution
□ Retrieve a profile of every entity mentioned in the text
□ Agglomerate the profiles in a centroid representing the document
□ Check which entities are coherent with the topics (relatedness score)
□ How do we create the entity profiles?
How strongly or are related to the document?
![Page 19: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/19.jpg)
Relatedness – Sunflower
19
A concept graph based on unified category graph from 120 Wikipedia language versions
□ Each language version acts like a witness for the importance of stored relation
Compact and accurate category profiles for all Wikipedia articles
□ Removal of unimportant categories
□ Inference of more general categories
![Page 20: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/20.jpg)
Sunflower – from graph to term profile
20
Sunflower graph is:
□ Directed
□ Weighted (importance score)
□ Sparse (only k most importantlinks per node)
Category-based profile isa sparse, weighted term vector
□ All categories at distance < d
□ Term weights based on edge weights
□ E.g., k = 3, d = 2
□ Path weight is the product of edge weights w(Intel → Comp. of US → Ec. of US) = 0.42*0.27 = 0.11
□ Category weight is the sum of path weights w(Ec. of US) = 0.11 + 0.19 = 0.3
![Page 21: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/21.jpg)
Topic centroids in Tulip
21
Retrieve category-based profiles for all default senses (example next slide)
![Page 22: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/22.jpg)
22
![Page 23: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/23.jpg)
Topic centroids in Tulip
23
Retrieve category-based profiles for all default senses (example next slide)
Topic Centroid Generation
□ Centroid is a linear combination of entity profiles
□ Default senses of non-suspicious mentions only(entity core)
![Page 24: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/24.jpg)
Topic centroids in Tulip
24
Retrieve category-based profiles for all default senses (example next slide)
Topic Centroid Generation
□ Centroid is a linear combination of entity profiles
□ Default senses of non-suspicious mentions only(entity core)
Topic Centroid Refinement
□ Entities far from the centroid are removed from the core
□ Cosine similarity with predefined threshold tcoh
=0.2
![Page 25: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/25.jpg)
Topic centroids in Tulip
25
Retrieve category-based profiles for all default senses (example next slide)
Topic Centroid Generation
□ Centroid is a linear combination of entity profiles
□ Default senses of non-suspicious mentions only(entity core)
Topic Centroid Refinement
□ Entities far from the centroid are removed from the core
□ Cosine similarity with predefined threshold tcoh
=0.2
Entity Scoring
□ Relatedness score assigned to each default sense entity(including suspicious mentions)
![Page 26: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/26.jpg)
Topic centroids in Tulip
26
Retrieve category-based profiles for all default senses (example next slide)
Topic Centroid Generation
□ Centroid is a linear combination of entity profiles
□ Default senses of non-suspicious mentions only(entity core)
Topic Centroid Refinement
□ Entities far from the centroid are removed from the core
□ Cosine similarity with predefined threshold tcoh
=0.2
Entity Scoring
□ Relatedness score assigned to each default sense entity(including suspicious mentions)
System output
□ Entities with score > tcoh
□ Entity with best relatedness score for each mention
![Page 27: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/27.jpg)
Challenge results
27
Tulip got second place in the long track
□ Category-based topic centroids – promising solution for relatedness
□ Top recall among all submitted systems (?!)
□ Lowest latency among all submitted systems
![Page 28: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/28.jpg)
Lightweight ERD
28
Entity Recognition and Disambiguation is typically just a single step in a more complex document processing system
To be practical the ERD system has to be lightweight:
□ Fast – lowest latency among all competing systems, over 200 documents per minute
□ Adaptable – both Solr Text Tagger and Sunflower can be easily adapted to changing data
□ Compact – the full system requires less than 4 GB of operational memory and uses no external data repositories
![Page 29: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/29.jpg)
Lightweight ERD
29
Entity Recognition and Disambiguation is typically just a single step in a more complex document processing system
To be practical the ERD system has to be lightweight:
□ Fast – lowest latency among all competing systems, throughput of over 200 documents per minute
□ Adaptable – both Solr Text Tagger and Sunflower can be easily adapted to changing data
□ Compact – the full system requires less than 4 GB of operational memory and uses no external data repositories
![Page 30: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/30.jpg)
The importance of default sense
30
Analysis on 50 documents with ground-truth data (1166 entities)
85% of mentions that can be disambiguated, should be disambiguated with default sense
□ Another 5% is explicitly disambiguated with another mention in the document (e.g., E72 and Nokia E72)
Focusing on default sense Tulip missed < 5% of entities
![Page 31: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/31.jpg)
The importance of default sense
31
Analysis on 50 documents with ground-truth data (1166 entities)
85% of mentions that can be disambiguated, should be disambiguated with default sense
□ Another 5% is explicitly disambiguated with another mention in the document (e.g., E72 and Nokia E72)
Focusing on default sense Tulip missed < 5% of entities
![Page 32: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/32.jpg)
The importance of default sense
32
Analysis on 50 documents with ground-truth data (1166 entities)
85% of mentions that can be disambiguated, should be disambiguated with default sense
□ Another 5% is explicitly disambiguated with another mention in the document (e.g., E72 and Nokia E72)
Focusing on default sense Tulip missed < 5% of entities
![Page 33: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/33.jpg)
Conclusions
33
Wikipedia-based category profiles can be used to determine the relatedness of an entity to the topics of a document
Small size of category profiles allows the system to represent the aggregated topics of the document in form of a centroid, which simplifies the recognition process
The pruning of suspicious mentions and focus on the default sense entities helps Tulip to build precise document centroids that can be further used to clean or expand the set of returned entities
The accuracy of extracted entities relies more on the successful recognition of correct entity mentions rather than their disambiguation
Project website: http://www.cs.dal.ca/~lipczak/erd/
![Page 34: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/34.jpg)
Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids
Marek LipczakArash Koushkestani
Evangelos Milios
![Page 35: Tulip: Lightweight Entity Recognition and Disambiguation ...lipczak/mypublications/tulip-presentation.pdf · Analysis on 50 documents with ground-truth data (1166 entities) 85% of](https://reader033.vdocuments.mx/reader033/viewer/2022042023/5e7b3e552877031ef1743d0a/html5/thumbnails/35.jpg)
Solr Text Tagger
35
Two level Finite State Transducers approach
□ Word to index (each edge is a letter)
□ Surface form to list of entities (each edge is a word)