phd day: entity linking using ontology modularization
DESCRIPTION
Presentation given at 6th NLP PhD Day at National University of Ireland, Galway (Insight) in 29/04/2014.TRANSCRIPT
PhD Day – 04/2014 Bianca Pereira
The PhD Route
Outline
Literature Review Define the PhD topic
DEFINING THE TOPIC
Entity Linking is..
“Grounding entity mentions in documents to
Knowledge Base entries”
- TAC-KBP 2009
Entity Resolution
http://en.wikipedia.org/wiki/The_Guardian http://en.wikipedia.org/wiki/National_Security_Agency
http://en.wikipedia.org/wiki/British_people http://en.wikipedia.org/wiki/Edward_snowden
PROBLEM SEEKING
Types of Entity Domains of Knowledge Methods
Accuracy Time
Types of Entity
Named Entities
Unamed Entities
Topics
Classes Natural Language Processing
Statistics Entity Linking
Domains of Knowledge
Methods
EVERYTHING !
Natural Language Processing
Statistics
Entity Linking
PROBLEM DEFINITION
Types of Entity
Named Entities
Given by Class
Given by Knowledge Base Others
Types of Entity
Named Entities
Given by Class
Given by Knowledge Base Others
Domains of Knowledge
Domains of Knowledge
Cross-domain Knowledge Base
Methods
“(…) Collective Inference over a set of entities can lead
to better performance.”
- Stoyanov et al 2012
Named Entity Recognition Disambiguation
Named Entity Recognition
Disambiguation
http://en.wikipedia.org/wiki/Michael_Jackson
http://en.wikipedia.org/wiki/Popular_music
http://en.wikipedia.org/wiki/Beat_It
http://en.wikipedia.org/wiki/Billie_Jean
http://en.wikipedia.org/wiki/Thriller_(song)
Collective Inference are algorithms for Disambiguation
Co
URI1
URI2
URI3
URI4 URI5
URI6
URI7
URI8
URI9
URI10
A Local Context is used to give the mention-candidate score
Co
URI1
There is coherence between entities in the same document.
Co
URI1
URI2
URI3
URI4 URI5
URI6
URI7
URI8
URI9
URI10
URI1
URI2
URI3
URI4 URI5
URI6
URI7
URI8
URI9
URI10
URI1
URI2
URI3
URI4 URI5
URI6
URI7
URI8
URI9
URI10
Disambiguation using collective inference is a NP problem.
Co
URI1
URI2
URI3
URI4 URI5
URI6
URI7
URI8
URI9
URI10
URI1
URI4 URI5
URI6
URI7
URI8
230 candidates
24 candidates
“The number of contexts [entities] is overwhelming and had to be reduced to
a manageable size.” - Cucerzan 2007
“Much speed is gained by imposing a threshold below which all senses
[candidates] are discarded” - Milne and Witten 2008
“Inference is NP Hard”
- Kulkarni et al 2009
“(…) exact algorithms on large input graphs are infeasible.”
- Hoffart et al 2011
Collective Inference - Accuracy
Collective Inference - Time
Using approximation algorithms the time is suitable for the task
Methods
Recalling
Given by Knowledge Base
Cross-domain Knowledge
Base
~ 5 MILLION entities
~ 10 MILLION entities
~ 43 MILLION entities
Problem Statement
The time spent in disambiguation for Entity Linking increases with the size of the Knowledge Base. It turns the disambiguation with large Knowledge Bases infeasible.
RELATED WORK
Two solutions for the problem..
1. Approximation Algorithms 2. Dimensionality Reduction
Approximation Algorithms
Kulkarni et al 2009, Hoffart et al 2011
Dimensionality Reduction
URI1
URI4 URI5
URI6
URI7
URI8
230
24
URI1
URI4 URI5
URI6
URI7
URI8 URI2
URI3
URI9
URI10
Cucerzan 2007, Milne and Witten 2008, Hoffart et al 2011
Dimensionality Reduction (candidate space)
Algorithm
Knowledge Base
Dimensionality Reduction (candidate space)
Algorithm
Knowledge Base
Related Work
Dimensionality Reduction (candidate space)
Algorithm
Knowledge Base
Related Work
RESEARCH QUESTIONS
R1. Is it possible to delimit a feasible maximum amount of time for disambiguation regardless of the size of the Knowledge Base?
R2. Is it possible to reduce the dimensionality directly in the Knowledge Base?
R3. Is it feasible to use exact algorithms for disambiguation using large Knowledge Bases?
R1. Is it possible to delimit a feasible maximum amount of time for disambiguation regardless of the size of the Knowledge Base?
R2. Is it possible to reduce the dimensionality directly in the Knowledge Base?
R3. Is it feasible to use exact algorithms for disambiguation using large Knowledge Bases?
R1. Is it possible to delimit a feasible maximum amount of time for disambiguation regardless of the size of the Knowledge Base?
R2. Is it possible to reduce the dimensionality directly in the Knowledge Base?
R3. Is it feasible to use exact algorithms for disambiguation using large Knowledge Bases?
HYPOTHESES
R1. Is it possible to delimit a feasible maximum amount of time for disambiguation regardless of the size of the Knowledge Base?
H1. There is a maximum size of candidate set that allows disambiguation in a feasible
time.
R1. Is it possible to delimit a feasible maximum amount of time for disambiguation regardless of the size of the Knowledge Base?
H2. If the Knowledge Base can be divided in subsets of constant ambiguity then the
candidate space is constant.
R1. Is it possible to delimit a feasible maximum amount of time for disambiguation regardless of the size of the Knowledge Base?
Subset of constant ambiguity
Candidate space constant
Candidate space = maximum allowed size
Feasible time
R2. Is it possible to reduce the dimensionality directly in the Knowledge Base?
H3. The relatedness between entities is a sufficient condition to reduce the
dimensionality without loss of accuracy.
R3. Is it feasible to use exact algorithms for disambiguation using large Knowledge Bases?
H4. Decreasing the ambiguity in the Knowledge Base is less time consuming
that perform it at disambiguation time.
R3. Is it feasible to use exact algorithms for disambiguation using large Knowledge Bases?
H5. Exact algorithms can be used in a feasible time until a maximum size of
candidate space.
PROPOSED SOLUTION
Ontology Modularization for Disambiguation in Entity Linking
Ontology Modularization
Ontology Modularization
How to Generate the Modules?
Semantic-Driven Strategies Depends on the Application.
Structure-Driven Strategies Graph Decomposition based on inter-relation. Machine Learning Strategies
Data Mining and Clustering.
EVALUATION
H1. There is a maximum size of candidate set that allows disambiguation in a feasible time.
H1. There is a maximum size of candidate set that allows disambiguation in a feasible time.
Perform an experiment using different collective inference approaches to discover how the time increases with the size of the candidate set.
H2. If the Knowledge Base can be divided in subsets of constant ambiguity then the candidate
space is constant.
H2. If the Knowledge Base can be divided in subsets of constant ambiguity then the candidate
space is constant.
Perform Ontology Modularization aiming a maximum ambiguity in each module.
H3. The relatedness between entities is a sufficient condition to reduce the dimensionality without loss
of accuracy.
H3. The relatedness between entities is a sufficient condition to reduce the dimensionality without loss
of accuracy.
Generate the module based on the same relatedness measure used by the original method and verify the accuracy.
H4. Decreasing the ambiguity in the Knowledge Base is less time consuming that perform it at
disambiguation time.
H4. Decreasing the ambiguity in the Knowledge Base is less time consuming that perform it at
disambiguation time.
Measure the time for disambiguation r e d u c i n g t h e d i m e n s i o n a l i t y a t disambiguation time and using the Modularization approach.
H5. Exact algorithms can be used in a feasible time until a maximum size of candidate space.
H5. Exact algorithms can be used in a feasible time until a maximum size of candidate space.
Select a set of exact algorithms and measure the time for different sizes of candidate space.
Next Steps
Doctoral Consortium TAC-KBP First Experiments Use Cases
Thank you!
Bianca Pereira [email protected]