phd day: entity linking using ontology modularization

PhD Day – 04/2014 Bianca Pereira

The PhD Route

Outline

Literature Review Define the PhD topic

DEFINING THE TOPIC

Entity Linking is..

“Grounding entity mentions in documents to

Knowledge Base entries”

- TAC-KBP 2009

Entity Resolution

http://en.wikipedia.org/wiki/The_Guardian http://en.wikipedia.org/wiki/National_Security_Agency

http://en.wikipedia.org/wiki/British_people http://en.wikipedia.org/wiki/Edward_snowden

PROBLEM SEEKING

Types of Entity Domains of Knowledge Methods

Accuracy Time

Types of Entity

Named Entities

Unamed Entities

Topics

Classes Natural Language Processing

Statistics Entity Linking

Domains of Knowledge

Methods

EVERYTHING !

Natural Language Processing

Statistics

Entity Linking

PROBLEM DEFINITION

Types of Entity

Named Entities

Given by Class

Given by Knowledge Base Others


Cross-domain Knowledge Base

Methods

“(…) Collective Inference over a set of entities can lead

to better performance.”

- Stoyanov et al 2012

Named Entity Recognition Disambiguation

Named Entity Recognition

Disambiguation

http://en.wikipedia.org/wiki/Michael_Jackson

http://en.wikipedia.org/wiki/Popular_music

http://en.wikipedia.org/wiki/Beat_It

http://en.wikipedia.org/wiki/Billie_Jean

http://en.wikipedia.org/wiki/Thriller_(song)

Collective Inference are algorithms for Disambiguation

Co

URI1

URI2

URI3

URI4 URI5

URI6

URI7

URI8

URI9

URI10

A Local Context is used to give the mention-candidate score

Co

URI1

There is coherence between entities in the same document.

Co

URI1

URI2

URI3

URI4 URI5

URI6

URI7

URI8

URI9

URI10

Disambiguation using collective inference is a NP problem.

Co

URI1

URI2

URI3

URI4 URI5

URI6

URI7

URI8

URI9

URI10

URI1

URI4 URI5

URI6

URI7

URI8

230 candidates

24 candidates

“The number of contexts [entities] is overwhelming and had to be reduced to

a manageable size.” - Cucerzan 2007

“Much speed is gained by imposing a threshold below which all senses

[candidates] are discarded” - Milne and Witten 2008

“Inference is NP Hard”

- Kulkarni et al 2009

“(…) exact algorithms on large input graphs are infeasible.”

- Hoffart et al 2011

Collective Inference - Accuracy

Collective Inference - Time

Using approximation algorithms the time is suitable for the task

Methods

Recalling

Given by Knowledge Base

Cross-domain Knowledge

Base

~ 5 MILLION entities

Problem Statement

The time spent in disambiguation for Entity Linking increases with the size of the Knowledge Base. It turns the disambiguation with large Knowledge Bases infeasible.

RELATED WORK

Two solutions for the problem..

1.  Approximation Algorithms 2.  Dimensionality Reduction

Approximation Algorithms

Kulkarni et al 2009, Hoffart et al 2011

Dimensionality Reduction

URI1

URI4 URI5

URI6

URI7

URI8

230

24

URI1

URI4 URI5

URI6

URI7

URI8 URI2

URI3

URI9

URI10

Cucerzan 2007, Milne and Witten 2008, Hoffart et al 2011

Dimensionality Reduction (candidate space)

Algorithm

Knowledge Base

Dimensionality Reduction (candidate space)

Algorithm

Knowledge Base

Related Work

RESEARCH QUESTIONS

R1. Is it possible to delimit a feasible maximum amount of time for disambiguation regardless of the size of the Knowledge Base?

R2. Is it possible to reduce the dimensionality directly in the Knowledge Base?

R3. Is it feasible to use exact algorithms for disambiguation using large Knowledge Bases?

HYPOTHESES


H1. There is a maximum size of candidate set that allows disambiguation in a feasible

time.


H2. If the Knowledge Base can be divided in subsets of constant ambiguity then the

candidate space is constant.


Subset of constant ambiguity

Candidate space constant

Candidate space = maximum allowed size

Feasible time

R2. Is it possible to reduce the dimensionality directly in the Knowledge Base?

H3. The relatedness between entities is a sufficient condition to reduce the

dimensionality without loss of accuracy.


H4. Decreasing the ambiguity in the Knowledge Base is less time consuming

that perform it at disambiguation time.


H5. Exact algorithms can be used in a feasible time until a maximum size of

candidate space.

PROPOSED SOLUTION

Ontology Modularization for Disambiguation in Entity Linking

Ontology Modularization

How to Generate the Modules?

Semantic-Driven Strategies Depends on the Application.

Structure-Driven Strategies Graph Decomposition based on inter-relation. Machine Learning Strategies

Data Mining and Clustering.

EVALUATION

H1. There is a maximum size of candidate set that allows disambiguation in a feasible time.

H1. There is a maximum size of candidate set that allows disambiguation in a feasible time.

Perform an experiment using different collective inference approaches to discover how the time increases with the size of the candidate set.

H2. If the Knowledge Base can be divided in subsets of constant ambiguity then the candidate

space is constant.

H2. If the Knowledge Base can be divided in subsets of constant ambiguity then the candidate

space is constant.

Perform Ontology Modularization aiming a maximum ambiguity in each module.

H3. The relatedness between entities is a sufficient condition to reduce the dimensionality without loss

of accuracy.

H3. The relatedness between entities is a sufficient condition to reduce the dimensionality without loss

of accuracy.

Generate the module based on the same relatedness measure used by the original method and verify the accuracy.

H4. Decreasing the ambiguity in the Knowledge Base is less time consuming that perform it at

disambiguation time.

H4. Decreasing the ambiguity in the Knowledge Base is less time consuming that perform it at

disambiguation time.

Measure the time for disambiguation r e d u c i n g t h e d i m e n s i o n a l i t y a t disambiguation time and using the Modularization approach.

H5. Exact algorithms can be used in a feasible time until a maximum size of candidate space.

H5. Exact algorithms can be used in a feasible time until a maximum size of candidate space.

Select a set of exact algorithms and measure the time for different sizes of candidate space.

Next Steps

Doctoral Consortium TAC-KBP First Experiments Use Cases

Thank you!

Bianca Pereira [email protected]

phd day: entity linking using ontology modularization

Internet