comprehensive evaluation - priya radhakrishnan€¦ · accuracy of 68% (on 21k queries) and 84%(on...

63
Comprehensive Evaluation PRIYA RADHAKRISHNAN, 201050035

Upload: others

Post on 29-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Comprehensive EvaluationPRIYA RADHAKRISHNAN, 201050035

Page 2: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

INTRODUCTION Motivation

Entity Linking &

Knowledgebase Enhancement

India successfully sends 'MOM' to Mars

India successfully sends 'MOM' to Mars

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 2

Page 3: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

INTRODUCTION Problem StatementIndia successfully sends 'MOM' to Mars

Mention Detection, Disambiguation and Linking

Entity Categorisation and KB Enhancement

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 3

Page 4: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Outline

Definition and Background Information

Literature Survey

Some Applications of EL

Future Work

Q & A

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 4

Page 5: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Entity Linking and KB Enhancement

Mention DetectionKnowledgebase Construction

Entity Categorization

Entity Linking

HYPERLINKED TEXT

INPUT TEXT

Disambiguation

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 5

Page 6: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Literature Survey1. Entity Linking

2. Measuring Semantic Relatedness

3. Entity Linking in documents

4. Entity Linking in short texts

5. Entity Linking Evaluation

6. Knowledgebase Creation

7. Knowledgebase Enhancement

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 6

Page 7: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Literature Survey : Entity LinkingP0.Bunescu, R., Pasca.M: Using encyclopedic knowledge for named entity disambiguation. In EACL. (2006)

P1. Mihalcea, R., Csomai, A.: Wikify!: Linking documents to encyclopedic knowledge. In CIKM (2007)

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 7

Page 8: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Using encyclopedic knowledge for named entity disambiguationAuthor : Bunescu, R, Pasca.M.

Aim : Detect Named entity in text and disambiguate to their named entity denotations in Wikipedia.

Approach : Disambiguation uses a SVM Kernel trained on Wikipedia features from Context and category.

Results : Using 1,783K dataset of ambigues NEs in Wikipedia, the report accuracy of 68% (on 21K queries) and 84%(on 31K queries).

Contribution : This work of Bunescu and Pasca is widely accepted as the first work in this area.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 8

Page 9: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P01. Mention Detection -Detects whether a proper name refers to a named entity included in the dictionary.

2. Disambiguation = Context-article Similarity + Word – category similarity

score(q, 𝑒𝑛) = cos(q.T , 𝑒𝑛.T) =𝑞.𝑇

𝑞.||𝑇||. 𝑒𝑛.𝑇

𝑒𝑛.||𝑇||,

𝑓(q, 𝑒𝑛) = {0,1}

Disambiguation score = arg𝑚𝑎𝑥𝑛 𝑠𝑐𝑜𝑟𝑒(𝑞, 𝑒𝑛)

3. Detect unlikable or out-of-Wikipedia entities. Disambiguation produces a confidence score and linking succeeds if the disambiguation score is above threshold. When it is below threshold, the entity is described as out of Wikipedia.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 9

Page 10: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P1 : Wikify!: Linking documents to encyclopedic knowledgeAuthor : Mihalcea, R., Csomai, A.

Aim : Use of Wikipedia as a resource for automatic keyword extraction and word sense disambiguation.

Approach : Assess the keyphraseness of each entity mention, disambiguates using context.

Results : It shows how Wikipedia can be used to achieve state-of-the-art results on both these tasks. The paper also shows how the two methods can be combined into a system able to automatically enrich a text with links to Wikipedia. Given an input document, the system identifies the important concepts in the text and automatically links these concepts to the corresponding Wikipedia pages.

Contribution : Link any entity mention appearing in text, not just the named entities, to their named entity denotations in Wikipedia.

In 2008, Medelyn et al used Wikipedia as a hyperlinked encyclopaedia. They defined Commonness as ratio of number of links with specific target and anchor text to the total number of links with that anchor text.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 10

Page 11: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P11.Keyword extraction(mention detection) They defined a measure called keyphraseness

Keyphraseness =𝑇ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑤ℎ𝑒𝑟𝑒 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚𝑤𝑎𝑠 𝑎𝑙𝑟𝑒𝑎𝑑𝑦 𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 𝑎𝑠 𝑘𝑒𝑦𝑤𝑜𝑟𝑑

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑤ℎ𝑒𝑟𝑒 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 𝑎𝑝𝑝𝑒𝑎𝑟𝑒𝑑

2. Link generation ( link ambiguity resolved using context).

In disambiguation two approaches were tried. One was a knowledge-base based approach, where overlap of contexts between q and e was used.

Second was a data driven algorithm - a naïve Bayes classifier is trained on local (3 words to left + right, POS of neighbours) and global (five keyphrases occurring at least 3 times in the contexts defining this word sense) to predict linkability.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 11

Page 12: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Literature Survey : Measuring Semantic RelatednessP11. Strube, M., Ponzetto, S.P.: Wikirelate! computing semantic relatedness using wikipedia. In AAAI (2006)

P2. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI (2007)

P3. Milne, D., Witten, I.H.: An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In AAAI (2008)

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 12

Page 13: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Semantic Relatedness

Semantic relatedness: how much words/texts are correlated in meaning to each other.

word 1 /text 1 ←→ word 2 /text 2

cricket ←→ sport

Domesticated Animals ↔ Pet Mammals

Producers ↔ Actors ↔ Directors

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 13

Page 14: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P11 : Wikirelate! computing semantic relatedness using wikipedia.Author : Strube, M., Ponzetto, S.P.

Aim : Using Wikipedia for computing semantic relatedness and compares it to WordNet on various benchmarking datasets.

Approach : Apply well established semantic relatedness measures originally developed for WordNet to the open domain encyclopaedia Wikipedia.

WorNet measures include Leacock & Chodorow (1998), Wu & Palmer(1994), Resnik(1995), Lesk and Banerjee & Pederson(2003)

Google based sr measure = 𝐻𝑖𝑡𝑠 ( 𝑖 𝐴𝑁𝐷 𝑗 )

𝐻𝑖𝑡𝑠 𝑖 +𝐻𝑖𝑡𝑠 𝑗 −𝐻𝑖𝑡𝑠 ( 𝑖 𝐴𝑁𝐷 𝑗)

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 14

Page 15: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P11Existing relatedness measures perform better using Wikipedia than a baseline given by Google counts. It also shows that Wikipedia outperforms WordNet when applied to the largest available dataset designed for that purpose. The best results on this dataset are obtained by integrating Google, WordNet and Wikipedia based measures.

Results : This work established that existing relatedness measures perform better using Wikipedia than a baseline given by Google count.

Contribution : Computing SR requires a semantic resource. Wordnet was the de-facto semantic resourse for calculating Sematic Relatedness until in 2006 the concept of using Wikipedia as a semantic source surfaced in this work of Strubeand Ponzetto.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 15

Page 16: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P2 : Computing semantic relatedness using wikipedia-based explicit semantic analysis.Author : Gabrilovich, E., Markovitch, S

Aim : Create a semantic interpretation of words occurring as Wikipedia titles.

Approach : ESA represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia (i.e Wikipedia titles). Using machine learning techniques, explicitly represent the meaning of any text as a weighted vector of Wikipedia-based concepts.

Assessing the relatedness of texts in this space is comparing the corresponding vectors using conventional metrics (e.g., cosine). Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with human judgments: from r = 0.56 to 0.75 for individual words and from r = 0.60 to 0.72 for texts.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 16

Page 17: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P2

They build an inverted index, which maps each word into a list of concepts in which it appears. Given a text fragment, the semantic interpreter ranks all the Wikipedia concepts by their relevance to the fragment as a vector.

Semantic Relatedness (SR) of a pair of text fragments is the cosine metric between their vectors.

Results : Proposed SR measure, ESA achieved the highest correlation with human (0.75)

Contribution : SR measure achieved the highest correlation with human (0.75) thus far. However the method requires processing whole Wikipedia text.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 17

Page 18: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P3 : An effective, low-cost measure of semantic relatedness obtained from Wikipedia links.Author : Milne, D., Witten, I.H.:

Aim : Proposed the Wikipedia Link based Measure (WLM) for computing Semantic Relatedness

Approach : Uses only the hyperlink structure of Wikipedia

Results : WLM achieved a correlation of 0.68 with human.

Contribution : The approach uses the hyperlink structure of Wikipedia rather than its category hierarchy( as in P11) or textual content(as in P2). Evaluation with manually defined measures of semantic relatedness reveals this approach to be an effective compromise between the ease of computation of the former(P11) and the accuracy of the latter(P2).

In their subsequent work this was expanded to measure relatedness between entities and used in Entity Linking

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 18

Page 19: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P3WLM is cheaper and effective: cheaper, because Wikipedia’s extensive textual content can largely be ignored, and effective, because it is more closely tied to the manually defined semantics of the resource.

Candidate articles of a term are identified using anchor .

anchors—the terms or phrases in Wikipedia articles texts to which links are attached.

Two SR measures

w(a ->b) = 𝑙𝑜𝑔|𝑇|

|𝑊|

sr(a,b) =log max 𝐴,𝐵 −log(𝐴∩𝐵)

log 𝑊 −log(min(𝐴,𝐵))

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 19

Page 20: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Literature Survey : Entity Linking in documentsP4. Milne, D., Witten, I.H.: Learning to link with Wikipedia. In CIKM (2008)

P5. Cucerzan, S.: Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In EMNLP-CoNLL(2007)

P6. Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.: Collective annotation of Wikipedia entities in web text. In SIGKDD (2009)

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 20

Page 21: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P4 : Learning to link with Wikipedia.Author : Milne, D., Witten, I.H.:

Aim : Automatically cross-reference documents with Wikipedia

Approach : Learn a disambiguator using WLM and Comonnness as features. This disambiguator is used to detect mentions and link them.

Results : Disambiguator (F=97.1), Link Detection (F = 75) , Accuracy of the detected links = 76.4%.

Contribution : The key difference in approach was that it uses disambiguation to inform detection, whereas conventional approach was to do detection first and then do disambiguation.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 21

Page 22: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P4Disambiguation followed by mention detection!

Mention Detection : Document ->n-grams ->remove infrequent and stopwords

Disambiguation : Disambiguate the cleaned n-grams using two features :Relatedness or WLM and link probability [p1]. A linear combination is achieved by preferentially weighing WLM , when good context is available and link probability when less context is available(choose most common sense) using a classifier (B4.5 algorithm).

Linking : The features WLM, link probability, disambiguation score, Generality( min depth at which topic is located in the Wikipedia category tree), Location and spread of topics (in the Wikipedia page) are used to train a Naïve Bayes classifier to detect link (whether to link or not).

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 22

Page 23: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P5 : Large-Scale Named Entity Disambiguation Based on Wikipedia Data.Author : Cucerzan, S.

Aim : A large-scale system for the recognition and semantic disambiguation of named entities based on information extracted from Wikipedia data and Web search results.

Approach : Maximize the agreement between the contextual information extracted from Wikipedia and the context of a document, as well as the agreement among the category tags associated with the candidate entities.

Disambiguation score = arg𝑚𝑎𝑥 𝑛=1𝑁 𝑏𝑛 ∣ 𝐶, 𝑑 + 𝑛=1 𝑚=1

𝑁 𝑏𝑛 ∣ 𝑇, 𝑏𝑚 ∣ 𝑇

Result: The implemented system shows high disambiguation accuracy on both news stories and Wikipedia articles.

Contribution : The first wikification system to map all named entities in a text simultaneously to trap the coherence among the entities, in the disambiguation of the detected entity mention. Evidence from the context of the mention is combined with that from the category tag of the mention to do disambiguation.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 23

Page 24: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P5Contextual information extracted from Wikipedia includes

1. the known entities (most articles in Wikipedia are associated to an entity/concept),

2. their entity class when available (Person, Location, Organization, and Miscellaneous),

3. their known surface forms (terms that are used to mention the entities in text),

4. contextual evidence (words or other entities that describe or co-occur with an entity), and

5. category tags (which describe topics to which an entity belongs to).

Mention Detection – Sentence and Entity boundary identification and type(PER, ORG, LOC, OTH)

Disambiguation - a vectorial representation of the processed document is compared with the vectorial representations of the Wikipedia entities.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 24

Page 25: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P5In mention detection, NEs are identified and the system retrieves all possible entity disambiguation of each NE.

Wikipedia contexts that occur in the document and their category tags are aggregated into a document vector, which is subsequently compared with the Wikipedia entity vector (of categories and contexts) of each possible entity disambiguation.

Choose the assignment of entities to surface forms that maximizes the similarity between the document vector and the entity vectors

Disambiguation score = arg𝑚𝑎𝑥 𝑛=1𝑁 𝑏𝑛 ∣ 𝐶, 𝑑 + 𝑛=1 𝑚=1

𝑁 𝑏𝑛 ∣ 𝑇, 𝑏𝑚 ∣ 𝑇

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 25

Page 26: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P6.Collective annotation of Wikipedia entities in web text.Author : Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.

Aim : Link entity mentions on Web pages to entities in Wikipedia.

Approach : This paper proposes a general collective disambiguation approach. On the premise that coherent documents refer to entities from one or a few related topics or domains, the authors propose formulations for the trade-off between local spot-to-entity compatibility and measures of global coherence between entities. The proposed solution is based on local hill-climbing, rounding integer linear programs, and pre-clustering entities followed by local optimization within clusters.

Result: In experiments involving over a hundred manually-annotated Web pages and tens of thousands of entity mentions, the approach significantly outperforms other existing algorithms.

Contribution : They built a manually curated dataset for evaluating EL. Achieved F1 = 69.69. Both P4 and P5 avoid direct joint optimization of all spot labels, which is done here. This system achieved higher disambiguation accuracy though at a higher computational cost.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 26

Page 27: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P6Wikipedia is preprocessed so that each page corresponding to an entity γ is represented by four fields.

• Text from the first descriptive paragraph of γ.

• Text from the whole page for γ.

• Anchor text within Wikipedia for γ.

• Anchor text and five tokens around it.

Each field is turned into a bag (multiset) of words. Three text match scores are computed between a field of γ and s:

• Dot-product between word count vectors.

• Cosine similarity in TFIDF vector space.

• Jaccard similarity between word sets. So in all, we get 4 × 3 = 12 features.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 27

Page 28: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P6

Coherence Score =

This optimization is converted first into a 0/1 integer linear program by considering NA. Then it is relaxed into a LP or using rounding.

1

𝑆 𝐶2

𝑠!=𝑠′Ɛ𝑆

𝑟 𝑦𝑠 , 𝑦𝑠′ +1

𝑆

𝑠Ɛ𝑆

𝑤 𝑓 𝑦𝑠

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 28

Page 29: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Literature Survey : Entity Linking in short textsP7. Ferragina, P., Scaiella, U.: TAGME: On-the-fly Annotation of Short Text Fragments (by Wikipedia Entities). In CIKM (2010)

P8. Meij, E., Weerkamp, W., de Rijke, M.: Adding Semantics to Microblog Posts. WSDM (2012)

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 29

Page 30: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P7: TAGME: On-the-fly Annotation of Short Text Fragments (by Wikipedia Entities). Author : Ferragina, P., Scaiella, U.

Aim : Uses Wikipedia's anchor text to page mapping to address the problem of cross-referencing text fragments with Wikipedia pages. This way synonymy and polysemy issues are resolved accurately and efficiently.

Approach : Uses Keyphraseness (P1) for mention detection and WLM (P4) for disambiguation.

Result: good results on both long documents (F = 78.2) and short texts fragments(F=77.9) i.eweb snippets and micro-blogging (namely, tweets) .

Contribution : On short texts, context is sparse. This is counted as state-of-the-art in Wikification systems.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 30

Page 31: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P71. Relatedness between pages

2. Disambiguation for a mention a from candidate sense Pa, rel(Pa).3. Linking :

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 31

Page 32: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P8: Adding Semantics to Microblog Posts.Author : Meij, E., Weerkamp, W., de Rijke, M..

Aim : Determine concept of a microblog post (tweet) through semantic linking.

Approach : Combine concept ranking method (which gives high Recall) which generate a ranked list of candidate concepts, with supervised machine learning method (that gives high Precision) to predict concept of tweet.

Result: Achieved MRR(0.708) on the published dataset.

Contribution : A reusable datset.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 32

Page 33: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P8Approach :

1. Mention detection and link generation : obtain a ranked list of candidate concepts for each n-gram in a tweet.

2. Disambiguation : Determine which of the candidate concepts to keep ( a comparison of methods for the initial concept ranking step; lexical matching, language modelling and compare their effectiveness). Supervised Learner.

Features :

N-gram features : IDF(q), WIG(q), SNIL(q), SNCL, Link Probability, Keyphraseness

Concept features : Inlinks, Outlinks, Redirect, WikiCat

Tweet features : TWCT, TWCQ, URL, TAGDEF

Dataset : A manually annotated dataset of tweet –to- concept was created.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 33

Page 34: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Literature Survey : Entity Linking EvaluationP9. Cornolti, M., Ferragina, P., Ciaramita, M.: A framework for benchmarking entity-annotation systems. In WWW 2013

P10. Hachey, B., Radford, W., Nothman, J., Honnibal, M., Curran, J.R.: Evaluating entity linking with Wikipedia. Artif. Intell. (2013)

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 34

Page 35: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P9: A framework for benchmarking entity-annotation systems. Author : Cornolti, M., Ferragina, P., Ciaramita, M..

Aim : Presents a benchmarking framework for fair and exhaustive comparison of entity-annotation systems.

Approach : Definition of a set of problems related to the entity-annotation task, a set of measures to evaluate systems performance, and a systematic comparative evaluation involving all publicly available data-sets, containing texts of various types such as news, tweets and Web pages. Problems fall into D2W, A2W, Sa2W, C2W, Sc2W or Rc2W

Result: Comparison of publicly available entity annotation systems namely : AIDA, Illinois Wikifier, Tagme, Wikipedia-Miner and Dbpedia-Spotlight

Contribution : Classification of entity linking systems into D2W, A2W and the evaluation measures defined here became well accepted as standards.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 35

Page 36: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P9

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 36

Page 37: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P9

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 37

Page 38: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P9

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 38

Page 39: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P10: Evaluating entity linking with Wikipedia. Author : Hachey, B., Radford, W., Nothman, J., Honnibal, M., Curran, J.R.

Aim : This paper re-implements three seminal Named Entity Linking (NEL) systems and presents a detailed evaluation of mention detection strategies. The results are systematically compared on standard data sets. The results establish that co-reference and acronym handling lead to substantial improvement, and mention detection strategy account for much of the variation between systems.

Approach : Compare Bunescu & Pasca (P0), Cucerzan(P5) and Varma (IIITHyderabad at TAC 2009) systems

Result: First direct comparison of three systems.

Contribution : Mention detection strategies account for much of the variation between systems compared to disambiguation methods.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 39

Page 40: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P10Review of Named Entity Disambiguation Tasks and Data Sets

NEL system = Extractor + Searcher + Disambiguator

Extractors – alias source

Searcher - effect of coreference, acronym handling and query length

Disambiguator –cosine similarity outdid scalar product and SVM rank

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 40

Page 41: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

RECAP : Entity Linking and KB Enhancement

Mention DetectionKnowledgebase Construction

Entity Categorization

Entity Linking

HYPERLINKED TEXT

INPUT TEXT

Disambiguation

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 41

Page 42: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Literature Survey : Knowledgebase CreationP12. Zesch, T., Gurevych, I.: Analysis of the Wikipedia category graph for nlp applications. In TextGraphs-2 Workshop of NAACL-HLT(2007)

P13. Suchanek, F.M., Kasneci, G.,Weikum, G.: Yago: A core of semantic knowledge. In WWW (2007)

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 42

Page 43: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

RECAP : Semantic Relatedness

Semantic relatedness: how much words/texts are correlated in meaning to each other.

word 1 /text 1 ←→ word 2 /text 2

cricket ←→ sport

Domesticated Animals ↔ Pet Mammals

Producers ↔ Actors ↔ Directors

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 43

Page 44: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

RECAP : P11 : Wikirelate! computing semantic relatedness using wikipedia.Author : Strube, M., Ponzetto, S.P.

Aim : Using Wikipedia for computing semantic relatedness and compares it to WordNet on various benchmarking datasets.

Approach : Existing relatedness measures perform better using Wikipedia than a baseline given by Google counts. It also shows that Wikipedia outperforms WordNet when applied to the largest available dataset designed for that purpose. The best results on this dataset are obtained by integrating Google, WordNet and Wikipedia based measures.

Results : This work established that existing relatedness measures perform better using Wikipedia than a baseline given by Google count.

Contribution : Computing SR requires a semantic resource. Wordnet was the de-facto semantic resourse for calculating Sematic Relatedness until in 2006 the concept of using Wikipedia as a semantic source surfaced in this work of Strube and Ponzetto.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 44

Page 45: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P12 : Analysis of the Wikipedia category graph for NLP applications.Author : Zesch, T., Gurevych, I.

Aim : Using Wikipedia for computing semantic relatedness and compares it to WordNet on various benchmarking datasets.

Approach : Compare the two graphs in Wikipedia (i) the article graph, and (ii) the category graph. Using graph theoretic analysis of the category graph, the authors show that Wikipedia Category Graph is a scale-free, small world graph like other well-known lexical semantic networks.

Results : WordNet based SR measures are adapted to Wikipedia Category Graph. German WordNet (a.k.a GermaNet) gives best correlation with human judgement on SR datasets.

Contribution : First published non-Englisg ( German ) SR dataset.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 45

Page 46: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P12

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 46

Page 47: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P12WorNet measures include

Path Length, Leacock & Chodorow (1998), Wu & Palmer(1994), Resnik(1995), Lin(1998), IIC (Intrinsic Information Content)

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 47

Page 48: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P13 : Yago: A core of semantic knowledge.Author : Suchanek, F.M., Kasneci, G.,Weikum, G.

Aim : YAGO is a light-weight and extensible ontology with high coverage and quality. YAGO contains more than 1 million entities and 5 million facts. This includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as HASONEPRIZE).

Approach : The facts were automatically extracted from Wikipedia and unified with WordNet, using a carefully designed combination of rule-based and heuristic methods.

Results : Empirical evaluation of fact correctness shows an accuracy of about 95%. YAGO is based on a logically clean model, which is decidable, extensible, and compatible with RDFS.

Contribution : First ontology using Wikipedia + WordNet

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 48

Page 49: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P13

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 49

Page 50: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P13

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 50

Page 51: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Literature Survey : Knowledgebase EnhancementEntity Attribute Extraction

P14. Ghani, R., Probst, K., Liu, Y., Krema, M., Fano, A.: Text mining for product attribute extraction. SIGKDD 2006

Structured Information Extraction

Wu, F., Weld, D.S.: Autonomously semantifying Wikipedia. In CIKM (2007)

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 51

Page 52: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Autonomously semantifying Wikipedia.Author : Wu, F., Weld, D.S.

Aim : Automatically enhance structures in Wikipedia like link structure, taxonomic data, infoboxes, etc..(KYLIN)

Approach: Uses a self-supervised, machine learning system. KYLIN looks for classes of pages with similar infoboxes, determines common attributes, creates training examples, learns CRF extractors, and runs them on each page — creating new infoboxes and completing others.

KYLIN also automatically identifies missing links for proper nouns on each page, resolving each to a unique identifier.

Results : Experiments show that the performance of KYLIN is roughly comparative with manual labelling in terms of precision and recall.

Contributions : System or API is not publically available.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 52

Page 53: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

KYLIN:KYLIN looks for classes of pages with similar infoboxes, determines common attributes, creates training examples, learns CRF extractors, and runs them on each page — creating new infoboxes and completing others.

KYLIN also automatically identifies missing links for proper nouns on each page, resolving each to a unique identifier.

Experiments show that the performance of KYLIN is roughly comparative with manual labelling in terms of precision and recall. On one domain, it does even better.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 53

Page 54: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P14 : Text mining for product attribute extractionAuthor : Ghani, R., Probst, K., Liu, Y., Krema, M., Fano, A.

Aim : Extracting attribute and value pairs from textual product descriptions. The goal is to augment databases of products by representing each product as a set of attribute-value pairs.

Approach: Problem is formulated as a classification problem and solved using semi-supervised learning algorithms.

Results : results on apparel and sporting goods product descriptions

Contribution : Representing product as A-V pairs is beneficial for tasks where treating the product as a set of attribute-value pairs is more useful than as an atomic entity. Used in applications like demand forecasting, assortment optimization, product recommendations, and assortment comparison across retailers and manufacturers.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 54

Page 55: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P14First system extracts implicit (semantic) attributes that are implicitly mentioned in descriptions.

Semantic Attribute Extraction:

1. Dataset : crawled from apparel retail websites

2. Define a set of semantic attributes that would be useful to extract for each product

3. a small subset (600 products) was given to a group of fashionaware people to label

4. create one text classifier for each semantic attribute(Na¨ıve Bayes)

5. use the Expectation- Maximization algorithm to combine labeled and unlabeled data.

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 55

Page 56: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

P14Second system extracts explicit attributes from product descriptions. These attributes associated with products are explicit physical attributes such as size and colour. The attribute-value pairs are explicitly mentioned in the data. Both the data populates a knowledge base with these products and attributes.

Explicit Attribute Extraction:

1. Data Collection from an internal database or from the web using web crawlers and wrappers, as done in the previous section.

2. Seed Generation either by generating them automatically or by obtaining human-labeled training data.

3. Attribute-Value Entity Extraction using a semi-supervised co-EM algorithm, because it can exploit the vast amounts of unlabelled data that can be collected cheaply.

4. Attribute-Value Pair Relationship Extraction by associating extracted attributes with corresponding extracted values. They use a dependency parser to establish links between attributes and values as well as correlation scores between words.

5. User Interaction to correct the results as well as to provide training data for the system to learn from using active learning techniques..

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 56

Page 57: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

RECAP : Literature Survey1. Entity Linking

2. Measuring Semantic Relatedness

3. Entity Linking in documents

4. Entity Linking in short texts

5. Entity Linking Evaluation

6. Knowledgebase Creation

7. Knowledgebase Enhancement

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 57

Page 58: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Applications

1. Information Extraction

2. Information Retrieval

3. Content Analysis

4. Question Answering

5. Knowledge Base Population

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 58

Page 59: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Our Attempts and Future Directions

1. Entity Linking

•Documents – TAC KBP Task 2014

•Tweets – NEEL Challenge @ WWW’14

•Search queries – ERD Challenge @ SIGIR’14

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 59

Page 60: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Our Attempts and Future Directions

2. Semantic Relatedness – Using Wikipedia Category Network.

"Extracting Semantic Knowledge from Wikipedia Category Names " in Proceedings of the 3rd Workshop on Knowledge Extraction ( AKBC 2013) at CIKM 2013

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 60

Page 61: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Our Attempts and Future Directions

3.Entity Attribute Extraction – From Product title

"Modeling Evolution of Product Entities" in proceedings of the ACM SIGIR 2014 Conference

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 61

Page 62: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

SUMMARYIndia successfully sends 'MOM' to Mars

COMPREHENSIVE EVALUATION : ENTITY LINKING & KNOWEDGEBASE ENHANCEMENT 62

Mention Detection

Knowledgebase Construction

Entity Categorization

Entity Linking

HYPERLINKED TEXT

Disambiguation

INPUT TEXT

1. Entity Linking

2. Measuring Semantic Relatedness

3. Entity Linking in documents

4. Entity Linking in short texts

5. Entity Linking Evaluation

6. Knowledgebase Creation

7. Knowledgebase Enhancement

Page 63: Comprehensive Evaluation - Priya Radhakrishnan€¦ · accuracy of 68% (on 21K queries) and 84%(on 31K queries). ... It shows how Wikipedia can be used to achieve state-of-the-art

Q&ATHANKYOU