benchmarking the extraction and disambiguation of named entities on the semantic web

21
Benchmarking the Extraction and Benchmarking the Extraction and Disambiguation of Named Entities Disambiguation of Named Entities on the Semantic Web on the Semantic Web Giuseppe Rizzo, Marieke van Erp, Raphaël Troncy @merpeltje @rtroncy @giusepperizzo

Upload: giuseppe-rizzo

Post on 08-May-2015

1.208 views

Category:

Documents


1 download

DESCRIPTION

"Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web" talk given at LREC'14, Reykjavik, Iceland

TRANSCRIPT

Page 1: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

Benchmarking the Extraction and Benchmarking the Extraction and Disambiguation of Named Entities Disambiguation of Named Entities

on the Semantic Webon the Semantic Web

Giuseppe Rizzo, Marieke van Erp, Raphaël Troncy

@merpeltje @rtroncy@giusepperizzo

Page 2: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 2/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

Benchmarking NER & NED

➢ NER➢ [newswire] CoNLL, ACE, MUC➢ [microposts] Microposts Concept Extraction

➢ NED➢ [newswire] TAC KBP➢ [microposts] Microposts NEEL

➢ Numerous academic and commercial NER and NED tools

➢ To name a few: AlchemyAPI, DBpedia Spotlight, GATE, OpeNER, Stanford

Page 3: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 3/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

This Work

➢ Evaluation and comparison of 11 NER and NED tools through the NERD API

➢ Combination of the 11 NER tools in NERD-ML

➢ Experiments on two types of corpora: newswire and microposts

Page 4: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 4/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

➢ http://nerd.eurecom.fr

➢ Ontology, REST API & Web Application

➢ Uniform access to 11 NER/NED external tools

➢ commercial: AlchemyAPI, dataTXT, OpenCalais, Saplo, TextRazor, Wikimeta, Yahoo!, Zemanta

➢ academic: DBpedia Spotlight, Lupedia, THD

Page 5: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 5/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

Theoretical limit

➢ Any of these systems have strengths in entity typing

➢ An ideal combination will use the best in typing among all

➢ Estimate the upper bound where each type is

t target=select te=tGS(te1

, te2, ... , t en

)

Page 6: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 6/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

➢ Try to perform better than each individual NER tool

➢ Learning:➢ NERD tool predictions➢ Stanford CRF predictions➢ Linguistic features

➢ Naive Bayes (NB), k-nearest neighbors (k-NN), Support Vector Machines (SVM, RBF kernel)

N ERD-ML

Page 7: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 7/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

Feature Vector

extractor2type

extractor1type

linguisticvector

...extractorN

typeGStype

training vector

Page 8: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 8/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

Linguistic Features

POSinitial

cap (*)all

caps (*)capitalizedratio (**)

prefix suffixbegin or end (*)

linguisticvector

* Boolean value** Double value

token

Page 9: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 9/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

Experiments - NER

➢ CoNLL2003 English, testb set [newswire]

➢ 231 Articles➢ 46,435 Tokens➢ 5,648 NEs

➢ MSM2013, test set [microposts]➢ 1,450 Posts➢ 29,085 Tokens➢ 1,538 NEs

Page 10: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 10/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

Results on CoNLL2003

Page 11: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 11/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

Results on MSM2013

Page 12: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 12/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

CoNLL2003

NERD-ML Incremental Learning (1/2)

Experimental settings:➢ Feature Vector: token, AlchemyAPI, DBpedia Spotlight, Cicero, Lupedia,

OpenCalais, Saplo, Yahoo!, Textrazor, Wikimeta, Stanford, GS type➢ Classifier = NB

Page 13: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 13/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

MSM2013

NERD-ML Incremental Learning (2/2)

Experimental settings:➢ Feature Vector: token, pos, initialcaps, allcaps, prefix, suffix, capitalfreq, start,

AlchemyAPI, DBpedia Spotlight, Cicero, Lupedia, Opencalais, Textrazor, Ritter, Stanford, GS type

➢ Classifier = SVM

Page 14: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 14/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

Experiments - NED

➢ AIDA CoNLL-YAGO links to Wikipedia, testb set [newswire]➢ 231 Articles➢ 46,435 Tokens➢ 4,485 Links

➢ Microposts2014 links to DBpedia, test set [microposts]➢ 1,165 Posts➢ 23,815 Tokens➢ 1,330 Links

Page 15: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 15/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

Results on AIDA CoNLL-YAGO

Wikipeda is the reference Knowledge Base

Page 16: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 16/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

Results on Microposts2014

DBpedia v3.9 is the reference Knowledge Base

Page 17: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 17/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

Discussion NER

➢ Newswire➢ Robust performance on recognizing common types ➢ But MISC class is hard to detect (always will be?)

➢ Microposts➢ Fairly robust for PER ➢ Weak in recognizing LOC and ORG➢ MISC is around 30% of F1

Page 18: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 18/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

Discussion NED

➢ Newswire➢ Unreliable performance on linking, with the peak in

F1 of 50.41% for TextRazor ➢ Linkers use different reference knowledge bases.

Source of bias is the link normalization part

➢ Microposts➢ Linking shows a big drop in performance➢ TextRazor has the best score with a 32.65% F1

Page 19: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 19/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

Future Work

➢ NER

➢ Improving the taxonomy alignment➢ NED

➢ Better harmonization of the linking stage ➢ NERD-ML

➢ Getting closer to the theoretical limit in NER➢ Use of gazetteers for MISC types ➢ Combining the outputs of the NEL tools to predict the links

Page 20: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 20/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

Acknowledgments

The research leading to this paper was partially supportedby the European Union’s 7th Framework Programme via the projects LinkedTV (GA 287911) and NewsReader (ICT-316404)

Page 21: Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web

May 30, 2014 21/219th Edition of the Language Resources and Evaluation Conference (LREC'14)

Thank You For Listening

http://www.slideshare.net/giusepperizzo

https://github.com/giusepperizzo/nerdml