nerd: an open source platform for extracting and disambiguating named entities in very diverse...

49
NERD: an open source platform for extracting and disambiguating named entities in very diverse documents Raphaël Troncy <[email protected] > Giuseppe Rizzo <[email protected] >

Upload: raphael-troncy

Post on 10-May-2015

2.114 views

Category:

Technology


5 download

DESCRIPTION

"NERD: an open source platform for extracting and disambiguating named entities in very diverse documents" - Keynote Talk given at the NLP&DBpedia International Workshop (NLP&DBpedia), 22 October 2013

TRANSCRIPT

Page 1: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

NERD: an open source platform for extracting and

disambiguating named entities in very diverse documents

Raphaël Troncy <[email protected]> Giuseppe Rizzo <[email protected]>

Page 2: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

What is a Named Entity recognition task?

A task that aims to locate and classify the name of a person or an organization, a location, a brand, a product, a numeric expression including time, date, money and percent in a textual document

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 2

Page 3: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Example

“ I want to book a room in an hotel located in the heart of Paris, just a stone’s throw from the Eiffel Tower ”

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 3

Eric Charton, “Named Entity Detection and Entity Linking in the Context of Semantic Web: Exploring the ambiguity question”

Page 4: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Part of Speech

I PRP want VBP to TO book VB a DT room NN in IN … … Paris NNP

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 4

NER: What is Paris? NEL: Which Paris are we talking about?

Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding”

Page 5: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

What is Paris? Type Ambiguity

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 5

Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding”

dbpedia-owl:Asteroid schema:City schema:Movie dbpedia-owl:Film

Page 6: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Named Entity Recognition (NER)

I PRP O want VBP O to TO O book VB O a DT O room NN O in IN O … … … Paris NNP LOC

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 6

Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding”

Page 7: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

What is Paris? Name Ambiguity

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 7

Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding”

Paris, Kentucky Paris, Maine Paris, Tennessee

Paris, France Paris, Idaho Paris, Ontario

Page 8: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Named Entity Linking (NEL)

I PRP O O want VBP O O to TO O O book VB O O a DT O O room NN O O in IN O O … … … … Paris NNP LOC http://dbpedia.org/resource/Paris

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 8

Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding”

Page 9: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

NER Tools and Web APIs

Standalone software GATE Stanford CoreNLP Temis

Web APIs

http://nerd.eurecom.fr/

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 9

Page 10: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Compare performances of NER and NEL tools Understand strengths and weaknesses of different Web APIs Adapt NER processing to different context

(Learn how to) Combine NER (/ NEL) tools

Participate in various benchmarks

NERD: Named Entity Recognition and Disambiguation

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 10

Page 11: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

What is NERD? REST API2 ontology1

UI3

1 http://nerd.eurecom.fr/ontology 2 http://nerd.eurecom.fr/api/application.wadl

3 http://nerd.eurecom.fr

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 11

Page 12: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 12/15

Alchemy API

DBpedia Spotlight

Evri Extractiv Lupedia Open Calais

Saplo Wikimeta Yahoo! Zemanta

Language EN,FR, GR,IT, PT,RU, SP,SW

EN GR* PT* SP*

EN,IT

EN EN,FR, IT

EN,FR SP

EN, SW

EN,FR SP

EN EN

Granularity OEN OEN OED OEN OEN OEN OED OEN OEN OED

Entity position

N/A char offset

N/A word offset

range of chars

char offset

N/A POS offset

range of

chars

N/A

Classification schema

Alchemy DBpedia FreeBase Scema.or

g

Evri DBpedia DBpedia LinkedM

DB

Open Calais

N/A ESTER

Yahoo FreeBase

Number of classes

324 320 5 34 319 95 5 7 13 81

Response Format

JSON MicroF XML RDF

HTML JSON RDF XML

HTML

JSON

RDF

HTML JSON RDF XML

HTML JSON RDFa XML

JSON MicroFormat

JSON JSON XML

JSON XML

XML JSON RDF

Quota (calls/day)

30000 unl 3000

3000 unl 50000 1333 unl 5000 10000

Factual comparison of 10 Web NER tools

Page 13: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Aligned the taxonomies used by the extractors

NERD Ontology

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 13

Page 14: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

NERD type Occurrence

Person 10

Organization 10

Country 6

Company 6

Location 6

Continent 5

City 5

RadioStation 5

Album 5

Product 5

... ...

Building the NERD Ontology

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 14

Page 15: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

NERD REST API

GET, POST, PUT,

DELETE

/document /user /annotation/{extractor} /extraction /evaluation ...

JSON

“entities” : [{ “entity”: “Tim Berners-Lee” , “type”: “Person” , “uri”: "http://dbpedia.org/resource/Tim_berners_lee", “nerdType”: "http://nerd.eurecom.fr/ontology#Person", “startChar”: 30, “endChar”: 45, “confidence”: 1, “relevance”: 0.5 }]

Rizzo G., Troncy R. (2012), NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Web Extraction Tools. In: European chapter of the Association for Computational Linguistics (EACL'12), Avignon, France.

RDF

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 15

Page 16: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

NERD meets NIF

Model documents through a set of strings deferencable on the Web

: offset_23107_ 23110 a str:String ; str:referenceContext :offset_0_26546 .

: offset_23107_ 23110 sso:oen dbpedia:W3C.

dbpedia:W3C rdf:type nerd:Organization .

Map string to entity

Classification

Rizzo G, Troncy R., Hellmann S. and Bruemmer M. (2012), NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud. In: (LDOW'12) Linked Data on the Web (WWW'12), Lyon, France.

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 16

Page 17: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

NERD User Dashboard

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 17

Page 18: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

NERD User Interface

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 18

Page 19: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

History of NER benchmarks CoNLL 2003 and CoNLL 2005

schema (4 types): person, organization, location and miscellaneous

ACE 2004, ACE 2005 and ACE 2007 schema (7 types): person, organization, location, facility, weapon,

vehicle and geo-political entity entity recognition, co-ref, find relationships among entities extracted

TAC 2009 (Knowledge Base Track) schema (3 types): person, organization and location create a knowledge base from the named entities extracted

ETAPE 2012 (Named Entity Task) schema: Quaero (7 main types, 32 sub-types)

MSM 2013: tweet corpus ! schema (4 types): person, organization, location, miscellaneous

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 19

Page 20: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

ETAPE 2012 challenge

genre train dev test sources

TV news 7h 40m 1h 40m 1h 40m BFM Story, Top QUestions (LCP)

TV debates 10h 30m 5h 10m 5h 10m Pile et Face, Ca vous regarde, Entre les lignes (LCP)

TV amusements - 1h 05m 1h 05m La place du village (TV8)

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 20

Train Dev Eval Item length 26h 10h 55m 10h 55m Nb files 44 15 15 Nb words 290517 91656 115511 Nb Named Entities 46763 14398 13055 Nb unique categories 33 33 33

Page 21: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

NERD @ ETAPE (naïve combined strategy)

(eA1,tA1,URIA1,siA1,eiA1) ... ... ...

`

(eA2,tA2,URIA2,siA2,eiA2) (eA3,tA3,URIA3,siA3,eiA3)

(eN2,tN2,URIN2,siN2,eiN2) (eN1,tN1,URIN1,siN1,eiN1)

extraction

cleaning

fusion When at least 2 extractors classify the same entity with a different type then we apply a preferred selection order

(empirically defined): Wikimeta, AlchemyAPI, OpenCalais, Lupedia

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 21

Page 22: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Participation at ETAPE (combined+ strategy)

(eA1,tA1,URIA1,siA1,eA1)

`

(eA2,tA2,URIA2,siA2,eiA2)

(eN2,tN2,URIN2,sN2,eN2) (eN1,tN1,URIN1,sN1,eN1)

...

ETAPE Train & Dev

Learned model

Created static rules

fusion Conflicts handled by

priority selection: own, Wikimeta,AlchemyAPI,OpenCalais,Lupedia

POS tagger

Apply rules

(e1,t1,URI1,si1,ei1)

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 22

Page 23: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

NERD Global results

SLR Precision Recall F-measure %correct

combined 86.85% 35.31% 17.69% 23.44% 17.69%

combined+ 188.81% 15.13% 28.40% 19.45% 28.40%

Combined+ : Eval corpus differs substantially from the Train & Dev corpora. The static rules do not fit well the Eval corpora and they introduce classification noise.

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 23

Page 24: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Per-extractor results SLR Precision Recall F-measure %correct

alchemyapi 37.71% 47.95% 5.45% 9.68% 5.45%

lupedia 39.49% 22.87% 1.56% 2.91% 1.56%

opencalais 37.47% 41.69% 3.53% 6.49% 3.53%

wikimeta 36.67% 19.40% 4.25% 6.95% 4.25%

combined (nerd)

86.85% 35.31% 17.69% 23.44% 17.69%

combined+ (nerd+)

188.81% 15.13% 28.40% 19.45% 28.40%

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 24

Page 25: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

- 25 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013

Page 26: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Learning How to Combine NER Extractors

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 26

Page 27: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

NERD on CoNLL 2003 (NER task)

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 27

Page 28: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

NERD on MSM 2013 (NER task)

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 28

Page 29: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

NERD on MSM 2013 (NEL task)

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 29

Page 30: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Media Fragment Enricher: http://mfe.synote.org/mfe/

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 30

Page 31: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Linking pieces of knowledge

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 31

Page 32: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Linking pieces of knowledge

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 32

Page 33: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Named Entities for Video Classification

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 33

Page 34: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Workflow

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 34

Media Fragment Enricher Services

Media Fragment Enricher UI

Metadata & timed-text

NERD Client RDFizator Triple Store

Categori-zation

Video and metadata preview

Video replay with subtitles and aligned NEs

1: Video URL

2: Metadata

3: meta-data 4:NERDify

5:Timed Text 6: NEs with time

alignment (json)

7: RDFize (ttl)

8: Generate Category

9: SPARQL query

Page 35: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Channel signature based on NE distribution

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 35

Page 36: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 36

Page 37: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

LinkedTV: automatic annotations ...

22/10/2013 - - 37 NLP&DBpedia International Workshop, Sydney, October 2013

Page 38: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

... and enrichment for hypervideos

Cubism Expressionism

Fauvism

FACETS / PROPERTIES OF CONCEPT

CONCEPT IN PLAYER

CONTENT ENRICHMENT

22/10/2013 - - 38 NLP&DBpedia International Workshop, Sydney, October 2013

Page 39: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Media Fragments and Annotations

nerd:Location Cafe Rick

nerd:Person H. Bogart

nerd:Person I. Bergman

nerd:Location Casablanca

Media Fragment URI 1.0 Chapters Scenes Shots etc…

http://data.linkedtv.eu/media/e2899e7f#t=840,900

22/10/2013 - - 39 NLP&DBpedia International Workshop, Sydney, October 2013

Page 40: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Enrichment and Hypervideos

nerd:Location Cafe Rick

nerd:Person H. Bogart

nerd:Person I. Bergman

nerd:Location Casablanca

Nerd:Person E. Tierney

nerd:Location China

22/10/2013 - - 40 NLP&DBpedia International Workshop, Sydney, October 2013

Page 41: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Locator

MediaResource

MediaFragment Annotation

Entity

URL (hyperlink)

Type

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 41

Media Fragment + Open Annotation + NERD

OffsetBasedString

Page 42: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Towards a Linked Media Layer

Enriching media with media from a closed collection (e.g. BBC archive) The MediaEval scenario (~ 1697 hours of archived BBC video)

http://www.multimediaeval.org/mediaeval2013/hyper2013/

Enriching media with content from the open web LinkedTV scenarios: white listed web sites for each program Media Collector for Social Media

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 42

Page 43: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Seed video enriched with web content rbbaktuell_20120809

nerd:Location Brandenburg

oa

Page 44: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Enrichments are Annotations too

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 44

Page 45: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Media Finder (named entities clustering)

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 45

Page 46: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Media Finder (zooming in a cluster)

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 46

Page 47: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Media Finder: http://mediafinder.eurecom.fr/

Live Topic Generation from Event Streams WWW 2013 Demo Session http://www.youtube.com/watch?v=8iRiwz7cDYY

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 47

Page 48: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

Credits

Giuseppe Rizzo, Vuk Milicic, José Luis Redondo Garcia (EURECOM)

Thomas Steiner (Google Inc.)

Marieke van Erp (Free University of Amsterdam)

Yunjia Li (University of Southampton)

… and many other students

22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 48

Page 49: NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

http://www.slideshare.net/troncy

22/10/2013 - - 49 NLP&DBpedia International Workshop, Sydney, October 2013