learning with the web. structuring data to ease machine understanding
DESCRIPTION
Talk given at the Universita' di Torino, Turin, Italy - July 11, 2013TRANSCRIPT
![Page 1: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/1.jpg)
Learning with the WebStructuring data to ease machine understanding
http://twitter.com/giusepperizzo
![Page 2: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/2.jpg)
July 11th, 2013 Università di Torino, Italy 2/44
GoogleKnowledge Graph Viewer
![Page 3: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/3.jpg)
July 11th, 2013 Università di Torino, Italy 3/44
Google Knowledge Graph
![Page 4: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/4.jpg)
July 11th, 2013 Università di Torino, Italy 4/44
The Google Knowledge Graph bulk: encyclopedic sources
![Page 5: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/5.jpg)
July 11th, 2013 Università di Torino, Italy 5/44
Web community has highlithed the road, but ...
![Page 6: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/6.jpg)
July 11th, 2013 Università di Torino, Italy 6/44
Vast wealth of unstructured data
“80% of data on the Web and on internal corporate intranets is unstructured"
“80% of data on the Web and on internal corporate intranets is unstructured”
“Semantic Web and Information Extraction Workshop”, SWAIE at RANLP2013
![Page 7: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/7.jpg)
July 11th, 2013 Università di Torino, Italy 7/44
The entire digital universe, going to be part of the Web
“unstructured data will account for 90 percent of all data created in the next decade”
IDC IVIEW, “Extracting Value from Chaos”, June 2011
![Page 8: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/8.jpg)
July 11th, 2013 Università di Torino, Italy 8/44
Structured means
making those resources available to be easily processed
by machines
![Page 9: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/9.jpg)
July 11th, 2013 Università di Torino, Italy 9/44
A Web of Linked Entities
http://wole2013.eurecom.fr
http://wole2012.eurecom.fr
➢ GGG (global giant graph) http://goo.gl/fH3h
➢ Nodes are Web entities
➢ Entities provide disambiguation pointers
➢ Entities can be univocally referred (disambiguated)
➢ Entities as centroids for topic generation and undestanding
![Page 10: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/10.jpg)
July 11th, 2013 Università di Torino, Italy 10/44
Chapter 1:Named Entity Recognition (NER)
and Named Entity Linking (NEL)
![Page 11: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/11.jpg)
July 11th, 2013 Università di Torino, Italy 11/44
I want to book a room in an hotel located in the heart of Paris, just a stone’s throw from the
Eiffel Tower
Eric Charton, “Named Entity Detection and Entity Linking in the Context of Semantic Web: Exploring the ambiguity question”
![Page 12: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/12.jpg)
July 11th, 2013 Università di Torino, Italy 12/44
Part of Speech
I
want
to
book
a
room
in
..
Paris
PRP
VBP
TO
VB
DT
NN
IN
..
NNP
I
want
to
book
a
room
in
..
Paris
NER: What is Paris?
NEL: Which Paris are we talking about?
![Page 13: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/13.jpg)
July 11th, 2013 Università di Torino, Italy 13/44
What is Paris? Type ambiguity
asteroid location/city film
![Page 14: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/14.jpg)
July 11th, 2013 Università di Torino, Italy 14/44
Entity recognitionI
want
to
book
a
room
in
..
Paris
PRP
VBP
TO
VB
DT
NN
IN
..
NNP
I
want
to
book
a
room
in
..
Paris
O
O
O
O
O
O
O
..
LOC
![Page 15: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/15.jpg)
July 11th, 2013 Università di Torino, Italy 15/44
NER: State of the art
➢ CRFs (Conditional Random Fields)➢ FSM (Finite-State Machine)➢ HMM (Hidden Markov Model)➢ Gazetteers
➢ Wikipedia/DBpedia➢ In-house dictionaries
![Page 16: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/16.jpg)
July 11th, 2013 Università di Torino, Italy 16/44
Which Paris?Name ambiguity
Paris, Kentucky Paris, Maine Paris, Tennessee
Paris, France Paris, OntarioParis, Idaho
![Page 17: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/17.jpg)
July 11th, 2013 Università di Torino, Italy 17/44
Entity linkingI
want
to
book
a
room
in
..
Paris
PRP
VBP
TO
VB
DT
NN
IN
..
NNP
I
want
to
book
a
room
in
..
Paris
O
O
O
O
O
O
O
..
LOC
O
O
O
O
O
O
O
..
http://en.wikipedia.org/wiki/Paris
![Page 18: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/18.jpg)
July 11th, 2013 Università di Torino, Italy 18/44
Ambiguity resolution: linking to an external knowledge base
➢ Wikipedia/DBpedia➢ Gigaword Corpus➢ In-house dataset➢ LOD dataset
➢ DBLP➢ ACM➢ BBC➢ ...
![Page 19: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/19.jpg)
July 11th, 2013 Università di Torino, Italy 19/44
NEL: State of the art
➢ Clustering➢ Vector Space Model (Cosine similarity or
Maximum Entropy) – it requires a priori knowledge of the spotted entities
➢ Conditional probability – it requires a priori knowledge of the spotted entities
➢ Dictionaries ➢ Wikipedia/DBpedia➢ In-house dataset
![Page 20: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/20.jpg)
July 11th, 2013 Università di Torino, Italy 20/44
Processing natural language texts
➢ Several attempts from the Web community to structure the large wealth of data available
➢ Numerous off-the-shelf systems (commercial, and academic) that perform the NER+NEL chain➢ AlchemyAPI➢ DBpedia Spotlight➢ Wikimeta➢ TextRazor➢ Stanford CRF➢ ...
![Page 21: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/21.jpg)
July 11th, 2013 Università di Torino, Italy 21/44
The NERD initiative
http://nerd.eurecom.fr
![Page 22: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/22.jpg)
July 11th, 2013 Università di Torino, Italy 22/44
Combination of off-the-shelf systems and properly trained CRFs
![Page 23: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/23.jpg)
July 11th, 2013 Università di Torino, Italy 23/44
The strength of this approach lies in the fact that the supported off-the-shelf systems have access
to large knowledge bases of entities such as DBpedia and Freebase, while CRFs are domain
specific
![Page 24: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/24.jpg)
July 11th, 2013 Università di Torino, Italy 24/44
Diversity
AlchemyAPI
DBpedia Spotlight
Extractiv Lupedia OpenCalais
Saplo SemiTags
Wikimeta Yahoo! Zemanta
Classificationschema
Alchemy DBpediaFreeBaseScema.org
Extractiv DBpediaLinkedM
DB
OpenCalais
Saplo ConLL-3
ESTER Yahoo FreeBase
Number of classes
324 320 34 319 95 5 4 7 13 81
![Page 25: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/25.jpg)
July 11th, 2013 Università di Torino, Italy 25/44
NERD OntologyNERD type Occurrence
Person 10
Organization 10
Country 6
Company 6
Location 6
Continent 5
City 5
RadioStation 5
Album 5
Product 5
... ...
The NERD ontology has been integrated in the NIF project, a EU FP7 in the context of the LOD2: Creating Knowledge out of Interlinked Data
![Page 26: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/26.jpg)
July 11th, 2013 Università di Torino, Italy 26/44
Learning with the Web
➢ FSM-core based➢ combination of the NERD supported off-the-shelf
systems
➢ ML-core based➢ combination of the NERD supported off-the-shelf
systems
– and a CRF, properly trained with the given corpus
![Page 27: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/27.jpg)
July 11th, 2013 Università di Torino, Italy 27/44
Challenges and benchmark
![Page 28: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/28.jpg)
July 11th, 2013 Università di Torino, Italy 28/44
ETAPE 2012 - Entity Extraction Challenge
➢ French transcripts of radio and video programs➢ Challenge objective: entity typing➢ Sumitted system:
➢ FSM-core based➢ Given annotation priority to the systems that have
fine grained classification schemes
➢ Ranked 7th/7
![Page 29: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/29.jpg)
July 11th, 2013 Università di Torino, Italy 29/44
#MSM'13 - Concept Extraction Challenge
➢ English Twitter microposts➢ Challenge objective: entity typing➢ Submitted system:
➢ ML-core based: SVM ➢ Features = linguistic features (some of them are
capitalization, 3 chars of prefix and suffix, POS), output of a CRF properly trained with the challenge training dataset, outputs of the off-the-shelf systems
➢ Ranked 2nd/22
![Page 30: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/30.jpg)
July 11th, 2013 Università di Torino, Italy 30/44
CoNLL-2003
➢ English newswire corpus➢ Benchmark objective: entity typing➢ System:
➢ ML-core based: SVM and NB➢ Features = linguistic features (some of them are capitalization, 3
chars of prefix, 3 chars of suffix, POS), output of a CRF properly trained with the challenge training dataset, output of the off-the-shelf systems
➢ Results: outperformed significantly the performances of all the systems (off-the-shelf) used as inputs and the Stanford CRF properly trained with the CoNLL-2003 training corpus
![Page 31: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/31.jpg)
July 11th, 2013 Università di Torino, Italy 31/44
TAC KBP 2011
➢ English newswire corpus➢ Benchmark objective: entity linking➢ System:
➢ FSM-core based➢ Features: outputs of the off-the-shelf systems,
harmonized with the Gigaword corpus
ongoing
![Page 32: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/32.jpg)
July 11th, 2013 Università di Torino, Italy 32/44
NERD in action
http://nerd.eurecom.fr/annotation/247957
![Page 33: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/33.jpg)
July 11th, 2013 Università di Torino, Italy 33/44
Chapter 2:Annotating streams of
heterogeneous data coming from social platforms for topic
generation
![Page 34: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/34.jpg)
July 11th, 2013 Università di Torino, Italy 34/44
The Social Web is growing fast and is becoming of a crucial importance for research and
companies
![Page 35: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/35.jpg)
July 11th, 2013 Università di Torino, Italy 35/44
Social Web = Big Data
Gartner “3V” definition: Volume, Velocity, Variety of microposts
![Page 36: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/36.jpg)
July 11th, 2013 Università di Torino, Italy 36/44
Microposts
➢ Short (~140 characters) and informal text➢ Grammar free text➢ Slang
➢ Media items➢ Picture➢ Video
![Page 37: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/37.jpg)
July 11th, 2013 Università di Torino, Italy 37/44
Can we make sense out of the massive and rapidly changing amount of information shared in
the Social Web?
![Page 38: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/38.jpg)
July 11th, 2013 Università di Torino, Italy 38/44
Live topic generation
http://youtu.be/8iRiwz7cDYY
![Page 39: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/39.jpg)
July 11th, 2013 Università di Torino, Italy 39/44
http://mediafinder.eurecom.fr
![Page 40: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/40.jpg)
July 11th, 2013 Università di Torino, Italy 40/44
Tracking and analyzing an event
➢ 1 week period➢ We collected microposts enclosed with pictures➢ We followed the 2013 Italian Election➢ We compared the results with the articles
published in those days on famous newspapers
http://youtu.be/jIMdnwMoWnk
![Page 41: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/41.jpg)
July 11th, 2013 Università di Torino, Italy 41/44http://mediafinder.eurecom.fr/story/elezioni2013
![Page 42: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/42.jpg)
July 11th, 2013 Università di Torino, Italy 42/44
Outlook: an entity graph from the open and Social Web
![Page 43: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/43.jpg)
July 11th, 2013 Università di Torino, Italy 43/44
Thanks for your time and attention
http://www.slideshare.net/giusepperizzo
![Page 44: Learning with the Web. Structuring data to ease machine understanding](https://reader033.vdocuments.mx/reader033/viewer/2022060107/554a1024b4c905825d8b4a04/html5/thumbnails/44.jpg)
July 11th, 2013 Università di Torino, Italy 44/44
Do you have any questions?