fine-grained geographical relation extraction from wikipediaandre blessing and hinrich schütze 1/20...
TRANSCRIPT
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
1/20 IMS Universität Stuttgart
Fine-Grained Geographical Relation Extraction from Wikipedia
André Blessing
Hinrich Schütze
University of Stuttgart
Institute for Natural Language Processing (IMS)
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
2/20 IMS Universität Stuttgart
Overview
• motivation
• why are fine-grained relations important?
• self-annotation
• automatic annotation using structured data
• use this annotation for training classifier
• extraction framework
• evaluation and conclusion
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
3/20 IMS Universität Stuttgart
Geographical data provider
• GeoNames
• gazetteer
• names, type, coordinates
• 8 million entries• 2.6 million populated places
• community-based
• Creative Commons Attribution 3.0 License
• Free to share
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
4/20 IMS Universität Stuttgart
GeoNames
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
5/20 IMS Universität Stuttgart
GeoNames – hierarchical types
Name German name + sample
Description
ADM1 Bundesland(Rheinland-Pfalz)
State in the United States, a primary administrative division of a country
ADM2 Regierungs-Bezirk
a subdivision of a first-order administrative division
ADM3 Landkreis(Bad Kreuznach)
County, a subdivision of a second-order administrative division
ADM4 Gemeinde(Gebroth)
Municipality, a subdivision of a third-order administrative division
PPL(populated place)
Stadt-, Ortsteil(Stuttgart Bad Cannstatt)
Suburb, a subdivision of a fourth-order administrative division
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
6/20 IMS Universität Stuttgart
GeoNames – missing hierarchical relations
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
7/20 IMS Universität Stuttgart
Task Definition
• relation definition
• R1-2
• ADM3-ADM4
• Landkreis (county)- Gemeinde (municipality)
• R0-1
• ADM4-PPL
• Gemeinde (municipality) and Ortsteil (suburb)
• task• classify all possible binary relations of named entities in
one sentence
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
8/20 IMS Universität Stuttgart
Example - binary relations between all NEs
• Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland).
• Gebroth is a municipality in the county Bad Kreuznach in Rheinland-Pfalz (Germany).
• binary relations between NEs
• (Gebroth,Bad Kreuznach) element of R1_2
• (Gebroth, Rheinland-Pfalz)
• (Gebroth, Deutschland)
• (Bad Kreuznach, Rheinland-Pfalz)
• (Bad Kreuznach, Germany)
• (Rheinland-Pfalz, Deutschland)
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
9/20 IMS Universität Stuttgart
Requirements for extraction system
• fast to develop
• requested relation types can change
• avoid expensive manual annotation
• fine-grained relation types
• e.g. simple part-of relation is not sufficient
• trained system need no structured data
• several input sources (Wikipedia, blogs, twitter, news)
• German data
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
10/20 IMS Universität Stuttgart
Wikipedia as resource
• structured data
• templates (e.g. infoboxes), links, categories, tables, lists
• unstructured data
• written text
• high quality
• many users
• WikiBots
• structured data can be used to annotate unstructured data → self-annotation
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
11/20 IMS Universität Stuttgart
Self-Annotation - example
structured dataunstructured data
Landkreis Bad Kreuznach(county)
Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland).
GebrothR1_2(Gebroth, Bad Kreuznach)
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
12/20 IMS Universität Stuttgart
Self-annotation - challenges
• infoboxes are not always complete/correct/coherent filled
• matching with unstructured data
• pattern matching not sufficient
• orthographic variances
• morphology
• multi-word expressions
• matching need some manual adjustment
• only one relation per article
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
13/20 IMS Universität Stuttgart
Extraction framework
• UIMA (Unstructured Information Management Architecture)
• pipeline architecture
• easy exchange of components
• fast development
• extended components• CollectionReader for Wikipedia
• linguistic annotation
• supervised classifier
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
14/20 IMS Universität Stuttgart
Extraction pipeline
JWPL
UIMA Pipeline
CollectionReader
Self-Annotation
ClearTK
FSPar-Engine
MaxEnt-Classifier
Consumer
GermanWikipedia
GeoNames
FSPar-Annotator
unstructured text
structured data
CollectionReader
text
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
15/20 IMS Universität Stuttgart
Linguistic processing
FSPar engine (Schiehlen 2003)
tokenizer
PoS-tagger (bases on TreeTagger)
chunker
partial dependency parserToken PoS Lemma
Geborth NE Gebroth
ist VAFIN seinA
eine ART ein
Ortsgemeinde NN Orts#@gemeinde
im APPART in
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
16/20 IMS Universität Stuttgart
Supervised classification
• extended ClearTK-Annotator
• feature sets• F0: NE distance (baseline)
• F1: Window-based (pos, lemma, size=2)
• F2: chunks (parent chunks of NEs)
• F3: dependency parse (paths between NEs)
• MaxEntClassifier
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
17/20 IMS Universität Stuttgart
Evaluation
• 9000 articles about German municipalities and suburbs
• 5300 articles for training
• 1800 articles for development
• 1800 articles for final evaluation
• R1_2 relation is also available from the Federal Statistical Office of Germany
• Used for evaluate self-annotation
• 99.9 % ( 1 error in 1304 sentences)
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
18/20 IMS Universität Stuttgart
Results
Classifier Features Precision Recall FP FN
1 F0 79.0% 55.7% 279 833
2 F0+F1 92.4% 89.3% 138 202
3 F0+F2 90.2% 89.5% 182 198
4 F0+F3 97.7% 97.4% 43 48
5 F0....F3 98.8% 97.8% 23 41
Linguistic effort description
F0 None Distance + NE position
F1 PoS-Tagging Window-based (size=2, PoS, lemma)
F2 Chunk-parse Parent chunk
F3 Dependency-parse Dependency paths between NEs
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze
19/20 IMS Universität Stuttgart
Conclusion
• text is important resource for context-aware systems
• self-annotation
• automatic annotation using structured data
• Wikipedia is a valuable resource
• structured and unstructured data
• containing fine-grained relations
• UIMA based implementation
• fine-grained geographical relation extraction is possible