fine-grained geographical relation extraction from wikipediaandre blessing and hinrich schütze 1/20...

20
Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze 1/20 IMS Universität Stuttgart Fine-Grained Geographical Relation Extraction from Wikipedia André Blessing Hinrich Schütze University of Stuttgart Institute for Natural Language Processing (IMS)

Upload: charles-billey

Post on 16-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

1/20 IMS Universität Stuttgart

Fine-Grained Geographical Relation Extraction from Wikipedia

André Blessing

Hinrich Schütze

University of Stuttgart

Institute for Natural Language Processing (IMS)

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

2/20 IMS Universität Stuttgart

Overview

• motivation

• why are fine-grained relations important?

• self-annotation

• automatic annotation using structured data

• use this annotation for training classifier

• extraction framework

• evaluation and conclusion

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

3/20 IMS Universität Stuttgart

Geographical data provider

• GeoNames

• gazetteer

• names, type, coordinates

• 8 million entries• 2.6 million populated places

• community-based

• Creative Commons Attribution 3.0 License

• Free to share

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

4/20 IMS Universität Stuttgart

GeoNames

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

5/20 IMS Universität Stuttgart

GeoNames – hierarchical types

Name German name + sample

Description

ADM1 Bundesland(Rheinland-Pfalz)

State in the United States, a primary administrative division of a country

ADM2 Regierungs-Bezirk

a subdivision of a first-order administrative division

ADM3 Landkreis(Bad Kreuznach)

County, a subdivision of a second-order administrative division

ADM4 Gemeinde(Gebroth)

Municipality, a subdivision of a third-order administrative division

PPL(populated place)

Stadt-, Ortsteil(Stuttgart Bad Cannstatt)

Suburb, a subdivision of a fourth-order administrative division

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

6/20 IMS Universität Stuttgart

GeoNames – missing hierarchical relations

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

7/20 IMS Universität Stuttgart

Task Definition

• relation definition

• R1-2

• ADM3-ADM4

• Landkreis (county)- Gemeinde (municipality)

• R0-1

• ADM4-PPL

• Gemeinde (municipality) and Ortsteil (suburb)

• task• classify all possible binary relations of named entities in

one sentence

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

8/20 IMS Universität Stuttgart

Example - binary relations between all NEs

• Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland).

• Gebroth is a municipality in the county Bad Kreuznach in Rheinland-Pfalz (Germany).

• binary relations between NEs

• (Gebroth,Bad Kreuznach) element of R1_2

• (Gebroth, Rheinland-Pfalz)

• (Gebroth, Deutschland)

• (Bad Kreuznach, Rheinland-Pfalz)

• (Bad Kreuznach, Germany)

• (Rheinland-Pfalz, Deutschland)

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

9/20 IMS Universität Stuttgart

Requirements for extraction system

• fast to develop

• requested relation types can change

• avoid expensive manual annotation

• fine-grained relation types

• e.g. simple part-of relation is not sufficient

• trained system need no structured data

• several input sources (Wikipedia, blogs, twitter, news)

• German data

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

10/20 IMS Universität Stuttgart

Wikipedia as resource

• structured data

• templates (e.g. infoboxes), links, categories, tables, lists

• unstructured data

• written text

• high quality

• many users

• WikiBots

• structured data can be used to annotate unstructured data → self-annotation

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

11/20 IMS Universität Stuttgart

Self-Annotation - example

structured dataunstructured data

Landkreis Bad Kreuznach(county)

Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland).

GebrothR1_2(Gebroth, Bad Kreuznach)

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

12/20 IMS Universität Stuttgart

Self-annotation - challenges

• infoboxes are not always complete/correct/coherent filled

• matching with unstructured data

• pattern matching not sufficient

• orthographic variances

• morphology

• multi-word expressions

• matching need some manual adjustment

• only one relation per article

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

13/20 IMS Universität Stuttgart

Extraction framework

• UIMA (Unstructured Information Management Architecture)

• pipeline architecture

• easy exchange of components

• fast development

• extended components• CollectionReader for Wikipedia

• linguistic annotation

• supervised classifier

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

14/20 IMS Universität Stuttgart

Extraction pipeline

JWPL

UIMA Pipeline

CollectionReader

Self-Annotation

ClearTK

FSPar-Engine

MaxEnt-Classifier

Consumer

GermanWikipedia

GeoNames

FSPar-Annotator

unstructured text

structured data

CollectionReader

text

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

15/20 IMS Universität Stuttgart

Linguistic processing

FSPar engine (Schiehlen 2003)

tokenizer

PoS-tagger (bases on TreeTagger)

chunker

partial dependency parserToken PoS Lemma

Geborth NE Gebroth

ist VAFIN seinA

eine ART ein

Ortsgemeinde NN Orts#@gemeinde

im APPART in

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

16/20 IMS Universität Stuttgart

Supervised classification

• extended ClearTK-Annotator

• feature sets• F0: NE distance (baseline)

• F1: Window-based (pos, lemma, size=2)

• F2: chunks (parent chunks of NEs)

• F3: dependency parse (paths between NEs)

• MaxEntClassifier

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

17/20 IMS Universität Stuttgart

Evaluation

• 9000 articles about German municipalities and suburbs

• 5300 articles for training

• 1800 articles for development

• 1800 articles for final evaluation

• R1_2 relation is also available from the Federal Statistical Office of Germany

• Used for evaluate self-annotation

• 99.9 % ( 1 error in 1304 sentences)

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

18/20 IMS Universität Stuttgart

Results

Classifier Features Precision Recall FP FN

1 F0 79.0% 55.7% 279 833

2 F0+F1 92.4% 89.3% 138 202

3 F0+F2 90.2% 89.5% 182 198

4 F0+F3 97.7% 97.4% 43 48

5 F0....F3 98.8% 97.8% 23 41

Linguistic effort description

F0 None Distance + NE position

F1 PoS-Tagging Window-based (size=2, PoS, lemma)

F2 Chunk-parse Parent chunk

F3 Dependency-parse Dependency paths between NEs

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

19/20 IMS Universität Stuttgart

Conclusion

• text is important resource for context-aware systems

• self-annotation

• automatic annotation using structured data

• Wikipedia is a valuable resource

• structured and unstructured data

• containing fine-grained relations

• UIMA based implementation

• fine-grained geographical relation extraction is possible

Fine-Grained Geographical Relation Extraction from Wikipedia Andre Blessing and Hinrich Schütze

20/20 IMS Universität Stuttgart

Questions: ?!

www.nexus.uni-stuttgart.de