Multilingual Corpora Workshop, 27 March 2003
Corpora and Evaluation Tools forMultilingual Named Entity Grammar Development
Christian Bering, Witold Drożdżyński, Gregor Erbach,
Clara Guasch, Petr Homola, Sabine Lehmann, Hong Li,
Hans-Ulrich Krieger, Jakub Piskorski, Ulrich Schäfer,
Atsuko Shimada, Melanie Siegel, Feiyu Xu, Dorothee Ziegler-Eisele
DFKI GmbH, LT Lab, Saarbrücken
Saarland University Computational Linguistics Dept, Saarbrücken
Acrolinx GmbH, Berlin
Multilingual Corpora Workshop, 27 March 2003
Outline
Motivation SPROUT – shallow processing toolkit Multilingual NE grammar development
Shared output structures Shared token classes Shared grammars
Multilingual NE corpora Evaluation tool
Multilingual Corpora Workshop, 27 March 2003
Motivation
Named Entity Recognition is fundamental to a number of information management applications (search engines, question answering, text mining …)
Many of these applications deal with different languages
Development of multilingual named entity grammars, supported by BMBF in the projects WHITEBOARD and COLLATE, and by the EU in the project AIRFORCE
Multilingual Corpora Workshop, 27 March 2003
Challenges in multilingual NER
Different alphabets, character sets and character encodings
Different tokenization conventions Different time and currency formats Different representations of proper names
Identical (New York, George Bush, IBM) Different for some languages (London vs. Londres, Firenze
vs. Florence vs. Florenz, NATO vs. OTAN, München vs. Munich vs. Monaco)
Multilingual Corpora Workshop, 27 March 2003
SProUT - Objectives
platform for the development of multilingual and domain adaptive shallow text processing and information extraction systems
trade-off between efficiency and expressiveness
modularity (fine-grained modeling of linguistic components into clear-cut modules)
portability and industrial standards
Multilingual Corpora Workshop, 27 March 2003
FINITE-STATETOOLKIT
REGULARCOMPILER
SHALLOWGRAMMAR
INTERPRETER
JTFS
SHALLOWGRAMMAR
EXTENDEDOPTIMIZED
FSTREPRES.
LEXICALRESOURCES
INPUT
DATA
STRUCTURED
OUTPUT DATA
G R A M M A R D E V E L O P M E N T E N V I R O N M E N T
System Architecture
O N L I N E P R O C E S S I N G
STREAM OFTEXT ITEMS
…. [..] [..] [..] ….
LINGUISTICPROCESSINGRESOURCES
Multilingual Corpora Workshop, 27 March 2003
System Components
linguistic processing resources
tokenizer (easily adaptable for indo-european languages)
gazetteer
morphology component (8 languages)
named entity recognition (6 languages)
core tools
JTFS
FSM toolkit
regular compiler
shallow grammar interpreter
tries for NLP processing
Multilingual Corpora Workshop, 27 March 2003
TFS and TFS-XML
TFS as data interchange format in SProUT
unification and subsumption check as basic operations for evaluation
compact XML encoding of typed feature structures (following TEI-SGML)
exchange format for linguistic resources:
grammars
feature structure tree banks
exchange format for visualization
Multilingual Corpora Workshop, 27 March 2003
TFS-XML: Example
<FS type="pred_argument"><F name="PRED"> <FS type=„übernehmen"/> </F><F name="AGENT"><FS coref="1" type="argument">
<F name="NAME"> <FS type="Maria_Müller"/> </F></FS>
</F><F name="THEME"><FS coref="2" type="argument">
<F name="NOM"> <FS type="Vorsitz"/> </F></FS>
</F></FS>
Multilingual Corpora Workshop, 27 March 2003
Morphological Resources
English200,000 entries (Mmorph (Multext))
German 830,000 entries (Mmorph (Multext))
French 225,000 entries (Mmorph (Multext))
Spanish 570,000 entries (Mmorph (Multext))
Italian 330,000 entries (Mmorph (Multext))
Czech 600,000 entries (Institue of Formal and Applied Linguistics in Prague)
Chinese Shanxi-Tokenizer
Japanese ChaSen
Asian language resources
Indo-European language resources
Multilingual Corpora Workshop, 27 March 2003
Architecture
Mmorph fullform lexica are stored as trie external modules (Asian and Czech) are integrated via Client/Server
Parser
Tokeniser
Mmorph
ChaSen
Czech
Shanxi
Multilingual Corpora Workshop, 27 March 2003
A SProUT Grammar Rule (XTDL)
*
Multilingual Corpora Workshop, 27 March 2003
UnificationMatched input structure Extended Rule Structure
After Match
Fully Unified Structure
Multilingual Corpora Workshop, 27 March 2003
Title of Slid
Item 1 Item 2
COLLATE, Scientific Advisory Board Meeting, Saarland University, 22 November 2002
Multilingual Corpora Workshop, 27 March 2003
Multilingual Named Entity Grammars
Languages
English, French, German, Spanish
Chinese, Japanese
Grammar Style
MUC-7/MET-2 named entity classes with some variations• ENAMEX: person, location, organisation
• TIMEX: time point, time span (instead of date, time)
• NUMEX: percentage, money
Named entity types with internal attribute-value structures, e.g.,
span := timex & [FROM point,
TO point ].
Multilingual Corpora Workshop, 27 March 2003
Multilingual NE grammar development
Our approach
Shared output structures Shared token classes Shared grammars
Multilingual Corpora Workshop, 27 March 2003
Shared Output Structures
The grammars for all six languages produce the same, semantically oriented output structures, defined in TDL
ne_type := sign & [DESCRIPTOR string].enamex := ne_type.ne-person := enamex & [TITLE list-of-strings, GIVEN_NAME list-of-strings, SURNAME list-of-strings, P-POSITION list-of-strings, NAME-SUFFIX string].ne-location := enamex & [LOCTYPE loc-type, LOCNAME string].loc-type :< atom.river := loc-type.continent := loc-type.country := loc-type.province := loc-type. city := loc-type.
Multilingual Corpora Workshop, 27 March 2003
Shared Token Classes
A single set of token classes is used for the European languages
NATURAL_NUMBER 12344 FLOATING_POINT_NUMBER 123,43NUMBER_PERCENT_COMPOUND 34,4%NUMBER_DOT_COMPOUND 234.345.545.NUMBER_WORD_COMPOUND 2,4-fachenDIGIT_SLASH_COMPOUND 12/01/1998DIGIT_DASH_COMPOUND 12-01-1998DIGIT_COLON_COMPOUND 15:13ALL_CAPS_WORD ABCLOWERCASE_WORD tokenizationFIRST_CAPITAL_WORD MicrosoftMIXED_WORD_FIRST_CAPITAL GmbHMIXED_WORD_FIRST_LOWER dKK
Multilingual Corpora Workshop, 27 March 2003
Shared Grammars
SPROUT supports re-use and extension of grammars This feature has been used for the development of
multilingual parallel grammars for English, Spanish and French
Common parts of the grammar for different languages (e.g. date formats like „20.10.2003“) are stored in one file, and combined with the language-specific parts of the grammars (for structures like „20 de octubre del 2003“)
Common proper names such as „Amsterdam“ are stored in generic gazeteer, while language-specific names such as „Brussels“, „Bruxelles“, „Bruselas“ are stored in language-specific lists
Multilingual Corpora Workshop, 27 March 2003
Advantages of shared grammars
Grammars are more easily re-usable and extendible Consistency is improved, as changes must only be made in
one place for shared structures Grammar development is more efficient, and less time-
consuming and error-prone The same methodology has been applied for combining
general-language grammars with domain-specific grammars
Multilingual Corpora Workshop, 27 March 2003
Re-use of corpora
We use NE-annotated corpora for grammar development and evaluation of grammars
Special-purpose annotation of corpora is only feasible for large-scale evaluations such as MUC, but exceeds the resources of most application-oriented projects
Corpora from other projects are re-used in order to save labour and have larger evaluation resources
There may be mismatches between corpus annotation and grammar output
Multilingual Corpora Workshop, 27 March 2003
Multilingual NE corpora
English corpora from the MUC7 evaluations Japanese and Chinese corpora annotated according to
MUC7 conventions German corpora annotated in the COLLATE project with a
superset of MUC7 annotations German, English, French and Spanish texts annotated with
Named Entities, from Joint Research Centre Spanish data from the CoNLL-2002 Language-Independent
NER task English and French corpora from the business domain
annotated with named entities according to the MUC7 guidelines within our project
Multilingual Corpora Workshop, 27 March 2003
Issues with re-use of corpora
The corpora contain differences in Annotation format Types of named entities annotated Attributes used to describe each NE
Superficial differences in annotation format are handled by conversion to XML
Differences in the content of the annotation are not handled by modification of the corpora, but rather by making our evaluation tool more flexible
Multilingual Corpora Workshop, 27 March 2003
Structure of Annotated Articles
<Firmenmeldung Annotator=“…” ID=“…” Status=“…”><teiHeader>
<fileDesc><titleStmt>
<author>…</author></titleStmt>
<publicationStmt> <publisher>…</publisher> <pubPlace>…</pubPlace> <date>…</date> </publicationStmt> <sourceDesc> <bibl>
<agency>…</agency> <page>…</page>
<topic>…</topic> <domain>…</domain> </bibl> </sourceDesc> </fileDesc>
</teiHeader><sourceText>… </sourceText>…<text>…</text>
</Firmenmeldung>
semantic relations
named entities+ coreference
Multilingual Corpora Workshop, 27 March 2003
Annotation of Semantic Relations
acquisition company corporateStructure dividends newBusiness offer occupation premiumIncome profit relocation revenue turnover
Robert Bosch GmbH, Stuttgart: Der Kfz-Zulieferkonzern übernimmt zum 1. Januar die van Doorne's Transmissie b. v., Tilburg. Das niederländische Unternehmen, das im letzten Jahr mit 220 Mitarbeitern einen Umsatz von 45 Millionen DM erzielte, entwickelt stufenlose auto-matische Automobilgetriebe (CVT = Continuously Variable Transmission) und produziert Komponenten für CVT.
<Firma Branche="Kfz-Zulieferkonzern" Firma="Robert Bosch" Rechtsform="GmbH" Sitz="Stuttgart"/>
<Firma Firma="van Doorne's Transmissie" Land="NL" Rechtsform="b. v." Sitz="Tilburg"/>
<Beschaeftigung Firma="van Doorne's Transmissie" Mitarbeiter="220"/>
<Umsatz Betrag="45 Mill." Firma="van Doorne's Transmissie" Waehrung="DEM"/>
<Uebernahme Kaeufer="Robert Bosch" Objekt="van Doorne's Transmissie"/>
Multilingual Corpora Workshop, 27 March 2003
Annotation of Named Entities
function location money number ordinalNumber organization percent personName productName scaleUnit time
Robert Bosch GmbH, Stuttgart: Der Kfz-Zulieferkonzern übernimmt zum 1. Januar die van Doorne's Transmissie b. v., Tilburg. Das niederländische Unternehmen, das im letzten Jahr mit 220 Mitarbeitern einen Umsatz von 45 Millionen DM erzielte, entwickelt stufenlose auto-matische Automobilgetriebe (CVT = Continuously Variable Transmission) und produziert Komponenten für CVT.
<NE Organisation="Robert Bosch GmbH">Robert Bosch GmbH</NE> ,<NE Ort="Stuttgart">Stuttgart</NE> : Der Kfz-Zulieferkonzern übernimmt zum<NE Zeit="01.01."> 1. Januar</NE> die <NE Organisation="van Doorne's Transmissie b. v.">van Doorne's Transmissie</NE> ,<NE Ort="Tilburg">Tilburg</NE> .
Multilingual Corpora Workshop, 27 March 2003
Annotation of Coreference
3rd person personal pronouns 3rd person possessive
pronouns and determiners demonstrative pronouns and
determiners indefinite pronouns and
determiners anaphoric and cataphoric
adverbs elliptical nominal phrases anaphoric and cataphoric
nominal phrases
LM Ericsson AB, Stockholm: Der schwedischeElektronikkonzern hat …
<exp id="101">LM Ericsson AB</exp>, Stockholm: <exp id="102"><ptr src="101"/>Der schwedische Elektronikkonzern</exp> hat …
Multilingual Corpora Workshop, 27 March 2003
Cooperation: Annotation of FR
<REQUEST><SPKR> SPD </SPKR> <FEE> fordert </FEE><ADD> Koalition </ADD> <MSG> zu Gespr"ach "uberReform </MSG> <FEE> auf </FEE>. </REQUEST>
<CONVERSATION>SPD fordert <INTLC-1> Koalition </INTLC-1> zu<FEE> Gespr"ach </FEE> <TOPIC> "uber Reform </TOPIC>auf. </CONVERSATION>
<s id="s37"><graph root="s37_503"><terminals> <t id="s37_1" word="Ausgerechnet" pos="ADJD" morph="--" /> <t id="s37_2" word="Iggy" pos="NE" morph="Masc.Nom.Sg" /> <t id="s37_3" word="Pop" pos="NE" morph="*.Nom.Sg" /> <t id="s37_4" word="verkörpert" pos="VVFIN"
morph="3.Sg.Pres.Ind" /> <t id="s37_5" word="gesanglich" pos="ADJD" morph="Pos" />...</terminals><nonterminals> <nt id="s37_500" cat="MPN"> <edge label="PNC" idref="s37_2"/> <edge label="PNC" idref="s37_3"/> </nt> <nt id="s37_501" cat="NP"> <edge label="NK" idref="s37_6"/> <edge label="NK" idref="s37_7"/> </nt>...</nonterminals></graph></s>
TIGER: syntactic annotation
LLX: FrameNet annotation
Multilingual Corpora Workshop, 27 March 2003
Cooperation: Multi-layer Annotation
<s id="s37"><graph root="s37_503"><terminals> <t id="s37_1" word="Ausgerechnet" pos="ADJD" morph="--" /> <t id="s37_2" word="Iggy" pos="NE" morph="Masc.Nom.Sg" /> <t id="s37_3" word="Pop" pos="NE" morph="*.Nom.Sg" /> <t id="s37_4" word="verkörpert" pos="VVFIN"
morph="3.Sg.Pres.Ind" /> <t id="s37_5" word="gesanglich" pos="ADJD" morph="Pos" />...</terminals><nonterminals> <nt id="s37_500" cat="MPN"> <edge label="PNC" idref="s37_2"/> <edge label="PNC" idref="s37_3"/> </nt> <nt id="s37_501" cat="NP"> <edge label="NK" idref="s37_6"/> <edge label="NK" idref="s37_7"/> </nt>...</nonterminals></graph></s>
LLX: FrameNet annotation
<REQUEST><SPKR> SPD </SPKR> <FEE> fordert </FEE><ADD> Koalition </ADD> <MSG> zu Gespr"ach "uberReform </MSG> <FEE> auf </FEE>. </REQUEST>
<CONVERSATION>SPD fordert <INTLC-1> Koalition </INTLC-1> zu<FEE> Gespr"ach </FEE> <TOPIC> "uber Reform </TOPIC>auf. </CONVERSATION>
TIGER: syntactic annotation
<Firmenmeldung Annotator="keku" ID="SZ_401" Status="1"><teiHeader>
<fileDesc><titleStmt>
<author/></titleStmt><publicationStmt>
<publisher>SZ</publisher><date>1995-03-31</date>
</publicationStmt><sourceDesc>
<bibl> <agency>vwd</agency> <page>22</page> <topic>Wirtschaft</topic> <domain>Firmenmeldungen</domain></bibl>
</sourceDesc></fileDesc>
</teiHeader><sourceText>Datev eG, Nürnberg: Der EDV-Dienstleister für Steuerberater hat 1994 den Umsatz laut vorläufigen Zahlen um 5% auf rund 980 Mill. DM gesteigert. Die Anzahl der Mitarbeiter ist auf 4605 (4474) Beschäftigte gestiegen, die Zahl der Genossenschaftsmitglieder zog auf 34246 (33551) an. Die Investitionen von 115 (93) Mill. DM haben sich in erster Linie auf die Modernisierung der Großrechner, den PC-Bereich sowie auf ein automatisches Versandlager konzentriert.</sourceText><Firma Branche1="EDV-Dienstleister für Steuerberater" Firma="Datev eG" Sitz1="Nürnberg" Rechtsform="eG"/><Umsatz Firma="Datev eG" Differenz="5%" Trend="plus" Betrag1="980 Mill." Waehrung1="DEM" Beschreibung1="rund" Zeit="1994"/><Beschaeftigung Firma="Datev eG" Trend="plus" Mitarbeiter1_alt="4474" Mitarbeiter1_neu="4605" Zeit="1994"/><text><NE Organisation="Datev eG">Datev eG</NE>, <NE Ort="Nürnberg">Nürnberg</NE>: Der EDV-Dienstleister für Steuerberater hat <NE Zeit="1994">1994</NE> den Umsatz laut vorläufigen Zahlen um <NE Prozentzahl="5%">5%</NE> auf <NE Geld="rund 980 Mill. DEM">rund 980 Mill. DM</NE> gesteigert. Die Anzahl der Mitarbeiter ist auf <NE Zahl="4605">4605</NE> (<NE Zahl="4474">4474</NE>) Beschäftigte gestiegen, die Zahl der Genossenschaftsmitglieder zog auf <NE Zahl="34246">34246</NE> (<NE Zahl="33551">33551</NE>) an. Die Investitionen von <NE Geld="115 (93) Mill. DEM">115 (93) Mill. DM</NE> haben sich in erster Linie auf die Modernisierung der Großrechner, den PC-Bereich sowie auf ein automatisches Versandlager konzentriert.</text>
</Firmenmeldung>
COLLATE: semantic annotation
=> multi-layer annotated language resource
Multilingual Corpora Workshop, 27 March 2003
Evaluation Tool: jTaCo
Evaluates grammars wrt. an annotated corpus Removes annotations from corpus, and feeds unannotated
text to grammar Compares grammar output with original annotated texts Produces detailed statistics, evaluation scores, and
diagnostic output
Multilingual Corpora Workshop, 27 March 2003
Configuration of jTaCo
jTaCo can be configured to deal with various problems in evaluating grammars wrt. a corpus:
Use of different classes of NE, or different granularities (e.g. organization and subclasses company, university etc.) Declaration of class equivalence and subclass relationships.
Extent of NE may be different (CEO Bill Gates vs. Bill Gates) Left or right boundary may be mismatched. Size of allowable
mismatch can be specified for each NE class. Markup of corpus may be textually oriented (XML tags)
while grammar output may be a different datastructure (e.g. semantics encoded in feature structure) No general solution is possible. In case of SPROUT, feature
structures are linked with input tokens, so that a correspondence can be established (under development).
Multilingual Corpora Workshop, 27 March 2003
Architecture of jTaCo
Multilingual Corpora Workshop, 27 March 2003
Conclusion
We discussed a fundamental problem in re-using heterogeneously annotated corpora for multilingual grammar development
With increasing availability of annotated corpora, re-use becomes attractive and cost-effective
We described methods and tools for re-using annotated corpora for development and evaluation of NE grammars