information extraction technologies & applications

Source: G. Neumann

1999

Information ExtractionInformation Extraction

Technologies & Applications

Günter Neumann

LT-lab, DFKI

Source: G. Neumann

1999

The increased availability of electronic text data requires new technologies for extracting relevant information

INFORMATION EXTRACTION (IE)

The goal of IE research is to build systems that find and link relevant information from NL text while ignoring extraneous and irrelevant information

The core functionality of an IE system is quite simple:

Input: 1. Specification of the relevant information in form of templates (feature structures), e.g., company information, product information, management succession, meetings of important peoples2. A set of real-world text documents

Output: A set of instantiated templates filled with relevant text fragments (eventually normalized to some canonical form)

Source: G. Neumann

1999

Example Information Extraction 1

Lübeck (dpa) - Die Lübecker Possehl-Gruppe, ein im Produktions-, Handel- und Dienstleistungsbereich tätiger Mischkonzern, hat 1994 den Umsatz kräftig um 17 Prozent auf rund 2,8 Milliarden DMgesteigert. In das neue Geschäftsjahr sei man ebenfalls „mit Schwung“gestartet. Im 1. Halbjahr 1995 hätten sich die Umsätze des Konzernsim Vergleich zur Vorjahresperiode um fast 23 Prozent auf rund 1,3Milliarden erhöht.

type = turnover c-name = Possehl1 year = 1994 amount = 2.8e+9DM tendency= + diff = +17%

type = turnover c-name = Possehl1 year = 1995/1 amount = 1.3e+9DM tendency= + diff = +23%

Source: G. Neumann

1999

Example Information Extraction 2

Parts from RWE‘s Anual Report (1998):Eine Schwerpunktregion im Rahmen der Internationalisierung im Energiebereich istOsteuropa. Hier haben wir unser Engagement im abgelaufenen Geschäftsjahr weiterausbauen können.Nach dem Kauf weiterer Anteile halten wir inzwischen jeweils knapp über 50% an den ungarischen Energieversorgungsunternehmen ELMÜ, ÉMÁSZ und MÁTRA. Im Falle von MÁTRA hat RWE Energie im April 1998 Anteile an Rheinbraun abgegeben. Die Präsenz in Polen wurde durch Kooperationsvereinbarungen mit den Regionalversorgern Zaklad Energetyczny Krakow S.A. (ZEK) und Stoleczny Zaklad Energetyczny S.A. (STOEN) ....im Frühjahr 1998 weiter ausgebaut.

Group/Subs. YEAR KIND FROM TO POT AMOUNTRWE 1998 + ELMÜ >50%RWE 1998 + ÉMÁSZ >50%RWE 1998 + MÁTRA >50%RWE Energie 1998 - MÁTRA Rheinbraun 4.1998

Source: G. Neumann

1999

From the viewpoint of natural language processing (NLP), IE is attractive for many reasons, including

• Extraction tasks are well defined

• IE uses real-world texts

• IE poses difficult and interesting NLP problems

• IE needs systematic interface specification between NL and domain knowledge

• IE performance can be compared to human performance on the same task

IE systems are a key factor in encouraging NLP researchers to move from small-scale systems and artificial data to large-scale systems operating on human language (Cowie & Lehnert, 1996)

Source: G. Neumann

1999

IE has a high application impact

• IE and information retrieval: construction of sensitive indices which are more closely linked to the actual meaning of a particular text

• IE and text classification: getting fine-grained decision rules

• IE and text mining: improve quality of extracted structured information

• IE and data-base systems: improve semi-structured DB approaches

• IE and knowledge-base systems: combine extracted information with KB

Quelle: GN

1999

The advanced IE technologies improve intelligent indexing and retrieval

improved indexing:

marked text &templatesIE core system

text files

indexing construction(phrasal, concept indices)

improved retrieval:

search engine

query

IE core system

Source: G. Neumann

1999

Shallow text processing as a common pre-processing tool for TM & IE

STP:TokenizationLex. ProcessingChunk Parsing

Data MiningInformationExtraction

Text documents

Str

uctu

re

Langu

ages

Domain

LinguisticEntities

LinguisticEntities

Instantiatedtemplates

Complexrelationsare known

Complexrelationsto be discovered

Source: G. Neumann

1999

IE as core component in incremental knowledge engineering systems

< >

IE core system

Statistical eval. &visualization

Construction of ontology

OntologyDomainlexicon

The core idea:

Hand-craft construction of domainontology on basis of linguistic informationextracted from texts.

Simulatenously construct domain lexiconwhich is used in next acquisition cycle.

Source: G. Neumann

1999

From a system development point of view there exists two approaches for IE

• Language technology approach (dem Ingeniör is nix zu schwör) linguistic knowledge specified manually by experts mapping between NL and domain knowledge hand-coded manual inspection of corpus to find out how specific domain knowledge is

expressed via NL still best approach to build reasonable complex systems development of tools for supporting application building

• Learning approach

– apply statistical methods where possible

– learn template filler rules from annotated corpora using Machine Learning shows promising results for IE subtasks (proper name recognition, flat slot

filler rules) mapping between NL and domain knowledge automatically induced still needs high amount of annotated corpora

Source: G. Neumann

1999

Common two both approaches is the use of shallow re-usable NL core components (of course, differing in granularity)

• tokenization, text scanning/information wrapping (e.g., analysis of tables, head lines)

– in most cases simple, but very important

• Morphological & lexical processing

– high-coverage and fast morphological analysis

– processing of unknowns & compounds

– part-of-speech tagging

• Recognition of named entities (proper names, expressions for dates, values, measurement); important issue here: new name creation, multilingual terms; Demo)

• shallow parsing: very long sentences (> 30 words), relclauses, coordination

– integration of specific subgrammars

– partial parsing

– very robust and efficient strategies needed (weighted finite state transducers)

• discourse analysis (NP-analysis, co-reference, relational links)

Source: G. Neumann

1999

IE poses difficult and interesting NLP problems, which has partially lead to the renaissance of „old“ methods and their improvement

• High amount of robusntess and efficiency requested:

– finite state technology (weighted finite state transducers)

– text-skipping as a realistic approach for handling real-world text

• systematic specification of NL and domain knowledge

• short system application cycle because of „daily news“

• multilingual methods are required

• evaluation of the predicting power of IE systems

– recall R = #correct found answers / #total possible correct

– precision P = #correct found answers / #found answers

– f-measure F = (2 + 1) R P / 2 R+P 0.6 barrier

Source: G. Neumann

1999

At DFKI´s LT-lab we have developed powerful domain-independent shallow text processing components in order to support a fast IE system development cycle

Text Tokenization

Lexical processor• Morphology• Compounds• Tagging

Chunk Parser• sentence topology• phrase recognition• sentence structure• grammatical fct.

> 120.000 main stems;> 12.000 verb frames;special name lexica;tagging rules;

general (NPs, PPs, VG);special (lexicon-poor,Time/Date/Names);general sentence patterns;

Lexical DB

Grammars (FST)

Text

Set of

Underspecified

Fct. Descr

Shallow Text Processor

Source: G. Neumann

1999

We have identified the needs for better chunk parsing strategies in order to improve robustness and coverage on the sentence level

Phrase recognition

Text (morph. analysed)

Clause recognition

Stream of phrases

Stream of sentences

Current chunk parser

bottom-up:first phrases and then sentence structure

main problem: even recognition of simple sentence structure depends on performance of phrase recognition

example: - complex NP (nominalization style)- relative pronouns

[Die vom Bundesgerichtshof und den Wettbewerbern als Verstoss gegen das Kartellverbot gegeisselte zentrale TV-Vermarktung] ist gängige Praxis.([central television marketing censured by the German Federal High Court and the guards against unfair competition as an act of contempt against the cartel ban] is common practice)

grammatical fct. recognition

Source: G. Neumann

1999

A new chunk parser has been developed that increases robustness and coverage on the sentence level

Sentencetopology

Text (morph. analysed)

phrase recognition

Stream of sentence structure

Stream of sentences

New chunk parser

top-down/bottom-up:first compute topological structure of sentencesecond apply phrase recognition to the fields

[coord [core Diese Angaben konnte der Bundesgrenzschutz aber nicht bestätigen], [core Kinkel sprach von Horrorzahlen, [relcl denen er keinen Glauben schenke]].(This information couldn‘t be verified by the Border Police, Kinkel spoke of horrible figures that he didn‘t believe.)

Advantages- simple (topology recognition based on keywords and verb groups)- resolution of critical ambiguities (coordination, pronouns/determiners)- no restriction on # of subclauses- good coverage (first tests on 670 sentence: 85-90%, Urli)

grammatical fct. recognition

Source: G. Neumann

1999

The Shallow Text Processor has several Important Characteristics

Modularity: each subcomponent can be used in isolation;

Declarativity: lexicon and grammar specification tools;

High coverage: more than 93 % lexical coverage of unseen text;

high degree of subgrammars (70% general NPs, >90% sentence topology);

Efficiency: finite state technology in all components;

specialized constrained solvers

(e.g. agreement checks & grammatical functions);

Run-time example:1-3 second per text page real time

Source: G. Neumann

1999

We´re using typed feature structures for modeling the domain knowledge and its relationship to NL expressions

Templ[action,date]

Move-T[from, to, unit]

Loc-T[action]

Fight-T[victim,attacker,attacked]

Meeting[visitor,visitee]

Phrase

Np

LocNpLocPp

DatePP

PpFdesc[process,mods]

trans[subj,obj]

intrans[subj]

Fight-L[subj=1, obj=2,temp=[attacker=1, attacked=2]

[process=1,subj=2,obj=3,dateMod = < ... =4 ...>,locMod = < ... =5...>,templ = [ action kill=1, attacker soldier=2, attacked general=3, date 3/8/93=4, Loc Mostar=5]]

UFD

DomainLex:kill=Fight-L

FillTempl

TypeMerge

STP

Source: G. Neumann

1999

The model as several advantages

• New applications are defined basically through

– template hierarchy (independently from linguistic knowledge)

– integration of domain and linguistic knowledge only through

linked types between abstract linguistic categories

• Support of faster adaptation to new applications

• The focus of the next research periods:

– systematic integration of lexical semantics (in particular nominal entities)

– template merging

– methods for automatic knowledge acquisition

– learning of domain lexicon

– learning of linked types

Source: G. Neumann

1999

Recently, the problem of using machine learning methods to induce IE routines has received more attention

• The majority of current approaches are variants of inductive supervised learning

• Goal: given a set of annotated text documents induce automatically template filler rules by successively generalizing the initialized instantiated rules computed from the tagged examples

• Usually, the documents are preprocessed by NL components– tokenization (Freitag, 98)– POS tagging (Califf&Mooney,98)– phrase recognition (Hufman,96)– shallow sentence parsing (Riloff,96a;Soderland,97)

• most approaches learn slot-filler rules, some newer learn relational structures (Califf&Mooney,98)

• current trend is towards minimally supervised strategies (Riloff,96b)

Source: G. Neumann

1999

The majority of current approaches are variants of inductive supervised learning

• Example

<PNG> Sue Smith </PNG>, 39, of Menlo Park, was appointed <TNG> president </TNG> of <CNG> Foo Inc. </CNG>

n_was_named_t_by_c:noun-group(PNG, head(isa(person-name))),

noun-group(TNG, head(isa(title))),

noun-group(CNG,head(isa(company-name))),

verb-group(VG, type(passive), head(named or elected or appointed)),

prep(PREP, head(of or at or by)),

subject(PNG,VG), object(VG,TNG), post_nominal_prep(TNG,PREP), prep_obj(PREP,CNG)

management_appointment(M,person(PNG), title(TNG), company(CNG)) supervised and unsupervised methods

• Current approaches show already impressive results (for flat sentence-based templates):

– Huffman,96: management changes task => 85.2% F (89.4%)– Califf&Mooney,98 : computer-related job postings => 87.1% P & 58.8% R

Source: G. Neumann

1999

In case of multilingual name recognition learning approaches are very promising

• Nymble, a high performance learning name-finder (Bikel et al. 97, BBN)

– hidden Markov model

– word-features (e.g., allCaps, twoDigitNum)

– Results for F-measure

• English: 93% (best MUC-6: 96%)

• Spanish: 90% (93%)

• Gallippi, 96:

– data-driven knowledge acquisition strategy based on decision trees (ID3)

– features: POS, Abbrev., list of known names, word-features

– Results for F-measure (av. across companies, persons, locations, date)

• English: 94%

• Spanish: 89.2%

• Japanese: 83.1%

Source: G. Neumann

1999

In summary: IE is a very attractive research area for building application systems

• IE is an interdisciplinary approach

– language technology

– statistical methods

– machine learning

– knowledge representation

– software engeenering

– expert knowledge

• What mix of methods is best depends on complexity of target information

– hand-crafted vs. automatic or mixed strategies

• Next system generation will

– be more intelligent in adapting itself to new domains

– learn from experience being minimally supervised

– support shorter development cycle (dynamic environment)

– understand several languages

information extraction technologies & applications

Documents