named entity recognition for daninsh in a cg framework eckhard bick southern denmark university...
TRANSCRIPT
Named Entity Recognition for Daninsh in a CG framework
Eckhard Bick
Southern Denmark University
Topics
• DanGram system overview
• Distributed NER techniques:- pattern matching- lexical- CG-rules
• Evaluation and outlook
System Structure
The DanGram system in current numbers
Lexemes in morphological base lexicon: 146.342(equals about 1.000.000 full forms), of these:
proper names: 44839 (experimental)polylexicals: 460 (+ names and certain number expressions)
Lexemes in the valency and semantic prototype lexicon: 95.308Lexemes in the bilingual lexicon (Danish-Esperanto): 36.001
Danish CG-rules, in all: 7.400morphological CG disambiguation rules: 2.259 + 137 = 2.396syntactic mapping-rules: 2.250syntactic CG disambiguation rules: 2.205NER-module: 432Attachment-CG: 117(plus 429 bilingual rules in separate MT grammars)
Danish PSG-rules: 605 (for generating syntactic tree structures)
Performance:At full disambiguation (i.e., maximal precision), the system has an average correctness of 99% for word class (PoS), and about 95% for syntactic function tags (depending, on how fine grained an annotation scheme is used). From raw news text, 50%-70% of sentences produce well-formed syntactic trees.
Speed:full CG-parse: ca. 400 words/sec for larger texts (start up time 3-6 sec)morphological analysis alone: ca. 1000 words/sec
Running CG-annotation
Da (When) [da] KS @SUB den (the) [den] ART UTR S DEF @>N gamle (old) [gammel] ADJ nG S DEF NOM @>N sælger (salesman) [sælger] N UTR S IDF NOM @SUBJ> kørte (drove) [køre] <mv> V IMPF AKT @FS-ADVL> hjem (home) [hjem] N NEU P IDF NOM @<ACC i (in) [i] PRP @<ADVL sin (his) [sin] <poss> <refl> DET UTR S @>N bil (car) [bil] N UTR S IDF NOM @P< , kunne (could) [se] <aux> V IMPF AKT @FAUX han (he) [han] PERS UTR 3S NOM @<SUBJse (see) [se] <mv> V INF AKT @AUX<mange (many) [mange] <quant> DET nG P NOM @>N rådyr (deer) [rådyr] N NEU P IDF NOM &ACI-SUBJ @<ACC på (in) [på] PRP @<OA de (the) [den] ART nG P DEF @>N våde (wet) [våd] ADJ nG P nD NOM @>N marker (fields) [mark] N UTR P IDF NOM @P<
Constituent trees (high level)
NER target classes
Feature bundling in the major name categories (synopsis)
1. NE string recognition at raw text level, pattern based
Recognition of author and source names (scanning & corpus impurities):København N. Førerbevis til ældreLUXEMBORG I over 700 år har ...RADIO & TV Københavns SommerunderholdningBedste da. resultat er opnået af Kurt Nielsen, der kom i finalen i 1953 og 1955. KuNiINFORMATION. Det er ikke sandt ...Kontokort Af BENNY SELANDER Stadig flere ...HELGE ADAM MØLLER Medlem af folketinget
Headline separation problems:Glimrende rolle Det er en ...Forholdet til Walesa "Mit forhold til Walesa er, som det var for ti år siden," siger ...
Uppercase highlighting vs. AbbrebiationsMENS børnene venter, JOURNALIST Michael LarsenNATO, UNPROFOR, USA (organisations), VHS (brand), DNA (chemical)
number of letters? Author position? Key small words? lower-casing after triage
dancorp.avis, dancorp.pre, dan.pre1.1. Name format
1.2. Name pattern recognition
Main principle: Fuse upper case stringsperson names: Nyrup=Rasmussen
institutions: Odder=Lille=Friskole
organisations: ABN-AMRO=Asia=Equities
events: Australian=Open
Name chaining particles: prepositions, coordinators, …Personal names: Maria dos Santos, Paul la Cour, Peter the Great, Margarete den Anden, Ras al Kafji,
Osama bin Laden, initially: da Vinci, van Gogh, de la Vega, ten Haaf, von Weizsäcker
organisation names: Dansk Selskab for Akupunktur, University of Michigan, Golf-Centeret for Strategiske Studier, Organisationen af Olieeksporterende Lande
place names: Place de la Concorde
brands: Muscat de Beaumes de Venise
media: le Nouvel Observateur
events: Slaget på Reden
Mid-sentence name chain initiators:Det ny Lademann Aps.
1.3. Name-internal punctuationas opposed to abbreviations and clause punctuation
initials: P.=Rostrup=Bøjesen, Carl=Th.=Pilipsen web-addresses: http://www.corp.hum.sdu.dk e-mails: [email protected] personal name additions: Mr.=Bush, frk.=Nielsen, Bush=jr./sr., hr.=Jensen, ... professional titles: Dr.=A.=Clarke, cand.=polit., mag.=art. company names: Aps., D/S=Isbjørn vehicles: H.M.S.=Polaris geographicals: Nr.=Nissum, Kbh.=K, St.=Bernhard. Mt.=Everest
1.4. Name-internal numerals
yearly events: Landsstævne='98 card names: Ru=9, Sp=E license plates: TF=34=322 town adresses: 8260=Viby=J house addresses: 14a,=st.=tv. kings: Christian IV dated or versionized products: Windows 98 vehicle names: Honda Civic 1,4 GL sedan, Citroën ZX Aura 1.6i, Peugeot 206,
1.6 LX van, 1,9 TDI, DC10 fly, købe 50 V44 vindmøller news channels: TV2, DR 1, Channel 4 bible quotes: Mt 28,1-10
1.5. In-name '&' and '/' or coordination?
• K/S Storkemarken
• Munster & Co., Møller & Baruah (fused company names?)
• Hartree & V. Booth: Safety in Biological Laboratories (separate authors?)
• NATO/FN
• 1560 kJ/420 kcal
1.6. In-name apostrophe or quote?
Quotes make title-recognition easier …
Med den melankolske 'Light Years' anslås ...
Filmen "Stay Cool" blev trukket ...
Han havde læst Den uendelige historie.
… but create their own problems: genitive: 'The Artist's Album', Bush' korte besked. 'Big Momma's House' company names: Kellogg's ellision: Pic du Midi d'Osseau, Côte d'Azur, Montfort l'Amaury fixed naming systems: O'Connor, O'Neill
2. NE string recognition at raw text level, lexicon supporteddan.pre
• Left or right context: selskab/forening/institut …. For
• Name-internal 'og' vs. Coordination:(a) pattern: Petersen & Co.(b)lexical governed: Told og Skat, Sol og Strand, Se og Hør.
• NOT name integrating numbers:(a) car companies:
Peter købte en Peugeot=206.* Den gang købte Peter=206 pakkegaver.
(b) unit words to the right:... tjente Toyota 6,8 procenttallet for Europa er 60 procent og for USA 75 procent
2.1. Low level lexicon: specialised word contexts
• fordi, derfor, siden, ifølge, har Fordi=Peter=Jensen ikke havde sendt… Fordi Peter=Jensen ikke havde sendt …
2.2. Sentence initial non-name small words (also used for sentence separation)
2.3. Words, prefixes and suffixes with <+name> valency:
adjunkt, ...chef, historiker, institut, kollega,
...erske, ...trice, virksomhed, ...ør, Hoved..., Vice....
• lektor i børsret ved Københavns Universitet # Per Schaumburg Müller• Landsstyreformand # Jonathan Motzfeldt• Styresystemet # Windows• Stillehavsøen # Okinawa• Vicelagmand # Oli Nilsson
2.4. High level lexicon:A lexicon based chunk splitter
• Old alternative: split [A-Z]…erne + uppercase, [A-Z]…ede + uppercasewith negative list of "forbidden words": Horsens, Jens, Vincens, Enschede
• New alternative: check all potential substrings of a polylexical name candidate, AND the whole string, against the full name lexicon (44.000 entries)
• Genitive splitting: allowed if first half is <hum>, <org>, <inst>, <civ> (= humanoids)Sonofons <org> # GSM 900-netRichard Strauss' <hum> # ZarathustraNew Yorks <civ> # ManhattanBeach Boys' <org> # Brian Wilson <hum>
• Refuse split genitives with certain second half geographicals:Jensens Plads, Rådmans Boulevard
• Variable split points (more lexicon checks necessary than in genitives):Kommende ambassadør i Kairo # Christian OldenburgBagefter hentede Peter # MariaSå ansatte IBM # Kevin MondaleDerfor forlod Jensen (Peter Jensen) # (FC) København
3. NE word recognition, morphological analyzer with lexical data and compositional rules
Full lexicon match ?
Dantag, danpost
Partial lexicon match ?
6 major & 20 minor categories:
<hum> person names <org> organisations<top> place names <occ> events<tit> semantic product names <brand> brands, objects
e.g. known first, unknown second nameknown company name with geographic extension
Morten Kaminski, Toshiba Denmark
Ambiguity
1. Name – other: Hans, Otte
2. Name NOM – Name GEN
3. Cross type: Lund (place/person), Audi (company/vehicle)
4. Systematic/underspecified: <media> = <org>/<tit>
3.1. Recognized names
3.2. Compositional analysis
No name recognized
Lower case run Full inflexional & derivational analysis for all word classes
Compositional analysis
ANC-kontor N, G8-mødet N Talebanstyret N, Martinicocktails N, Marsåret NAB'ernes [AB'er] N, AGF'ere NEU-godkendt ADJ, Heisenberg'ske ADJ
Inflected names 1. PROP N: EMSen (N DEF), EMS'en (N DEF) 2. Frozen usage (lexicon): Sovjetunionen, Folketinget (PROP)
Heuristics(a) Hyphen <hum>: Jean-Pierre Wallez, Blomster-Jensen
but: hovedvej Pec-Prizren <top>, Lolland-Falster <top>Al-Qaida <org>, CO-Industri <org>, Jyllands-Posten <media>
(b) Iterations with exchanged and omitted '-' and '=': Al=Qaida
(c) Heuristic full-string name reading
4.1. NE semantic type prediction, semi-lexical compositional heuristics Cg2adapt.dansk
Respects non-heuristic types from the analyzer/lexicon
Tries to verify/falsify semi-lexical type analyzes
Uses patterns, suffixes, clue-word lists to predict types
To prevent interference between individual sections:ordering type predictions (e.g. <tit> early, <top> late)iterating certain classes (e.g. <hum>)NOT-conditions quoting partial or overlapping patterns that would
indicate other semantic name classes.
Prepares cg-level: <non-hum> predictionnon-alphabetic characters, in-word capitals, coordinators (og, eller),
certain English function words (of), non-human suffixes (-tion) etc.
4.2. NE semantic type predictor: Patterns
<tit> e.g. quotes, in-name function words (articles, pronouns etc.), "semantic things" (-loven, -brev, -song, -report, Circulære=, Redegørelse=, Dictionary= ...)<media> e.g. -avis, -bladet, -tidende, Ugeskrift=, Kanal=, Channal=, Nyt= ...<occ> e.g. Expedition, -freden, -krig, -krise, =Rundt, Projekt=, Konference=, Slaget=<V> e.g. Boeing/Mercedes/Toyota=, =Combi, =Sedan, HMS=, USS=, M/S= ...<brand> e.g. Macintosh/Phillips/Sanyo=[0-9], wine types:=Appelation, =Cru, =Sec, Edition, Yamaha/Siemens=, quality markers:=Extra, =de=Luxe, =Ultra ...<hum> e.g. suffixes:-sen, -sson, -sky, -owa, infixes: ibn, van, ter, y, zu, di, abbreviated and part-of-name titles: frk., hr., Madame, Mlle, Morbror, jr., sr., Mc=, Al=, =Khan<A><B> e.g. [A.Z][a-z]+(=[a-z]+([ae]ns|ea|is|um|us))+<civ> e.g. =SSR/Republik, =Town/Ville, suffixes: -ager, -borough, -bølle, -dorf, -hausen, -løse, -ville, -polis (a number of these will receive both <civ> and <hum> tags for later disambiguation<top> e.g. =Bahnhof, =Bakker, =Kirke, =Manor, =Sund, =Prospekt, Islas=, Ciudad=, Gammel=, Lake=, Rio=, Sønder/Vester/Øster/Nørre=, suffixes: -fors, -kanten, -kvarteret, -marken, addresses: -stien, -strasse, -torv, -gade, -vej (the latter are also used by dantag)<org> e.g. in-word capitals: [a-z][A-Z] (MediaSoft), "suffixes": Amba, GmbH, A/S, AG, Bros., & Co ..., type indicators: =Holding, =Organisation, =Society, =Network, Bank=of=, Banco=d[eiao], K/S, I/S, Klub=, Fonden=, morphological indicators: -con, -com, -ex, -rama, -tech, -soft<inst> e.g. =Ambassade, =Airport?, =Børnehave, =Institut, =Universitet, =Bibliotek, =Hotel, Chez=, morphologicals: -eriet, -værk, -handel<mat> e.g. -[cpt]am, -[cz]id, -lax, -vent, Retard=, uppercase + number (NO2, H2O)<common> e.g. =Collection, =Samling, Ugens=, cards: Spa?=, Ru=
5. NE word class and case disambiguation, rule and context based
<+name> valency of preceding noun: filmen Tornfuglene PROP <tit> semantic product class <sem> in preceding noun: Lynda La Plantes tv-serie "Mistænkt" topologicals rather than topological-derived nouns: Amagerbrogade PROP <top> establishment NOM (not hum-GEN), if no np-head to the right: vi spiste på Marion's i går GEN - GEN and NOM - NOM coordination matches: Peters NOM og Jensen kom kørende NOM name readings are discarded in favor of GEN names, if there is an IDF noun or NOM name to the right with only matching prenominals in between: Australiens mest kendte sangere. Sentence-initially, names are discarded in favor of verbs and function words, if followed by an np non-compound nouns are favoured over heuristic names heuristic names are favoured over compound names in a left lower case context•non-heuristic names are favoured over compound or derived nouns sentence-initially or in left upper case context
dancg.morf (ca 2.400 rules)
Context based decisions are safer than pattern based predictions, and support each other
Full valency and semantic class context can be drawn upon
Iterated disambiguation creates safer context for more dangerous decisions
6. NE chaining, a repair mechanism for faulty NE string recognition at levels (1) and (2)
cleanmorf.dan
Performs chunking choices too hard or too ambiguous to make before CG:
• Fuses Hans=Jensen og Otte=Nielsen, but keeps Hans Porsche and Otte PC'erusing CG-recognition of Jensen PROP <hum>, Nielsen PROP <hum>, Porsche PROP <V> and PC'er N <cc-h>
• Fuses PROP and certain semantic N-types, if upper case and so far unrecognized:PROP + N <build> -> PROP <top>: Betty=Nansen BroenPROP + N <HH> -> PROP <org>: Betty=Nansen ForeningenPROP + N <sem> -> PROP <tit>: Betty=Nansen Prisen
• Repairs erroneous PROP splitting by the preprocessor, if later contextual typing asks for fusion:PROP <org, media> + PROP <top, civ>: Dansk=Røde=Kors AfrikaPROP <civ> + PROP <org, inst>: Danmarks Monetære Institut
7. NE function classes, mapped and disambiguated by context based rules
dancg.syn (ca. 4.400 rules)
Handles, among other things, the syntactic function and attachment of names. The following are examples of functions relevant to the subsequent type mapper:
(i) @N< (nominal dependents)
præsident Bush, filmen "The Matrix"
(ii) @APP (identifying appositions)
Forældrebestyrelsens forman, Kurt Chistensen, anklager borgmester ...
(iii) @N<PRED (predicating appositions)
John Andersen, distrikschef, Billund, 60 år
8. NE semantic types, mapped and disambiguated by context based rules
dancg.prop (428 rules)
(a) Type mapper (introduces ambiguity, instantiates earlier tags)
(b) Type disambiguator (reduces ambiguity)
• Uses the same 6 major and 20 subcategories used by the lexicon and pattern based name predictor
• Draws on syntactic relations, sentence context and lexical knowledge
• Can override previously assigned type readings
• Can disambiguate previously ambiguous readings
8.1 Cross-nominal prototype transfer
• Post-nominal attachment: i byen RijnsburgMAP (<top>) TARGET (PROP @N<) (-1(N NOM) LINK 0 N-TOP) ;
• Missing hyphen: Uppenskij katedralenMAP (<top>) TARGET (PROP) (1 @N<FUSE LINK 0 N-TOP) ;
• Subject complement inference: Moskva er en by i RuslandSELECT (<top>) (0 @SUBJ>) (*1 @MV LINK 0 <vk> LINK *1 @<SC LINK 0 N-TOP) ;
• Mines semantic N-types from relative clauses:Strongyle, som de gamle grækere kaldte øjenSELECT (<top>) (0 NOM) (*1 (<rel> INDP @SUBJ>) BARRIER NON-KOMMA LINK *1 VFIN LINK 0 @FS-N< LINK -1 ALL LINK *1 @MV LINK 0 <vk> LINK *1 @<SC LINK 0 N-TOP);MAP (%top) TARGET (PROP NOM) (*1 ("som") BARRIER NON-KOMMA LINK 0 @OC> LINK *1 @MV LINK 0 ("kalde" AKT) LINK *1 N-TOP BARRIER NON-PRE-N/ADV LINK 0 @<ACC)
• "Som"-comparison: tv-programmer som "Robinson-Ekspeditionen" (here, <tit> overrides previous <occ>MAP (%tit) TARGET (PROP NOM) (0 @P< OR @AS<) (-1 ("som") LINK 0 @N< OR @AS-N<) (-2 N-SEM) (NOT -2 N-HUM) ;
8.2 Coordination based type inference
1. Maps "close coordinators" (&KC-CLOSE):ADD (&KC-CLOSE) TARGET (KC) (*1 @SUBJ> BARRIER @NON->N) (-1 @SUBJ>)
2. Then uses this tags in disambiguation rules: e.g. Arafat @SUBJ> og hans Palæstinas=Selvstyre @SUBJ>REMOVE %non-h (0 %hum-all) (*-1 &KC-CLOSE BARRIER @NON->N LINK -1C %hum OR N-HUM LINK 0 NOM); SELECT (<top>) (1 &KC-CLOSE) (*2C <top> BARRIER @NON->N) ;
3. Danish has <hum>-only and <non-hum> pronouns:SELECT %hum (0 @SUBJ>) (1 KC) (2 ("han" GEN) OR ("hun" GEN)) (*3 @SUBJ> BARRIER @NON->N/KOMMA) ; # Hejberg og hans skoleREMOVE %hum (0 @SUBJ>) (1 KC) (2 ("den" GEN) OR ("det" GEN)) (*3 @SUBJ> BARRIER @NON->N/KOMMA) ; # Anden Verdenskrig og dens mange slag
8.3 PP-contexts
• Word-specific narrow contextMAP (<top>) TARGET (PROP) (-1 ("for" PRP)) (-2 ("syd") OR ("vest") OR ("nord") OR ("øst")) ;
• Np-level vs. Clause level functionADD (<top>) TARGET (PROP @P<) (-1 ("i" PRP)) (NOT -1 @PIV) (NOT -2 <+i>) ; (safe, early rule)REMOVE (<top>) (0 @P<) (-1 ("i" PRP)) (-2 (<+i>)); (heuristic, later rule)
• Pp-attachment inference, class based: godt 40 km fra Madras REMOVE %non-top (-1 ("fra" PRP) OR ("til" PRP)) (-2 N-DIST) (-3 NUM) ;
• Pp-attchment inference, word list basedMAP (%org) TARGET (PROP NOM @P<) (-1 ("i" PRP)) (-2 ("afdelingsleder") OR ("ansat") OR ("chef") OR ("direktør") OR ("forvaltningschef") OR ("koordinator") OR ("personalechef") OR ("souschef")) (NOT 0 <top> OR <civ>) ;
8.4 Genitive mapping
• MAP (%org) TARGET (GEN @>N) (*1 (N IDF) BARRIER @NON->N LINK 0 GEN-ORG) (NOT 0 <inst> OR <media> OR <party> OR <civ> OR <top>) ; # Microsofts generalforsamling/aktiekurs ("hard" GEN-ORG set)
• MAP (%org) TARGET (GEN @>N) (*1 (N IDF) BARRIER @NON->N LINK 0 GEN-ORG/HUM) (NOT 0 <inst> OR <media> OR <party> OR <civ> OR <top> OR <hum>) ; # Microsofts/ Bill=Gates advokat/hjemmeside ("soft" GEN-ORG set)
• REMOVE %non-h (0 GEN LINK 0 %h) (*1 N BARRIER @NON->N/KOMMA LINK 0 (<p>) OR (<pp>)) ; # owning thoughts and "thought products". %non-h respects also "humanoids", <org>, <civ> etc.
8.5 Prenominal context: Using adjective classes
Uses semantic adjective classes, e.g.1. Type based, more general, less safe:
LIST ADJ-HUM = <Dphys> <Dpsych> <Dsoc> <Drel> ;2. Word based, more specific and safer:
LIST ADJ-HUM& = <alder> "adfærdsvanskelig" "adspredt" "affektlabil" "afklaret" "afmægtig" "afslappet" "afstumpet" "afvisende" "agtbar" "agtpågivende" "agtsom" "alert" "alfaderlig" "alkærlig" "altopgivende" "altopofrende" "alvorsfuld" ....
MAP (%hum) TARGET (<heur> PROP NOM) (-1 AD LINK 0 ADJ-HUM&) (*-2 (ART S DEF) BARRIER @NON->N) ; # Den langlemmede Kanako=Yonekura
ADD (%hum) TARGET (<heur> PROP NOM) (-1 AD LINK 0 ADJ-HUM) (*-2 (ART S DEF) BARRIER @NON->N) ; # Den langlemmede Kanako=Yonekura
Evaluation
Recall Precision F-score
All word classes[1] 98.6 98.7 98.65
All syntactic functions 95.4 94.6 94.9
[1] Verbal subcategories (present PR, past IMPF, infinitive INF, present and past participle PCP1/2) and pronoun subcategories (inflecting DET, uninflecting INDP and personal PERS) were counted as different PoS.
CG-annotation for Danish news text
Performance statistics Korpus 90
Performance statistics Korpus 2000
Cross-class and class-internal name type errors
Comparisons• LTG (Mikheev et. al. 1998) achieved an overall F-measure of 93.39,
using hybrid techniques involving both probabilistics/HMM, name/suffix lists and sgml-manipulating rules
• MENE (Borthwick et. al. 1998), maximum entropy training:in-domain/same-topic F-scores of up to 92.2097.12% for a hybrid system integrating other MUC-7-systemscross-topic formal test, F-scores: 84.22 (pure MENE), 92 (hybrid MENE) possible weakness of trained systems: heavy training data bias?
Korpus90/2000, which was used for the evaluation of the rule based system presented here, is a mixed-genre corpus, even its newstexts are highly cross-domain/cross-topic, since sentence order has been randomized for copyright reasons.
What is it being used for?
• Enhance ordinary grammatical analysis- noun-disambiguation- semantic selection restriction fillers
• Corpus research on names
• Enhance IR-systems: e.g. Question-answering
Outlook• Future direct comparison might corroborate the intuition that a hand-
crafted system is less likely to have a domain/topic bias than automated learning systems with limited training data.
• Balancing strengths and weaknesses, future work should also examine to which degree automated learning / probabilistic systems can interface with or supplement Constraint Grammar based NER systems
• For large chunks, Text/Discourse based memory should be used for name type disambiguation, so clear cases and a majority vote could determine the class of unknown names
• With a larger window of analysis, anaphora resolution across sentence boundaries might help NER ("human" pronouns, definite np's, …)
Where to reach us: http://beta.visl.sdu.dk - http://corp.hum.sdu.dk