categorizing unknown words: using decision trees to identify names and misspellings

of 42 /42
Categorizing Unknown Categorizing Unknown Words: Words: Using Decision Trees to Using Decision Trees to Identify Names and Identify Names and Misspellings Misspellings

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

Embed Size (px)

TRANSCRIPT

  • Slide 1
  • Categorizing Unknown Words: Using Decision Trees to Identify Names and Misspellings
  • Slide 2
  • Janine Toole Simon Fraser University Burnaby, BC, Canada From ANLP-NAACL Proceedings, April 29-May 4, 2000 (pp.. 173-179)
  • Slide 3
  • Goal: automatic categorization of unknown words b Unknown Words (UknWrds): word not contained in lexicon of NLP system b "unknown-ness" - property relative to NLP system
  • Slide 4
  • Motivation b Degraded system performance in presence of unknown words b Disproportionate effect possible Min (1996) - only 0.6% of words in 300 e-mails misspelledMin (1996) - only 0.6% of words in 300 e-mails misspelled Result - 12% of the sentences contained an error (discussed in (Min and Wilson, 1998)).Result - 12% of the sentences contained an error (discussed in (Min and Wilson, 1998)). b Difficulties translating live closed captions (CC) 5 seconds to transcribe dialogue, no post-edit5 seconds to transcribe dialogue, no post-edit
  • Slide 5
  • Reasons for unknown words b Proper name b Misspelling b Abbreviation or number b Morphological variant
  • Slide 6
  • And my favorite... b Misspoken words b Examples (courtesy H. K. Longmore): *I'll phall you on the cone (call, phone) *I did a lot of hiking by mysummer this self (myself this summer)
  • Slide 7
  • What to do? b Identify class of unknown word b Take action based on goals of system and class of word Correct spellingCorrect spelling Expand abbr.Expand abbr. Convert number formatConvert number format
  • Slide 8
  • Overall System Architecture b Multiple components, one per category b Return confidence measure (Elworthy, 1998) b Evaluate results from each component to determine category b One reason for approach: take advantage of existing research
  • Slide 9
  • Simplified Version: Names & Spelling Errors b Decision tree architecture combine multiple types of evidence about wordcombine multiple types of evidence about word b Results combined using weighted voting procedure b Evaluation: Live CC data - replete with wide variety of UknWds
  • Slide 10
  • Name Identifier b Proper names ==> proper name bucket b Others ==> discard b PN : person, place, concept, typically requiring Caps in English
  • Slide 11
  • Problems b CC is ALL CAPS! b No confidence measure with existing PN Recognizers b Perhaps future PNRs will work?
  • Slide 12
  • Solution b Build custom PNR
  • Slide 13
  • Decision Trees b Highly explainable - readily understand features affecting analysis b Well suited for combining a variety of info. b Don't grow tree from seed - use IBM's Intelligent Miner suite b Ignore DT algorithm - point is application of DT
  • Slide 14
  • Proper Names - Features b 10 features specified per UknWrd POS and Detailed POS of UknWrd + and - 2 wordsPOS and Detailed POS of UknWrd + and - 2 words Rule-based system for detailed tagsRule-based system for detailed tags in-house statistical parser for POSin-house statistical parser for POS b Would include feature indicating presence of Initial Upper Case if data had it
  • Slide 15
  • Misspellings b Unintended, orthographically incorrect representation b Relative to NLP system b 1 or more additions, deletions, substitutions, reversals, punctuation
  • Slide 16
  • Orthography b Word: orthography or.thog.ra.phy \o.r-'tha:g-r*-fe-\ n 1a: the art of writing words with the proper letters according to standard usage 1b: the representation of the sounds of a language by written or printed symbols 2: a part of language study that deals with letters and spelling
  • Slide 17
  • Misspellings - Features b Derived from prior research (including own) b Abridged list of features used Corpus freq., word length, edit distance, Ispell info, char seq. freq., Non-Engl. charsCorpus freq., word length, edit distance, Ispell info, char seq. freq., Non-Engl. chars
  • Slide 18
  • Misspellings Features (cont.) b Word length - (Agirre et. al., 1998) b Predictions for correct spelling more accurate if |w| > 4
  • Slide 19
  • Misspellings Features (cont.) b Edit distance 1 edit distance == 1 substitution, addition, deletion, reversal1 edit distance == 1 substitution, addition, deletion, reversal 80% of errors w/in 1 edit distance of intended word80% of errors w/in 1 edit distance of intended word 70% w/in 1 edit distance of intended word70% w/in 1 edit distance of intended word b Unix spell checker: ispell edit distance = distance from UnkWrd to closest ispell suggestion, or 30edit distance = distance from UnkWrd to closest ispell suggestion, or 30
  • Slide 20
  • Misspellings Features (cont.) b Char. Seq. Freq. wful, rql, etc.wful, rql, etc. composite of individual char. seq.composite of individual char. seq. relevance to 1 tree vs. manyrelevance to 1 tree vs. many Non-English - Transmission noise in CC case, or Foreign namesNon-English - Transmission noise in CC case, or Foreign names
  • Slide 21
  • Decision Time b Misspelling module says not a misspell PNR says its a name -> name b Both negative -> neither misspell nor name b What if both are positive? One with highest confidence measure winsOne with highest confidence measure wins Confidence measureConfidence measure per leaf, calculated from training data correct predictions / total # of predictions at leaf
  • Slide 22
  • Evaluation - Dataset b 7000 cases of UnkWrds b 2.6 million word corpus b Live business news captions b 70.4% manually ID'd as names b 21.3% as misspellings b Rest - other types of UnkWrds
  • Slide 23
  • Dataset (cont.) b 70% of Dataset randomly selected as training corpus b Remainder (2100) for test corpus b Test data - 10 samples, random selection with replacement b Total of 10 test datasets
  • Slide 24
  • Evaluation - Training b Train a DT with misspelling module b Train a DT with misspelling & name module b Train a DT with name module b Train a DT with name & misspelling module
  • Slide 25
  • Misspelling DT Results - Table 3 b baseline - no recall b 1st decision tree -73.8% recall b 2nd decision tree - increase in precision, decrease in recall by similar amount b name features not predictive for ID'ing misspellings in this domain b not surprising - 8 of 10 features deal with information external to word itself
  • Slide 26
  • Misspelling DT failures b 2 classes of omissions b Misidentifications Foreign wordsForeign words
  • Slide 27
  • Omission type 1 b Words with typical characteristics of English words b Differ from intended word by addition or deletion of a syllable creditability for credibilitycreditability for credibility coordinatored for coordinatedcoordinatored for coordinated representives for representativesrepresentives for representatives
  • Slide 28
  • Omission type 2 b Words differing from intended word by deletion of a blank webpagewebpage crewmemberscrewmembers rainshowerrainshower
  • Slide 29
  • Fixes b Fix for 2nd type feature to specify whether UnkWrd can be split into 2 known wordsfeature to specify whether UnkWrd can be split into 2 known words b Fix for 1st type more difficult homophonic relationshiphomophonic relationship phonetic distance featurephonetic distance feature
  • Slide 30
  • Name DT Results - Table 4 b 1st tree precision is large improvementprecision is large improvement recall is excellentrecall is excellent b 2nd tree increased recall & precisionincreased recall & precision unlike 2nd misspelling DT - why?unlike 2nd misspelling DT - why?
  • Slide 31
  • Name DT failures b Not ID'd as a name - Names with determiners the steelers, the pathfinderthe steelers, the pathfinder b Adept at individual people, places trouble with names having similar distributions to common nounstrouble with names having similar distributions to common nouns
  • Slide 32
  • Name DT failures (cont.) b Incorrectly ID'd as name Unusual character sequences: sxetion, fwlamgUnusual character sequences: sxetion, fwlamg b Misspelling identifier correctly ID's as misspellings b Decision-making component needs to resolve these
  • Slide 33
  • Unknown Word Categorizer b Precision = # of correct misspelling or name categorizations / total number of times a word was identified as misspelling or name b Recall = # of times system correctly ID's misspelling or name / # of misspellings and names existing in data
  • Slide 34
  • Confusion matrix of tie-breaker b Table 5 - good results b 5% of cases needed confidence measure b Majority of cases decision-maker rules in favor of name prediction
  • Slide 35
  • Confusion matrix (cont.) b Name DT has better results, likely to have higher confidence measures b UknWrd as Name when it is a misspelling (37 cases) b Phonetic relation with intended word - temt, tempt; floyda, Florida;
  • Slide 36
  • Encouraging Results b Productive approach b Future focus Improve existing componentsImprove existing components features sensitive to distinction between names & misspellings Develop components to ID remaining typesDevelop components to ID remaining types abbr., morph variants, etc. Alternative decision-making processAlternative decision-making process
  • Slide 37
  • Portability b Little required linguistic resources Corpus of new domain (language)Corpus of new domain (language) Spelling suggestionsSpelling suggestions ispell avail. for many languages POS taggerPOS tagger
  • Slide 38
  • Possible portability problems b Edit distance Words consist of alphabetic chars. having undergone subst/add/delWords consist of alphabetic chars. having undergone subst/add/del Less useful for Chinese, JapaneseLess useful for Chinese, Japanese b General approach still transferable consider means by which misspellings differ from intended wordsconsider means by which misspellings differ from intended words identify features to capture differencesidentify features to capture differences
  • Slide 39
  • Related Research b Assume all UknWrds are misspellings b Rely on capitalization b Expectations from scripts Rely on world knowledge of situationRely on world knowledge of situation e.g. naval ship-to-shore messages
  • Slide 40
  • Related Research (cont.) b (Baluja et al., 1999) DT classifier to ID PNs in text b 3 features: word level, dictionary level, POS information b Highest F-score: 95.2% slightly higher than name moduleslightly higher than name module
  • Slide 41
  • But... b Different tasks ID all words & phrases that are PNsID all words & phrases that are PNs vs. ID only those words which are UknWrdsvs. ID only those words which are UknWrds b Different data - Case information b If word-level features (case) excluded F-score of 79.7%
  • Slide 42
  • Conclusion b UknWrd Categorizer to ID misspellings & names b Individual components, specializing in identifying a particular class of UknWrd b 2 Existing components use DTs b Encouraging results in a challenging domain (live CC transcripts)!