categorizing unknown words: using decision trees to identify names and misspellings
of 42
/42
Categorizing Unknown Categorizing Unknown Words: Words: Using Decision Trees to Using Decision Trees to Identify Names and Identify Names and Misspellings Misspellings
Post on 20-Dec-2015
214 views
Embed Size (px)
TRANSCRIPT
- Slide 1
- Categorizing Unknown Words: Using Decision Trees to Identify Names and Misspellings
- Slide 2
- Janine Toole Simon Fraser University Burnaby, BC, Canada From ANLP-NAACL Proceedings, April 29-May 4, 2000 (pp.. 173-179)
- Slide 3
- Goal: automatic categorization of unknown words b Unknown Words (UknWrds): word not contained in lexicon of NLP system b "unknown-ness" - property relative to NLP system
- Slide 4
- Motivation b Degraded system performance in presence of unknown words b Disproportionate effect possible Min (1996) - only 0.6% of words in 300 e-mails misspelledMin (1996) - only 0.6% of words in 300 e-mails misspelled Result - 12% of the sentences contained an error (discussed in (Min and Wilson, 1998)).Result - 12% of the sentences contained an error (discussed in (Min and Wilson, 1998)). b Difficulties translating live closed captions (CC) 5 seconds to transcribe dialogue, no post-edit5 seconds to transcribe dialogue, no post-edit
- Slide 5
- Reasons for unknown words b Proper name b Misspelling b Abbreviation or number b Morphological variant
- Slide 6
- And my favorite... b Misspoken words b Examples (courtesy H. K. Longmore): *I'll phall you on the cone (call, phone) *I did a lot of hiking by mysummer this self (myself this summer)
- Slide 7
- What to do? b Identify class of unknown word b Take action based on goals of system and class of word Correct spellingCorrect spelling Expand abbr.Expand abbr. Convert number formatConvert number format
- Slide 8
- Overall System Architecture b Multiple components, one per category b Return confidence measure (Elworthy, 1998) b Evaluate results from each component to determine category b One reason for approach: take advantage of existing research
- Slide 9
- Simplified Version: Names & Spelling Errors b Decision tree architecture combine multiple types of evidence about wordcombine multiple types of evidence about word b Results combined using weighted voting procedure b Evaluation: Live CC data - replete with wide variety of UknWds
- Slide 10
- Name Identifier b Proper names ==> proper name bucket b Others ==> discard b PN : person, place, concept, typically requiring Caps in English
- Slide 11
- Problems b CC is ALL CAPS! b No confidence measure with existing PN Recognizers b Perhaps future PNRs will work?
- Slide 12
- Solution b Build custom PNR
- Slide 13
- Decision Trees b Highly explainable - readily understand features affecting analysis b Well suited for combining a variety of info. b Don't grow tree from seed - use IBM's Intelligent Miner suite b Ignore DT algorithm - point is application of DT
- Slide 14
- Proper Names - Features b 10 features specified per UknWrd POS and Detailed POS of UknWrd + and - 2 wordsPOS and Detailed POS of UknWrd + and - 2 words Rule-based system for detailed tagsRule-based system for detailed tags in-house statistical parser for POSin-house statistical parser for POS b Would include feature indicating presence of Initial Upper Case if data had it
- Slide 15
- Misspellings b Unintended, orthographically incorrect representation b Relative to NLP system b 1 or more additions, deletions, substitutions, reversals, punctuation
- Slide 16
- Orthography b Word: orthography or.thog.ra.phy \o.r-'tha:g-r*-fe-\ n 1a: the art of writing words with the proper letters according to standard usage 1b: the representation of the sounds of a language by written or printed symbols 2: a part of language study that deals with letters and spelling
- Slide 17
- Misspellings - Features b Derived from prior research (including own) b Abridged list of features used Corpus freq., word length, edit distance, Ispell info, char seq. freq., Non-Engl. charsCorpus freq., word length, edit distance, Ispell info, char seq. freq., Non-Engl. chars
- Slide 18
- Misspellings Features (cont.) b Word length - (Agirre et. al., 1998) b Predictions for correct spelling more accurate if |w| > 4
- Slide 19
- Misspellings Features (cont.) b Edit distance 1 edit distance == 1 substitution, addition, deletion, reversal1 edit distance == 1 substitution, addition, deletion, reversal 80% of errors w/in 1 edit distance of intended word80% of errors w/in 1 edit distance of intended word 70% w/in 1 edit distance of intended word70% w/in 1 edit distance of intended word b Unix spell checker: ispell edit distance = distance from UnkWrd to closest ispell suggestion, or 30edit distance = distance from UnkWrd to closest ispell suggestion, or 30
- Slide 20
- Misspellings Features (cont.) b Char. Seq. Freq. wful, rql, etc.wful, rql, etc. composite of individual char. seq.composite of individual char. seq. relevance to 1 tree vs. manyrelevance to 1 tree vs. many Non-English - Transmission noise in CC case, or Foreign namesNon-English - Transmission noise in CC case, or Foreign names
- Slide 21
- Decision Time b Misspelling module says not a misspell PNR says its a name -> name b Both negative -> neither misspell nor name b What if both are positive? One with highest confidence measure winsOne with highest confidence measure wins Confidence measureConfidence measure per leaf, calculated from training data correct predictions / total # of predictions at leaf
- Slide 22
- Evaluation - Dataset b 7000 cases of UnkWrds b 2.6 million word corpus b Live business news captions b 70.4% manually ID'd as names b 21.3% as misspellings b Rest - other types of UnkWrds
- Slide 23
- Dataset (cont.) b 70% of Dataset randomly selected as training corpus b Remainder (2100) for test corpus b Test data - 10 samples, random selection with replacement b Total of 10 test datasets
- Slide 24
- Evaluation - Training b Train a DT with misspelling module b Train a DT with misspelling & name module b Train a DT with name module b Train a DT with name & misspelling module
- Slide 25
- Misspelling DT Results - Table 3 b baseline - no recall b 1st decision tree -73.8% recall b 2nd decision tree - increase in precision, decrease in recall by similar amount b name features not predictive for ID'ing misspellings in this domain b not surprising - 8 of 10 features deal with information external to word itself
- Slide 26
- Misspelling DT failures b 2 classes of omissions b Misidentifications Foreign wordsForeign words
- Slide 27
- Omission type 1 b Words with typical characteristics of English words b Differ from intended word by addition or deletion of a syllable creditability for credibilitycreditability for credibility coordinatored for coordinatedcoordinatored for coordinated representives for representativesrepresentives for representatives
- Slide 28
- Omission type 2 b Words differing from intended word by deletion of a blank webpagewebpage crewmemberscrewmembers rainshowerrainshower
- Slide 29
- Fixes b Fix for 2nd type feature to specify whether UnkWrd can be split into 2 known wordsfeature to specify whether UnkWrd can be split into 2 known words b Fix for 1st type more difficult homophonic relationshiphomophonic relationship phonetic distance featurephonetic distance feature
- Slide 30
- Name DT Results - Table 4 b 1st tree precision is large improvementprecision is large improvement recall is excellentrecall is excellent b 2nd tree increased recall & precisionincreased recall & precision unlike 2nd misspelling DT - why?unlike 2nd misspelling DT - why?
- Slide 31
- Name DT failures b Not ID'd as a name - Names with determiners the steelers, the pathfinderthe steelers, the pathfinder b Adept at individual people, places trouble with names having similar distributions to common nounstrouble with names having similar distributions to common nouns
- Slide 32
- Name DT failures (cont.) b Incorrectly ID'd as name Unusual character sequences: sxetion, fwlamgUnusual character sequences: sxetion, fwlamg b Misspelling identifier correctly ID's as misspellings b Decision-making component needs to resolve these
- Slide 33
- Unknown Word Categorizer b Precision = # of correct misspelling or name categorizations / total number of times a word was identified as misspelling or name b Recall = # of times system correctly ID's misspelling or name / # of misspellings and names existing in data
- Slide 34
- Confusion matrix of tie-breaker b Table 5 - good results b 5% of cases needed confidence measure b Majority of cases decision-maker rules in favor of name prediction
- Slide 35
- Confusion matrix (cont.) b Name DT has better results, likely to have higher confidence measures b UknWrd as Name when it is a misspelling (37 cases) b Phonetic relation with intended word - temt, tempt; floyda, Florida;
- Slide 36
- Encouraging Results b Productive approach b Future focus Improve existing componentsImprove existing components features sensitive to distinction between names & misspellings Develop components to ID remaining typesDevelop components to ID remaining types abbr., morph variants, etc. Alternative decision-making processAlternative decision-making process
- Slide 37
- Portability b Little required linguistic resources Corpus of new domain (language)Corpus of new domain (language) Spelling suggestionsSpelling suggestions ispell avail. for many languages POS taggerPOS tagger
- Slide 38
- Possible portability problems b Edit distance Words consist of alphabetic chars. having undergone subst/add/delWords consist of alphabetic chars. having undergone subst/add/del Less useful for Chinese, JapaneseLess useful for Chinese, Japanese b General approach still transferable consider means by which misspellings differ from intended wordsconsider means by which misspellings differ from intended words identify features to capture differencesidentify features to capture differences
- Slide 39
- Related Research b Assume all UknWrds are misspellings b Rely on capitalization b Expectations from scripts Rely on world knowledge of situationRely on world knowledge of situation e.g. naval ship-to-shore messages
- Slide 40
- Related Research (cont.) b (Baluja et al., 1999) DT classifier to ID PNs in text b 3 features: word level, dictionary level, POS information b Highest F-score: 95.2% slightly higher than name moduleslightly higher than name module
- Slide 41
- But... b Different tasks ID all words & phrases that are PNsID all words & phrases that are PNs vs. ID only those words which are UknWrdsvs. ID only those words which are UknWrds b Different data - Case information b If word-level features (case) excluded F-score of 79.7%
- Slide 42
- Conclusion b UknWrd Categorizer to ID misspellings & names b Individual components, specializing in identifying a particular class of UknWrd b 2 Existing components use DTs b Encouraging results in a challenging domain (live CC transcripts)!