categorizing unknown words:

of 42/42
Categorizing Unknown Categorizing Unknown Words: Words: Using Decision Trees to Using Decision Trees to Identify Names and Identify Names and Misspellings Misspellings

Post on 13-Jan-2016

37 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Categorizing Unknown Words:. Using Decision Trees to Identify Names and Misspellings. Janine Toole Simon Fraser University Burnaby, BC, Canada. From ANLP-NAACL Proceedings, April 29-May 4, 2000 (pp.. 173-179). Goal: automatic categorization of unknown words. - PowerPoint PPT Presentation

TRANSCRIPT

  • Categorizing Unknown Words:Using Decision Trees to Identify Names and Misspellings

  • Janine TooleSimon Fraser UniversityBurnaby, BC, CanadaFrom ANLP-NAACL Proceedings, April 29-May 4, 2000 (pp.. 173-179)

  • Goal: automatic categorization of unknown wordsUnknown Words (UknWrds): word not contained in lexicon of NLP system"unknown-ness" - property relative to NLP system

  • MotivationDegraded system performance in presence of unknown wordsDisproportionate effect possibleMin (1996) - only 0.6% of words in 300 e-mails misspelledResult - 12% of the sentences contained an error (discussed in (Min and Wilson, 1998)).Difficulties translating live closed captions (CC)5 seconds to transcribe dialogue, no post-edit

  • Reasons for unknown wordsProper nameMisspellingAbbreviation or numberMorphological variant

  • And my favorite...Misspoken wordsExamples (courtesy H. K. Longmore):*I'll phall you on the cone (call, phone)*I did a lot of hiking by mysummer this self (myself this summer)

  • What to do?Identify class of unknown wordTake action based on goals of system and class of wordCorrect spellingExpand abbr.Convert number format

  • Overall System ArchitectureMultiple components, one per categoryReturn confidence measure (Elworthy, 1998)Evaluate results from each component to determine categoryOne reason for approach: take advantage of existing research

  • Simplified Version:Names & Spelling ErrorsDecision tree architecture combine multiple types of evidence about wordResults combined using weighted voting procedureEvaluation: Live CC data - replete with wide variety of UknWds

  • Name IdentifierProper names ==> proper name bucketOthers ==> discardPN : person, place, concept, typically requiring Caps in English

  • ProblemsCC is ALL CAPS!No confidence measure with existing PN RecognizersPerhaps future PNRs will work?

  • SolutionBuild custom PNR

  • Decision TreesHighly explainable - readily understand features affecting analysisWell suited for combining a variety of info.Don't grow tree from seed - use IBM's Intelligent Miner suiteIgnore DT algorithm - point is application of DT

  • Proper Names - Features10 features specified per UknWrdPOS and Detailed POS of UknWrd + and - 2 wordsRule-based system for detailed tagsin-house statistical parser for POSWould include feature indicating presence of Initial Upper Case if data had it

  • MisspellingsUnintended, orthographically incorrect representationRelative to NLP system1 or more additions, deletions, substitutions, reversals, punctuation

  • OrthographyWord: orthography or.thog.ra.phy \o.r-'tha:g-r*-fe-\ n 1a: the art of writing words with the proper letters according to standard usage 1b: the representation of the sounds of a language by written or printed symbols 2: a part of language study that deals with letters and spelling

  • Misspellings - FeaturesDerived from prior research (including own)Abridged list of features usedCorpus freq., word length, edit distance, Ispell info, char seq. freq., Non-Engl. chars

  • Misspellings Features (cont.)Word length - (Agirre et. al., 1998)Predictions for correct spelling more accurate if |w| > 4

  • Misspellings Features (cont.)Edit distance1 edit distance == 1 substitution, addition, deletion, reversal80% of errors w/in 1 edit distance of intended word70% w/in 1 edit distance of intended wordUnix spell checker: ispelledit distance = distance from UnkWrd to closest ispell suggestion, or 30

  • Misspellings Features (cont.)Char. Seq. Freq.wful, rql, etc.composite of individual char. seq.relevance to 1 tree vs. manyNon-English - Transmission noise in CC case, or Foreign names

  • Decision TimeMisspelling module says not a misspell PNR says its a name -> nameBoth negative -> neither misspell nor nameWhat if both are positive?One with highest confidence measure winsConfidence measureper leaf, calculated from training datacorrect predictions / total # of predictions at leaf

  • Evaluation - Dataset7000 cases of UnkWrds2.6 million word corpusLive business news captions70.4% manually ID'd as names21.3% as misspellingsRest - other types of UnkWrds

  • Dataset (cont.)70% of Dataset randomly selected as training corpusRemainder (2100) for test corpusTest data - 10 samples, random selection with replacementTotal of 10 test datasets

  • Evaluation - TrainingTrain a DT with misspelling moduleTrain a DT with misspelling & name moduleTrain a DT with name moduleTrain a DT with name & misspelling module

  • Misspelling DT Results - Table 3baseline - no recall1st decision tree -73.8% recall2nd decision tree - increase in precision, decrease in recall by similar amountname features not predictive for ID'ing misspellings in this domainnot surprising - 8 of 10 features deal with information external to word itself

  • Misspelling DT failures2 classes of omissionsMisidentificationsForeign words

  • Omission type 1Words with typical characteristics of English wordsDiffer from intended word by addition or deletion of a syllablecreditability for credibilitycoordinatored for coordinatedrepresentives for representatives

  • Omission type 2Words differing from intended word by deletion of a blankwebpagecrewmembersrainshower

  • FixesFix for 2nd typefeature to specify whether UnkWrd can be split into 2 known wordsFix for 1st type more difficulthomophonic relationshipphonetic distance feature

  • Name DT Results - Table 41st treeprecision is large improvementrecall is excellent2nd treeincreased recall & precisionunlike 2nd misspelling DT - why?

  • Name DT failuresNot ID'd as a name - Names with determinersthe steelers, the pathfinderAdept at individual people, placestrouble with names having similar distributions to common nouns

  • Name DT failures (cont.)Incorrectly ID'd as nameUnusual character sequences: sxetion, fwlamgMisspelling identifier correctly ID's as misspellingsDecision-making component needs to resolve these

  • Unknown Word CategorizerPrecision = # of correct misspelling or name categorizations / total number of times a word was identified as misspelling or nameRecall = # of times system correctly ID's misspelling or name / # of misspellings and names existing in data

  • Confusion matrix of tie-breakerTable 5 - good results5% of cases needed confidence measureMajority of cases decision-maker rules in favor of name prediction

  • Confusion matrix (cont.)Name DT has better results, likely to have higher confidence measuresUknWrd as Name when it is a misspelling (37 cases)Phonetic relation with intended word - temt, tempt; floyda, Florida;

  • Encouraging ResultsProductive approachFuture focusImprove existing componentsfeatures sensitive to distinction between names & misspellingsDevelop components to ID remaining typesabbr., morph variants, etc.Alternative decision-making process

  • PortabilityLittle required linguistic resourcesCorpus of new domain (language)Spelling suggestionsispell avail. for many languagesPOS tagger

  • Possible portability problemsEdit distanceWords consist of alphabetic chars. having undergone subst/add/delLess useful for Chinese, JapaneseGeneral approach still transferableconsider means by which misspellings differ from intended wordsidentify features to capture differences

  • Related ResearchAssume all UknWrds are misspellingsRely on capitalizationExpectations from scriptsRely on world knowledge of situatione.g. naval ship-to-shore messages

  • Related Research (cont.)(Baluja et al., 1999) DT classifier to ID PNs in text3 features: word level, dictionary level, POS informationHighest F-score: 95.2%slightly higher than name module

  • But...Different tasksID all words & phrases that are PNsvs. ID only those words which are UknWrdsDifferent data - Case informationIf word-level features (case) excluded F-score of 79.7%

  • ConclusionUknWrd Categorizer to ID misspellings & namesIndividual components, specializing in identifying a particular class of UknWrd2 Existing components use DTsEncouraging results in a challenging domain (live CC transcripts)!