categorizing unknown words:

Categorizing Unknown Categorizing Unknown Words:Words:

Using Decision Trees to Using Decision Trees to Identify Names and Identify Names and

MisspellingsMisspellings

Janine TooleJanine TooleSimon Fraser UniversitySimon Fraser UniversityBurnaby, BC, CanadaBurnaby, BC, Canada

From ANLP-NAACL From ANLP-NAACL Proceedings, April 29-May 4, Proceedings, April 29-May 4,

2000 (pp.. 173-179)2000 (pp.. 173-179)

Goal: automatic Goal: automatic categorization of unknown categorization of unknown

wordswords

Unknown Words (UknWrds): word Unknown Words (UknWrds): word not contained in lexicon of NLP not contained in lexicon of NLP systemsystem

"unknown-ness" - property relative "unknown-ness" - property relative to NLP systemto NLP system

MotivationMotivation

Degraded system performance in Degraded system performance in presence of unknown wordspresence of unknown words

Disproportionate effect possibleDisproportionate effect possible• Min (1996) - only 0.6% of words in 300 e-mails Min (1996) - only 0.6% of words in 300 e-mails

misspelledmisspelled• Result - 12% of the sentences contained an error Result - 12% of the sentences contained an error

(discussed in (Min and Wilson, 1998)).(discussed in (Min and Wilson, 1998)).

Difficulties translating live closed Difficulties translating live closed captions (CC)captions (CC)• 5 seconds to transcribe dialogue, no post-edit5 seconds to transcribe dialogue, no post-edit

Reasons for unknown Reasons for unknown wordswords

Proper nameProper name MisspellingMisspelling Abbreviation or numberAbbreviation or number Morphological variantMorphological variant

And my favorite...And my favorite...

Misspoken wordsMisspoken words Examples (courtesy H. K. Examples (courtesy H. K.

Longmore):Longmore):– *I'll phall you on the cone (call, phone)*I'll phall you on the cone (call, phone)– *I did a lot of hiking by mysummer this *I did a lot of hiking by mysummer this

self (myself this summer)self (myself this summer)

What to do?What to do?

Identify class of unknown wordIdentify class of unknown word Take action based on goals of Take action based on goals of

system and class of wordsystem and class of word• Correct spellingCorrect spelling• Expand abbr.Expand abbr.• Convert number formatConvert number format

Overall System Overall System ArchitectureArchitecture

Multiple components, one per Multiple components, one per categorycategory

Return confidence measure Return confidence measure (Elworthy, 1998)(Elworthy, 1998)

Evaluate results from each Evaluate results from each component to determine categorycomponent to determine category

One reason for approach: take One reason for approach: take advantage of existing researchadvantage of existing research

Simplified Version:Simplified Version:Names & Spelling ErrorsNames & Spelling Errors

Decision tree architecture Decision tree architecture • combine multiple types of evidence combine multiple types of evidence

about wordabout word Results combined using weighted Results combined using weighted

voting procedurevoting procedure Evaluation: Live CC data - replete Evaluation: Live CC data - replete

with wide variety of UknWdswith wide variety of UknWds

Name IdentifierName Identifier

Proper names ==> proper name Proper names ==> proper name bucketbucket

Others ==> discardOthers ==> discard PN : person, place, concept, typically PN : person, place, concept, typically

requiring Caps in Englishrequiring Caps in English

ProblemsProblems

CC is ALL CAPS!CC is ALL CAPS! No confidence measure with No confidence measure with

existing PN Recognizersexisting PN Recognizers Perhaps future PNRs will work?Perhaps future PNRs will work?

SolutionSolution

Build custom PNRBuild custom PNR

Decision TreesDecision Trees

Highly explainable - readily Highly explainable - readily understand features affecting analysisunderstand features affecting analysis

Well suited for combining a variety of Well suited for combining a variety of info.info.

Don't grow tree from seed - use IBM's Don't grow tree from seed - use IBM's Intelligent Miner suiteIntelligent Miner suite

Ignore DT algorithm - point is Ignore DT algorithm - point is application of DTapplication of DT

Proper Names - FeaturesProper Names - Features

10 features specified per UknWrd10 features specified per UknWrd• POS and Detailed POS of UknWrd + POS and Detailed POS of UknWrd +

and - 2 wordsand - 2 words• Rule-based system for detailed tagsRule-based system for detailed tags• in-house statistical parser for POSin-house statistical parser for POS

Would include feature indicating Would include feature indicating presence of Initial Upper Case if presence of Initial Upper Case if data had itdata had it

MisspellingsMisspellings

Unintended, orthographically Unintended, orthographically incorrect representationincorrect representation

Relative to NLP systemRelative to NLP system 1 or more additions, deletions, 1 or more additions, deletions,

substitutions, reversals, substitutions, reversals, punctuationpunctuation

OrthographyOrthography

Word: orthographyWord: orthographyor.thog.ra.phy \o.r-'tha:g-r*-fe-\ n 1a: the art of or.thog.ra.phy \o.r-'tha:g-r*-fe-\ n 1a: the art of writing words with the proper letters according to writing words with the proper letters according to standard usage 1b: the representation of the standard usage 1b: the representation of the sounds of a language by written or printed symbols sounds of a language by written or printed symbols 2: a part of language study that deals with letters 2: a part of language study that deals with letters and spellingand spelling

Misspellings - FeaturesMisspellings - Features

Derived from prior research Derived from prior research (including own)(including own)

Abridged list of features usedAbridged list of features used• Corpus freq., word length, edit Corpus freq., word length, edit

distance, Ispell info, char seq. freq., distance, Ispell info, char seq. freq., Non-Engl. charsNon-Engl. chars

Misspellings Features Misspellings Features (cont.)(cont.)

Word length - (Agirre et. al., 1998)Word length - (Agirre et. al., 1998) Predictions for correct spelling Predictions for correct spelling

more accurate if |w| > 4more accurate if |w| > 4


Edit distanceEdit distance• 1 edit distance == 1 substitution, addition, 1 edit distance == 1 substitution, addition,

deletion, reversaldeletion, reversal• 80% of errors w/in 1 edit distance of 80% of errors w/in 1 edit distance of

intended wordintended word• 70% w/in 1 edit distance of intended word70% w/in 1 edit distance of intended word

Unix spell checker: ispellUnix spell checker: ispell• edit distance = distance from UnkWrd to edit distance = distance from UnkWrd to

closest ispell suggestion, or 30closest ispell suggestion, or 30


Char. Seq. Freq.Char. Seq. Freq.• wful, rql, etc.wful, rql, etc.• composite of individual char. seq.composite of individual char. seq.• relevance to 1 tree vs. manyrelevance to 1 tree vs. many• Non-English - Transmission noise in Non-English - Transmission noise in

CC case, or Foreign namesCC case, or Foreign names

Decision TimeDecision Time

Misspelling module says not a misspellMisspelling module says not a misspellPNR says its a name -> namePNR says its a name -> name

Both negative -> neither misspell nor Both negative -> neither misspell nor namename

What if both are positive?What if both are positive?• One with highest confidence measure winsOne with highest confidence measure wins• Confidence measureConfidence measure

– per leaf, calculated from training dataper leaf, calculated from training data– correct predictions / total # of predictions at leafcorrect predictions / total # of predictions at leaf

Evaluation - DatasetEvaluation - Dataset

7000 cases of UnkWrds7000 cases of UnkWrds 2.6 million word corpus2.6 million word corpus Live business news captionsLive business news captions 70.4% manually ID'd as names70.4% manually ID'd as names 21.3% as misspellings21.3% as misspellings Rest - other types of UnkWrdsRest - other types of UnkWrds

Dataset (cont.)Dataset (cont.)

70% of Dataset randomly selected 70% of Dataset randomly selected as training corpusas training corpus

Remainder (2100) for test corpusRemainder (2100) for test corpus Test data - 10 samples, random Test data - 10 samples, random

selection with replacementselection with replacement Total of 10 test datasetsTotal of 10 test datasets

Evaluation - TrainingEvaluation - Training

Train a DT with misspelling moduleTrain a DT with misspelling module Train a DT with misspelling & name Train a DT with misspelling & name

modulemodule Train a DT with name moduleTrain a DT with name module Train a DT with name & misspelling Train a DT with name & misspelling

modulemodule

Misspelling DT Results - Misspelling DT Results - Table 3Table 3

baseline - no recallbaseline - no recall 1st decision tree -73.8% recall1st decision tree -73.8% recall 2nd decision tree - increase in precision, 2nd decision tree - increase in precision,

decrease in recall by similar amountdecrease in recall by similar amount name features not predictive for ID'ing name features not predictive for ID'ing

misspellings in this domainmisspellings in this domain not surprising - 8 of 10 features deal with not surprising - 8 of 10 features deal with

information external to word itselfinformation external to word itself

Misspelling DT failuresMisspelling DT failures

2 classes of omissions2 classes of omissions MisidentificationsMisidentifications

• Foreign wordsForeign words

Omission type 1Omission type 1

Words with typical characteristics Words with typical characteristics of English wordsof English words

Differ from intended word by Differ from intended word by addition or deletion of a syllableaddition or deletion of a syllable• creditability for credibilitycreditability for credibility• coordinatored for coordinatedcoordinatored for coordinated• representives for representativesrepresentives for representatives

Omission type 2Omission type 2

Words differing from intended Words differing from intended word by deletion of a blankword by deletion of a blank• webpagewebpage• crewmemberscrewmembers• rainshowerrainshower

FixesFixes

Fix for 2nd typeFix for 2nd type• feature to specify whether UnkWrd feature to specify whether UnkWrd

can be split into 2 known wordscan be split into 2 known words Fix for 1st type more difficultFix for 1st type more difficult

• homophonic relationshiphomophonic relationship• phonetic distance featurephonetic distance feature

Name DT Results - Table 4Name DT Results - Table 4

1st tree1st tree• precision is large improvementprecision is large improvement• recall is excellentrecall is excellent

2nd tree2nd tree• increased recall & precisionincreased recall & precision• unlike 2nd misspelling DT - why?unlike 2nd misspelling DT - why?

Name DT failuresName DT failures

Not ID'd as a name - Names with Not ID'd as a name - Names with determinersdeterminers• the steelers, the pathfinderthe steelers, the pathfinder

Adept at individual people, placesAdept at individual people, places• trouble with names having similar trouble with names having similar

distributions to common nounsdistributions to common nouns

Name DT failures (cont.)Name DT failures (cont.)

Incorrectly ID'd as nameIncorrectly ID'd as name• Unusual character sequences: Unusual character sequences:

sxetion, fwlamgsxetion, fwlamg Misspelling identifier correctly ID's Misspelling identifier correctly ID's

as misspellingsas misspellings Decision-making component needs Decision-making component needs

to resolve theseto resolve these

Unknown Word Unknown Word CategorizerCategorizer

Precision = # of correct misspelling Precision = # of correct misspelling or name categorizations / total or name categorizations / total number of times a word was number of times a word was identified as misspelling or nameidentified as misspelling or name

Recall = # of times system Recall = # of times system correctly ID's misspelling or name / correctly ID's misspelling or name / # of misspellings and names # of misspellings and names existing in dataexisting in data

Confusion matrix of tie-Confusion matrix of tie-breakerbreaker

Table 5 - good resultsTable 5 - good results 5% of cases needed confidence 5% of cases needed confidence

measuremeasure Majority of cases decision-maker rules Majority of cases decision-maker rules

in favor of name predictionin favor of name prediction

Confusion matrix (cont.)Confusion matrix (cont.)

Name DT has better results, likely to Name DT has better results, likely to have higher confidence measureshave higher confidence measures

UknWrd as Name when it is a UknWrd as Name when it is a misspelling (37 cases)misspelling (37 cases)

Phonetic relation with intended word Phonetic relation with intended word - temt, tempt; floyda, Florida;- temt, tempt; floyda, Florida;

Encouraging ResultsEncouraging Results

Productive approachProductive approach Future focusFuture focus

• Improve existing componentsImprove existing components– features sensitive to distinction between features sensitive to distinction between

names & misspellingsnames & misspellings

• Develop components to ID remaining Develop components to ID remaining typestypes– abbr., morph variants, etc.abbr., morph variants, etc.

• Alternative decision-making processAlternative decision-making process

PortabilityPortability

Little required linguistic resourcesLittle required linguistic resources• Corpus of new domain (language)Corpus of new domain (language)• Spelling suggestionsSpelling suggestions

– ispell avail. for many languagesispell avail. for many languages

• POS taggerPOS tagger

Possible portability Possible portability problemsproblems

Edit distanceEdit distance• Words consist of alphabetic chars. Words consist of alphabetic chars.

having undergone subst/add/delhaving undergone subst/add/del• Less useful for Chinese, JapaneseLess useful for Chinese, Japanese

General approach still transferableGeneral approach still transferable• consider means by which misspellings consider means by which misspellings

differ from intended wordsdiffer from intended words• identify features to capture differencesidentify features to capture differences

Related ResearchRelated Research

Assume all UknWrds are Assume all UknWrds are misspellingsmisspellings

Rely on capitalizationRely on capitalization Expectations from scriptsExpectations from scripts

• Rely on world knowledge of situationRely on world knowledge of situation– e.g. naval ship-to-shore messagese.g. naval ship-to-shore messages

Related Research (cont.)Related Research (cont.)

(Baluja et al., 1999)(Baluja et al., 1999)DT classifier to ID PNs in textDT classifier to ID PNs in text

3 features: word level, dictionary 3 features: word level, dictionary level,level,POS informationPOS information

Highest F-score: 95.2%Highest F-score: 95.2%• slightly higher than name moduleslightly higher than name module

But...But...

Different tasksDifferent tasks• ID all words & phrases that are PNsID all words & phrases that are PNs• vs. ID only those words which are vs. ID only those words which are

UknWrdsUknWrds Different data - Case informationDifferent data - Case information If word-level features (case) excludedIf word-level features (case) excluded

F-score of 79.7%F-score of 79.7%

ConclusionConclusion

UknWrd Categorizer to ID UknWrd Categorizer to ID misspellings & namesmisspellings & names

Individual components, specializing Individual components, specializing in identifying a particular class of in identifying a particular class of UknWrdUknWrd

2 Existing components use DTs2 Existing components use DTs Encouraging results in a challenging Encouraging results in a challenging

domain (live CC transcripts)!domain (live CC transcripts)!

categorizing unknown words:

Documents