1
AnoopSarkarOnSabbaticalatU.ofEdinburgh(Informatics4.18b)
SimonFraserUniversityVancouver,Canada
natlang.cs.sfu.ca October2,2009
BootstrappingaClassifier
UsingtheYarowskyAlgorithm
Acknowledgements
• ThisisjointworkwithmystudentsGholamrezaHaffari(Ph.D.)andMaxWhitney(B.Sc.)atSFU.
• ThankstoMichaelCollinsforprovidingthenamed‐entitydatasetandansweringourquestions.
• ThankstoDamianosKarakosandJasonEisnerforprovidingthewordsensedatasetandansweringourquestions.
2
3
Bootstrapping
4
Self‐Training
1. Abasemodelistrainedwithasmall/largeamountoflabeleddata.
2. Thebasemodelisthenusedtoclassifytheunlabeleddata.
3. Onlythemostconfidentunlabeledpoints,alongwiththepredictedlabels,areincorporatedintothelabeledtrainingset(pseudo‐labeleddata).
4. Thebasemodelisre‐trained,andtheprocessisrepeated.
5
Self‐Training
• Itcanbeappliedtoanybaselearningalgorithm:onlyneedconfidenceweightsforpredictions.
• DifferenceswithEM:• Self‐trainingonlyusesthemodeofpredictiondistribution.• Unlikehard‐EM,itcanabstain:“Idonotknowthelabel”.
• DifferenceswithCo‐training:• Inco‐trainingtherearetwoviews,ineachofwhichamodel
islearned.• Themodelinoneviewtrainsthemodelinanotherviewby
providingpseudo‐labeledexamples.
6
Bootstrapping
• Startwithafewseedrules(typicallyhighprecision,lowrecall).Buildinitialclassifier.
• Useclassifiertolabelunlabeleddata.
• Extractnewrulesfrompseudo‐labeleddataandbuildclassifierfornextiteration.
• Exitiflabelsforunlabeleddataareunchanged.Else,applyclassifiertounlabeleddataandcontinue.
77
DecisionList(DL)
• ADecisionListisanorderedsetofrules.• Givenaninstancex,thefirstapplicableruledeterminestheclass
label.
• Insteadoforderingtherules,wecangiveweighttothem.• Amongallapplicablerulestoaninstancex,applytherulewhich
hasthehighestweight.
• Theparametersaretheweightswhichspecifytheorderingoftherules.
Rules:Ifxhasfeaturefclassk ,θf,k
parameters
88
DLforWordSenseDisambiguation
Ifcompany+1,confidenceweight.97Iflife−1,confidenceweight.96…
(Yarowsky1995)
• WSD:Specifythemostappropriatesense(meaning)ofawordinagivensentence.
Considerthesetwosentences: …companysaidtheplantisstilloperating.factorysense+ …anddividelifeintoplantandanimalkingdom.livingorganismsense‐
Considerthesetwosentences: …companysaidtheplantisstilloperating.sense+ …anddividelifeintoplantandanimalkingdom.sense‐
Considerthesetwosentences: …companysaidtheplantisstilloperating.(company,operating)sense+ …anddividelifeintoplantandanimalkingdom.(life,animal)sense‐
Sorted
Example:disambiguate2sensesofsentence
• Seedrules:Ifcontextcontainsserved,label+1,conf=1.0Ifcontextcontainsreads,label‐1,conf=1.0
• Seedruleslabel8outof303unlabeledexamples
• These8pseudo‐labeledexamplesprovide6rulesabove0.95threshold(includingtheoriginalseedrules)e.g.Ifcontextcontainsread,label‐1,conf=0.953
• These6ruleslabel151outof303unlabeledexamples
Example:disambiguate2sensesofsentence
• These151pseudo‐labeledexamplesprovide60rulesabovethethreshold,e.g.Ifcontextcontainsprison,label+1,conf=0.989Ifprevwordislife,label+1,conf=0.986Ifprevwordishis,label+1,conf=0.983Ifnextwordisfrom,label‐1,conf=0.982Ifcontextcontainsrelevant,label‐1,conf=0.953Ifcontextcontainspage,label‐1,conf=0.953
• After5iterations,297/303unlabeledexamplesarepermanentlylabeled(nochangespossible)
• Buildingfinalclassifiergives67%accuracyontestsetof515sentences.Withsome“tricks”wecanget76%accuracy.
11
BriefHistoryofBootstrapping
• (Yarowsky1995)useditwithDecisionListbaseclassifierforWordSenseDisambiguation(WSD)task.• Itachievedthesameperformancelevelasthesupervisedalgorithm,
usingonlyafewseedexamplesaslabeledtrainingdata.
• (Collins&Singer1999)useditforNamedEntityClassificationtaskwithDecisionListbaseclassifier.• Usingonly7initialrules,itachieved91%accuracy.• ItachievedthesameperformancelevelasCo‐training(noneedfor2
views).
• (AbneyACL2002)inapaperaboutco‐trainingcontrastsitwiththeYarowskyalgorithm.Initialanalysisabandonedlater.
12
BriefHistoryofBootstrapping
• (AbneyCL2004)providedanewanalysisoftheYarowskyalgorithm.• ItcouldnotmathematicallyanalyzetheoriginalYarowskyalgorithm,
butintroducednewvariants(wewillseethemlater).
• (Haffari&SarkarUAI2007)advancedAbney’sanalysisandgaveageneralframeworkthatshowedhowtheYarowskyalgorithmintroducedbyAbneyisrelatedtootherSSLmethods.
• (EisnerandKarakos2005)examinestheconstructionofseedrulesforbootstrapping.
13
AnalysisoftheYarowskyAlgorithm
14
OriginalYarowskyAlgorithm
• TheYarowskyalgorithmisabootstrappingalgorithmwithaDecisionListbaseclassifier.
• Thepredictedlabelisk*iftheconfidenceoftheappliedruleisabovesomethresholdη.
• Aninstancemaybecomeunlabeledinfutureiterations.
(Yarowsky1995)
15
ModifiedYarowskyAlgorithm
• Insteadofthefeaturewiththemaxscoreweusethesumofthescoresofallfeaturesactiveforanexampletobelabeled.
• Thepredictedlabelisk*iftheconfidenceoftheappliedruleisabovethethreshold1/K.• K:isthenumberoflabels.
• Aninstancemuststaylabeledonceitbecomeslabeled,butthelabelmaychange.
• Thesearetheconditionsinallthealgorithmswewillanalyzeintherestofthetalk.• AnalyzingtheoriginalYarowskyalgorithmisstillanopenquestion.
(Abney2004)
1616
BipartiteGraphRepresentation
+1companysaidtheplantisstilloperating
‐1dividelifeintoplantandanimalkingdom
company
operating
life
animal
(Features)F
…
X(Instances)
…
Unlabeled
(Cordunneanu2006,Haffari&Sarkar2007)
Weproposetoviewbootstrappingaspropagatingthelabelsofinitiallylabelednodestotherestofthegraphnodes.
1717
Self‐TrainingontheGraph
f
(Features)F X(Instances)
… …
xπx qxLabelingdistribution
+ ‐
1qx
θfLabelingdistribution
+ ‐
.7.3θf
(Haffari&Sarkar2007)
+ ‐
.6.4
+ ‐
1qx
1818
GoalsoftheAnalysis
• Tofindreasonableobjectivefunctionsfortheself‐trainingalgorithmsonthebipartitegraph.
• TheobjectivefunctionsmayshedlighttotheempiricalsuccessofdifferentDL‐basedself‐trainingalgorithms.
• Itcantellusthekindofpropertiesinthedatawhicharewellexploitedandcapturedbythealgorithms.
• Itisalsousefulinprovingtheconvergenceofthealgorithms.
• KL‐divergenceisameasureofdistancebetweentwoprobabilitydistributions:
• EntropyHisameasureofrandomnessinadistribution:
• Theobjectivefunction:
19
ObjectiveFunction
F X
20
TheBregmanDistance
• Examples:– Ifψ(t)=tlogtThenBψ(α,β)=KL(α,β)– Ifψ(t)=t2ThenBψ(α,β)=Σi(αi‐βi)2
ψ(αi)
αiβi
ψ(t)
t
• Givenastrictlyconvexfunctionψ,theBregmandistanceBψbetweentwoprobabilitydistributionsisdefinedas:
• Theψ‐entropyHψisdefinedas:
• Thegeneralizedobjectivefunction:
21
GeneralizingtheObjectiveFunction
F X
22
OptimizingtheObjectiveFunctions
• Inwhatfollows,wementionsomespecificobjectivefunctionstogetherwiththeiroptimizationalgorithms.
• TheseoptimizationalgorithmscorrespondtosomevariantsofthemodifiedYarowskyalgorithm.
• Itisnoteasytocomeupwithalgorithmsfordirectlyoptimizingthegeneralizedobjectivefunctions.
2323
UsefulOperations
• Average:takestheaveragedistributionoftheneighbors
• Majority:takesthemajoritylabeloftheneighbors
(.2,.8)
(.4,.6)
(.3,.7)
(0,1)
(.2,.8)
(.4,.6)
2424
AnalyzingSelf‐Training
Theorem.Thefollowingobjectivefunctionsareoptimizedbythecorrespondinglabelpropagationalgorithmsonthebipartitegraph:
F X
where:ConvergesinPolytimeO(|F|2|X|2|)
Relatedtograph‐basedSSlearning(Zhuetal2003)
Abney’svariantofYarowskyalgorithm
2525
WhataboutLog‐Likelihood?
• Initially,thelabelingdistributionisuniformforunlabeledverticesandaδ‐likedistributionforlabeledvertices.
• Bylearningtheparameters,wewouldliketoreducetheuncertaintyinthelabelingdistributionwhilerespectingthelabeleddata:
Negativelog‐Likelihoodoftheoldandnewlylabeleddata
Lemma.Ifmisthenumberoffeaturesconnectedtoaninstance,then:
2626
ConnectionbetweenthetwoAnalyses
ComparewithConditionalEntropyRegularization(GrandvaletandBengio2005)!
x!L
KL(qx||!x) + "!
x!U
H(!x)
27
Experiments
NamedEntityClassification
• 971,476sentencesfromtheNYTwereparsedwiththeCollinsparser
• Thetaskistoidentifythreetypesofnamedentities:1. Location(LOC)2. Person(PER)3. Organization(ORG)−1.notaNEor“don’tknow”
28
(CollinsandSinger,1999)
NamedEntityClassification
• Nounphraseswereextractedthatmetthefollowingconditions1. TheNPcontainedonlywordstaggedasproper
nouns2. TheNPappearedinthefollowingtwo
syntacticcontexts: Modifiedbyanappositivewhoseheadisasingular
noun InaprepositionalphrasemodifyinganNPwhose
headisasingularnoun29
(CollinsandSinger,1999)
NamedEntityClassification
• Nounphraseswereextractedthatmetthefollowingconditions1. TheNPcontainedonlywordstaggedasproper
nouns2. TheNPappearedinthefollowingtwo
syntacticcontexts: Modifiedbyanappositivewhoseheadisasingular
noun InaprepositionalphrasemodifyinganNPwhose
headisasingularnoun30
(CollinsandSinger,1999)
NP
NNP NNP NNPS
International Business Machines
…,says[NEMauryCooper],avice[CONTEXTpresident]atS.&P.
…,fraudrelatedtoworkonafederallyfundedsewage[CONTEXTplantin][NEGeorgia]
NamedEntityClassification
• Thetask:classifyNPsintoLOC,PER,ORG• 89,305trainingexampleswith68,475distinctfeaturetypes– 88,962wasusedinCS99experiments
• 1000testdataexamples(includesNPsthatarenotLOC,PERorORG)
• Monthnamesareeasilyidentifiableasnotnamedentities:leaves962examples
• Still85NPsthatarenotLOC,PER,ORG.• Cleanaccuracyover877;Noisyover962 31
(CollinsandSinger,1999)
YarowskyVariants
• AtrickfromtheCo‐trainingpaper(BlumandMitchell1998)istobecautious.Don’taddallrulesabovethe0.95threshold
• Addonlynrulesperlabel(say5)andincreasethisamountbynineachiteration
• Changesthedynamicsoflearninginthealgorithmbutnottheobjectivefn
• Twovariants:Yarowsky(basic),Yarowsky(cautious)
• Withoutathreshold:Yarowsky(nothreshold)32
(Abney2004,CollinsandSinger,1999)
ResultsLearning Algorithm Accuracy (Clean) Accuracy (Noisy) Baseline (all organization)
45.8 41.8
EM 83.1 75.8 Yarowsky (basic) 80.7 73.5 Yarowsky (no threshold) 80.3 73.2 Yarowsky (cautious) 91 83 DL-CoTrain 91 83
33
NumberofRules(basic)
34
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 2 3 4 5 6 7 8 9
Num
. R
ule
s
Iteration
NumberofRules(cautious)
35 0
1000
2000
3000
4000
5000
6000
7000
0 50 100 150 200 250 300 350 400 450
Num
. R
ule
s
Iteration
Coverage(basic)
36 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 3 4 5 6 7 8 9
Co
ve
rag
e
Iteration
Coverage(cautious)
37 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 50 100 150 200 250 300 350 400 450
Co
vera
ge
Iteration
Accuracy(basic)
38
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 3 4 5 6 7 8 9 10
Accura
cy
Iteration
Accuracy(cautious)
39
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 50 100 150 200 250 300 350 400 450
Accura
cy
Iteration
Precision‐Recall(basic)
40
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85
Pre
cis
ion
Recall
Precision‐Recall(cautious)
41
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Seeds
• Selectingseedrules:whatisagoodstrategy?– Frequency:sortbyfrequencyoffeatureoccurrence
– Contexts:sortbynumberofotherfeaturesafeaturewasobservedwith
– Weighted:sortbyaweightedcountofotherfeaturesobservedwithfeature.Weight(f)=count(f)/Σf’count(f’)
42
(EisnerandKarakos2005,ZagibalovandCarroll2008)
Seeds
• Ineachcasethefrequenciesweretakenfromtheunlabeledtrainingdata
• Seedswereextractedfromthesortedlistoffeaturesbymanualinspectionandassignedalabel(theentireexamplewasused)
• Location(LOC)featuresappearinfrequentlyinallthreeorderings
• ItispossiblethatsomegoodLOCseedsweremissed
43
SeedsNumber of Rules Frequency Contexts Weighted
(n/3) rules/label Clean Noisy Clean Noisy Clean Noisy 3 84 77 84 77 88 80 9 91 83 90 82 82 74 15 91 83 91 83 85 77 7 (CS99) 91 83
44
WordSenseDisambiguation
• Datafrom(EisnerandKarakos2005)• Disambiguatetwosenseseachfordrug,duty,land,language,position,sentence(Galeet.al.1992)
• Sourceofunlabeleddata:14MwordCanadianHansards(Englishonly)
• Twoseedrulesforeachdisambiguationtaskfrom(EisnerandKarakos2005)
45
Results
46
Learning Algorithm drug land sentence Seeds alcohol medical acres courts served reads Train / Test size 134 / 386 1604 / 1488 303 / 515
Yarowsky (basic) 53.3 79.3 67.7
Yarowsky (no threshold) 52 79 64.8
Yarowsky (cautious) 55.9 79 76.1
DL-CoTrain (2 views = long distance v.s. immediate context)
53.1 77.7 75.9
47
Self‐trainingforMachineTranslation
48
Self‐TrainingforSMT
Train
MFE
Bilingualtext
F E
Monolingualtext
DecodeTranslatedtext
F E
F E
SelecthighqualitySent.pairs
Re‐Log‐linearModel
Re‐trainingtheSMTmodel
49
SelectingSentencePairs
• Firstgivescores: Usenormalizeddecoder’sscore Confidenceestimationmethod(Ueffing&Ney2007)
• Thenselectbasedonthescores: Importancesampling: Thosewhosescoreisaboveathreshold Keepallsentencepairs
50
Re‐trainingtheSMTModel
• UsenewsentencepairstotrainanadditionalphrasetableanduseitasanewfeaturefunctionintheSMTlog‐linearmodel Onephrasetabletrainedonsentencesforwhichwe
havethetruetranslations Onephrasetabletrainedonsentenceswiththeir
generatedtranslations
PhraseTable1 PhraseTable2
51
ChinesetoEnglish(Transductive)
Selection Scoring BLEU%
Baseline 31.8±.7
Keepall 33.1
ImportanceSampling
Norm.score 33.5
Confidence 33.2
Threshold Norm.score 33.5
confidence 33.5
Bold:bestresult,italic:significantlybetter
NISTEval‐2004:train=8.2M,test=1788(4refs)
Train:news,magazines,laws+UN
Test:newswire,editorials,politicalspeeches
We use Portage from NRC as the underlying SMT system (Ueffing et al, 2007)
52
ChinesetoEnglish(Inductive)
system BLEU%
Baseline 31.8±.7
AddChinesedata Iter1 32.8
Iter4 32.6
Iter10 32.5
Bold:bestresult,italic:significantlybetter
Usingimportancesampling
Before After
editorials 30.7 31.3
newswire 30.0 31.1
speeches 36.1 37.3
53
Whydoesitwork?
• Reinforcespartsofthephrasetranslationmodelwhicharerelevantfortestcorpus
• Gluephrasesfromtestdatausedtocomposenewphrases(mostphrasesstillfromoriginaldata)
54
Whydoesitwork?
Summary
• ShouldweeveruseCo‐trainingforBootstrapping?• Per‐labelcautiousnessleadstoeffective
bootstrapping.– ExploitedinYarowskyalgo.,DL‐CoTrain,Co‐Boosting
• Thesedynamicscan/shouldbeexaminedmoreclosely.– Perhapsusingtoolsfromtheanalysisoffeature
induction.
• Bootstrappingandself‐trainingmaybemoreeffectivethanyoumayhavethought.
55