the learning problem - cs114 · 2017-03-14 · the learning problem baum-welch = forward-backward...
TRANSCRIPT
The Learning Problem
� Baum-Welch=Forward-BackwardAlgorithm(Baum1972)
� IsaspecialcaseoftheEMorExpectaDon-MaximizaDonalgorithm(Dempster,Laird,Rubin)
� ThealgorithmwillletustrainthetransiDonprobabiliDesA={aij}andtheemissionprobabiliDesB={bi(ot)}oftheHMM
Input to Baum-Welch � O unlabeledsequenceofobservaDons� Q vocabularyofhiddenstates
� Forice-creamtask� O={1,3,2,…,}� Q={H,C}
Starting out with Observable Markov Models � Howtotrain?� RunthemodelonobservaDonsequenceO.� Sinceit’snothidden,weknowwhichstateswewentthrough,hencewhichtransiDonsandobservaDonswereused.
� GiventhatinformaDon,training:� B={bk(ot)}:SinceeverystatecanonlygenerateoneobservaDonsymbol,observaDonlikelihoodsBareall1.0
� A={aij}:
€
aij =C(i→ j)C(i→ q)
q∈Q∑
Extending Intuition to HMMs � ForHMM,cannotcomputethesecountsdirectlyfromobservedsequences
� Baum-WelchintuiDons:� IteraDvelyesDmatethecounts.
� StartwithanesDmateforaijandbk,iteraDvelyimprovetheesDmates
� GetesDmatedprobabiliDesby:� compuDngtheforwardprobabilityforanobservaDon� dividingthatprobabilitymassamongallthedifferentpathsthatcontributedtothisforwardprobability
The Backward algorithm � Wedefinethebackwardprobabilityasfollows:
� ThisistheprobabilityofgeneraDngparDalobservaDonsOt+1TfromDmet+1totheend,giventhattheHMMisinstateiatDmetandofcoursegivenΦ.€
βt (i) = P(ot+1,ot+2,...oT ,|qt = i,Φ)
The Backward algorithm � WecomputebackwardprobbyinducDon:
q0 /
Inductive step of the backward algorithm � ComputaDonofβt(i)byweightedsumofallsuccessivevaluesβt+1
!"#$!"
%&$
%&'
%&(
%&)
*$+!"#$,
!!"#$%&"'&!!()"'$&%&-.&*-+!"#$,&
/$
/'
/)
/(
/$
/&
/'
/$
/'
!"0$
/)
/(
!!()"*$
!!()"+$
!!()",$
!!()")$
*'+!"#$,*'+!"#$,
*'+!"#$,
Intuition for re-estimation of aij � WewillesDmateâijviathisintuiDon:
� NumeratorintuiDon:� AssumewehadsomeesDmateofprobabilitythatagiventransiDoniàjwastakenatDmetinobservaDonsequence.
� IfweknewthisprobabilityforeachDmet,wecouldsumoverallttogetexpectedvalue(count)foriàj.
€
ˆ a ij =expected number of transitions from state i to state j
expected number of transitions from state i
Re-estimation of aij � LetξtbetheprobabilityofbeinginstateiatDmetandstatejatDmet+1,givenO1..TandmodelΦ:
� Wecancomputeξfromnot-quite-ξ,whichis:
€
ξ t (i, j) = P(qt = i,qt+1 = j |O,λ)
€
not _quite_ξ t (i, j) = P(qt = i,qt+1 = j,O | λ)
Computing not-quite-ξ
€
The four components of P(qt = i,qt+1 = j,O | λ) : α,β,aij and bj (ot )
From not-quite-ξ to ξ
� Wewant:
� We’vegot:
� Whichwecomputeasfollows:
€
not _quite_ξ t (i, j) = P(qt = i,qt+1 = j,O | λ)€
ξ t (i, j) = P(qt = i,qt+1 = j |O,λ)
From not-quite-ξ to ξ
� Wewant:
� We’vegot:
� Since:
� Weneed:€
not _quite_ξ t (i, j) = P(qt = i,qt+1 = j,O | λ)
€
ξ t (i, j) = P(qt = i,qt+1 = j |O,λ)
€
ξ t (i, j) =not _quite_ξ t (i, j)
P(O | λ)
From not-quite-ξ to ξ
€
ξ t (i, j) =not _quite_ξ t (i, j)
P(O | λ)
From ξ to aij
� TheexpectednumberoftransiDonsfromstateitostatejisthesumoveralltofξ
� ThetotalexpectednumberoftransiDonsoutofstateiisthesumoveralltransiDonsoutofstatei
� FinalformulaforreesDmatedaij:
€
ˆ a ij =expected number of transitions from state i to state j
expected number of transitions from state i
Re-estimating the observation likelihood b
€
ˆ b j (vk ) =expected number of times in state j and observing symbol vk
expected number of times in state j
We’llneedtoknow γt(j): theprobabilityofbeinginstatejatDmet:
Computing γ
Summary
TheraDobetweentheexpectednumberoftransiDonsfromstateitojandtheexpectednumberofalltransiDonsfromstatei
TheraDobetweentheexpectednumberofDmestheobservaDondataemieedfromstatejisvk,andtheexpectednumberofDmesanyobservaDonisemieedfromstatej
The Forward-Backward Algorithm
Summary: Forward-Backward Algorithm � InDalizeΦ=(A,B)� Computeα,β,ξ� EsDmatenewΦ’=(A,B)� ReplaceΦwithΦ’� Ifnotconvergedgoto2
Applying FB to speech: Caveats � NetworkstructureofHMMisalwayscreatedbyhand
� noalgorithmfordouble-inducDonofopDmalstructureandprobabiliDeshasbeenabletobeatsimplehand-builtstructures.
� AlwaysBakisnetwork=linksgoforwardinDme� SubcaseofBakisnet:beads-on-stringnet:
� Baum-Welchonlyguaranteedtoreturnlocalmax,ratherthanglobalopDmum
� Attheend,wethroughawayAandonlykeepB
CS 224S / LINGUIST 285 Spoken Language Processing
DanJurafskyStanfordUniversity
Spring2014
Lecture4b:AdvancedDecoding
Outline for Today � AdvancedDecoding� HowthisfitsintotheASRcomponentofcourse
� April8:HMMs,Forward,ViterbiDecoding� Onyourown:N-gramsandLanguageModeling� Apr10:Training:Baum-Welch(Forward-Backward)� Apr10:AdvancedDecoding� Apr15:AcousDcModelingandGMMs� Apr17:FeatureExtracDon,MFCCs� May27:DeepNeuralNetAcousDcModels
Advanced Search (= Decoding)
� HowtoweighttheAMandLM� Speedingthingsup:Viterbibeamdecoding� MulDpassdecoding
� N-bestlists� Lalces� Wordgraphs� Meshes/confusionnetworks
� FiniteStateMethods
What we are searching for � GivenAcousDcModel(AM)andLanguageModel(LM):
€
ˆ W = argmaxW ∈L
P(O |W )P(W )
AM(likelihood) LM(prior)
(1)
Combining Acoustic and Language Models
� Wedon’tactuallyuseequaDon(1)
� AMunderesDmatesacousDcprobability� Why?BadindependenceassumpDons� IntuiDon:wecompute(independent)AMprobabilityesDmates;butifwecouldlookatcontext,wewouldassignamuchhigherprobability.SoweareunderesDmaDng
� Wedothisevery10ms,butLMonlyeveryword.� Besides:AMisn’tatrueprobability
� AMandLMhavevastlydifferentdynamicranges
€
(1) ˆ W = argmaxW ∈L
P(O |W )P(W )
Language Model Scaling Factor � SoluDon:addalanguagemodelweight(alsocalledlanguageweightLWorlanguagemodelscalingfactorLMSF
� Valuedeterminedempirically,isposiDve(why?)� Ooenintherange10+-5.
€
(2) ˆ W = argmaxW ∈L
P(O |W )P(W )LMSF
Language Model Scaling Factor � AsLMSFisincreased:
� MoredeleDonerrors(sinceincreasepenaltyfortransiDoningbetweenwords)
� FewerinserDonerrors� Needwidersearchbeam(sincepathscoreslarger)
� LessinfluenceofacousDcmodelobservaDonprobabiliDes
Slide from Bryan Pellom
Word Insertion Penalty � ButLMprobP(W)alsofuncDonsaspenaltyforinserDngwords� IntuiDon:whenauniformlanguagemodel(everywordhasanequalprobability)isused,LMprobisa1/VpenaltymulDpliertakenforeachword
� EachsentenceofNwordshaspenaltyN/V� Ifpenaltyislarge(smallerLMprob),decoderwillpreferfewerlongerwords
� Ifpenaltyissmall(largerLMprob),decoderwillprefermoreshorterwords
� WhentuningLMforbalancingAM,sideeffectofmodifyingpenalty
� SoweaddaseparatewordinserDonpenaltytooffset
€
(3) ˆ W = argmaxW ∈L
P(O |W )P(W )LMSFWIPN (W )
Word Insertion Penalty � Controlstrade-offbetweeninserDonanddeleDonerrors� Aspenaltybecomeslarger(morenegaDve)� MoredeleDonerrors� FewerinserDonerrors
� Actsasamodelofeffectoflengthonprobability� Butprobablynotagoodmodel(geometricassumpDonprobablybadforshortsentences)
Log domain � Wedoeverythinginlogdomain� SofinalequaDon:
€
(4) ˆ W = argmaxW ∈L
logP(O |W ) + LMSF logP(W ) + N logWIP
Speeding things up � ViterbiisO(N2T),whereNistotalnumberofHMMstates,andTislength
� Thisistoolargeforreal-Dmesearch� AtonofworkinASRsearchisjusttomakesearchfaster:� Beamsearch(pruning)� Fastmatch� Tree-basedlexicons
Beam search � Insteadofretainingallcandidates(cells)ateveryDmeframe
� UseathresholdTtokeepsubset:� AteachDmet� IdenDfystatewithlowestcostDmin� Eachstatewithcost>Dmin+Tisdiscarded(“pruned”)beforemovingontoDmet+1
� UnprunedstatesarecalledtheacDvestates
Viterbi Beam Search
Slide from John-Paul Hosom
bA(1)
bB(1)
bC(1)
bA(2)
bB(2)
bC(2)
bA(3)
bB(3)
bC(3)
bA(4)
bB(4)
bC(4)
πA
π B
π C
A:
B:
C:
t=0 t=1 t=2 t=3 t=4
Viterbi Beam search
� MostcommonsearchalgorithmforLVCSR� Time-synchronous� Comparingpathsofequallength
� TwodifferentwordsequencesW1andW2:� WearecomparingP(W1|O0t)andP(W2|O0t)� BasedonsameparDalobservaDonsequenceO0t� Sodenominatorissame,canbeignored
� Time-asynchronoussearch(A*)isharder
Viterbi Beam Search � Empirically,beamsizeof5-10%ofsearchspace
� Thus90-95%ofHMMstatesdon’thavetobeconsideredateachDmet
� VastsavingsinDme.
On-line processing
� ProblemwithViterbisearch� Doesn’treturnbestsequenceDlfinalframe
� ThisdelayisunreasonableformanyapplicaDons.
� On-lineprocessing� usuallysmallerdelayindetermininganswer� atcostofalwaysincreasedprocessingDme.
36
On-line processing � AteveryDmeintervalI(e.g.1000msecor100frames):
� AtcurrentDmetcurr,foreachacDvestateqtcurr,findbestpathP(qtcurr)thatgoesfromfromt0totcurr(usingbacktrace(ψ))
� ComparesetofbestpathsPandfindlastDmetmatchatwhichallpathsPhavethesamestatevalueatthatDme
� Iftmatchexists{Outputresultfromt0totmatchReset/RemoveψvaluesunDltmatchSett0totmatch+1}
� EfficiencydependsonintervalI,beamthreshold,andhowwelltheobservaDonsmatchtheHMM.
Slide from John-Paul Hosom
On-line processing
� Example(Interval=4frames):
� AtDme4,allbestpathsforallstatesA,B,andChavestateBincommonatDme2.So,tmatch=2.
� NowoutputstatesBBforDmes1and2,becausenomaeerwhathappensinthefuture,thiswillnotchange.Sett0to3
Slide from John-Paul Hosom
δ1(A)
δ1(B)
δ1(C)
δ2(A)
δ2(B)
δ2(C)
δ3(A)
δ3(B)
δ3(C)
δ4(A)
δ4(B)
δ4(C)
A:
B:
C:
t=1 t=2 t=3 t=4
BBAA
BBBB
BBBC
best sequence
t0=1 tcurr=4
On-line processing
Slide from John-Paul Hosom
• Nowtmatch=7,sooutputfromt=3tot=7:BBABB,thensett0to8.
• IfT=8,thenoutputstatewithbestδ8,forexampleC.Finalresult(obtainedpiece-by-piece)isthenBBBBABBC
δ3(A)
δ3(B)
δ3(C)
δ4(A)
δ4(B)
δ4(C)
δ5(A)
δ5(B)
δ5(C)
δ6(A)
δ6(B)
δ6(C)
A:
B:
C:
t=3 t=4 t=5 t=6
BBABBA
BBABBB
BBABBC
t0=3 tcurr=8
δ7(A)
δ7(B)
δ7(C)
δ8(A)
δ8(B)
δ8(C)
t=7 t=8
best sequence Interval=4
Problems with Viterbi � It’shardtointegratesophisDcatedknowledgesources� Trigramgrammars� Parser-basedLM
� long-distancedependenciesthatviolatedynamicprogrammingassumpDons
� Knowledgethatisn’tleo-to-right� Followingwordscanhelppredictprecedingwords
� SoluDons� ReturnmulDplehypothesesandusesmartknowledgetorescorethem
� Useadifferentsearchalgorithm,A*Decoding(=Stackdecoding)
Multipass Search
Ways to represent multiple hypotheses � N-bestlist
� Insteadofsinglebestsentence(wordstring),returnorderedlistofNsentencehypotheses
� Wordlalce� CompactrepresentaDonofwordhypothesesandtheirDmesandscores
� Wordgraph� FSArepresentaDonoflalceinwhichDmesarerepresentedbytopology
Another Problem with Viterbi � TheforwardprobabilityofobservaDongivenwordstring
� TheViterbialgorithmmakesthe“ViterbiApproximaDon”
� ApproximatesP(O|W)
� withP(O|beststatesequence)
Solving the best-path-not-best-words problem � Viterbireturnsbestpath(statesequence)notbestwordsequence� BestpathcanbeverydifferentthanbestwordstringifwordshavemanypossiblepronunciaDons
� TwosoluDons� ModifyViterbitosumoverdifferentpathsthatsharethesamewordstring.� DothisaspartofN-bestcomputaDon
� ComputeN-bestwordstrings,notN-bestphonepaths
� Useadifferentdecodingalgorithm(A*)thatcomputestrueForwardprobability.
Sample N-best list
N-best lists � Again,wedon’twanttheN-bestpaths� Thatwouldbetrivial
� StoreNvaluesineachstatecellinViterbitrellisinsteadof1value
� But:� MostoftheN-bestpathswillhavethesamewordstring� Useless!!!
� ItturnsoutthatafactorofNistoomuchtopay
Computing N-best lists � Intheworstcase,anadmissiblealgorithmforfindingtheNmostlikelyhypothesesisexponenDalinthelengthoftheueerance.� S.Young.1984.“GeneraDngMulDpleSoluDonsfromConnectedWordDPRecogniDonAlgorithms”.Proc.oftheInsDtuteofAcousDcs,6:4,351-354.
� Forexample,ifAMandLMscorewerenearlyidenDcalforallwordsequences,wemustconsiderallpermutaDonsofwordsequencesforwholesentence(allwiththesamescores).
� Butofcourseifthisistrue,can’tdoASRatall!
Computing N-best lists � Instead,variousnon-admissiblealgorithms:� (Viterbi)ExactN-best� (Viterbi)WordDependentN-best
� Andoneadmissible� A*N-best
Exact N-best for time-synchronous Viterbi � DuetoSchwartzandChow;alsocalled“sentence-dependentN-best”
� Idea:eachstatestoresmulDplepaths� Idea:maintainseparaterecordsforpathswithdisDnctwordhistories� History:wholewordsequenceuptocurrentDmetandwordw
� When2ormorepathscometothesamestateatthesameDme,mergepathsw/samehistoryandsumtheirprobabiliDes.� i.e.computetheforwardprobabilitywithinwords
� Otherwise,retainonlyN-bestpathsforeachstate
Exact N-best for time-synchronous Viterbi � Efficiency:
� TypicalHMMstatehas2or3predecessorstateswithinwordHMM
� SoforeachDmeframeandstate,needtocompare/merge2or3setsofNpathsintoNnewpaths.
� Atendofsearch,NpathsinfinalstateoftrellisgiveN-bestwordsequences
� ComplexityisO(N)� SDlltooslowforpracDcalsystems
� Nis100to1000� Moreefficientversions:word-dependentN-best
Word-dependent (‘bigram’) N-best � IntuiDon:
� Insteadofeachstatemergingallpathsfromstartofsentence
� Wemergeallpathsthatsharethesamepreviousword� Details:
� ThiswillrequireustodoamorecomplextracebackattheendofsentencetogeneratetheN-bestlist
Word-dependent (‘bigram’) N-best � Ateachstatepreservetotalprobabilityforeachofk<<Npreviouswords� Kis3to6;Nis100to1000
� Atendofeachword,recordscoreforeachpreviouswordhypothesisandnameofpreviousword� Soeachwordendingwestore“alternaDves”
� But,likenormalViterbi,passonjustthebesthypothesis
� Atendofsentence,doatraceback� Followbackpointerstoget1-best� Butaswefollowpointers,putonaqueuethealternatewordsendingatsamepoint
� OnnextiteraDon,popnextbest
Word Lattice � EacharcannotatedwithAMandLMlogprobs
Word Graph � TiminginformaDonremoved� Overlappingcopiesofwordsmerged� AMinformaDonremoved� ResultisaWFST� NaturalextensiontoN-gramlanguagemodel
Converting word lattice to word graph � Wordlalcecanhaverangeofpossibleendframesforword
� Createanedgefrom(wi,ti)to(wj,tj)iftj-1isoneoftheend-Dmesofwi
Slide from Bryan Pellom
Lattices
� SomeresearchersarecarefultodisDnguishbetweenwordgraphsandwordlalces
� Butwe’llfollowconvenDoninusing“lalce”tomeanbothwordgraphsandwordlalces.
� Twofactsaboutlalces:� Density:thenumberofwordhypothesesorwordarcsperueeredword
� Lalceerrorrate(alsocalled“lowerbounderrorrate”):thelowestworderrorrateforanywordsequenceinlalce� Lalceerrorrateisthe“oracle”errorrate,thebestpossibleerrorrateyoucouldgetfromrescoringthelalce.
� Wecanusethisasanupperbound
Posterior lattices � Wedon’tactuallycomputeposteriors:
� Whydowewantposteriors?� Withoutaposterior,wecanchoosebesthypothesis,butwecan’tknowhowgooditis!
� Inordertocomputeposterior,needto� NormalizeoveralldifferentwordhypothesisataDme
� Alignallthehypotheses,sumoverallpathspassingthroughword
Mesh = Sausage = pinched lattice
Summary: one-pass vs. multipass � PotenDalproblemswithmulDpass
� Can’tuseforreal-Dme(needendofsentence)� (Butcankeepsuccessivepassesreallyfast)
� Eachpasscanintroduceinadmissiblepruning� (Butone-passdoesthesamew/beampruningandfastmatch)
� WhymulDpass� VeryexpensiveKSs.(NLparsing,higher-ordern-gram,etc.)� Spokenlanguageunderstanding:N-bestperfectinterface� Research:N-bestlistverypowerfulofflinetoolsforalgorithmdevelopment
� N-bestlistsneededfordiscriminanttraining(MMIE,MCE)togetrivalhypotheses
Weighted Finite State Transducers for ASR � AnalternaDveparadigmforASR� UsedbyKaldi� Weightedfinitestateautomatonthattransducesaninputsequencetoanoutputsequence
� Mohri,Mehryar,FernandoPereira,andMichaelRiley."SpeechrecogniDonwithweightedfinite-statetransducers."InSpringerHandbookofSpeechProcessing,pp.559-584.SpringerBerlinHeidelberg,2008.
� hep://www.cs.nyu.edu/~mohri/pub/hbka.pdf
Weighted Finite State Acceptors
Weighted Finite State Transducers
WFST Algorithms Composi:on:combinetransducersatdifferentlevels.IfGisafinitestategrammarandPisapronunciaDondicDonary,P◦GtransducesaphonestringtowordstringsallowedbythegrammarDeterminiza:on:EnsureseachstatehasnomorethanoneoutputtransiDonforagiveninputlabelMinimiza:on:transformsatransducertoanequivalenttransducerwiththefewestpossiblestatesandtransiDons
slide from Steve Renals
WFST-based decoding � RepresentthefollowingcomponentsasWFSTs
� Context-dependentacousDcmodels(C)� PronunciaDondicDonary(D)� n-gramlanguagemodel(L)
� ThedecodingnetworkisdefinedbytheircomposiDon:C◦D◦L� Successivelydeterminizeandcombinethecomponenttransducers,thenminimizethefinalnetwork
slide from Steve Renals
G
L
G o L
min(det(L o G))
Advanced Search (= Decoding)
� HowtoweighttheAMandLM� Speedingthingsup:Viterbibeamdecoding� MulDpassdecoding
� N-bestlists� Lalces� Wordgraphs� Meshes/confusionnetworks
� FiniteStateMethods