the learning problem - cs114 · 2017-03-14 · the learning problem baum-welch = forward-backward...

The Learning Problem

�  Baum-Welch=Forward-BackwardAlgorithm(Baum1972)

�  IsaspecialcaseoftheEMorExpectaDon-MaximizaDonalgorithm(Dempster,Laird,Rubin)

�  ThealgorithmwillletustrainthetransiDonprobabiliDesA={aij}andtheemissionprobabiliDesB={bi(ot)}oftheHMM

Input to Baum-Welch � O unlabeledsequenceofobservaDons� Q vocabularyofhiddenstates

� Forice-creamtask� O={1,3,2,…,}� Q={H,C}

Starting out with Observable Markov Models �  Howtotrain?�  RunthemodelonobservaDonsequenceO.�  Sinceit’snothidden,weknowwhichstateswewentthrough,hencewhichtransiDonsandobservaDonswereused.

�  GiventhatinformaDon,training:� B={bk(ot)}:SinceeverystatecanonlygenerateoneobservaDonsymbol,observaDonlikelihoodsBareall1.0

� A={aij}:

€

aij =C(i→ j)C(i→ q)

q∈Q∑

Extending Intuition to HMMs �  ForHMM,cannotcomputethesecountsdirectlyfromobservedsequences

�  Baum-WelchintuiDons:� IteraDvelyesDmatethecounts.

�  StartwithanesDmateforaijandbk,iteraDvelyimprovetheesDmates

� GetesDmatedprobabiliDesby:�  compuDngtheforwardprobabilityforanobservaDon�  dividingthatprobabilitymassamongallthedifferentpathsthatcontributedtothisforwardprobability

The Backward algorithm � Wedefinethebackwardprobabilityasfollows:

�  ThisistheprobabilityofgeneraDngparDalobservaDonsOt+1TfromDmet+1totheend,giventhattheHMMisinstateiatDmetandofcoursegivenΦ.€

βt (i) = P(ot+1,ot+2,...oT ,|qt = i,Φ)

The Backward algorithm � WecomputebackwardprobbyinducDon:

q0 /

Inductive step of the backward algorithm �  ComputaDonofβt(i)byweightedsumofallsuccessivevaluesβt+1

!"#$!"

%&$

%&'

%&(

%&)

*$+!"#$,

!!"#$%&"'&!!()"'$&%&-.&*-+!"#$,&

/$

/'

/)

/(

/$

/&

/'

/$

/'

!"0$

/)

/(

!!()"*$

!!()"+$

!!()",$

!!()")$

*'+!"#$,*'+!"#$,

*'+!"#$,

Intuition for re-estimation of aij � WewillesDmateâijviathisintuiDon:

�  NumeratorintuiDon:� AssumewehadsomeesDmateofprobabilitythatagiventransiDoniàjwastakenatDmetinobservaDonsequence.

�  IfweknewthisprobabilityforeachDmet,wecouldsumoverallttogetexpectedvalue(count)foriàj.

€

ˆ a ij =expected number of transitions from state i to state j

expected number of transitions from state i

Re-estimation of aij � LetξtbetheprobabilityofbeinginstateiatDmetandstatejatDmet+1,givenO1..TandmodelΦ:

� Wecancomputeξfromnot-quite-ξ,whichis:

€

ξ t (i, j) = P(qt = i,qt+1 = j |O,λ)

€

not _quite_ξ t (i, j) = P(qt = i,qt+1 = j,O | λ)

Computing not-quite-ξ

€

The four components of P(qt = i,qt+1 = j,O | λ) : α,β,aij and bj (ot )

From not-quite-ξ to ξ

� Wewant:

� We’vegot:

� Whichwecomputeasfollows:

€

not _quite_ξ t (i, j) = P(qt = i,qt+1 = j,O | λ)€

ξ t (i, j) = P(qt = i,qt+1 = j |O,λ)


� Wewant:

� We’vegot:

�  Since:

� Weneed:€

not _quite_ξ t (i, j) = P(qt = i,qt+1 = j,O | λ)

€

ξ t (i, j) = P(qt = i,qt+1 = j |O,λ)

€

ξ t (i, j) =not _quite_ξ t (i, j)

P(O | λ)


€

ξ t (i, j) =not _quite_ξ t (i, j)

P(O | λ)

From ξ to aij

�  TheexpectednumberoftransiDonsfromstateitostatejisthesumoveralltofξ

�  ThetotalexpectednumberoftransiDonsoutofstateiisthesumoveralltransiDonsoutofstatei

�  FinalformulaforreesDmatedaij:

€

ˆ a ij =expected number of transitions from state i to state j

expected number of transitions from state i

Re-estimating the observation likelihood b

€

ˆ b j (vk ) =expected number of times in state j and observing symbol vk

expected number of times in state j

We’llneedtoknow γt(j): theprobabilityofbeinginstatejatDmet:

Computing γ

Summary

TheraDobetweentheexpectednumberoftransiDonsfromstateitojandtheexpectednumberofalltransiDonsfromstatei

TheraDobetweentheexpectednumberofDmestheobservaDondataemieedfromstatejisvk,andtheexpectednumberofDmesanyobservaDonisemieedfromstatej

The Forward-Backward Algorithm

Summary: Forward-Backward Algorithm �  InDalizeΦ=(A,B)�  Computeα,β,ξ�  EsDmatenewΦ’=(A,B)�  ReplaceΦwithΦ’�  Ifnotconvergedgoto2

Applying FB to speech: Caveats �  NetworkstructureofHMMisalwayscreatedbyhand

�  noalgorithmfordouble-inducDonofopDmalstructureandprobabiliDeshasbeenabletobeatsimplehand-builtstructures.

� AlwaysBakisnetwork=linksgoforwardinDme�  SubcaseofBakisnet:beads-on-stringnet:

�  Baum-Welchonlyguaranteedtoreturnlocalmax,ratherthanglobalopDmum

�  Attheend,wethroughawayAandonlykeepB

CS 224S / LINGUIST 285 Spoken Language Processing

DanJurafskyStanfordUniversity

Spring2014

Lecture4b:AdvancedDecoding

Outline for Today �  AdvancedDecoding�  HowthisfitsintotheASRcomponentofcourse

� April8:HMMs,Forward,ViterbiDecoding� Onyourown:N-gramsandLanguageModeling� Apr10:Training:Baum-Welch(Forward-Backward)� Apr10:AdvancedDecoding� Apr15:AcousDcModelingandGMMs� Apr17:FeatureExtracDon,MFCCs� May27:DeepNeuralNetAcousDcModels

Advanced Search (= Decoding)

� HowtoweighttheAMandLM�  Speedingthingsup:Viterbibeamdecoding� MulDpassdecoding

� N-bestlists� Lalces� Wordgraphs� Meshes/confusionnetworks

� FiniteStateMethods

What we are searching for � GivenAcousDcModel(AM)andLanguageModel(LM):

€

ˆ W = argmaxW ∈L

P(O |W )P(W )

AM(likelihood) LM(prior)

(1)

Combining Acoustic and Language Models

� Wedon’tactuallyuseequaDon(1)

� AMunderesDmatesacousDcprobability� Why?BadindependenceassumpDons�  IntuiDon:wecompute(independent)AMprobabilityesDmates;butifwecouldlookatcontext,wewouldassignamuchhigherprobability.SoweareunderesDmaDng

� Wedothisevery10ms,butLMonlyeveryword.� Besides:AMisn’tatrueprobability

� AMandLMhavevastlydifferentdynamicranges

€

(1) ˆ W = argmaxW ∈L

P(O |W )P(W )

Language Model Scaling Factor �  SoluDon:addalanguagemodelweight(alsocalledlanguageweightLWorlanguagemodelscalingfactorLMSF

�  Valuedeterminedempirically,isposiDve(why?)�  Ooenintherange10+-5.

€


P(O |W )P(W )LMSF

Language Model Scaling Factor �  AsLMSFisincreased:

� MoredeleDonerrors(sinceincreasepenaltyfortransiDoningbetweenwords)

� FewerinserDonerrors� Needwidersearchbeam(sincepathscoreslarger)

� LessinfluenceofacousDcmodelobservaDonprobabiliDes

Slide from Bryan Pellom

Word Insertion Penalty �  ButLMprobP(W)alsofuncDonsaspenaltyforinserDngwords�  IntuiDon:whenauniformlanguagemodel(everywordhasanequalprobability)isused,LMprobisa1/VpenaltymulDpliertakenforeachword

�  EachsentenceofNwordshaspenaltyN/V�  Ifpenaltyislarge(smallerLMprob),decoderwillpreferfewerlongerwords

�  Ifpenaltyissmall(largerLMprob),decoderwillprefermoreshorterwords

� WhentuningLMforbalancingAM,sideeffectofmodifyingpenalty

�  SoweaddaseparatewordinserDonpenaltytooffset

€


P(O |W )P(W )LMSFWIPN (W )

Word Insertion Penalty �  Controlstrade-offbetweeninserDonanddeleDonerrors� Aspenaltybecomeslarger(morenegaDve)� MoredeleDonerrors�  FewerinserDonerrors

�  Actsasamodelofeffectoflengthonprobability� Butprobablynotagoodmodel(geometricassumpDonprobablybadforshortsentences)

Log domain � Wedoeverythinginlogdomain�  SofinalequaDon:

€


logP(O |W ) + LMSF logP(W ) + N logWIP

Speeding things up �  ViterbiisO(N2T),whereNistotalnumberofHMMstates,andTislength

�  Thisistoolargeforreal-Dmesearch�  AtonofworkinASRsearchisjusttomakesearchfaster:� Beamsearch(pruning)�  Fastmatch�  Tree-basedlexicons

Beam search �  Insteadofretainingallcandidates(cells)ateveryDmeframe

�  UseathresholdTtokeepsubset:� AteachDmet�  IdenDfystatewithlowestcostDmin�  Eachstatewithcost>Dmin+Tisdiscarded(“pruned”)beforemovingontoDmet+1

� UnprunedstatesarecalledtheacDvestates

Viterbi Beam Search

Slide from John-Paul Hosom

bA(1)

bB(1)

bC(1)

bA(2)

bB(2)

bC(2)

bA(3)

bB(3)

bC(3)

bA(4)

bB(4)

bC(4)

πA

π B

π C

A:

B:

C:

t=0 t=1 t=2 t=3 t=4

Viterbi Beam search

� MostcommonsearchalgorithmforLVCSR� Time-synchronous� Comparingpathsofequallength

� TwodifferentwordsequencesW1andW2:� WearecomparingP(W1|O0t)andP(W2|O0t)� BasedonsameparDalobservaDonsequenceO0t� Sodenominatorissame,canbeignored

� Time-asynchronoussearch(A*)isharder

Viterbi Beam Search � Empirically,beamsizeof5-10%ofsearchspace

� Thus90-95%ofHMMstatesdon’thavetobeconsideredateachDmet

� VastsavingsinDme.

On-line processing

� ProblemwithViterbisearch� Doesn’treturnbestsequenceDlfinalframe

� ThisdelayisunreasonableformanyapplicaDons.

� On-lineprocessing� usuallysmallerdelayindetermininganswer� atcostofalwaysincreasedprocessingDme.

36

On-line processing �  AteveryDmeintervalI(e.g.1000msecor100frames):

� AtcurrentDmetcurr,foreachacDvestateqtcurr,findbestpathP(qtcurr)thatgoesfromfromt0totcurr(usingbacktrace(ψ))

�  ComparesetofbestpathsPandfindlastDmetmatchatwhichallpathsPhavethesamestatevalueatthatDme

�  Iftmatchexists{Outputresultfromt0totmatchReset/RemoveψvaluesunDltmatchSett0totmatch+1}

�  EfficiencydependsonintervalI,beamthreshold,andhowwelltheobservaDonsmatchtheHMM.


On-line processing

�  Example(Interval=4frames):

�  AtDme4,allbestpathsforallstatesA,B,andChavestateBincommonatDme2.So,tmatch=2.

�  NowoutputstatesBBforDmes1and2,becausenomaeerwhathappensinthefuture,thiswillnotchange.Sett0to3


δ1(A)

δ1(B)

δ1(C)

δ2(A)

δ2(B)

δ2(C)

δ3(A)

δ3(B)

δ3(C)

δ4(A)

δ4(B)

δ4(C)

A:

B:

C:

t=1 t=2 t=3 t=4

BBAA

BBBB

BBBC

best sequence

t0=1 tcurr=4

On-line processing


•  Nowtmatch=7,sooutputfromt=3tot=7:BBABB,thensett0to8.

•  IfT=8,thenoutputstatewithbestδ8,forexampleC.Finalresult(obtainedpiece-by-piece)isthenBBBBABBC

δ3(A)

δ3(B)

δ3(C)

δ4(A)

δ4(B)

δ4(C)

δ5(A)

δ5(B)

δ5(C)

δ6(A)

δ6(B)

δ6(C)

A:

B:

C:

t=3 t=4 t=5 t=6

BBABBA

BBABBB

BBABBC

t0=3 tcurr=8

δ7(A)

δ7(B)

δ7(C)

δ8(A)

δ8(B)

δ8(C)

t=7 t=8

best sequence Interval=4

Problems with Viterbi �  It’shardtointegratesophisDcatedknowledgesources� Trigramgrammars� Parser-basedLM

�  long-distancedependenciesthatviolatedynamicprogrammingassumpDons

� Knowledgethatisn’tleo-to-right�  Followingwordscanhelppredictprecedingwords

�  SoluDons� ReturnmulDplehypothesesandusesmartknowledgetorescorethem

� Useadifferentsearchalgorithm,A*Decoding(=Stackdecoding)

Multipass Search

Ways to represent multiple hypotheses �  N-bestlist

�  Insteadofsinglebestsentence(wordstring),returnorderedlistofNsentencehypotheses

� Wordlalce�  CompactrepresentaDonofwordhypothesesandtheirDmesandscores

� Wordgraph�  FSArepresentaDonoflalceinwhichDmesarerepresentedbytopology

Another Problem with Viterbi �  TheforwardprobabilityofobservaDongivenwordstring

�  TheViterbialgorithmmakesthe“ViterbiApproximaDon”

�  ApproximatesP(O|W)

� withP(O|beststatesequence)

Solving the best-path-not-best-words problem �  Viterbireturnsbestpath(statesequence)notbestwordsequence� BestpathcanbeverydifferentthanbestwordstringifwordshavemanypossiblepronunciaDons

�  TwosoluDons� ModifyViterbitosumoverdifferentpathsthatsharethesamewordstring.�  DothisaspartofN-bestcomputaDon

�  ComputeN-bestwordstrings,notN-bestphonepaths

� Useadifferentdecodingalgorithm(A*)thatcomputestrueForwardprobability.

Sample N-best list

N-best lists �  Again,wedon’twanttheN-bestpaths�  Thatwouldbetrivial

� StoreNvaluesineachstatecellinViterbitrellisinsteadof1value

�  But:� MostoftheN-bestpathswillhavethesamewordstring� Useless!!!

� ItturnsoutthatafactorofNistoomuchtopay

Computing N-best lists �  Intheworstcase,anadmissiblealgorithmforfindingtheNmostlikelyhypothesesisexponenDalinthelengthoftheueerance.�  S.Young.1984.“GeneraDngMulDpleSoluDonsfromConnectedWordDPRecogniDonAlgorithms”.Proc.oftheInsDtuteofAcousDcs,6:4,351-354.

�  Forexample,ifAMandLMscorewerenearlyidenDcalforallwordsequences,wemustconsiderallpermutaDonsofwordsequencesforwholesentence(allwiththesamescores).

�  Butofcourseifthisistrue,can’tdoASRatall!

Computing N-best lists �  Instead,variousnon-admissiblealgorithms:� (Viterbi)ExactN-best� (Viterbi)WordDependentN-best

� Andoneadmissible� A*N-best

Exact N-best for time-synchronous Viterbi �  DuetoSchwartzandChow;alsocalled“sentence-dependentN-best”

�  Idea:eachstatestoresmulDplepaths�  Idea:maintainseparaterecordsforpathswithdisDnctwordhistories� History:wholewordsequenceuptocurrentDmetandwordw

� When2ormorepathscometothesamestateatthesameDme,mergepathsw/samehistoryandsumtheirprobabiliDes.�  i.e.computetheforwardprobabilitywithinwords

� Otherwise,retainonlyN-bestpathsforeachstate

Exact N-best for time-synchronous Viterbi �  Efficiency:

�  TypicalHMMstatehas2or3predecessorstateswithinwordHMM

�  SoforeachDmeframeandstate,needtocompare/merge2or3setsofNpathsintoNnewpaths.

� Atendofsearch,NpathsinfinalstateoftrellisgiveN-bestwordsequences

�  ComplexityisO(N)�  SDlltooslowforpracDcalsystems

�  Nis100to1000�  Moreefficientversions:word-dependentN-best

Word-dependent (‘bigram’) N-best �  IntuiDon:

�  Insteadofeachstatemergingallpathsfromstartofsentence

� Wemergeallpathsthatsharethesamepreviousword�  Details:

�  ThiswillrequireustodoamorecomplextracebackattheendofsentencetogeneratetheN-bestlist

Word-dependent (‘bigram’) N-best �  Ateachstatepreservetotalprobabilityforeachofk<<Npreviouswords�  Kis3to6;Nis100to1000

�  Atendofeachword,recordscoreforeachpreviouswordhypothesisandnameofpreviousword�  Soeachwordendingwestore“alternaDves”

�  But,likenormalViterbi,passonjustthebesthypothesis

�  Atendofsentence,doatraceback�  Followbackpointerstoget1-best� Butaswefollowpointers,putonaqueuethealternatewordsendingatsamepoint

� OnnextiteraDon,popnextbest

Word Lattice �  EacharcannotatedwithAMandLMlogprobs

Word Graph �  TiminginformaDonremoved�  Overlappingcopiesofwordsmerged�  AMinformaDonremoved�  ResultisaWFST�  NaturalextensiontoN-gramlanguagemodel

Converting word lattice to word graph � Wordlalcecanhaverangeofpossibleendframesforword

�  Createanedgefrom(wi,ti)to(wj,tj)iftj-1isoneoftheend-Dmesofwi

Slide from Bryan Pellom

Lattices

�  SomeresearchersarecarefultodisDnguishbetweenwordgraphsandwordlalces

�  Butwe’llfollowconvenDoninusing“lalce”tomeanbothwordgraphsandwordlalces.

�  Twofactsaboutlalces:� Density:thenumberofwordhypothesesorwordarcsperueeredword

�  Lalceerrorrate(alsocalled“lowerbounderrorrate”):thelowestworderrorrateforanywordsequenceinlalce�  Lalceerrorrateisthe“oracle”errorrate,thebestpossibleerrorrateyoucouldgetfromrescoringthelalce.

�  Wecanusethisasanupperbound

Posterior lattices � Wedon’tactuallycomputeposteriors:

� Whydowewantposteriors?� Withoutaposterior,wecanchoosebesthypothesis,butwecan’tknowhowgooditis!

�  Inordertocomputeposterior,needto�  NormalizeoveralldifferentwordhypothesisataDme

� Alignallthehypotheses,sumoverallpathspassingthroughword

Mesh = Sausage = pinched lattice

Summary: one-pass vs. multipass �  PotenDalproblemswithmulDpass

�  Can’tuseforreal-Dme(needendofsentence)�  (Butcankeepsuccessivepassesreallyfast)

�  Eachpasscanintroduceinadmissiblepruning�  (Butone-passdoesthesamew/beampruningandfastmatch)

� WhymulDpass� VeryexpensiveKSs.(NLparsing,higher-ordern-gram,etc.)�  Spokenlanguageunderstanding:N-bestperfectinterface� Research:N-bestlistverypowerfulofflinetoolsforalgorithmdevelopment

� N-bestlistsneededfordiscriminanttraining(MMIE,MCE)togetrivalhypotheses

Weighted Finite State Transducers for ASR �  AnalternaDveparadigmforASR� UsedbyKaldi� Weightedfinitestateautomatonthattransducesaninputsequencetoanoutputsequence

� Mohri,Mehryar,FernandoPereira,andMichaelRiley."SpeechrecogniDonwithweightedfinite-statetransducers."InSpringerHandbookofSpeechProcessing,pp.559-584.SpringerBerlinHeidelberg,2008.

�  hep://www.cs.nyu.edu/~mohri/pub/hbka.pdf

Weighted Finite State Acceptors

Weighted Finite State Transducers

WFST Algorithms Composi:on:combinetransducersatdifferentlevels.IfGisafinitestategrammarandPisapronunciaDondicDonary,P◦GtransducesaphonestringtowordstringsallowedbythegrammarDeterminiza:on:EnsureseachstatehasnomorethanoneoutputtransiDonforagiveninputlabelMinimiza:on:transformsatransducertoanequivalenttransducerwiththefewestpossiblestatesandtransiDons

slide from Steve Renals

WFST-based decoding �  RepresentthefollowingcomponentsasWFSTs

�  Context-dependentacousDcmodels(C)�  PronunciaDondicDonary(D)�  n-gramlanguagemodel(L)

�  ThedecodingnetworkisdefinedbytheircomposiDon:C◦D◦L�  Successivelydeterminizeandcombinethecomponenttransducers,thenminimizethefinalnetwork

slide from Steve Renals

min(det(L o G))

Advanced Search (= Decoding)

� HowtoweighttheAMandLM�  Speedingthingsup:Viterbibeamdecoding� MulDpassdecoding

� N-bestlists� Lalces� Wordgraphs� Meshes/confusionnetworks

� FiniteStateMethods

the learning problem - cs114 · 2017-03-14 · the learning problem baum-welch = forward-backward...

Documents