hidden markov models use for speech recognition...hmms 1 sgn-24006 hidden markov models use for...

HMMs 1SGN-24006

Hidden Markov Models usefor speech recognition

Contents:Viterbi trainingAcoustic modeling aspectsIsolated-word recognitionConnected-word recognitionToken passing algorithmLanguage models

HMMs 2SGN-24006Phoneme HMM

Each phoneme is represented by a left-to-right HMM with3 states

Word and sentence HMMs are constructed byconcatenating the phoneme-level HMMs

W AX N

HMMs 3SGN-24006Viterbi training

HMM statesForward-backward algorithm assigns a probability that a featurevector was emitted from an HMM stateViterbi training: we construct the composite HMM from thephoneme units and use Viterbi algorithm to find the best state-

HMMs 4SGN-24006Viterbi training

For each training example, use current HMM models toassign feature vectors to HMM states

Using Viterbi algorithm, find the most likely path through thecomposite HMM modelThis is called Viterbi forced alignment

Group the feature vectors assigned to each HMM stateand estimate new parameters for each HMM (for exampleusing the GMM update equations)

Repeat alignment and parameter reestimation

HMMs 5SGN-24006Acoustic models

An ideal acoustic model is:Accurate

It accounts for context dependency (phonetic context)

CompactIt provides a compact representation, trainable from finiteamounts of data

GeneralIt is a general representation that allows new words to bemodeled, even if they were not seen in the training data

HMMs 6SGN-24006Whole-word HMMs

Each word is modeled as a wholeEach word is assigned an HMM with a number of states

Is it a good acoustic model?Accurate Yes, if there is enough data and the system has asmall vocabulary; No, if trying to model context changes betweenwordsCompact No. It needs many states as the vocabulary increases,and there might not be enough training data to model EVERYword.General No. It cannot be used to build new words.

HMMs 7SGN-24006Phoneme HMMs

Each phoneme is modeled using an HMM with M states

Is it a good acoustic model?Accurate No. It does not model well coarticulation.Compact Yes. The complete system will have M states and Nphonemes, a total of MxN states, not so many parameters to beestimatedGeneral Yes. Any new word can be formed by concatenating theunits.

HMMs 8SGN-24006Modeling phonetic context

MonophoneA single model is used to represent a phoneme in all contexts

BiphoneOne model represents a particular left or right contextNotation:

left context biphone: (a-b)right context biphone: (b+c)

TriphoneOne model represents a particular left and right contextNotation: (a-b+c)

HMMs 9SGN-24006Context-dependent model examples

MonophoneSPEECH S P IY CH

BiphoneLeft context:Right context:

Triphone

HMMs 10SGN-24006Context-dependent model examples

MonophoneSPEECH S P IY CH

BiphoneLeft context: SIL-S S-P P-IY IY-CHRight context: S+P P+IY IY+CH CH+SIL

Triphone SIL-S+P S-P+IY P-IY+CH IY-CH+SIL

Word-internal context dependent triphones backs off to leftand right biphone models at the word boundarySPEECH RECOGNITIONSIL S-P S-P+IY P-IY+CH IY+CH R-EH R-EH+K EH-K+AH K-AH+G ..

Cross-word context-dependent triphonesSIL-S+P S-P+IY P-IY+CH IY-CH+R CH-R+EH R-EH+K EH-K

HMMs 11SGN-24006Context-dependent triphone HMMs

Each phoneme unit within the immediate left and rightcontext is modeled using an HMM with M statesIs it a good acoustic model?

Accurate Yes. Takes into account coarticulation.Compact Yes. Trainable No. For N phonemes there are NxNxNtriphone models, too many parameters to estimate!General Yes. New words can be formed by concatenating units

Training issues

Many triphones occur infrequently not enough training dataSolution: clustering of HMM states which have similar statisticaldistributions, to estimate HMM parameters using pooled data

HMMs 12SGN-24006Isolated word recognition

Whole-word modelCollect many examples of each word spoken in isolationAssign a number of states to each word model based on worddurationEstimate HMM model parameters

Subword-unit modelCollect a large corpus of speech and estimate phonetic unit HMMsConstruct word-level HMMs from phoneme-level HMMsThis is more general than the whole-word approach

HMMs 13SGN-24006Whole-word HMM

HMMs 14SGN-24006Viterbi algorithm through a model

HMMs 15SGN-24006Isolated word recognition system

P(O|W) calculated using Viterbi algorithm rather thanforward algorithmViterbi provides the probability of the path represented bythe most likely state sequence

HMMs 16SGN-24006Connected-word recognition

Boundaries of utterance are unknownNumber of words spoken is unknown position of wordboundaries is often unclear, difficult to determineExample: two word network

HMMs 17SGN-24006Connected-words Viterbi search

HMMs 18SGN-24006Beam pruning

At each node we must compute - the probability of thebest state sequence up to that point, and keep the informationabout where it came from this will allow back-tracing to findthe best state sequenceDuring back-tracing we will find the word boundariesBeam pruning:

at each point determine the log-probability of the absolute bestViterbi path

j if

HMMs 19SGN-24006Beam pruning illustration

HMMs 20SGN-24006Token passing approach

Assume each HMM state can hold multiple tokensToken is an object that can move from state to state in theHMM networkEach token carries with it the log scale Viterbi path score s

At each time t we examine tokens assigned to the nodesWe propagate tokens to reachable positions at time t+1:

Make a copy of the tokenAdjust path score to account for the transition within the HMMnetwork and observation probability

Merge tokens according to Viterbi algorithmSelect the token with maximum scoreDiscard all other competing tokens

HMMs 21SGN-24006Token passing algorithm

Initialization (t=0)Initialize each initial state to hold a token with score s = 0All other states are initialized with a token with

Algorithm (t>0)Propagate tokens to all possible next states (all connecting states)and increment

In each state, find the token with the largest s and discard the restof the tokens in that state (Viterbi)

Termination (t=T)Examine the tokens in all possible final states, find the one withthe largest Viterbi path scoreThis is the probability of the most likely state sequence

HMMs 22SGN-24006Token propagation illustration

HMMs 23SGN-24006Token passing for connected-word

recognitionIndividual word models are connected into a composite modelcan transition from final state of word m to initial state of word nPath scores are maintained by the tokensPath sequence also maintaned by the tokens, allowingrecovery of the best word sequence

Tokens emitted from last stateof each word propagate toinitial state of each word

Probability of entering theinitial state of each word P(W1) isthe probability of that word givenby the language model

s = s + P(W1)

HMMs 24SGN-24006Bayes formulation revisited

Recall the Bayes rule applied to speech recognition

In practice, we use log-probabilities:

Probabilities of word sequences,given by the language model

HMMs 25SGN-24006Language models

Usually the language model is also scaled by a grammarscale factor s and word transition penalty p

HMMs 26SGN-24006Language models

Assign probabilities to word sequences P(W)

The additional information provides help to reduce thesearch space

Language models resolve homonyms:

Write a letter to Mr. Wright right away.

Tradeoff between constraint and flexibility

HMMs 27SGN-24006Stastistical language models

We want to estimate

We can decompose this probability left-to-right:

HMMs 28SGN-24006How does this work?

P(W) = P(analysis of audio, speech and music signals)= P(analysis) P(of | analysis) P(audio | analysis of) ..

How can we model the entire word sequence? There isnever enough training data!Consider restricting the word history

HMMs 29SGN-24006Practical training

Consider word-histories ending in the same last N-1words, and treat is as a markov model

N = 1

N = 2

N = 3

HMMs 30SGN-24006n-gram language models

Probability of a word based on the previous N-1 words:

N=1 unigramN=2 bigramN=3 trigram

Training: probabilities are estimated from a corpus oftraining data (a large amount of text)Once the model is trained, it can be used to generate newsentences randomlySyntax is roughly encoded by the obtained model, butgenerated sentences are often ungrammatical andsemantically strange

HMMs 31SGN-24006Trigram example

P(states | the united) = ..

P(America | states of) = ..

HMMs 32SGN-24006Estimating the n-gram probabilities

Given a text corpus, define:Count of occurences of word n

Count of occurences of word (n-1) followed by word n

Count of occurences of word (n-2) followed by word n-1and word n

HMMs 33SGN-24006Estimating the n-gram probabilties

Based on the count frequency of occurence for the wordsequences, the maximum likelihood estimates of wordprobabilities are calculated:

HMMs 34SGN-24006n-grams in the decoding process

The goal of the search is to find the most likely string ofsymbols (phonemes, words, etc) to account for theobserved speech waveform:

Connected-word example:

HMMs 35SGN-24006Connected-word log-Viterbi search

At each node we must compute

where ij is the log language model score

s is the grammar scale factor and p is the (log) wordtransition penalty

HMMs 36SGN-24006Beam search revisited

HMMs 37SGN-24006Language model in the search

The language model scores are applied at the point wherethere is a transition INTO a wordAs the number of words increases, the number of statesand interconnections increases tooN-grams are easier to incorporate into the token passingalgorithms = s + gP(W1)+p

*Note: here g is the grammar scalefactor, as s was used to denote thepath score

The language model scoreis added to the path scoreupon word entry, so thetoken keeps the combinedacoustic and languagemodel information

HMMs 38SGN-24006Lyrics recognition from singing

Y EH S T ER D EY vs Y EH S . T AH D EYM AY . M AY vs M AA M AHAO L . DH AH . W EY vs AO L . AH W EY

hidden markov models use for speech recognition...hmms 1 sgn-24006 hidden markov models use for...

Documents