hmms as generative models of speech · hmms as generative models of speech [email protected]...

HMMs as Generative Models of Speech

[email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Samudravijaya KTata Institute of Fundamental Research [email protected] [email protected]

Workshop on Text-to-Speech (TTS) Synthesis16-18 June 2014

Dhirubhai Ambani Institute of Information and Communication TechnologyGandhinagar, Gujarat

mailto:[email protected]



Outline of the talk


Statistical models for TTS

Probability distributions

– Normal (Gaussian) distribution

– Gaussian Mixture Model (GMM)

– Hidden Markov Model (HMM)

● Generation of speech from models

● Overview of HMM based Speech synthesis system (HTS)

● Training of HMMs


Text to Speech Systems

l

l [email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Waveform concatenation

'Cut-and-paste' approach

Unit selection approach

Speech Model

Articulatory models : Speech production model

Formant : Source-filter model (rules for trajectory)

HTS : Statistical models (machine learning)


Statistical models of speech


Why statistical models are appropriate (in the context of TTS)?

A lot of variability exists in speech signal due to

Phonetic context

Supra segmental variation: pitch, emphasis, mood.

Models are mathematical expressions of a process / phenomenon in terms of

a small number of parameters.

Statistics provides a succinct method of describing aggregate behaviour of an ensemble.

Statistical models represent an ensemble: a collection of similar entities (ex: phones).

Statistics: Mean, Variance, skewness, kurtosis


•Univariate Gaussian Distribution

• [email protected]

• Normal distribution:

• Parameters:– Mean (μ)– Variance (σ2 )

Estimation of parameters


admin

Text Box

Probability Vs Likelihood (conditional probability)

Maximum Likelihood Estimator

Given x[0], x[1], . . . , x[N − 1] and pdf parameterised by θ =

θ1

θ2

.

.

θm−1

We form Likelihood function L(X; θ) =N∏

i=0

p(xi;θ)

θMLE = arg maxθ

L(X; θ)

For height problem:

⇒ can show (θ)MLE = 1N

∑xi

⇒ Estimate of mean of Gaussian = sample mean of measured heights.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 9 of 88

admin

Text Box

admin

Text Box

admin

Text Box

admin

Text Box

admin

Text Box

admin

Text Box

admin

Text Box

admin

Text Box

admin

Text Box

values

admin

Line

Formant space of vowels



Multi-modal Distributions


• Distribution of cepstral coefficient of a phone



• Extension to multi-dimensional case


Training a GMM


• Live demonstration at: http://staff.aist.go.jp/s.akaho/MixtureEM.html

The parameters of a GMM can be trained using

Expectation-Maximization algorithm.

This is an iterative algorithm and consists of 2 steps. It begins with an initial GMM

with (even) random parameters.

In the E-step, an expectation of the log likelihood of the training (adapation) data

given the current GMM is computed.

In the M-step, the parameters of GMM are re-estimated in order to maximise the

expectation of log likelihood.


Generation of speech from statistical models


Consider the vowel ii

Mean StdDevF1 300 100F2 2800 500

Such a normal distribution of formant frequencies of the vowel i can generate a large number of formant values centered around the mean values.

Instead of formants, we can model cepstral coefficients. Then, the corresponding normal distribution can generate any number of MFCCs.

MFCC--> log power spectrum--> speech waveform


l



admin

Text Box

Why HMMs are good models of sequences?

l



Modelling of Phoneme

To enunciate /aa/ in a word ⇒Our Articulators are moving from a configuration

for previous phoneme to /aa/ and then proceeding

to move to configuration of next phoneme.

Can think of 3 distinct time periods:

⇒ Transition from previous phoneme

⇒ Steady state

⇒ Transition to next phoneme

Features for 3 “time-interval ”are quite different

⇒ Use different density functions to model the three time intervals

⇒ model as paa1(;θaa1) paa2(;θaa

2) paa3(;θaa3)

Also need to model the time durations of these time-intervals – transition probs.


HMM Model of Phoneme

• Use term “State”for each of the three time periods.

• Prob. of ot from jth state, i.e. paaj(ot; θaaj) ⇒ denoted as bj(ot)

1 2 3. . .p(; �1aa) p(; �2aa) p(; �3aa)o10o3o2o1

• Observation, ot, is generated by which state density?

– Only observations are seen, the state-sequence is “hidden”

– Recall: In GMM, the “mixture component is “hidden”


What is hidden in hidden Markov model?

Samudravijaya K TIFR, [email protected] Introduction to Automatic Speech Recognition 46/76

HMM Model of Phoneme

• Use term “State”for each of the three time periods.

• Prob. of ot from jth state, i.e. paaj(ot; θaaj) ⇒ denoted as bj(ot)

1 2 3. . .p(; �1aa) p(; �2aa) p(; �3aa)o10o3o2o1

• Observation, ot, is generated by which state density?

– Only observations are seen, the state-sequence is “hidden”

– Recall: In GMM, the “mixture component is “hidden”


GMM and HMM

f(Hz)f(Hz)

p(f)

f(Hz)

p(f)

a12

a11

1

p(f)

2 3

Workshop on ASR (Osmania U): “GMM”, [email protected] 48 of 50

admin

Text Box

How to generate speech from a HMM?

l


Input:

– A sentence (sequence of words)

Inventory:

– Pronunciation dictionary

– Trained HMM models for every phone

Output:

Speech waveform

Sentence + pronunciation dictionary

---> sequence of phones

---> sequence of HMM states

---> sequence of feature vectors (source + excitation)

---> speech waveform (using source-filter model)


l Speech Production Model

l


Source: Tomoki Toda; WiSSAP 2013


admin

Text Box

admin

Text Box

l


Source: Tomoki Toda; WiSSAP 2013

These speech parametersshould be modeled

by HMMs


l


Source: T.Nagarajan, TTS workshop 2012


admin

Text Box

Speech: A Dynamic Signal

[email protected]

Additional features: Slope and curvature of trajectory: formants/LSPs

Features modeled by HMMs for TTS systems:Cepstral coefficients (MFCC / LPCC)Delta- and delta-delta coefficients

Models for Excitation sourceDurationEmotion


l


Source: T.Nagarajan, TTS workshop 2012


admin

Text Box

System overview of HTS

3 / 15

Training of HMM

Context-DependentHMMs and Duration Models

Label

Mel-cepstral CoefficientsF0

TEXT

Label

SYNTHESIZEDSPEECH

F0

Speech signal

Mel-cepstral Coefficients

Training part

Synthesis part

Parameter Generationfrom HMM

Text Analysis

ExcitationGeneration

MLSAFilter

Mel-cepstralAnalysis

F0Extraction

SPEECHDATABASE

admin

Typewriter

Source: zen et al. ICSLP 2004

4 / 15

Training part of HTS

PhonemeAlignment

CD-labelsequence

CD-labelsequence

Training data

Context-Dependent HMMs and Duration Models

ContextIndependent

ContextDependent

Initialization and Reestimation

Copy CI-HMMs to CD-HMMs

Embedded Reestimation

Embedded Reestimation

Duration model generation

Tree-based clustering (Spectra)Tree-based clustering (F0)

Tree-based clustering (Duration)

Spectra

F0

Duration

admin

Line

admin

Typewriter


5 / 15

Synthesis part of HTS

State Durations 1 2d d

Mel-cepstrum c c c c cc1 2 3 5 64 cTp p p ppp1 2 3 4 5 6F0 pT

SYNTHESIZEDSPEECH

TEXT

Label

Sentence HMM

State DurationDistributions

Context-Dependent HMMsand Duration Models

Parameter Generation from HMM

d d21

ExcitationGeneration

MLSAFilter

Text analysis

admin

Line

admin

Typewriter


Basic Probability

Joint and Conditional probability (Definitions)

p(A,B) = p(A|B) p(B) = p(B|A) p(A)

Bayes’ rule

p(A|B) =p(B|A) p(A)

p(B)


Chain rule


P( A1, A

2, A

3, ... A

n )

= P( An | A

1, A

2, A

3, ... A

n-1 ) P( A

1, A

2, A

3, ... A

n-1 )

= P( An | A

1, A

2, A

3, ... A

n-1 ) P( A

n-1 | A

1, A

2, A

3, ... A

n-2 )

= P( An | A

1, A

2, A

3, ... A

n-1 ) ... P( A

2 | A

1 ) P(A

1 )


admin

Text Box

= Product of P(Ai) if Ai are independent

figures/logos/tifrLogo.eps

HMM: definitions

AssumptionsFirst order Markov assumption (finite history):

P(qt = j |qt−1 = i , qt−2 = k, ...) = P(qt = j |qt−1 = i)Stationarity (parameters do not change with time):

P(qt = j |qt−1 = i) = P(qt+l = j |qt+l−1 = i)⇒ exponential duration distribution

Elements of HMMN: number of hidden statesQ: set of states: Q = {q1, q2, q3, ..., qN}B : observation probability distribution: B = {bj} 1 ≤ j ≤ N

A: state transition probability matrix: A = {aij}aij = P(qt+1 = j |qt = i), 1 ≤ i , j ,≤ N

π: initial state distribution:πi = P(q1 = i) 1 ≤ i ≤ N

λ: the entire model: λ = (A,B , π)

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 10/26


3 problems in HMM

1. Matching: Given an observation sequence O = o1, o2, o3, ..., oT , and atrained model λ = (A,B , π), how to efficiently compute the likelihood,P(O|λ) (likelihood of the model λ generating the observationsequence) O?Solution: forward algorithm (use recursion for computational efficiency)Use: Given two models λ1 and λ2, choose λ1 if P(O|λ1) > P(O|λ2)



3 problems in HMM


2. Optimal path: Given O and λ, how to find the optimal state sequence(Q = q1, q2, q3, ..., qT )?Solution: Viterbi algorithm (similar to DTW)Use: Derive word/phone sequence



3 problems in HMM


2. Optimal path: Given O and λ, how to find the optimal state sequence(Q = q1, q2, q3, ..., qT )?Solution: Viterbi algorithm (similar to DTW)Use: Derive word/phone sequence

3. Training: How to estimate the parameters of the model: λ = (A,B , π)that maximise P(O|λ)?Solution: Forward-backward algorithm.


Training HMMs


Samudravijaya KTata Institute of Fundamental Research [email protected] [email protected]

Workshop on Text-to-Speech (TTS) Synthesis16-18 June 2014

Dhirubhai Ambani Institute of Information and Communication TechnologyGandhinagar, Gujarat




Training subword HMMs

An iterative algorithm (Baum-Welch, also known asForward-Backward) is used. The Maximum Likelihood approachguarantees increase of the likelihood of the trained model matchingwith training data with each iteration. To begin with, an initialestimation of parameters of HMMs (A,B , π) is required.

Q: How to get an initial estimation of (λ = {A,B , π}?A: We can estimate parameters if we know the boundaries of everysubword HMM in training utterances.


Training subword HMMs

An iterative algorithm (Baum-Welch, also known asForward-Backward) is used. The Maximum Likelihood approachguarantees increase of the likelihood of the trained model matchingwith training data with each iteration. To begin with, an initialestimation of parameters of HMMs (A,B , π) is required.

Q: How to get an initial estimation of (λ = {A,B , π}?A: We can estimate parameters if we know the boundaries of everysubword HMM in training utterances.

Practical solution: Assume that the durations of all units (phones)are equal. If there are N phones in a training utterance, divide thefeature vector sequence into N equal parts. Assign each part, to aphoneme in the phoneme sequence corresponding to thetranscription of the utterance. Repeat for all training utterances.


Basic units of HMM (phone-like units)

a aA i I u U e e� ao aOa A i I u U e E o Ok K g G Rk kh g gh ng C j J � h j jh njV W X Y ZT Th D Dh Nt T d D nt th d dh np P b B mp ph b bh my r l v f q s hy r l w sh S s hSamudravijaya K TIFR, [email protected] Introduction to Automatic Speech Recognition 54/76

Pronunciation dictionary

* Representing a word as a sequence of units of recognition* Pronunciation rules can be used* Manual verification is necessary

kalam vs kamalkarnaa, pahale, Bhaartiyapause

aage aa g e

aaja aa j

aba a b

abbaasa a bb aa s

aatxha aa t’h


Initial estimation of HMM parameters: an illustration

Let the transcription of the 1st wave file be the following sequenceof words: mera bhaarat mahaan

Let the relevant lines in the dictionary be as follows:bhaarata bh aa r a tmahaana m a h aa nmera m e r aa

The phonemeHMM sequence (of length 16) corresponding to thissentence is sil m e r aa bh aa r a t m a h aa n sil


Initial estimation of HMM parameters: an illustration

Let the transcription of the 1st wave file be the following sequenceof words: mera bhaarat mahaan

Let the relevant lines in the dictionary be as follows:bhaarata bh aa r a tmahaana m a h aa nmera m e r aa

The phonemeHMM sequence (of length 16) corresponding to thissentence is sil m e r aa bh aa r a t m a h aa n sil

If the duration of the wavefile is 1.0sec, there will 98 featurevectors (frame shift = 10msec and frame size = 25msec).

Assign the first 6 feature vectors to “sil” HMM; the next 6 (7through 12) to “m”; the next 6 (13 through 18) to “e”; ... ; thelast 8 feature vectors to “sil”. If HMM has 3 states, assign 2feature vector to each state; compute mean,SD.Assume ai ,j=0.5 if j=i or j=i+1; else assign 0.Samudravijaya K TIFR, [email protected] Introduction to Automatic Speech Recognition 59/76

Initial estimation of HMM parameters


Training data would consist of hundreds of sentences.

For each spoken sentence, repeat the above process: assigning feature vectors to different phonemes of the sentence

Thus, each phone would be assigned several sequences of feature vectors. “m” occurred twice in the previous example; mera bhaarat mahaan

Thus, “m” was allocated 6 feature vectors twice from one speech file




Training data would consist of hundreds of sentences.

For each spoken sentence, repeat the above process: assigning feature vectors to different phonemes of the sentence

Thus, each phone would be assigned several sequences of feature vectors.

“m” occurred twice in the previous example; mera bhaarat mahaan

Thus, “m” was allocated 6 feature vectors twice from one speech file

If a phone is modeled by a 3-state HMM, divide each feature vector sequence into 3 equal

parts. Collect all feature vectors belonging to the first part of the phoneme. Compute mean

and standard deviation: the parameters for the Gaussian distribution N(μ,σ) of the 1st state.

Similarly, estimate the parameters of 2nd and 3rd state of HMM of phoneme “m”

Repeat the above for each phoneme of the language.

We have estimated B = { bj }




We estimated B = { bj } likelihood functions

Let us estimate A = { aij } state transition prob

probabilities

Assign aij

= 0.5 if i=j or j = i+1

0.0 otherwise

Assign = 0.5 for i=1 or 2

Now, we have HMMs for each phoneme: = (A, B, )

by assuming that all phonemes have equal duration !

π

λ π


Better estimation of HMM parameters


Initial assumption: all phonemes have equal duration

==> boundaries between phonemes are equidistant

100 vectors

16 phones

Adjust the boundaries for better estimation of HMM parameters.

sil m e r aa bh aa r a t m a h aa n sil



Re-estimation of HMM parameters


Adjust the boundaries for better estimation of HMM parameters.

Search for those set of phoneme boundaries such that

the HMM parameters estimated by the revised boundaries

represent the training data better.

Search for the set of phoneme/state boundaries such that the likelihood of

the training data given the current model is the highest.

Then, use this boundary and likelihood information to update the parameters.




Re-estimation of HMM parameters


Search for the set of phoneme/state boundaries such that the likelihood of the training data given the revised

parameters is the highest.

We should be able to compute the likelihood of an utterance matching a HMM. In other words,

given an utterance represented by a sequence of observations (O = o1, o

2, o

3, o

4, o

5, o

6, ... o

T)

and a trained HMM = (A, B, ),

we should be able to compute the likelihood P(O | q, )



λ π

λ


Match a feature vector sequence with a HMM




P(O, q | ) = P(O |q, ) P(q | ) because P(A,B) = P(A|B) P(B)λλ λ



Match observation (speech vector) sequence with a model

Goal: To compute P(o1, o2, o3, ..., oT |λ)

Steps: There are many state sequences (paths). Consider one statesequence q = q1, q2, q3, ..., qT

If we assume that observations are independent,P(O|q, λ) =

∏Ti=1 P(ot |qt , λ) = bq1(o1)bq2(o2) . . . bqT (oT )

Probability of a particular state sequence is:P(q|λ) = πq1aq1q2aq2q3 . . . aqT−1qT

Enumerate paths and sum probabilities:P(O|λ) =

∑qP(O|q, λ)P(q|λ)

⇒ NT state sequences and O(T) calculations⇒ NT O(TNT ) computational complexity: exponential in length!



Forward Algorithm: Intution

1

2

3

i

Stat

es

o3 o_t o_t+1 o_T−1 o_T

Observation sequence

i

j

aij

a2j

a_1j

a3j

aNj

N−1

N

o1 o2

Let αt(i) = P(o1, o2, . . . , ot , qt = i |λ). Then

αt+1(j) =∑N

i=1 αt(i)aijbj(ot+1)



Forward Algorithm

Define a forward variable αt(i) as:αt(i) = P(o1, o2, . . . , ot , qt = i |λ)

αt(i) is the probability of observing the partial sequence ( o1, o2, . . . , ot)and ot being generated by i th state (i.e., qt = i).

Induction:Initialization:

α1(i) = πibi (o1)Recursion:

αt+1(j) = [∑N

i=1 αt(i)aij ] bj(ot+1)Termination:

P(O|λ) =∑N

i=1 αT (i)



Forward Algorithm

Define a forward variable αt(i) as:αt(i) = P(o1, o2, . . . , ot , qt = i |λ)

αt(i) is the probability of observing the partial sequence ( o1, o2, . . . , ot)and ot being generated by i th state (i.e., qt = i).

Induction:Initialization:

α1(i) = πibi (o1)Recursion:

αt+1(j) = [∑N

i=1 αt(i)aij ] bj(ot+1)Termination:

P(O|λ) =∑N

i=1 αT (i)

Computational complexity: O(N2T )

Use: Match a test speech feature vector sequence with all models. Chooseλi if P(O|λi ) > P(O|λj)∀j



Viterbi Algorithm: IntutionProblem 2: Given O and λ, how to find the optimal state sequence(Q = q1, q2, q3, ..., qT ) (Optimal path)?




Define δt(i) (the highest probability path ending at state i at time t) as:δt(i) = max

q1,q2,...,qt−1

P(q1, q2, · · · , qt = i , o1, o2, . . . , ot |λ)

1

2

3

i

Stat

es

N−1

N

o_t o_t+1


i

j

aij

a2j

a_1j

a3j

aNj

Viterbi recursion:δt+1(j) = max

iδt(i)aijbj(ot+1)




Define δt(i) (the highest probability path ending at state i at time t) as:δt(i) = max

q1,q2,...,qt−1

P(q1, q2, · · · , qt = i , o1, o2, . . . , ot |λ)

1

2

3

i

Stat

es

N−1

N

o_t o_t+1


i

j

aij

a2j

a_1j

a3j

aNj

Viterbi recursion:δt+1(j) = max

iδt(i)aijbj(ot+1)

Contrast the above with the recursion in Forward algorithm:αt+1(j) =

∑N

i=1 αt(i)aijbj(ot+1)



Viterbi AlgorithmInitialization:

δ1(i) = πibi(o1), 1 ≤ i ≤ N

ψ1(i) = 0

Recursion:δt(j) = max

1≤i≤N[δt−1(i)aij ] bj(ot)

ψt(j) = argmax1≤i≤N

[δt−1(i)aij ] 2 ≤ t ≤ T , 1 ≤ j ≤ N

Termination:P∗ = max

1≤i≤N[δT (i)]

q∗T = argmax

1≤i≤N

[δT (i)]

Path (optimal state sequence) backtracking:q∗t = ψt+1(q

∗t+1), t = T − 1,T − 2, · · · , 2, 1.



Training

Problem 3: Given training data and its transcription, how to estimate theparameters of the model, λ = (A,B , π), that maximises the probability ofrepresentation of training data by the model, P(O|λ)?There is no analytic solution because of its complexity. So, we employExpectation-Maximisation (an iterative) algorithm.



Training


1) Start with an initial (approximate) model, λ0.2) E-step: Using the current model (λ0), compute the expectation of thelikelihood of the training data: P(O|λ) =

∑Ni=1 αT (i).

3) M-step: Re-estimate the parameters (λ = (A,B , π)) so as to maximisethe probability (P(O|λ)).4) Stop if the improvement in log likelihood is insignificant:

P(O|λ)− P(O|λ0) < ∆5) Else, set λ0 ← λ and go to step 2.



Training


1) Start with an initial (approximate) model, λ0.2) E-step: Using the current model (λ0), compute the expectation of thelikelihood of the training data: P(O|λ) =

∑Ni=1 αT (i).

3) M-step: Re-estimate the parameters (λ = (A,B , π)) so as to maximisethe probability (P(O|λ)).4) Stop if the improvement in log likelihood is insignificant:

P(O|λ)− P(O|λ0) < ∆5) Else, set λ0 ← λ and go to step 2.

The EM algorithm as applied to ASR is known as B-W algorithm; it is alsoknown as Forward-Backward algorithm.



Forward-Backward Algorithm: βt(i)

Define a backward variable βt(i) = p(ot+1, . . . , oT |qt = i , λ)

βt(i)Given that we are at node i at time t:

⇒ Sum of probabilities of all paths such thatpartial sequence ot+1, . . . , oT are observed


[email protected]



Forward-Backward Algorithm: βt(i)

Define a backward variable βt(i) = p(ot+1, . . . , oT |qt = i , λ)

βt(i)Given that we are at node i at time t:

⇒ Sum of probabilities of all paths such thatpartial sequence ot+1, . . . , oT are observed

Starting with the initial condition at the last speech vector (t = T ):βT (i) = 1.0, 1 ≤ i ≤ N,

we can recursively compute βt(i) for every state i = 1, 2, . . . ,N backwardsin time (t = T-1, T-2, . . . , 2, 1) as follows:

βt(i) =

N∑

j=1

[aijbj(ot+1)]

︸︷︷︸Going to each nodefrom i th node

βj(t + 1)︸︷︷︸

Prob. of observationot+2 . . . oT givennow we are in j th

node at t + 1Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 20/26


Joint event: state i at time t AND state j at t+1

Define ξt(i , j) as the probability of system being in state i at time t and instate j at time t+1:

ξt(i , j) =αt(i)aijbj (ot+1)βt+1(j)

P(O|λ)

Source: http://www.shokhirev.com/nikolai/abc/alg/hmm/hmm.html



Re-estimation Formulae: πi and aij

The revised estimate of initial probability, πi , is the expected frequency instate i at time (t=1):

πnewi =

N∑

j=1

ξ1(i , j)


Estimating Transition Probability

Trans. Prob. from state i to j = No. of times transition was made from i to jTotal number of times we made transition from i

τt(i, j) ⇒ prob. of being in “state=i at time=t” and “state=j at time=t+1”

If we average τt(i, j) over all time-instants, we get the number of times the system

was in ith state and made a transition to jth state. So, a revised estimation of

transition probability is

anewij =

∑T−1t=1 τt(i, j)

∑Tt=1(

N∑

j=1

τt(i, j)

︸︷︷︸all transitions out

of i at time=t

)


admin

Pencil

admin

Pencil

admin

Pencil

admin

Pencil

admin

Pencil

admin

Pencil

admin

Pencil


Re-estimation Formulae: bj(t)

Parameters of State Probability Density FunctionLet us assume that the state output distribution function is Gaussian. Ifthere was just one state j , the maximum likelihood estimation ofparameters would be

µj =1

T

T∑

t=1

ot

Σj =1

T

T∑

t=1

(ot − µj)(ot − µj)′




Parameters of State Probability Density FunctionLet us assume that the state output distribution function is Gaussian. Ifthere was just one state j , the maximum likelihood estimation ofparameters would be

µj =1

T

T∑

t=1

ot

Σj =1

T

T∑

t=1

(ot − µj)(ot − µj)′

* Difficulty: Speech HMMs have many states.* Speech vector ↔ state mapping is unknown because the state sequenceitself is unknown.* Solution: Assign each speech vector to every state in proportion to thelikelihood of system being in that state when the speech vector wasobserved.




Let Lj (t) denote the probability of being in state j at time t.

Lj (t) = p(qt = j |O, λ)

=p(qt = j ,O|λ)

p(O|λ)

=αt(i)βt(j)∑

i αT (i)




Let Lj (t) denote the probability of being in state j at time t.

Lj (t) = p(qt = j |O, λ)

=p(qt = j ,O|λ)

p(O|λ)

=αt(i)βt(j)∑

i αT (i)

Revised estimates of the state pdf parameters are

µj =

∑Tt=1 Lj(t)ot∑Tt=1 Lj(t)

Σj =

∑Tt=1 Lj(t)(ot − µj)(ot − µj)

′

∑Tt=1 Lj(t)

The expected values (estimations) are weighted averages, weights being theprobability of being in state j at time t.



Some remarks

Types of HMM* Ergodic Vs left-to-right* Semi-Markov (state duration)* Discriminative models

Implementational Issues* Number of states* Initial parameters* Scaling, addition of logLikelihoods* Multiple observations (tokens/repetitions)* Discrete Vs Continuous probability functions (with GMMs)* Concatenation of smaller HMMs → larger HMM



References

◮ Four online tutorials on HMM are listed at< http : //speech.tifr .res.in/tutorials/index .html >

◮ Books: ”Fundamentals of Speech Recognition”, by Lawrence R.Rabiner, B. H. Juang and B.Yegnanarayana, Pearson Education India,2008, Rs. 450; ISBN:9788177585605

◮ Spoken Language Processing : A Guide to Theory, Algorithm andSystem Development, by Xuedong Huang, Alex Acero, Hsiao-WuenHon Year 2001, Prentice Hall PTR; ISBN: 0130226165.

◮ Hidden Markov models for speech recognition; X.D. Huang, Y. Ariki,M.A. Jack. Edinburgh: Edinburgh University Press, c1990.

◮ Statistical methods for speech recognition, F.Jelinek, The MIT Press,Cambridge, MA., 1998.

◮ HMM on MATLAB “HMM toolbox on matlab: Discrete HMMs:training and recognition” by Kevin Murphy, 2005;< http : //www .cs.ubc.ca/ murphyk/Software/HMM/hmm.html >


hmms as generative models of speech · hmms as generative models of speech [email protected]...

Documents