hmms as generative models of speech · hmms as generative models of speech [email protected]...

78
HMMs as Generative Models of Speech [email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT Samudravijaya K Tata Institute of Fundamental Research Mumbai [email protected] [email protected] Workshop on Text-to-Speech (TTS) Synthesis 16-18 June 2014 Dhirubhai Ambani Institute of Information and Communication Technology Gandhinagar, Gujarat

Upload: others

Post on 23-Jan-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

HMMs as Generative Models of Speech

[email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Samudravijaya KTata Institute of Fundamental Research [email protected] [email protected]

Workshop on Text-to-Speech (TTS) Synthesis16-18 June 2014

Dhirubhai Ambani Institute of Information and Communication TechnologyGandhinagar, Gujarat

Outline of the talk

[email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Statistical models for TTS

Probability distributions

– Normal (Gaussian) distribution

– Gaussian Mixture Model (GMM)

– Hidden Markov Model (HMM)

● Generation of speech from models

● Overview of HMM based Speech synthesis system (HTS)

● Training of HMMs

Text to Speech Systems

l

l [email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Waveform concatenation

'Cut-and-paste' approach

Unit selection approach

Speech Model

Articulatory models : Speech production model

Formant : Source-filter model (rules for trajectory)

HTS : Statistical models (machine learning)

Statistical models of speech

[email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Why statistical models are appropriate (in the context of TTS)?

A lot of variability exists in speech signal due to

Phonetic context

Supra segmental variation: pitch, emphasis, mood.

Models are mathematical expressions of a process / phenomenon in terms of

a small number of parameters.

Statistics provides a succinct method of describing aggregate behaviour of an ensemble.

Statistical models represent an ensemble: a collection of similar entities (ex: phones).

Statistics: Mean, Variance, skewness, kurtosis

•Univariate Gaussian Distribution

[email protected]

• Normal distribution:

• Parameters:– Mean (μ)– Variance (σ2 )

Estimation of parameters

admin
Text Box
Probability Vs Likelihood (conditional probability)

Maximum Likelihood Estimator

Given x[0], x[1], . . . , x[N − 1] and pdf parameterised by θ =

θ1

θ2

.

.

θm−1

We form Likelihood function L(X; θ) =N∏

i=0

p(xi;θ)

θMLE = arg maxθ

L(X; θ)

For height problem:

⇒ can show (θ)MLE = 1N

∑xi

⇒ Estimate of mean of Gaussian = sample mean of measured heights.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 9 of 88

admin
Text Box
admin
Text Box
admin
Text Box
admin
Text Box
admin
Text Box
admin
Text Box
admin
Text Box
admin
Text Box
admin
Text Box
values
admin
Line

Formant space of vowels

[email protected]

Multi-modal Distributions

[email protected]

• Distribution of cepstral coefficient of a phone

[email protected]

• Extension to multi-dimensional case

Training a GMM

[email protected]

• Live demonstration at: http://staff.aist.go.jp/s.akaho/MixtureEM.html

The parameters of a GMM can be trained using

Expectation-Maximization algorithm.

This is an iterative algorithm and consists of 2 steps. It begins with an initial GMM

with (even) random parameters.

In the E-step, an expectation of the log likelihood of the training (adapation) data

given the current GMM is computed.

In the M-step, the parameters of GMM are re-estimated in order to maximise the

expectation of log likelihood.

Generation of speech from statistical models

[email protected]

Consider the vowel ii

Mean StdDevF1 300 100F2 2800 500

Such a normal distribution of formant frequencies of the vowel i can generate a large number of formant values centered around the mean values.

Instead of formants, we can model cepstral coefficients. Then, the corresponding normal distribution can generate any number of MFCCs.

MFCC--> log power spectrum--> speech waveform

l

l [email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

admin
Text Box
Why HMMs are good models of sequences?

l

l [email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

l

l [email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Modelling of Phoneme

To enunciate /aa/ in a word ⇒Our Articulators are moving from a configuration

for previous phoneme to /aa/ and then proceeding

to move to configuration of next phoneme.

Can think of 3 distinct time periods:

⇒ Transition from previous phoneme

⇒ Steady state

⇒ Transition to next phoneme

Features for 3 “time-interval ”are quite different

⇒ Use different density functions to model the three time intervals

⇒ model as paa1(;θaa1) paa2(;θaa

2) paa3(;θaa3)

Also need to model the time durations of these time-intervals – transition probs.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 36 of 88

HMM Model of Phoneme

• Use term “State”for each of the three time periods.

• Prob. of ot from jth state, i.e. paaj(ot; θaaj) ⇒ denoted as bj(ot)

1 2 3. . .p(; �1aa) p(; �2aa) p(; �3aa)o10o3o2o1

• Observation, ot, is generated by which state density?

– Only observations are seen, the state-sequence is “hidden”

– Recall: In GMM, the “mixture component is “hidden”

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 38 of 88

What is hidden in hidden Markov model?

Samudravijaya K TIFR, [email protected] Introduction to Automatic Speech Recognition 46/76

HMM Model of Phoneme

• Use term “State”for each of the three time periods.

• Prob. of ot from jth state, i.e. paaj(ot; θaaj) ⇒ denoted as bj(ot)

1 2 3. . .p(; �1aa) p(; �2aa) p(; �3aa)o10o3o2o1

• Observation, ot, is generated by which state density?

– Only observations are seen, the state-sequence is “hidden”

– Recall: In GMM, the “mixture component is “hidden”

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 38 of 88

GMM and HMM

f(Hz)f(Hz)

p(f)

f(Hz)

p(f)

a12

a11

1

p(f)

2 3

Workshop on ASR (Osmania U): “GMM”, [email protected] 48 of 50

admin
Text Box

How to generate speech from a HMM?

l

l [email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Input:

– A sentence (sequence of words)

Inventory:

– Pronunciation dictionary

– Trained HMM models for every phone

Output:

Speech waveform

Sentence + pronunciation dictionary

---> sequence of phones

---> sequence of HMM states

---> sequence of feature vectors (source + excitation)

---> speech waveform (using source-filter model)

l Speech Production Model

l

l [email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Source: Tomoki Toda; WiSSAP 2013

admin
Text Box
admin
Text Box

l

l [email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Source: Tomoki Toda; WiSSAP 2013

These speech parametersshould be modeled

by HMMs

l

l [email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Source: T.Nagarajan, TTS workshop 2012

admin
Text Box

l

l [email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Source: T.Nagarajan, TTS workshop 2012

admin
Text Box

Speech: A Dynamic Signal

[email protected]

Additional features: Slope and curvature of trajectory: formants/LSPs

Features modeled by HMMs for TTS systems:Cepstral coefficients (MFCC / LPCC)Delta- and delta-delta coefficients

Models for Excitation sourceDurationEmotion

l

l [email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Source: T.Nagarajan, TTS workshop 2012

admin
Text Box

System overview of HTS

3 / 15

Training of HMM

Context-DependentHMMs and Duration Models

Label

Mel-cepstral CoefficientsF0

TEXT

Label

SYNTHESIZEDSPEECH

F0

Speech signal

Mel-cepstral Coefficients

Training part

Synthesis part

Parameter Generationfrom HMM

Text Analysis

ExcitationGeneration

MLSAFilter

Mel-cepstralAnalysis

F0Extraction

SPEECHDATABASE

admin
Typewriter
Source: zen et al. ICSLP 2004

4 / 15

Training part of HTS

PhonemeAlignment

CD-labelsequence

CD-labelsequence

Training data

Context-Dependent HMMs and Duration Models

ContextIndependent

ContextDependent

Initialization and Reestimation

Copy CI-HMMs to CD-HMMs

Embedded Reestimation

Embedded Reestimation

Duration model generation

Tree-based clustering (Spectra)Tree-based clustering (F0)

Tree-based clustering (Duration)

Spectra

F0

Duration

admin
Line
admin
Typewriter
Source: zen et al. ICSLP 2004

5 / 15

Synthesis part of HTS

State Durations 1 2d d

Mel-cepstrum c c c c cc1 2 3 5 64 cTp p p ppp1 2 3 4 5 6F0 pT

SYNTHESIZEDSPEECH

TEXT

Label

Sentence HMM

State DurationDistributions

Context-Dependent HMMsand Duration Models

Parameter Generation from HMM

d d21

ExcitationGeneration

MLSAFilter

Text analysis

admin
Line
admin
Typewriter
Source: zen et al. ICSLP 2004

Basic Probability

Joint and Conditional probability (Definitions)

p(A,B) = p(A|B) p(B) = p(B|A) p(A)

Bayes’ rule

p(A|B) =p(B|A) p(A)

p(B)

Workshop on ASR (Osmania U): “GMM”, [email protected] 4 of 50

Basic Probability

Joint and Conditional probability (Definitions)

p(A,B) = p(A|B) p(B) = p(B|A) p(A)

Bayes’ rule

p(A|B) =p(B|A) p(A)

p(B)

If Ais are mutually exclusive events,

p(B)

= p(B|A1)p(A1) + p(B|A2)p(A2) + p(B|A3)p(A3) + ...

=∑

i p(B|Ai) p(Ai)

Workshop on ASR (Osmania U): “GMM”, [email protected] 5 of 50

Basic Probability

Joint and Conditional probability (Definitions)

p(A,B) = p(A|B) p(B) = p(B|A) p(A)

Bayes’ rule

p(A|B) =p(B|A) p(A)

p(B)

If Ais are mutually exclusive events,

p(B)

= p(B|A1)p(A1) + p(B|A2)p(A2) + p(B|A3)p(A3) + ...

=∑

i p(B|Ai) p(Ai)

p(A|B) =p(B|A) p(A)∑i p(B|Ai) p(Ai)

Workshop on ASR (Osmania U): “GMM”, [email protected] 6 of 50

Chain rule

[email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

P( A1, A

2, A

3, ... A

n )

= P( An | A

1, A

2, A

3, ... A

n-1 ) P( A

1, A

2, A

3, ... A

n-1 )

= P( An | A

1, A

2, A

3, ... A

n-1 ) P( A

n-1 | A

1, A

2, A

3, ... A

n-2 )

= P( An | A

1, A

2, A

3, ... A

n-1 ) ... P( A

2 | A

1 ) P(A

1 )

admin
Text Box
= Product of P(Ai) if Ai are independent

figures/logos/tifrLogo.eps

HMM: definitions

AssumptionsFirst order Markov assumption (finite history):

P(qt = j |qt−1 = i , qt−2 = k, ...) = P(qt = j |qt−1 = i)Stationarity (parameters do not change with time):

P(qt = j |qt−1 = i) = P(qt+l = j |qt+l−1 = i)⇒ exponential duration distribution

Elements of HMMN: number of hidden statesQ: set of states: Q = {q1, q2, q3, ..., qN}B : observation probability distribution: B = {bj} 1 ≤ j ≤ N

A: state transition probability matrix: A = {aij}aij = P(qt+1 = j |qt = i), 1 ≤ i , j ,≤ N

π: initial state distribution:πi = P(q1 = i) 1 ≤ i ≤ N

λ: the entire model: λ = (A,B , π)

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 10/26

figures/logos/tifrLogo.eps

HMM: definitions

AssumptionsFirst order Markov assumption (finite history):

P(qt = j |qt−1 = i , qt−2 = k, ...) = P(qt = j |qt−1 = i)Stationarity (parameters do not change with time):

P(qt = j |qt−1 = i) = P(qt+l = j |qt+l−1 = i)⇒ exponential duration distribution

Elements of HMMN: number of hidden statesQ: set of states: Q = {q1, q2, q3, ..., qN}B : observation probability distribution: B = {bj} 1 ≤ j ≤ N

A: state transition probability matrix: A = {aij}aij = P(qt+1 = j |qt = i), 1 ≤ i , j ,≤ N

π: initial state distribution:πi = P(q1 = i) 1 ≤ i ≤ N

λ: the entire model: λ = (A,B , π)

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 10/26

figures/logos/tifrLogo.eps

3 problems in HMM

1. Matching: Given an observation sequence O = o1, o2, o3, ..., oT , and atrained model λ = (A,B , π), how to efficiently compute the likelihood,P(O|λ) (likelihood of the model λ generating the observationsequence) O?Solution: forward algorithm (use recursion for computational efficiency)Use: Given two models λ1 and λ2, choose λ1 if P(O|λ1) > P(O|λ2)

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 11/26

figures/logos/tifrLogo.eps

3 problems in HMM

1. Matching: Given an observation sequence O = o1, o2, o3, ..., oT , and atrained model λ = (A,B , π), how to efficiently compute the likelihood,P(O|λ) (likelihood of the model λ generating the observationsequence) O?Solution: forward algorithm (use recursion for computational efficiency)Use: Given two models λ1 and λ2, choose λ1 if P(O|λ1) > P(O|λ2)

2. Optimal path: Given O and λ, how to find the optimal state sequence(Q = q1, q2, q3, ..., qT )?Solution: Viterbi algorithm (similar to DTW)Use: Derive word/phone sequence

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 11/26

figures/logos/tifrLogo.eps

3 problems in HMM

1. Matching: Given an observation sequence O = o1, o2, o3, ..., oT , and atrained model λ = (A,B , π), how to efficiently compute the likelihood,P(O|λ) (likelihood of the model λ generating the observationsequence) O?Solution: forward algorithm (use recursion for computational efficiency)Use: Given two models λ1 and λ2, choose λ1 if P(O|λ1) > P(O|λ2)

2. Optimal path: Given O and λ, how to find the optimal state sequence(Q = q1, q2, q3, ..., qT )?Solution: Viterbi algorithm (similar to DTW)Use: Derive word/phone sequence

3. Training: How to estimate the parameters of the model: λ = (A,B , π)that maximise P(O|λ)?Solution: Forward-backward algorithm.

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 11/26

Training HMMs

[email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Samudravijaya KTata Institute of Fundamental Research [email protected] [email protected]

Workshop on Text-to-Speech (TTS) Synthesis16-18 June 2014

Dhirubhai Ambani Institute of Information and Communication TechnologyGandhinagar, Gujarat

Training subword HMMs

An iterative algorithm (Baum-Welch, also known asForward-Backward) is used. The Maximum Likelihood approachguarantees increase of the likelihood of the trained model matchingwith training data with each iteration. To begin with, an initialestimation of parameters of HMMs (A,B , π) is required.

Q: How to get an initial estimation of (λ = {A,B , π}?A: We can estimate parameters if we know the boundaries of everysubword HMM in training utterances.

Samudravijaya K TIFR, [email protected] Introduction to Automatic Speech Recognition 58/76

Training subword HMMs

An iterative algorithm (Baum-Welch, also known asForward-Backward) is used. The Maximum Likelihood approachguarantees increase of the likelihood of the trained model matchingwith training data with each iteration. To begin with, an initialestimation of parameters of HMMs (A,B , π) is required.

Q: How to get an initial estimation of (λ = {A,B , π}?A: We can estimate parameters if we know the boundaries of everysubword HMM in training utterances.

Practical solution: Assume that the durations of all units (phones)are equal. If there are N phones in a training utterance, divide thefeature vector sequence into N equal parts. Assign each part, to aphoneme in the phoneme sequence corresponding to thetranscription of the utterance. Repeat for all training utterances.

Samudravijaya K TIFR, [email protected] Introduction to Automatic Speech Recognition 58/76

Basic units of HMM (phone-like units)

a aA i I u U e e� ao aOa A i I u U e E o Ok K g G Rk kh g gh ng C j J � h j jh njV W X Y ZT Th D Dh Nt T d D nt th d dh np P b B mp ph b bh my r l v f q s hy r l w sh S s hSamudravijaya K TIFR, [email protected] Introduction to Automatic Speech Recognition 54/76

Pronunciation dictionary

* Representing a word as a sequence of units of recognition* Pronunciation rules can be used* Manual verification is necessary

kalam vs kamalkarnaa, pahale, Bhaartiyapause

aage aa g e

aaja aa j

aba a b

abbaasa a bb aa s

aatxha aa t’h

Samudravijaya K TIFR, [email protected] Introduction to Automatic Speech Recognition 55/76

Initial estimation of HMM parameters: an illustration

Let the transcription of the 1st wave file be the following sequenceof words: mera bhaarat mahaan

Let the relevant lines in the dictionary be as follows:bhaarata bh aa r a tmahaana m a h aa nmera m e r aa

The phonemeHMM sequence (of length 16) corresponding to thissentence is sil m e r aa bh aa r a t m a h aa n sil

Samudravijaya K TIFR, [email protected] Introduction to Automatic Speech Recognition 59/76

Initial estimation of HMM parameters: an illustration

Let the transcription of the 1st wave file be the following sequenceof words: mera bhaarat mahaan

Let the relevant lines in the dictionary be as follows:bhaarata bh aa r a tmahaana m a h aa nmera m e r aa

The phonemeHMM sequence (of length 16) corresponding to thissentence is sil m e r aa bh aa r a t m a h aa n sil

If the duration of the wavefile is 1.0sec, there will 98 featurevectors (frame shift = 10msec and frame size = 25msec).

Assign the first 6 feature vectors to “sil” HMM; the next 6 (7through 12) to “m”; the next 6 (13 through 18) to “e”; ... ; thelast 8 feature vectors to “sil”. If HMM has 3 states, assign 2feature vector to each state; compute mean,SD.Assume ai ,j=0.5 if j=i or j=i+1; else assign 0.Samudravijaya K TIFR, [email protected] Introduction to Automatic Speech Recognition 59/76

Initial estimation of HMM parameters

[email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Training data would consist of hundreds of sentences.

For each spoken sentence, repeat the above process: assigning feature vectors to different phonemes of the sentence

Thus, each phone would be assigned several sequences of feature vectors. “m” occurred twice in the previous example; mera bhaarat mahaan

Thus, “m” was allocated 6 feature vectors twice from one speech file

Initial estimation of HMM parameters

[email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Training data would consist of hundreds of sentences.

For each spoken sentence, repeat the above process: assigning feature vectors to different phonemes of the sentence

Thus, each phone would be assigned several sequences of feature vectors.

“m” occurred twice in the previous example; mera bhaarat mahaan

Thus, “m” was allocated 6 feature vectors twice from one speech file

If a phone is modeled by a 3-state HMM, divide each feature vector sequence into 3 equal

parts. Collect all feature vectors belonging to the first part of the phoneme. Compute mean

and standard deviation: the parameters for the Gaussian distribution N(μ,σ) of the 1st state.

Similarly, estimate the parameters of 2nd and 3rd state of HMM of phoneme “m”

Repeat the above for each phoneme of the language.

We have estimated B = { bj }

Initial estimation of HMM parameters

[email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

We estimated B = { bj } likelihood functions

Let us estimate A = { aij } state transition prob

probabilities

Assign aij

= 0.5 if i=j or j = i+1

0.0 otherwise

Assign = 0.5 for i=1 or 2

Now, we have HMMs for each phoneme: = (A, B, )

by assuming that all phonemes have equal duration !

π

λ π

Better estimation of HMM parameters

[email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Initial assumption: all phonemes have equal duration

==> boundaries between phonemes are equidistant

100 vectors

16 phones

Adjust the boundaries for better estimation of HMM parameters.

sil m e r aa bh aa r a t m a h aa n sil

sil m e r aa bh aa r a t m a h aa n sil

Re-estimation of HMM parameters

[email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Adjust the boundaries for better estimation of HMM parameters.

Search for those set of phoneme boundaries such that

the HMM parameters estimated by the revised boundaries

represent the training data better.

Search for the set of phoneme/state boundaries such that the likelihood of

the training data given the current model is the highest.

Then, use this boundary and likelihood information to update the parameters.

sil m e r aa bh aa r a t m a h aa n sil

sil m e r aa bh aa r a t m a h aa n sil

Re-estimation of HMM parameters

[email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

Search for the set of phoneme/state boundaries such that the likelihood of the training data given the revised

parameters is the highest.

We should be able to compute the likelihood of an utterance matching a HMM. In other words,

given an utterance represented by a sequence of observations (O = o1, o

2, o

3, o

4, o

5, o

6, ... o

T)

and a trained HMM = (A, B, ),

we should be able to compute the likelihood P(O | q, )

sil m e r aa bh aa r a t m a h aa n sil

sil m e r aa bh aa r a t m a h aa n sil

λ π

λ

Match a feature vector sequence with a HMM

[email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

[email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

[email protected] Workshop on TTS Synthesis 17-JUN-2014 DAIICT

P(O, q | ) = P(O |q, ) P(q | ) because P(A,B) = P(A|B) P(B)λλ λ

figures/logos/tifrLogo.eps

Match observation (speech vector) sequence with a model

Goal: To compute P(o1, o2, o3, ..., oT |λ)

Steps: There are many state sequences (paths). Consider one statesequence q = q1, q2, q3, ..., qT

If we assume that observations are independent,P(O|q, λ) =

∏Ti=1 P(ot |qt , λ) = bq1(o1)bq2(o2) . . . bqT (oT )

Probability of a particular state sequence is:P(q|λ) = πq1aq1q2aq2q3 . . . aqT−1qT

Enumerate paths and sum probabilities:P(O|λ) =

∑qP(O|q, λ)P(q|λ)

⇒ NT state sequences and O(T) calculations⇒ NT O(TNT ) computational complexity: exponential in length!

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 12/26

figures/logos/tifrLogo.eps

Forward Algorithm: Intution

1

2

3

i

Stat

es

o3 o_t o_t+1 o_T−1 o_T

Observation sequence

i

j

aij

a2j

a_1j

a3j

aNj

N−1

N

o1 o2

Let αt(i) = P(o1, o2, . . . , ot , qt = i |λ). Then

αt+1(j) =∑N

i=1 αt(i)aijbj(ot+1)

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 13/26

figures/logos/tifrLogo.eps

Forward Algorithm: Intution

1

2

3

i

Stat

es

o3 o_t o_t+1 o_T−1 o_T

Observation sequence

i

j

aij

a2j

a_1j

a3j

aNj

N−1

N

o1 o2

Let αt(i) = P(o1, o2, . . . , ot , qt = i |λ). Then

αt+1(j) =∑N

i=1 αt(i)aijbj(ot+1)

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 13/26

figures/logos/tifrLogo.eps

Forward Algorithm

Define a forward variable αt(i) as:αt(i) = P(o1, o2, . . . , ot , qt = i |λ)

αt(i) is the probability of observing the partial sequence ( o1, o2, . . . , ot)and ot being generated by i th state (i.e., qt = i).

Induction:Initialization:

α1(i) = πibi (o1)Recursion:

αt+1(j) = [∑N

i=1 αt(i)aij ] bj(ot+1)Termination:

P(O|λ) =∑N

i=1 αT (i)

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 14/26

figures/logos/tifrLogo.eps

Forward Algorithm

Define a forward variable αt(i) as:αt(i) = P(o1, o2, . . . , ot , qt = i |λ)

αt(i) is the probability of observing the partial sequence ( o1, o2, . . . , ot)and ot being generated by i th state (i.e., qt = i).

Induction:Initialization:

α1(i) = πibi (o1)Recursion:

αt+1(j) = [∑N

i=1 αt(i)aij ] bj(ot+1)Termination:

P(O|λ) =∑N

i=1 αT (i)

Computational complexity: O(N2T )

Use: Match a test speech feature vector sequence with all models. Chooseλi if P(O|λi ) > P(O|λj)∀j

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 14/26

figures/logos/tifrLogo.eps

Viterbi Algorithm: IntutionProblem 2: Given O and λ, how to find the optimal state sequence(Q = q1, q2, q3, ..., qT ) (Optimal path)?

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 15/26

figures/logos/tifrLogo.eps

Viterbi Algorithm: IntutionProblem 2: Given O and λ, how to find the optimal state sequence(Q = q1, q2, q3, ..., qT ) (Optimal path)?

Define δt(i) (the highest probability path ending at state i at time t) as:δt(i) = max

q1,q2,...,qt−1

P(q1, q2, · · · , qt = i , o1, o2, . . . , ot |λ)

1

2

3

i

Stat

es

N−1

N

o_t o_t+1

Observation sequence

i

j

aij

a2j

a_1j

a3j

aNj

Viterbi recursion:δt+1(j) = max

iδt(i)aijbj(ot+1)

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 15/26

figures/logos/tifrLogo.eps

Viterbi Algorithm: IntutionProblem 2: Given O and λ, how to find the optimal state sequence(Q = q1, q2, q3, ..., qT ) (Optimal path)?

Define δt(i) (the highest probability path ending at state i at time t) as:δt(i) = max

q1,q2,...,qt−1

P(q1, q2, · · · , qt = i , o1, o2, . . . , ot |λ)

1

2

3

i

Stat

es

N−1

N

o_t o_t+1

Observation sequence

i

j

aij

a2j

a_1j

a3j

aNj

Viterbi recursion:δt+1(j) = max

iδt(i)aijbj(ot+1)

Contrast the above with the recursion in Forward algorithm:αt+1(j) =

∑N

i=1 αt(i)aijbj(ot+1)

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 15/26

figures/logos/tifrLogo.eps

Viterbi AlgorithmInitialization:

δ1(i) = πibi(o1), 1 ≤ i ≤ N

ψ1(i) = 0

Recursion:δt(j) = max

1≤i≤N[δt−1(i)aij ] bj(ot)

ψt(j) = argmax1≤i≤N

[δt−1(i)aij ] 2 ≤ t ≤ T , 1 ≤ j ≤ N

Termination:P∗ = max

1≤i≤N[δT (i)]

q∗T = argmax

1≤i≤N

[δT (i)]

Path (optimal state sequence) backtracking:q∗t = ψt+1(q

∗t+1), t = T − 1,T − 2, · · · , 2, 1.

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 16/26

figures/logos/tifrLogo.eps

Training

Problem 3: Given training data and its transcription, how to estimate theparameters of the model, λ = (A,B , π), that maximises the probability ofrepresentation of training data by the model, P(O|λ)?There is no analytic solution because of its complexity. So, we employExpectation-Maximisation (an iterative) algorithm.

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 19/26

figures/logos/tifrLogo.eps

Training

Problem 3: Given training data and its transcription, how to estimate theparameters of the model, λ = (A,B , π), that maximises the probability ofrepresentation of training data by the model, P(O|λ)?There is no analytic solution because of its complexity. So, we employExpectation-Maximisation (an iterative) algorithm.

1) Start with an initial (approximate) model, λ0.2) E-step: Using the current model (λ0), compute the expectation of thelikelihood of the training data: P(O|λ) =

∑Ni=1 αT (i).

3) M-step: Re-estimate the parameters (λ = (A,B , π)) so as to maximisethe probability (P(O|λ)).4) Stop if the improvement in log likelihood is insignificant:

P(O|λ)− P(O|λ0) < ∆5) Else, set λ0 ← λ and go to step 2.

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 19/26

figures/logos/tifrLogo.eps

Training

Problem 3: Given training data and its transcription, how to estimate theparameters of the model, λ = (A,B , π), that maximises the probability ofrepresentation of training data by the model, P(O|λ)?There is no analytic solution because of its complexity. So, we employExpectation-Maximisation (an iterative) algorithm.

1) Start with an initial (approximate) model, λ0.2) E-step: Using the current model (λ0), compute the expectation of thelikelihood of the training data: P(O|λ) =

∑Ni=1 αT (i).

3) M-step: Re-estimate the parameters (λ = (A,B , π)) so as to maximisethe probability (P(O|λ)).4) Stop if the improvement in log likelihood is insignificant:

P(O|λ)− P(O|λ0) < ∆5) Else, set λ0 ← λ and go to step 2.

The EM algorithm as applied to ASR is known as B-W algorithm; it is alsoknown as Forward-Backward algorithm.

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 19/26

figures/logos/tifrLogo.eps

Forward-Backward Algorithm: βt(i)

Define a backward variable βt(i) = p(ot+1, . . . , oT |qt = i , λ)

βt(i)Given that we are at node i at time t:

⇒ Sum of probabilities of all paths such thatpartial sequence ot+1, . . . , oT are observed

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 20/26

figures/logos/tifrLogo.eps

Forward-Backward Algorithm: βt(i)

Define a backward variable βt(i) = p(ot+1, . . . , oT |qt = i , λ)

βt(i)Given that we are at node i at time t:

⇒ Sum of probabilities of all paths such thatpartial sequence ot+1, . . . , oT are observed

Starting with the initial condition at the last speech vector (t = T ):βT (i) = 1.0, 1 ≤ i ≤ N,

we can recursively compute βt(i) for every state i = 1, 2, . . . ,N backwardsin time (t = T-1, T-2, . . . , 2, 1) as follows:

βt(i) =

N∑

j=1

[aijbj(ot+1)]

︸ ︷︷ ︸Going to each nodefrom i th node

βj(t + 1)︸ ︷︷ ︸

Prob. of observationot+2 . . . oT givennow we are in j th

node at t + 1Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 20/26

figures/logos/tifrLogo.eps

Joint event: state i at time t AND state j at t+1

Define ξt(i , j) as the probability of system being in state i at time t and instate j at time t+1:

ξt(i , j) =αt(i)aijbj (ot+1)βt+1(j)

P(O|λ)

Source: http://www.shokhirev.com/nikolai/abc/alg/hmm/hmm.html

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 21/26

figures/logos/tifrLogo.eps

Re-estimation Formulae: πi and aij

The revised estimate of initial probability, πi , is the expected frequency instate i at time (t=1):

πnewi =

N∑

j=1

ξ1(i , j)

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 22/26

Estimating Transition Probability

Trans. Prob. from state i to j = No. of times transition was made from i to jTotal number of times we made transition from i

τt(i, j) ⇒ prob. of being in “state=i at time=t” and “state=j at time=t+1”

If we average τt(i, j) over all time-instants, we get the number of times the system

was in ith state and made a transition to jth state. So, a revised estimation of

transition probability is

anewij =

∑T−1t=1 τt(i, j)

∑Tt=1(

N∑

j=1

τt(i, j)

︸ ︷︷ ︸all transitions out

of i at time=t

)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 51 of 88

admin
Pencil
admin
Pencil
admin
Pencil
admin
Pencil
admin
Pencil
admin
Pencil
admin
Pencil

figures/logos/tifrLogo.eps

Re-estimation Formulae: bj(t)

Parameters of State Probability Density FunctionLet us assume that the state output distribution function is Gaussian. Ifthere was just one state j , the maximum likelihood estimation ofparameters would be

µj =1

T

T∑

t=1

ot

Σj =1

T

T∑

t=1

(ot − µj)(ot − µj)′

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 23/26

figures/logos/tifrLogo.eps

Re-estimation Formulae: bj(t)

Parameters of State Probability Density FunctionLet us assume that the state output distribution function is Gaussian. Ifthere was just one state j , the maximum likelihood estimation ofparameters would be

µj =1

T

T∑

t=1

ot

Σj =1

T

T∑

t=1

(ot − µj)(ot − µj)′

* Difficulty: Speech HMMs have many states.* Speech vector ↔ state mapping is unknown because the state sequenceitself is unknown.* Solution: Assign each speech vector to every state in proportion to thelikelihood of system being in that state when the speech vector wasobserved.

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 23/26

figures/logos/tifrLogo.eps

Re-estimation Formulae: bj(t)

Let Lj (t) denote the probability of being in state j at time t.

Lj (t) = p(qt = j |O, λ)

=p(qt = j ,O|λ)

p(O|λ)

=αt(i)βt(j)∑

i αT (i)

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 24/26

figures/logos/tifrLogo.eps

Re-estimation Formulae: bj(t)

Let Lj (t) denote the probability of being in state j at time t.

Lj (t) = p(qt = j |O, λ)

=p(qt = j ,O|λ)

p(O|λ)

=αt(i)βt(j)∑

i αT (i)

Revised estimates of the state pdf parameters are

µj =

∑Tt=1 Lj(t)ot∑Tt=1 Lj(t)

Σj =

∑Tt=1 Lj(t)(ot − µj)(ot − µj)

∑Tt=1 Lj(t)

The expected values (estimations) are weighted averages, weights being theprobability of being in state j at time t.

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 24/26

figures/logos/tifrLogo.eps

Some remarks

Types of HMM* Ergodic Vs left-to-right* Semi-Markov (state duration)* Discriminative models

Implementational Issues* Number of states* Initial parameters* Scaling, addition of logLikelihoods* Multiple observations (tokens/repetitions)* Discrete Vs Continuous probability functions (with GMMs)* Concatenation of smaller HMMs → larger HMM

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 25/26

figures/logos/tifrLogo.eps

References

◮ Four online tutorials on HMM are listed at< http : //speech.tifr .res.in/tutorials/index .html >

◮ Books: ”Fundamentals of Speech Recognition”, by Lawrence R.Rabiner, B. H. Juang and B.Yegnanarayana, Pearson Education India,2008, Rs. 450; ISBN:9788177585605

◮ Spoken Language Processing : A Guide to Theory, Algorithm andSystem Development, by Xuedong Huang, Alex Acero, Hsiao-WuenHon Year 2001, Prentice Hall PTR; ISBN: 0130226165.

◮ Hidden Markov models for speech recognition; X.D. Huang, Y. Ariki,M.A. Jack. Edinburgh: Edinburgh University Press, c1990.

◮ Statistical methods for speech recognition, F.Jelinek, The MIT Press,Cambridge, MA., 1998.

◮ HMM on MATLAB “HMM toolbox on matlab: Discrete HMMs:training and recognition” by Kevin Murphy, 2005;< http : //www .cs.ubc.ca/ murphyk/Software/HMM/hmm.html >

Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 ASR using Hidden Markov Model : A tutorial 26/26