gaussian mixture model and hidden markov model

Post on 28-Nov-2014

721 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Gaussian Mixture Model (GMM)and

Hidden Markov Model (HMM)

Samudravijaya K

Majority of the slides are taken from S.Umesh’s tutorial on ASR (WiSSAP 2006).

1 of 88

Pattern Recognition

Signal

ModelGeneration

PatternMatching

Input

Output

Training

Testing

Processing

GMM: static patterns

HMM: sequential patterns

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 2 of 88

Basic Probability

Joint and Conditional probability

p(A,B) = p(A|B) p(B) = p(B|A) p(A)

Bayes’ rule

p(A|B) =p(B|A) p(A)

p(B)

If Ais are mutually exclusive events,

p(B) =∑

i

p(B|Ai) p(Ai)

p(A|B) =p(B|A) p(A)∑i p(B|Ai) p(Ai)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 3 of 88

Normal Distribution

Many phenomenon are described by Gaussian pdf

p(x|θ) =1√

2πσ2exp

(− 1

2σ2(x − µ)2

)(1)

pdf is parameterised by θ = [µ, σ2] where mean = µ and variance=σ2.

A convenient pdf : second order statistics is sufficient.

Example: Heights of Pygmies ⇒ Gaussian pdf with µ = 4ft & std-dev(σ) = 1ft

OR: Heights of bushmen ⇒ Gaussian pdf with µ = 6ft & std-dev(σ) = 1ft

Question:If we arbitrarily pick a person from a population ⇒what is the probability of the height being a particular value?

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 4 of 88

If I pick arbitrarily a Pygmy, say x, then

Pr(Height of x=4’1”) =1√2π.1

exp

(− 1

2.1(4′1′′ − 4)2

)(2)

Note: Here mean and variances are fixed, only the observations, x, change.

Also see: Pr(x = 4′1′′) ≫ Pr(x = 5′) AND Pr(x = 4′1′′) ≫ Pr(x = 3′)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 5 of 88

Conversely: Given a person’s height is 4′1′′ ⇒Person is more likely to be a pygmy than bushman.

If we observe heights of many persons – say 3′6′′, 4′1′′, 3′8′′, 4′5′′, 4′7′′, 4′, 6′5′′

and all are from same population (i.e. either pygmy or bushmen.)

⇒ then more certain we are that the population is pygmy.

More the observations ⇒ better will be our decision

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 6 of 88

Likelihood Function

x[0], x[1], . . . , x[N − 1]

⇒ set of independent observations from pdf parameterised by θ.

Previous Example: x[0], x[1], . . . , x[N − 1] are heights observed and

θ is the mean of density which is unknown (σ2 assumed known).

L(X; θ) = p(x0 . . . xN−1; θ) =N∏

i=0

p(xi; θ)

=1

(2πσ2)N |2exp

(− 1

2σ2

N∑

i=0

(xi − θ)2

)(3)

L(X; θ) is a function of θ and is called Likelihood Function

Given: x0 . . . xN−1, ⇒ what can we say about value of θ, i.e. best estimate of θ.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 7 of 88

Maximum Likelihood Estimation

Example: We know height of a person x[0] = 4′4′′.

Most likely to have come from which pdf ⇒ θ = 3′, 4′6′′ or 6′ ?

Maximum of L(x[0]; θ = 3′), L(x[0]; 4′6′′) and L(x[0]; θ = 6′) ⇒ choose θ = 4′6′′.

If θ is just a parameter, we will choose arg maxθ

L(x[0]; θ).

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 8 of 88

Maximum Likelihood Estimator

Given x[0], x[1], . . . , x[N − 1] and pdf parameterised by θ =

θ1

θ2

.

.

θm−1

We form Likelihood function L(X; θ) =N∏

i=0

p(xi;θ)

θMLE = arg maxθ

L(X; θ)

For height problem:

⇒ can show (θ)MLE = 1N

∑xi

⇒ Estimate of mean of Gaussian = sample mean of measured heights.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 9 of 88

Bayesian Estimation

• MLE ⇒ θ is assumed unknown but deterministic

• Bayesian Approach: θ is assumed random with pdf p(θ) ⇒ Prior Knowledge.

p(θ|x)︸ ︷︷ ︸Aposterior

=p(x|θ)p(θ)

p(x)∝ p(x|θ) p(θ)︸︷︷︸

Prior

• Height problem: Unknown mean is random ⇒ pdf Gaussian N (γ, ν2)

p(µ) =1√

2πν2exp

(− 1

2ν2(µ − γ)2

)

Then : (µ)Bayesian =σ2γ + nν2x

σ2 + nν2

⇒ Weighted average of sample mean and a prior mean

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 10 of 88

Gaussian Mixture Model

p(x) = α p(x|N(µ1; σ1)) + (1 − α) p(x|N(µ2; σ2))

p(x) =M∑

m=1

wm p(x|N(µm; σm)),∑

wi = 1

Characteristics of GMM:

Just like ANNs are universal approximators of functions, GMMs are universal

approximators of densities (provided sufficient no. of mixtures are used); true for

diagonal GMMs as well.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 11 of 88

General Assumption in GMM

• Assume that there are M components.

• Each component generates data from a Gaussian with mean µm and covariance

matrix Σm.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 12 of 88

GMM

Consider the following probability density function shown in solid blue

It is useful to parameterise or “model” this seemingly arbitrary “blue” pdf

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 13 of 88

Gaussian Mixture Model (Contd.)

Actually – pdf is a mixture of 3 Gaussians, i.e.

p(x) = c1N(x; µ1, σ1) + c2N(x; µ2, σ2) + c3N(x; µ3, σ3) and∑

ci = 1 (4)

pdf parameters: c1, c2, c3, µ1, µ2, µ3, σ1, σ2, σ3

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 14 of 88

Observation from GMM

Experiment: An urn contains balls of 3 different colurs: red, blue or green. Behind

a curtain, a person picks a ball from urn

If red ball ⇒ generate x[i] from N(x; µ1, σ1)

If blue ball ⇒ generate x[i] from N(x; µ2, σ2)

If green ball ⇒ generate x[i] from N(x; µ3, σ3)

We have access only to observations x[0], x[1], . . . , x[N − 1]

Therefore : p(x[i];θ) = c1N(x; µ1, σ1) + c2N(x; µ2, σ2) + c3N(x; µ3, σ3)

but we do not which urn x[i] comes from!

Can we estimate component θ = [c1 c2 c3 µ1 µ2 µ3 σ1 σ2 σ3]T from the observations?

arg maxθ

p(X; θ) = arg maxθ

N∏

i=1

p(xi;θ) (5)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 15 of 88

Estimation of Parameters of GMM

Easier Problem: We know the component for each observation

Obs: x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12]

Comp. 1 2 2 1 3 1 3 3 2 2 3 3 3

X1 = {x[0], x[3], x[5] } belong to p1(x;µ1, σ1)

X2 = {x[1], x[2], x[8], x[9] } belongs to p2(x; µ2, σ2)

X3 = {x[3], x[6], x[7], x[10], x[11], x[12]} belongs to p3(x; µ3, σ3)

From: X1 = {x[0], x[3], x[5] }c1 = 3

13

and µ1 = 13 {x[0] + x[3] + x[5]}

σ21 = 1

3

{(x[0] − µ1)

2 + (x[2] − µ1)2 + (x[5] − µ1)

2}

In practice we do not know which observation come from which pdf.

⇒ How do we solve for arg maxθ

p(X ;θ) ?

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 16 of 88

Incomplete & Complete Data

x[0], x[1], . . . , x[N − 1] ⇒ incomplete data,

Introduce another set of variables y[0], y[1], . . . , y[N − 1]

such that y[i] = 1 if x[i] ∈ p1, y[i] = 2 if x[i] ∈ p2 and y[i] = 3 if x[i] ∈ p3

Obs: x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12]

Comp. 1 2 2 1 3 1 3 3 2 2 3 3 3

miss: y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7] y[8] y[9] y[10] y[11] y[12]

y[i] = missing data—unobserved data ⇒ information about component

z = (x; y) is complete data ⇒ observations and which density they come from

p(z;θ) = p(x, y;θ)

⇒ θMLE = arg maxθ

p(z;θ)

Question: But how do we find which observation belongs to which density ?

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 17 of 88

Given observation x[0] and θg, what is the probability of x[0] coming from first

distribution?

p (y[0] = 1|x[0];θg)

= p(y[0]=1,x[0];θg)p(x[0];θg)

=p(x[0]|y[0]=1;µ

g1,σ

g1) · p(y[0]=1)∑3

j=1 p(x[0]|y[0]=j;θg) p(y[0]=j)

=p(x[0]|y[0]=1,µ

g1,σ

g1) · c

g1

p(x[0]|y[0]=1;µg1,σ

g1) c

g1+p(x[0]|y[0]=2;µ

g2,σ

g2) c

g2+...

All parameters are known ⇒ we can calculate p(y[0] = 1|x[0];θg)

(Similarly calculate p(y[0] = 2|x[0]; θg), p(y[0] = 3|x[0];θg)

Which density? ⇒ y[0] = arg maxi

p(y[0] = i|x[0]; θg) – Hard allocation

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 18 of 88

Parameter Estimation for Hard Allocation

x[0] x[1] x[2] x[3] x[4] x[5] x[6]

p(y[j] = 1|x[j];θg) 0.5 0.6 0.2 0.1 0.2 0.4 0.2

p(y[j] = 2|x[j];θg) 0.25 0.3 0.75 0.3 0.7 0.5 0.6

p(y[j] = 3|x[j];θg) 0.25 0.1 0.05 0.6 0.1 0.1 0.2

Hard Assign. y[0]=1 y[1]=1 y[2]=2 y[3]=3 y[4]=2 y[5]=2 y[6]=2

Updated Parameters: c1 = 27 c2 = 4

7 c3 = 17 (different from initial guess!)

Similarly (for Gaussian) find: µi, σ2i for ith pdf

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 19 of 88

Parameter Estimation for Soft Assignment

x[0] x[1] x[2] x[3] x[4] x[5] x[6]

p(y[j] = 1|x[j];θg) 0.5 0.6 0.2 0.1 0.2 0.4 0. 2

p(y[j] = 2|x[j];θg) 0.25 0.3 0.75 0.3 0.7 0.5 0.6

p(y[j] = 3|x[j];θg) 0.25 0.1 0.05 0.6 0.1 0.1 0.2

Example: Prob. of each sample belonging to component 1

p(y[0] = 1|x[0]; θg), p(y[1] = 1|x[1]; θg) p(y[2] = 1|x[2]; θg), · · · · · ·

Average probability that a sample belongs to Comp.#1 is

cnew1 =

1

N

N∑

i=1

p(y[i] = 1|x[i];θg)

=0.5 + 0.6 + 0.2 + 0.1 + 0.2 + 0.4 + 0.2

7=

2.2

7

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 20 of 88

Soft Assignment – Estimation of Means & Variances

Recall: Prob. of sample j belonging to component i

p(y[j] = i|x[j];θg)

Soft Assignment: Parameters estimated by taking weighted average !

µnew1 =

∑Ni=1 xi · p(y[i] = 1 | x[i]; θg)∑N

i=1 p(y[i] = 1 | x[i];θg)

(σ21)

new =

∑Ni=1(xi − µ1)

2 · p(y[i] = 1 | x[i]; θg)∑N

i=1 p(y[i] = 1 | x[i]; θg)

These are updated parameters starting with initial guess θg

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 21 of 88

Maximum Likelihood Estimation of Parameters of GMM

1. Make initial guess of parameters: θg = cg1, c

g2, c

g3, µ

g1, µ

g2, µ

g3, σ

g1, σ

g2, σ

g3

2. Knowing parameters θg, find Prob. of sample xi belonging to jth component.

p[y[i] = j | x[i]; θg] for i = 1, 2, . . . N ⇒ no. of observations

for j = 1, 2, . . . M ⇒ no. of components

3.

bcnewj =

1

N

NX

i=1

p(y[i] = j | x[i]; θg)

4.

µnewj =

PNi=1 xi · p(y[i] = j | x[i]; θg)

PNi=1 p(y[i] = j | x[i]; θg)

5.

(σ2j )

new =

PNi=1(xi − bµ1)

2 · p(y[i] = j | x[i]; θg)PN

i=1 p(y[i] = j | x[i]; θg)

6. Go back to (2) and repeat until convergence

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 22 of 88

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 23 of 88

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 24 of 88

Live demonstration: http://www.neurosci.aist.go.jp/˜ akaho/MixtureEM.html

Practical Issues

• E.M. can get stuck in local minima.

• EM is very sensitive to initial conditions; a good initial guess helps; k-means

algorithm is used prior to application of EM algorithm

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 25 of 88

Size of a GMM

Bayesian Information Criterion (BIC) value of a GMM can be defined as follows:

BIC(G | X) = logp(X | G) − d

2logN

where

G represent the GMM with the ML parameter configuration

d represents the number of parameters in G

N is the size of the dataset

the first term is the log-likelihood term;

the second term is the model complexity penalty term.

BIC selects the best GMM corresponding to the largest BIC value by trading off

these two terms.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 26 of 88

source: Boosting GMM and Its Two Applications, F.Wang, C.Zhang and N.Lu in N.C.Oza et al. (Eds.) LNCS 3541, pp. 12-21, 2005

The BIC criterion can discover the true GMM size effectively as shown in the figure.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 27 of 88

Maximum A Posteriori (MAP)

• Sometimes, it is difficult to get sufficient number of examples for robust estimation

of parameters.

• However, one may have access to large number of similar examples which can be

utilized.

• Adapt the target distribution from such a distribution. For example, adapt a

speaker independent model to a new speaker using small amount of adaptation

data.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 28 of 88

MAP Adaptation

ML Estimation

θMLE = arg maxθ

p(X |θ)

MAP Estimation

θMAP = arg maxθ

p(θ|X)

= arg maxθ

p(X |θ) p(θ)

p(X)

= arg maxθ

p(X |θ) p(θ)

p(θ) is the a priori distribution of parameters θ.

A conjugate prior is chosen such that the corresponding posterior belongs to the

same functional family as the prior.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 29 of 88

Simple Implementation of MAP-GMMs

Source: Statistical Machine Learning from Data: GMM; Samy Bengio

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 30 of 88

Simple Implementation

Train a prior model p with a large amount of available data (say, from multiple

speakers). Adapt the parameters to a new speaker using some adaptation data (X).

Let α = [0, 1] be a parameter that describes the faith on the prior model.

Adapted weight of jth mixture

wj =

[αw

pj + (1 − α)

i

p(j|xi)

Here γ is a normalization factor such that∑

wj = 1.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 31 of 88

Simple Implementation (contd.)

means

µj = αµpj + (1 − α)

∑i p(j|xi) xi∑

i p(j|xi)Weighted average of sample mean and a prior mean

variances

σj = α(σ

pj + µ

pjµ

p′

j

)+ (1 − α)

∑i p(j|xi) xix

i∑i p(j|xi)

− µjµj

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 32 of 88

HMM

• Primary role of speech signal is to carry a message; sequence of sounds (phonemes)

encode a sequence of words.

• The acoustic manifestation of a phoneme is mostly determined by:

– Configuration of articulators (jaw, tongue, lip)

– physiology and emotional state of speaker

– Phonetic context

• HMM models sequential patterns; speech is a sequential pattern

• Most text dependent speaker recognition systems use HMMs

• Text verification involves verification/recognition of phonemes

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 33 of 88

Phoneme recognition

Consider two phonemes classes /aa/ and /iy/.

Problem: Determine to which class a given sound belongs.

Processing of speech signal results in a sequence of feature (observation) vectors:

o1, . . . , oT (say MFCC vectors)

We say the speech is /aa/ if: p(aa|O) > p(iy|O)

Using Bayes Rule

AcousticModel︷ ︸︸ ︷p(O|aa) p(aa)

p(O)V s.

p(O|iy)

PriorProb︷ ︸︸ ︷p(iy)

p(O)

Given p(O|aa), p(aa), p(O|iy) and p(iy) ⇒ which is more probable ?

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 34 of 88

Parameter Estimation of Acoustic Model

How do we find the density function paa(.) and piy(.).

We assume a parametric model:⇒ paa() parameterised by θaa

⇒ pij() parameterised by θiy

Training Phase: Collect many examples of /aa/ being said

⇒ Compute corresponding observations o1, . . . , oTaa

Use the Maximum Likelihood Principle

θaa = arg maxθaa

p(O;θaa)

Recall: if the pdf is modelled as a Gaussian Mixture Model

. ⇒ then we use EM Algorithm

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 35 of 88

Modelling of Phoneme

To enunciate /aa/ in a word ⇒Our Articulators are moving from a configuration

for previous phoneme to /aa/ and then proceeding

to move to configuration of next phoneme.

Can think of 3 distinct time periods:

⇒ Transition from previous phoneme

⇒ Steady state

⇒ Transition to next phoneme

Features for 3 “time-interval ”are quite different

⇒ Use different density functions to model the three time intervals

⇒ model as paa1(;θaa1) paa2(;θaa

2) paa3(;θaa3)

Also need to model the time durations of these time-intervals – transition probs.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 36 of 88

Stochastic Model (HMM)

f(Hz)f(Hz)

p(f)

f(Hz)

p(f)

a12

a11

1

p(f)

2 3

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 37 of 88

HMM Model of Phoneme

• Use term “State”for each of the three time periods.

• Prob. of ot from jth state, i.e. paaj(ot; θaaj) ⇒ denoted as bj(ot)

1 2 3. . .p(; �1aa) p(; �2aa) p(; �3aa)o10o3o2o1

• Observation, ot, is generated by which state density?

– Only observations are seen, the state-sequence is “hidden”

– Recall: In GMM, the “mixture component is “hidden”

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 38 of 88

Probability of Observation

Recall: To classify, we evaluate Pr(O|Λ) – where Λ are parameters of models

In /aa/ Vs /iy/ calculation: ⇒ Pr(O|Λaa) Vs Pr(O|Λiy)

Example: 2-state HMM model and 3 observations o1 o2 o3

Model parameters are assumed known:

Transition Prob. ⇒ a11, a12, a21, and a22 – model time durations

State density ⇒ b1(ot) and b2(ot).

bj(ot) are usually modelled as single Gaussian with parameter µj, σ2j or by GMMs

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 39 of 88

Probability of Observation through one Path

T = 3 observations and N = 2 nodes ⇒ 8 paths thru 2 nodes for 3 observations

Example: Path P1 through states 1, 1, 1.

Pr{O|P1,Λ} = b1(o1) · b1(o2) · b1(o3)

Prob. of Path P1 = Pr{P1|Λ} = a01 · a11 · a11

Pr{O, P1|Λ} = Pr{O|P1,Λ} · Pr{P1|Λ} = a01b1(o1).a11b1(o2).a11b1(o3)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 40 of 88

Probability of Observation

Path o1 o2 o3 p(O, Pi|Λ)

P1 1 1 1 a01b1(o1).a11b1(o2).a11b1(o3)

P2 1 1 2 a01b1(o1).a11b1(o2).a12b2(o3)

P3 1 2 1 a01b1(o1).a12b2(o2).a21b1(o3)

P4 1 2 2 a01b1(o1).a12b2(o2).a22b2(o3)

P5 2 1 1 a02b2(o1).a21b1(o2).a11b1(o3)

P6 1 1 2 a02b2(o1).a21b1(o2).a12b2(o3)

P7 1 1 1 a02b2(o1).a22b1(o2).a21b1(o3)

P8 1 1 2 a02b2(o1).a22b1(o2).a22b2(o3)

p(O|Λ) =∑

Pi

P{O, Pi|Λ} =∑

Pi

P{O|Pi, Λ} · P{Pi,Λ}

Forward Algorithm ⇒ Avoid Repeat Calculations:

Two Multiplications︷ ︸︸ ︷a01b1(o1)a11b1(o2).a11b1(o3) + a02b2(o1).a21b1(o2).a11b1(o3)

=[a01.b1(o1)a11b1(o2) + a02b2(o1)a21b1(o2)]a11b1(o3)︸ ︷︷ ︸One Multiplication

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 41 of 88

Forward Algorithm – Recursion

Let :α1(t = 1) = a01b1(o1)

Let :α2(t = 1) = a02b2(o2)

Recursion : α1(t = 2) = [a01b1(o1).a11 + a02b2(o1).a21].b1(o2)

= [α1(t = 1).a11 + α2(t = 1).a21].b1(o2)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 42 of 88

General Recursion in Forward Algorithm

αj(t) =[∑

αi(t − 1)aij

].bj(ot)

= P{o1, o2, . . .ot, st = j|Λ}

Note

αj(t)

Sum of probabilities of all paths ending at

⇒ node j at time t with partial observation

sequence o1,o2, . . . ,ot

The probability of the entire observation (o1,o2, . . . ,oT ), therefore, is

p(0|Λ) =

N∑

j=1

P{o1,o2, . . . ,oT , ST = j|Λ}

=N∑

j=1

αj(T )

where N=No. of nodes

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 43 of 88

Backward Algorithm

• analogous to Forward, but coming from the last time instant T

Example: a01b1(o1).a11b1(o2).a11b1(o3) + a01b1(o1).a11b1(o2)a12b2(o3) + . . .

= [a01.b1(o1).a11b1(o2)].(a11b1(o3) + a12b2(o3))

β1(t = 2) = p{o3|st=2 = 1, Λ}= p{o3, st=3 = 1|st=2 = 1;Λ} + p{o3, st=3 = 2|st=2 = 1, Λ}

=p{o3|st=3 = 1, st=2 = 1, Λ}.p{st=3 = 1|st=2 = 1, Λ} +

p{o3|st=3 = 2, st=2 = 1,Λ}.p{st=3 = 2|st=2 = 1, Λ}= b1(o3).a11 + b2(o3).a12

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 44 of 88

General Recursion in Backward Algorithm

βj(t)

Given that we are at node j at time t

⇒ Sum of probabilities of all paths such that

partial sequence ot+1, . . . , oT are observed

βi(t) =

N∑

j=1

[aijbj(ot+1)]

︸ ︷︷ ︸Going to each node from ith node

βj(t + 1)︸ ︷︷ ︸Prob. of observation ot+2 . . . oT given

now we are in jth node at t + 1

= p{ot+1, . . . , ot|st = i,Λ}

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 45 of 88

Estimation of Parameters of HMM Model

• Given known Model parameters, Λ:

– Evaluated p(O|Λ) ⇒ useful for classification

– Efficient Implementation: Use Forward or Backward Algo.

• Given set of observation vectors, ot how do we estimate parameters of HMM?

– Do not know which states ot come from

∗ Analogous to GMM – do not know which component

– Use a special case of EM – Baum-Welch Algorithm

– Use following relations from Forward/Backward

p(O|Λ) = αN(T ) = β1(T ) =N∑

j=1

αj(t)βj(t)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 46 of 88

Parameter Estimation for Known State Sequence

Assume each state is modelled as a single Gaussian:

µj = Sample mean of observations assigned to state j.

σ2j = Variance of the observations assigned to state j.

and

Trans. Prob. from state i to j =No. of times transition was made from i to j

Total number of times we made transition from i

In practice since we do not know which state generated the observation

⇒ So we will do probabilistic assignment.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 47 of 88

Review of GMM Parameter Estimation

Do not know which component of the GMM generated output observation.

Given initial model parameters Λg, and observation sequence x1, . . . , xT .

Find probability xi comes from component j ⇒ Soft Assignment

p[component = 1|xi; Λg] =

p[component = 1, xi|Λg]

p[xi|Λg]

So, re-estimation equations are:

bCj =1

T

TX

i=1

p(comp = j|xi; Λg)

bµnewj =

PTi=1 xip(comp = j|xi; Λ

g)PT

i=1 p(comp = j|xi; Λg)bσ2

j =

PTi=1(xi − bµj)

2p(comp = j|xi; Λg)PTi=1 p(comp = j|xi; Λg)

A similar analogy holds for hidden Markov models

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 48 of 88

Baum-Welch Algorithm

Here: We do not know which observation ot comes from which state si

Again like GMM we will assume initial guess parameter Λg

Then prob. of being in “state=i at time=t” and “state=j at time=t+1” is

τt(i, j) = p{qt = i, qt+1 = j|O,Λg} =p{qt = i, qt+1 = j,O|Λg}

p{O|Λg}

where p{O|Λg} = αN(T ) =∑

i αi(T )

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 49 of 88

Baum-Welch Algorithm

Then prob. of being in “state=i at time=t” and “state=j at time=t+1” is

τt(i, j) = p{qt = i, qt+1 = j|O,Λg} =p{qt = i, qt+1 = j, O|Λg}

p{O|Λg}

where p{O|Λg} = αN(T ) =∑

i αi(T )

From ideas of Forward-Backward Algorithm, numerator is

p{qt = i, qt+1 = j,O|Λg} = αi(t).aijbj(ot+1).βj(t + 1)

So τt(i, j) =αi(t).aijbj(ot+1)βj(t + 1)

αN(t)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 50 of 88

Estimating Transition Probability

Trans. Prob. from state i to j = No. of times transition was made from i to jTotal number of times we made transition from i

τt(i, j) ⇒ prob. of being in “state=i at time=t” and “state=j at time=t+1”

If we average τt(i, j) over all time-instants, we get the number of times the system

was in ith state and made a transition to jth state. So, a revised estimation of

transition probability is

anewij =

∑T−1t=1 τt(i, j)

∑Tt=1(

N∑

j=1

τt(i, j)

︸ ︷︷ ︸all transitions out

of i at time=t

)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 51 of 88

Estimating State-Density Parameters

Analogous to GMM: which observation belonged to which component,

New estimates for the state pdf parameters are (assuming single Gaussian)

µi =

∑Tt=1 γi(t)ot∑Tt=1 γi(t)

∑i

=

∑Tt=1 γi(t)(ot − µi)(ot − µi)

T

∑Tt=1 γi(t)

These are weighted averages ⇒ weighted by Prob. of being in state j at t

– Given observation ⇒ HMM model parameters estimated iteratively

– p(O|Λ) ⇒ evaluated efficiently by Forward/Backward algorithm

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 52 of 88

Viterbi Algorithm

Given the observation sequence,

• the goal is to find corresponding state-sequence that generated it

• there are many possible combination (NT ) of state sequence ⇒ many paths.

One possible criterion : Choose the state sequence corresponding to path that with

maximum probability

maxi

P{O, Pi|Λ}

Word : represented as sequence of phones

Phone : represented as sequence states

Optimal state-sequence ⇒ Optimal phone-sequence ⇒ Word sequence

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 53 of 88

Viterbi Algorithm and Forward Algorithm

Recall Forward Algorithm : We found probability over each path and summed over

all possible paths

NT∑

i=1

p{O, Pi|Λ}

Viterbi is just special case of Forward algo.

At each node

{instead of sum of prob. of all paths

choose path with max prob.

In Practice: p(O|Λ) approximated by Viterbi (instead of Forward Algo.)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 54 of 88

Viterbi Algorithm

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 55 of 88

Decoding

• Recall: Desired transcription W obtained by maximising

W = arg maxW

p(W |O)

• Search over all possible W – astronomically large!

• Viterbi Search – find most likely path through a HMM

– Sequence of phones (states) which is most probable

– Mostly: most probable sequence of phones correspond to most probable

sequence of words

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 56 of 88

Training of a Speech Recognition System

– HMM parameter’s estimated using large databases – 100 hours

∗ Parameters estimated using Maximum Likelihood Criterion

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 57 of 88

Recognition

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 58 of 88

Speaker Recognition

Spectra (formants) of a given sound are different for different speakers.

Spectra of 2 speakers for one “frame” of /iy/

Derive speaker dependent model of a new speaker by MAP adaptation of Speaker-

Independent (SI) model using small amount of adaptation data; use for speaker

recognition.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 59 of 88

References

• Pattern Classification, R.O.Duda, P.E.Hart and D.G.Stork, John Wiley, 2001.

• Introduction to Statistical Pattern Recognition, K.Fukunaga, Academic Press,

1990.

• The EM Algorithm and Extensions, Geoffrey J. McLachlan and Thriyambakam

Krishnan, Wiley-Interscience; 2nd edition, 2008. ISBN-10: 0471201707

• The EM Algorithm and Extensions, Geoffrey J. McLachlan and Thriyambakam

Krishnan, Wiley-Interscience; 2nd edition, 2008. ISBN-10: 0471201707

• Fundamentals of Speech Recognition, Lawrence Rabiner & Biing-Hwang Juang,

Englewood Cliffs NJ: PTR Prentice Hall (Signal Processing Series), c1993, ISBN

0-13-015157-2

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 60 of 88

• Hidden Markov models for speech recognition, X.D. Huang, Y. Ariki, M.A. Jack.

Edinburgh: Edinburgh University Press, c1990.

• Statistical methods for speech recognition, F.Jelinek, The MIT Press, Cambridge,

MA., 1998.

• Maximum Likelihood from incomplete data via the em algorithm, J. Royal

Statistical Soc. 39(1), pp. 1-38, 1977.

• Maximum a Posteriori Estimation for Multivariate Guassian Mixture Observation

of Markov Chains, J.-L.Gauvain and C.-H.Lee, IEEE Trans. SAP, 2(2), pp. 291-

298, 1994.

• A Gentle Tutorial of the EM Algorithm and its Application to Parameter

Estimation for Gaussian Mixture and Hidden Markov Models, J.A.Bilmes,

ISCI, TR-97-021.

• Boosting GMM and Its Two Applications, F.Wang, C.Zhang and N.Lu in

N.C.Oza et al. (Eds.) LNCS 3541, pp. 12-21, 2005.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 61 of 88

top related