1 hmm - part 2 review of the last lecture the em algorithm continuous density hmm

1

HMM - Part 2

Review of the last lecture The EM algorithm Continuous density HMM

2

Three Basic Problems for HMMs

Given an observation sequence O=(o1,o2,…,oT), and an HMM =(A,B,)

– Problem 1:

How to compute P(O|) efficiently ?

The forward algorithm

– Problem 2:

How to choose an optimal state sequence Q=(q1,q2,……, qT)

which best explains the observations?

The Viterbi algorithm– Problem 3:

How to adjust the model parameters =(A,B,) to maximize P(O|)? The Baum-Welch (forward-backward) algorithm

cf. The segmental K-means algorithm maximizes P(O, Q* |)

* arg max ( , | )P Q

Q O Q

)|(maxarg*iP

i

OP(up, up, up, up, up|)?

3

The Forward Algorithm

The forward variable:– Probability of o1,o2,…,ot being observed and the state at time t

being i, given model λ

The forward algorithm

1 1 1 1

1 11

1

1. Initialization ( , | ) 1

2. Induction 1 1 1

3.Termination O

i i

N

t t ij j ti

N

Ti

α i P o q i π b o , i N

α j α i a b o , t T - , j N

P λ α i

λiqoooPi ttt ,...21

4

The Viterbi Algorithm

1. Initialization

2. Induction

3. Termination

4. Backtracking),...,,(

1,...,2.1),(**

2*1

*

*11

T

tt*t

qqq

TTtqq

Q

Ni, i

Ni, obπi ii

10)(

1

1

11

Nj,T-t, aij

Nj,T-t, obaij

ijtNi

tjijtNi

t

1 11][maxarg)(

1 11][max

11t

11

1

iq

iλP

TNi

*T

TNi

1

1

*

maxarg

max,QO

is the best state sequence

11

1 cf.

tj

N

iijtt obaiαjα

N

iT iαλP

1O

5

The Segmental K-means Algorithm

Assume that we have a training set of observations and an initial estimate of model parameters– Step 1 : Segment the training data

The set of training observation sequences is segmented into states, based on the current model, by the Viterbi Algorithm

– Step 2 : Re-estimate the model parameters

– Step 3: Evaluate the model If the difference between the new and current model scores exceeds a threshold, go back to Step 1; otherwise, return

Number of " " in state ˆ Number of times in state j

k jb k

j

1Number of times ˆ

Number of training sequencesi

q i

Number of transitions from state to state ˆ

Number of transitions from state ij

i ja

i

1ˆ1

N

ii

1ˆ1

N

jija

1)(ˆ1

M

kj kb

6

Segmental K-means vs. Baum-Welch

Number of " " in state ˆ Number of times in state j

k jb k

j

1Number of times ˆ


q i

Number of transitions from state to state ˆ

Number of transitions from state ij

i ja

i

1 number of times ˆ


q i

Expected

number of " " in state ˆ number of times in state j

k jb k

j

Expected

Expected

number of transitions from state to state ˆ

number of transitions from state ij

i ja

i

Expected

Expected

7

The Backward Algorithm

The backward variable:– Probability of ot+1,ot+2,…,oT being observed, given the state at

time t being i and model

The backward algorithm

1 11

1 11

1. Initialization 1, 1

2. Induction , 1 1 1

3. Termination ( ) ( )

T

N

t ij j t tj

N

i ii

β i i N

i a b o j t T - , j N

P λ i b o

O

λiqoooPi tTttt ,,...,, 21

N

iT iαλP

1Ocf.

8

ot

The Forward-Backward Algorithm

Relation between the forward and backward variables

)(][

,...

11

21

ti

N

jjitt

ttt

obaji

iqoooPi

λ

N

jttjijt

tTttt

jobai

iqoooPi

111

21

)(

,...

λ

λiqPii ttt ,)( O

(Huang et al., 2001)

Ni tt iiλP 1 )(O

9

The Baum-Welch Algorithm (1/3)

Define two new variables:

t(i)= P(qt = i | O, ) – Probability of being in state i at time t, given O and

t( i, j )=P(qt = i, qt+1 = j | O, ) – Probability of being in state i at time t and state j at time t+1, given O

and

N

m

N

nttnmnt

ttjijtttt

nobam

jobai

λP

λjqiqPji

1 111

111 ,,,

O

O

Ni tt

tttttt

ii

ii

λP

ii

λP

iqPi

1

)|,(

OO

O

N

jtt jii

1

,

10


t(i)= P(qt = i | O, ) – Probability of being in state i at time t, given O and

t( i, j )=P(qt = i, qt+1 = j | O, ) – Probability of being in state i at time t and state j at time t+1, given O

and

iiL

l

T

t

lt

l

state from ns transitioofnumber expected)( 1

1

1

iqiL

l

l

11

1 timesofnumber expected)(

jijiL

l

T

t

lt

l

state to state from ns transitioofnumber expected),( 1

1

1

11


Re-estimation formulae for , A, and B are

How do you know ? ˆ( | ) ( | )P P O O

L

iiq

L

l

l

i

1

11

)(

sequences trainingofNumber

timesofnumber Expectedˆ

L

l

T

t

lt

L

l

T

t

lt

ijl

l

i

ji

i

jia

1

1

1

1

1

1

)(

),(

state from ns transitioofnumber Expected

state to state from ns transitioofnumber Expectedˆ

L

l

T

t

lt

L

l

T

vo

t

lt

jl

l

kt

j

j

j

jkkb

1 1

1 s.t.

1

)(

)(

statein timesofnumber Expected

statein "" ofnumber Expected)(ˆ

12

Maximum Likelihood Estimation for HMM

QQO

O

O

)|,(log maxarg

))|(log)(( )(maxarg

))|()(( )(maxarg

P

Pll

PLLML

,....,...,, 10 tHowever, we cannot find the solution directly.

An alternative way is to find a sequence:

....)(...)()( 10 tlll s.t.

13

])|,(

)|,([log

])|,(

)|,([log

)|,(

)|,(),|(log

)|,(

)|,(

)|(

)|,(log

)|,(

)|,(

)|(

)|,(log

)|(

)|,(log

)|(log)|,(log

)|(log)|(log)()(

),|(

),|(

tP

tP

tt

tt

t

t

t

t

t

t

tt

P

PE

P

PE

P

PP

P

P

P

P

P

P

P

P

P

P

PP

PPll

t

t

QO

QO

QO

QO

QO

QOOQ

QO

QO

O

QO

QO

QO

O

QO

O

QO

OQO

OO

OQ

OQ

Q

Q

Q

Q

Q

Jensen’s inequality

)|,(logmaxarg

)|,(log),|(maxarg

)|,(

)|,(log),|(maxarg

])|,(

)|,([logmaxarg

),|(

),|(

)1(

QO

QOOQ

QO

QOOQ

QO

QO

OQ

Q

Q

OQ

PE

PP

P

PP

P

PE

t

t

P

t

tt

tP

t

Q function

Solvable and can be proved that 1( ) ( )t tl l

1( | ) ( | )t tP P O OIf f is a concave function, and X is a r.v., thenE[f(X)]≤ f(E[X])

14

The EM Algorithm

EM: Expectation Maximization– Why EM?

• Simple optimization algorithms for likelihood functions rely on the intermediate variables, called latent dataFor HMM, the state sequence is the latent data

• Direct access to the data necessary to estimate the parameters is impossible or difficultFor HMM, it is almost impossible to estimate (A, B, ) without considering the state sequence

– Two Major Steps :• E step: compute the expectation of the likelihood by including the

latent variables as if they were observed

• M step: compute the maximum likelihood estimates of the parameters by maximizing the expected likelihood found in the E step

Q QOOQ )|,(log),|( PλP

15

Three Steps for EM

Step 1. Draw a lower bound– Use the Jensen’s inequality

Step 2. Find the best lower bound auxiliary function– Let the lower bound touch the objective function at the

current guess

Step 3. Maximize the auxiliary function– Obtain the new guess– Go to Step 2 until converge

[Minka 1998]

16

)(F

objective function

current guess

Form an Initial Guess of =(A,B,)

Given the current guess , the goal is to find a new guess such that

NEW

)()( NEWFF

)|(maxarg*

OP

17

)(F

)()( Fg

Step 1. Draw a Lower Bound

)(glower bound function

objective function

18

)(F

Step 2. Find the Best Lower Bound

objective function

lower bound function)(g

),( g

auxiliary function

19

)(F

),( g

Step 3. Maximize the Auxiliary Function

NEW

)()( FF NEW

auxiliary function

objective function

20

)(F

Update the Model

NEW

objective function

21

)(F

),( g

Step 2. Find the Best Lower Bound

objective function

auxiliary function

22

)(F

NEW

Step 3. Maximize the Auxiliary Function

)()( FF NEW

objective function

23

Step 1. Draw a Lower Bound (cont’d)

Q

Q

Q

QOQ

QOO

)(

)|,()(log

)|,(log)|(log

p

Pp

PP

Apply Jensen’s Inequality

A lower bound function of

)(F

If f is a concave function, and X is a r.v., thenE[f(X)]≤ f(E[X])

Q Q

QOQ

)(

)|,(log)(

p

Pp

Objective function

p(Q): an arbitrary probability distribution

24

Step 2. Find the Best Lower Bound (cont’d)

– Find that makes

the lower bound function

touch the objective function

at the current guess

*

( )

( , | ) We want to maximize ( ) log w.r.t ( ) at

( )

( , | )The best ( ) arg max ( ) log

( )p

Pp p

p

Pp p

p

Q

QQ

O QQ Q

Q

O QQ Q

Q

)(Qp

25


),|()|(

)|,(

)|,(

)|,()(

)|,(

1)|,(

)()|,(

)(

1)|,(log)(log

01)(log)|,(log

)(log)()|,(log)()(1

here multiplier Lagrange a introduce we,1)( Since

OQO

QO

QO

QOQ

QO

QOQ

QOQ

QOQ

QQO

QQQOQQ

Q

QQ

Q

Q

QQQ

Q

PP

P

P

Pp

P

ee

e

Pep

e

Pep

Pp

pP

ppPpp

p

Take the derivative w.r.t and set it to zero)(Qp

26


Q function

)|(log)|(log),|(

),|(

)|(),|(log),|(

),|(

)|,(log),|(),(

OOOQ

OQ

OOQOQ

OQ

QOOQ

Q

Q

Q

PPP

P

PPP

P

PPg

We can check

),(),|(

)|,(log),|(

g

P

PP

Q OQ

QOOQDefine

QQ

OQOQQOOQ ),|(log),|()|,(log),|(),( PPPPg

27

EM for HMM Training

Basic idea– Assume we have and the probability that each Q occurred in the

generation of O

i.e., we have in fact observed a complete data pair (O,Q) with frequency proportional to the probability P(O,Q|)

– We then find a new that maximizes

– It can be guaranteed that

EM can discover parameters of model to maximize the log-likelihood of the incomplete data, logP(O|), by iteratively maximizing the expectation of the log-likelihood of the complete data, logP(O,Q|)

Q QOOQ )|,(log),|( PλP ˆ

)|()ˆ|( λPP OO Expectation

28

Solution to Problem 3 - The EM Algorithm

The auxiliary function

where and can be expressed as

Q

Q

QOO

QO

QOOQ

,log,

,log,,

PλP

λP

PλPλλQ

T

ttq

T

tqqq

T

ttq

T

tqqq

obaP

obaλP

ttt

ttt

1

1

1

1

1

1

logloglog,log

,

11

11

QO

QO

λP QO, QO,log P

29

Solution to Problem 3 - The EM Algorithm (cont’d)

The auxiliary function can be rewritten as

wi yi

wj yj

wk yk

N

j

M

k votj

t

N

i

N

j

T

tij

tt

N

ii

kt

kbλP

λj,qPλQ

aλP

λjqi,qPλQ

λP

λi,qPλQ

1 1

1 1

1

1

1

1

1

log,

log,

,

log,

O

Ob

O

Oa

O

Oπ

b

a

π

i1

( )t j

),( jit

λQλQλQ

obaλP

λ,PλλQ

T

ttq

T

tqqq ttt

,,,

]loglog[log, all 1

1

111

baπ

O

QO

baπ

Q

example

30


The auxiliary function is separated into three independent terms, each respectively corresponds to , , and – Maximization procedure on can be done by maximizing

the individual terms separately subject to probability constraints

– All these terms have the following form

ija kb ji

N

nn

jj

j

N

jj

N

jjjN

w

wyF

yyywyyygF

1

1121

: when valuemaximum a has

0 and ,1 where,log,,...,,

y

y

Mk j

Nj ij

Ni i jkbia 111 1)( , 1 ,1

λλ,Q

31


Proof: Apply Lagrange Multiplier

N

nn

jj

N

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwyy

jy

w

y

w

y

F

yywywF

1

1111

1 11

0Then

0 Letting

1loglog that Suppose

Multiplier Lagrange applyingBy

Constraint

xe

xxh

x

xhxh

h

xhx

h

xhx

dx

xd

he

hx

xh

xhx

h

h

h

hh

h

h

1ln

1/1lnlim

1

/1lnlim/1lnlim

/lnlim

)ln()ln(lim

ln

...71828.21lim

/

0/

/1/

0

/1

0

00

/1

0

32


N

ii

λP

λi,qPλQ

1

1log,

O

Oππ

wi yi

N

nn

ii

w

wy

1 i

P

iqPi 1

1,ˆ

O

O

λiqPi tt ,)( O

1

1

1

1

N

n

N

nn

λP

λn,qP

w

O

O

33


N

nn

jj

w

wy

1

1

1

1

11

1

1

11 ,

,

,,ˆ

T

tt

T

tt

T

tt

T

ttt

ij

i

ji

iqP

jqiqPa

O

O

N

i

N

j

T

tij

tta

λP

λjqi,qPλQ

1 1

1

1

1log

,,

O

Oaa

wj yj

34


N

nn

kk

w

wy

1

wk yk

1 1s.t. s.t.

1 1

,

ˆ

,

t k t k

T T

t tt to v o v

j T T

t tt t

P q j j

b kP q j j

O

O

N

j

M

k votj

t

kt

kbλP

λj,qPλQ

1 1log,

O

Obb

35


The new model parameter set can be expressed as:

BAπ ˆ,ˆ,ˆ=̂

1

1

1 1

11 1

1 1

1 1

1 1s.t. s.t.

1 1

,ˆ

, , ,ˆ

,

,

ˆ

,

t k t k

i

T T

t t tt t

ij T T

t tt t

T T

t tt to v o v

j T T

t tt t

P q ii

P

P q i q j i ja

P q i i

P q j j

b kP q j j

O

O

O

O

O

O

λjqiqPji

λiqPi

ttt

tt

,,,

,)(

1 O

O

36

Discrete vs. Continuous Density HMMs

Two major types of HMMs according to the observations– Discrete and finite observation:

• The observations that all distinct states generate are finite in number, i.e., V={v1, v2, v3, ……, vM}, vkRL

• In this case, the observation probability distribution in state j, B={bj(k)}, is defined as bj(k)=P(ot=vk|qt=j), 1kM, 1jNot : observation at time t, qt : state at time t

bj(k) consists of only M probability values

– Continuous and infinite observation:• The observations that all distinct states generate are infinite and

continuous, i.e., V={v| vRL}

• In this case, the observation probability distribution in state j, B={bj(v)}, is defined as bj(v)=f(ot=v|qt=j), 1jNot : observation at time t, qt : state at time t

bj(v) is a continuous probability density function (pdf) and is often a mixture of Multivariate Gaussian (Normal) Distributions

37

Gaussian Distribution

A continuous random variable X is said to have a Gaussian distribution with mean μ and variance σ2(σ>0) if X has a continuous pdf in the following form:

2

2

2/12

2exp

2

1),|(

x

μxXf

38

Multivariate Gaussian Distribution

If X=(X1,X2,X3,…,Xd) is an d-dimensional random vector with a multivariate Gaussian distribution with mean vector and covariance matrix , then the pdf can be expressed as

If X1,X2,X3,…,Xd are independent random variables, the covariance matrix is reduced to diagonal, i.e.,

))((

oft determinan : ((

2

1exp

2

1),;()(

2

TTT

1T2/12/

jjiiij

d

xxE

E))E

E

Nf

ΣΣμμxxμxμxΣ

xμ

μxΣμxΣ

ΣμxxX

jiij ,02

d

i ii

ii

ii

xf

12

2

2/1 2exp

2

1),|(

ΣμxX

39

Multivariate Mixture Gaussian Distribution

An d-dimensional random vector X=(X1,X2,X3,…,Xd) is with a multivariate mixture Gaussian distribution if

In CDHMM, bj(v) is a continuous probability density function (pdf) and is often a mixture of multivariate Gaussian distributions

M

kjkjkjk

jkd

jkj cb1

1T2/12/ 2

1exp

2

1μvΣμv

Σv

M

kjkjk cc

1

1and0 Covariance matrix of the k-th mixture of the j-th state

Mean vectorof the k-th mixture of the j-th state

Observation vector

wNwfM

kk

M

kkkk 1 ,),;()(

11

Σμxx

40

Solution to Problem 3 – The Segmental K-means Algorithm for CDHMM

Assume that we have a training set of observations and an initial estimate of model parameters– Step 1 : Segment the training data

The set of training observation sequences is segmented into states, based on the current model, by Viterbi Algorithm

– Step 2 : Re-estimate the model parameters

– Step 3: Evaluate the model If the difference between the new and current model scores exceeds a threshold, go back to Step 1; otherwise, return

1Number of times ˆ


q i Number of transitions from state to state

ˆ Number of transitions from state ij

i ja

i

By partitioning the observation vectors within each state into clusters

number of vectors classified into cluster of state ˆ

number of vectors in state

ˆ sample mean of vectors classified

jm

jm

j M

m jc

j

μ into cluster of state

ˆ sample covariance matrix of vectors classified into cluster of state jm

m j

m jΣ

41

Solution to Problem 3 – The Segmental K-means Algorithm for CDHMM

(cont’d) 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 tOt

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means {11,11,c11}{12,12,c12}

{13,13,c13} {14,14,c14}

42

Solution to Problem 3 – The Baum-Welch Algorithm for CDHMM

Define a new variable t(j,k) – Probability of being in state j at time t with the k-th mixture

component accounting for ot, given O and

M

mjmjmtjm

jkjktjk

N

stt

tt

tt

tttt

t

tttttt

tttttt

Nc

Nc

ss

jj

λjqP

λjqkmPj

λjqP

λjqkmPjλjqkmPj

λjqkmPλjqPλkmjqPkj

11,;

,;

,

,,

,

,,,,

,,,,,,

Σμo

Σμo

o

o

O

OO

OOO

Observation-independent assumption

λjqP

λjqkmP

tTttt

tTtttt

,,...,,,,...,

,,...,,,,...,,

111

111

ooooo

ooooo

λjqP

λkmjqPλjqkmP

tt

ttttt

,

,,,

o

o

43

Solution to Problem 3 – The Baum-Welch Algorithm for CDHMM (cont’d)

Re-estimation formulae for are

1

1

,ˆ Weighted average (mean) of observations in state and mixture

,

T

t tt

jk T

tt

j kj k

j k

oμ

1 1

1 1 1

Expected number of times in state and mixture ˆ

Expected number of times in state

T T

t tt t

jk T M T

t tt m t

j,k j,kj k

cj j,m j

T

1

1

ˆ Weighted covariance of observations in state and mixture

ˆ ˆ,

,

jk

T

t t jk t jkt

T

tt

j k

j k

j k

Σ

o μ o μ

, , jk jk jkc μ Σ

44

A Simple Example

o1

State

o2 o3

1 2 3 Time

S1

S2

S1

S2

S1

S2

1 1 11

2 2 11 2 2 22

2 2 33

1 1 22 1 1 33

The Forward/Backward Procedure

N

jtt

tt

N

jt

t

tt

jj

ii

λjqP

λiqP

λP

λiqPi

1

1

,

,

,

O

O

O

O

N

jt1tjijt

N

i

t1tjijt

N

jtt

N

i

tt

ttt

jobai

jobai

λjqiqP

λjqiqP

λP

λjqiqPji

11

1

1

11

1

1

1

)(

)(

,,

,,

,,,

O

O

O

O

45

A Simple Example (cont’d) 1

2

1

2

1

2

4v 7v 4v

start1

2

11a

12a

22a

21a

4,1117,1114,11 babab 1 4,1117,1114,11 loglogloglogloglog babab








)|,(log qOp)|,( λp qO

Total 8 paths

q: 1 1 1

q: 1 1 2

46

A Simple Example (cont’d)

pathsall 87654321 all

21 log8765

log4321

allall

...

log8487

log7365

log6243

log5121

2221

1211

aallall

aallall

aallall

aallall

back

)1,1()1,1(/1,1/1,1 213221 λPλq,qPλPλq,qP OOOO

λQλQλQ

obaλP

λ,PλλQ

T

ttq

T

tqqq ttt

,,,

]loglog[log, all 1

1

111

baπ

O

QO

baπ

Q

2111 log)2(log)1(

)1(/ 11 λPλi,qP OO )2(1

11(1,1) logtt

a

( , ) logt ijt

i j a

1 hmm - part 2 review of the last lecture the em algorithm continuous density hmm

Documents

state i

state j

time t

best state sequence

baumwelch algorithm

probability of ot

observation sequence

optimal state sequence