markov models

107
P ATTERN RECOGNITION Markov models Vu PHAM [email protected] Department of Computer Science March 28 th , 2011 28/03/2011 1 Markov models

Upload: phvu

Post on 05-Dec-2014

4.480 views

Category:

Education


7 download

DESCRIPTION

A presentation on Markov Chain, HMM, Markov Random Fields with the needed algorithms and detailed explanations.

TRANSCRIPT

Page 1: Markov Models

PATTERN RECOGNITION

Markov modelsMarkov models

Vu PHAM

[email protected]

Department of Computer Science

March 28th, 2011

28/03/2011 1Markov models

Page 2: Markov Models

Contents

• Introduction

– Introduction

– Motivation

• Markov Chain

• Hidden Markov Models

• Markov Random Field

28/03/2011 2Markov models

Page 3: Markov Models

Introduction

• Markov processes are first proposed by

Russian mathematician Andrei Markov

– He used these processes to investigate

Pushkin’s poem.

• Nowadays, Markov property and HMMs are

widely used in many domains:widely used in many domains:

– Natural Language Processing

– Speech Recognition

– Bioinformatics

– Image/video processing

– ...

28/03/2011 Markov models 3

Page 4: Markov Models

Motivation [0]

• As shown in his paper in 1906, Markov’s original

motivation is purely mathematical:

– Application of The Weak Law of Large Number to dependent

random variables.random variables.

• However, we shall not follow this motivation...

28/03/2011 Markov models 4

Page 5: Markov Models

Motivation [1]

• From the viewpoint of classification:

– Context-free classification: Bayes classifier

( ) ( )| |i j

j ip pω ω> ∀ ≠x x

28/03/2011 Markov models 5

Page 6: Markov Models

Motivation [1]

• From the viewpoint of classification:

– Context-free classification: Bayes classifier

( ) ( )| |i j

j ip pω ω> ∀ ≠x x

28/03/2011 Markov models 6

• Classes are independent.

• Feature vectors are independent.

Page 7: Markov Models

Motivation [1]

• From the viewpoint of classification:

– Context-free classification: Bayes classifier

( ) ( )| |i j

j ip pω ω> ∀ ≠x x

– However, there are some applications where various

classes are closely realated:

• POS Tagging, Tracking, Gene boundary recover...

28/03/2011 Markov models 7

s1...s2 s3

...sm

Page 8: Markov Models

Motivation [1]

• Context-dependent classification:

– s1, s2, ..., sm: sequence of m feature vector

– ω1, ω2,..., ωN: classes in which these vectors are classified: ωi = 1...k.

s1...s2 s3

...sm

28/03/2011 Markov models 8

Page 9: Markov Models

Motivation [1]

• Context-dependent classification:

– s1, s2, ..., sm: sequence of m feature vector

– ω1, ω2,..., ωN: classes in which these vectors are classified: ωi = 1...k.

s1...s2 s3

...sm

• To apply Bayes classifier:

– X = s1s2...sm: extened feature vector

– Ωi = ωi1, ωi2,..., ωiN : a classification Nm possible classifications

28/03/2011 Markov models 9

( ) ( )| |i j j ip pΩ > Ω ∀ ≠X X

( ) ( ) ( ) ( )| |i i j jp p jpp iΩ Ω > Ω Ω ∀ ≠X X

Page 10: Markov Models

Motivation [1]

• Context-dependent classification:

– s1, s2, ..., sm: sequence of m feature vector

– ω1, ω2,..., ωN: classes in which these vectors are classified: ωi = 1...k.

s1...s2 s3

...sm

• To apply Bayes classifier:

– X = s1s2...sm: extened feature vector

– Ωi = ωi1, ωi2,..., ωiN : a classification Nm possible classifications

28/03/2011 Markov models 10

( ) ( )| |i j j ip pΩ > Ω ∀ ≠X X

( ) ( ) ( ) ( )| |i i j jp p jpp iΩ Ω > Ω Ω ∀ ≠X X

Page 11: Markov Models

Motivation [2]

• From a general view, sometimes we want to evaluate the joint

distribution of a sequence of dependent random variables

28/03/2011 Markov models 11

Page 12: Markov Models

Motivation [2]

• From a general view, sometimes we want to evaluate the joint

distribution of a sequence of dependent random variables

Hôm nay mùng tám tháng ba

Chị em phụ nữ đi ra đi vào...

28/03/2011 Markov models 12

Hôm ...nay mùng ...vào

q1 q2 q3 qm

Page 13: Markov Models

Motivation [2]

• From a general view, sometimes we want to evaluate the joint

distribution of a sequence of dependent random variables

Hôm nay mùng tám tháng ba

Chị em phụ nữ đi ra đi vào...

• What is p(Hôm nay.... vào) = p(q1=Hôm q2=nay ... qm=vào)?

28/03/2011 Markov models 13

Hôm ...nay mùng ...vào

q1 q2 q3 qm

Page 14: Markov Models

Motivation [2]

• From a general view, sometimes we want to evaluate the joint

distribution of a sequence of dependent random variables

Hôm nay mùng tám tháng ba

Chị em phụ nữ đi ra đi vào...

• What is p(Hôm nay.... vào) = p(q1=Hôm q2=nay ... qm=vào)?

28/03/2011 Markov models 14

Hôm ...nay mùng ...vào

q1 q2 q3 qm

p(sm|s1s2...sm-1) =p(s1s2... sm-1 sm)

p(s1s2... sm-1)

Page 15: Markov Models

Contents

• Introduction

• Markov Chain

• Hidden Markov ModelsHidden Markov Models

• Markov Random Field

28/03/2011 15Markov models

Page 16: Markov Models

Markov Chain

• Has N states, called s1, s2, ..., sN

• There are discrete timesteps, t=0,

t=1,...

• On the t’th timestep the system is in

exactly one of the available states.

s1

s

s2

Call it

28/03/2011 16Markov models

s3

N = 3

t = 0

qt = q0 = s3

1 2, ,...,t Ns sq s∈

Current state

Page 17: Markov Models

Markov Chain

• Has N states, called s1, s2, ..., sN

• There are discrete timesteps, t=0,

t=1,...

• On the t’th timestep the system is in

exactly one of the available states.

s1

s

s2

Current state

Call it

• Between each timestep, the next

state is chosen randomly.

28/03/2011 17Markov models

s3

N = 3

t = 1

qt = q1 = s2

1 2, ,...,t Ns sq s∈

Page 18: Markov Models

Markov Chain

• Has N states, called s1, s2, ..., sN

• There are discrete timesteps, t=0,

t=1,...

• On the t’th timestep the system is in

exactly one of the available states.

s1

s

s2

( ) 0p q s q s == =˚

˚

( )

( )

( )2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

Call it

• Between each timestep, the next

state is chosen randomly.

• The current state determines the

probability for the next state.

28/03/2011 18Markov models

s3

N = 3

t = 1

qt = q1 = s2

1 2, ,...,t Ns sq s∈( )

( )

( )

1 1 1

12

3 1

0

0

1

t tp q

p

s q s

s s

sp s

+ =

=

=

=

˚

˚ ( )

( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

Page 19: Markov Models

Markov Chain

• Has N states, called s1, s2, ..., sN

• There are discrete timesteps, t=0,

t=1,...

• On the t’th timestep the system is in

exactly one of the available states.

s1

s

s2

( ) 0p q s q s == =˚

˚

( )

( )

( )2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

11/3

1/2

1/2

2/3

Call it

• Between each timestep, the next

state is chosen randomly.

• The current state determines the

probability for the next state.

– Often notated with arcs between states

28/03/2011 19Markov models

s3

N = 3

t = 1

qt = q1 = s2

1 2, ,...,t Ns sq s∈( )

( )

( )

1 1 1

12

3 1

0

0

1

t tp q

p

s q s

s s

sp s

+ =

=

=

=

˚

˚ ( )

( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

Page 20: Markov Models

Markov Property

• qt+1 is conditionally independent of

qt-1, qt-2,..., q0 given qt.

• In other words:s1

s

s2

( )

( )1 1 0

1

, ,...,t t t

t t

p q q q q

p q q

+ −

+=

˚

˚( ) 0p q s q s == =˚

˚

( )

( )

( )2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

11/3

1/2

1/2

2/3

28/03/2011 20Markov models

s3

N = 3

t = 1

qt = q1 = s2

( )1t tp q q+=

˚

˚( )

( )

( )

1 1 1

12

3 1

0

0

1

t tp q

p

s q s

s s

sp s

+ =

=

=

=

˚

˚ ( )

( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

Page 21: Markov Models

Markov Property

• qt+1 is conditionally independent of

qt-1, qt-2,..., q0 given qt.

• In other words:s1

s

s2

( )

( )1 1 0

1

, ,...,t t t

t t

p q q q q

p q q

+ −

+=

˚

˚( ) 0p q s q s == =˚

˚

( )

( )

( )2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

11/3

1/2

1/2

2/3

28/03/2011 21Markov models

s3

N = 3

t = 1

qt = q1 = s2

( )1t tp q q+=

˚

˚( )

( )

( )

1 1 1

12

3 1

0

0

1

t tp q

p

s q s

s s

sp s

+ =

=

=

=

˚

˚ ( )

( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

The state at timestep t+1 depends

only on the state at timestep t

Page 22: Markov Models

Markov Property

• qt+1 is conditionally independent of

qt-1, qt-2,..., q0 given qt.

• In other words:s1

s

s2

( )

( )1 1 0

1

, ,...,t t t

t t

p q q q q

p q q

+ −

+=

˚

˚( ) 0p q s q s == =˚

˚

( )

( )

( )2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

11/3

1/2

1/2

2/3

28/03/2011 22Markov models

s3

N = 3

t = 1

qt = q1 = s2

( )1t tp q q+=

˚

˚( )

( )

( )

1 1 1

12

3 1

0

0

1

t tp q

p

s q s

s s

sp s

+ =

=

=

=

˚

˚ ( )

( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

The state at timestep t+1 depends

only on the state at timestep t

A Markov chain of order m (m finite): the state at

timestep t+1 depends on the past m states:

( ) ( )1 1 0 1 1 1, ,..., , ,...,t t t t t t t mp q q q q p q q q q+ − + − − +=˚ ˚

Page 23: Markov Models

Markov Property

• qt+1 is conditionally independent of

qt-1, qt-2,..., q0 given qt.

• In other words:

( )

( )1 1 0

1

, ,...,t t t

t t

p q q q q

p q q

+ −

+=

˚

˚( ) 0p q s q s == =˚

˚

( )

( )

( )2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

s1

s

s2

11/3

1/2

1/2

2/3

• How to represent the joint

distribution of (q0, q1, q2...) using

graphical models?

28/03/2011 23Markov models

N = 3

t = 1

qt = q1 = s2

( )1t tp q q+=

˚

˚( )

( )

( )

1 1 1

12

3 1

0

0

1

t tp q

p

s q s

s s

sp s

+ =

=

=

=

˚

˚ ( )

( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

s3

The state at timestep t+1 depends

only on the state at timestep t

Page 24: Markov Models

Markov Property

• qt+1 is conditionally independent of

qt-1, qt-2,..., q0 given qt.

• In other words:s1

s

s2

( )

( )1 1 0

1

, ,...,t t t

t t

p q q q q

p q q

+ −

+=

˚

˚( ) 0p q s q s == =˚

˚

( )

( )

( )2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

11/3

1/2

1/2

1/3

q0

q1

q

• How to represent the joint

distribution of (q0, q1, q2...) using

graphical models?

28/03/2011 24Markov models

s3

N = 3

t = 1

qt = q1 = s2

( )1t tp q q+=

˚

˚( )

( )

( )

1 1 1

12

3 1

0

0

1

t tp q

p

s q s

s s

sp s

+ =

=

=

=

˚

˚ ( )

( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

The state at timestep t+1 depends

only on the state at timestep t

q2

q3

Page 25: Markov Models

Markov chain

• So, the chain of qt is called Markov chain

q0 q1 q2 q3

28/03/2011 Markov models 25

Page 26: Markov Models

Markov chain

• So, the chain of qt is called Markov chain

• Each qt takes value from the countable state-space s1, s2, s3...

• Each qt is observed at a discrete timestep t

• qt sastifies the Markov property:

q0 q1 q2 q3

( ) ( )1 1 0 1, ,...,t t t t tp q q q q p q q+ − +=˚ ˚• qt sastifies the Markov property:

28/03/2011 Markov models 26

( ) ( )1 1 0 1, ,...,t t t t tp q q q q p q q+ − +=˚ ˚

Page 27: Markov Models

Markov chain

• So, the chain of qt is called Markov chain

• Each qt takes value from the countable state-space s1, s2, s3...

• Each qt is observed at a discrete timestep t

• qt sastifies the Markov property:

q0 q1 q2 q3

( ) ( )1 1 0 1, ,...,t t t t tp q q q q p q q+ − +=˚ ˚• qt sastifies the Markov property:

• The transition from qt to qt+1 is calculated from the transition

probability matrix

28/03/2011 Markov models 27

s1

s3

s2

11/3

1/2

1/2

2/3

s1 s2 s3

s10 0 1

s2½ ½ 0

s31/3 2/3 0

Transition probabilities

( ) ( )1 1 0 1, ,...,t t t t tp q q q q p q q+ − +=˚ ˚

Page 28: Markov Models

Markov chain

• So, the chain of qt is called Markov chain

• Each qt takes value from the countable state-space s1, s2, s3...

• Each qt is observed at a discrete timestep t

• qt sastifies the Markov property:

q0 q1 q2 q3

( ) ( )1 1 0 1, ,...,t t t t tp q q q q p q q+ − +=˚ ˚• qt sastifies the Markov property:

• The transition from qt to qt+1 is calculated from the transition

probability matrix

28/03/2011 Markov models 28

s1

s3

s2

11/3

1/2

1/2

2/3

s1 s2 s3

s10 0 1

s2½ ½ 0

s31/3 2/3 0

Transition probabilities

( ) ( )1 1 0 1, ,...,t t t t tp q q q q p q q+ − +=˚ ˚

Page 29: Markov Models

Markov Chain – Important property

• In a Markov chain, the joint distribution is

( ) ( ) ( )0 1 0 1

1

, ,..., |m

m j j

j

q q q p q q qp p −

=

= ∏

28/03/2011 Markov models 29

Page 30: Markov Models

Markov Chain – Important property

• In a Markov chain, the joint distribution is

• Why?( ) ( ) ( ), previous st, ,... ates, |

m

q q q p q p q qp = ∏

( ) ( ) ( )0 1 0 1

1

, ,..., |m

m j j

j

q q q p q q qp p −

=

= ∏

Why?

28/03/2011 Markov models 30

( ) ( ) ( )

( ) ( )

0 1 0 1

1

0 1

1

, previous st, ,... ates, |

|

m j j

j

m

j j

j

q q q p q p q q

p q p q q

p −

=

−=

=

=

Due to the Markov property

Page 31: Markov Models

Markov Chain: e.g.

• The state-space of weather:

cloud

windrain

28/03/2011 Markov models 31

Page 32: Markov Models

Markov Chain: e.g.

• The state-space of weather:

cloud

windrain

1/2

1/2

1/32/3

1

Rain Cloud Wind

Rain ½ 0 ½

Cloud 1/3 0 2/3

Wind 0 1 0

28/03/2011 Markov models 32

Page 33: Markov Models

Markov Chain: e.g.

• The state-space of weather:

cloud

windrain

1/2

1/2

1/32/3

1

Rain Cloud Wind

Rain ½ 0 ½

Cloud 1/3 0 2/3

Wind 0 1 0

• Markov assumption: weather in the t+1’th day is

depends only on the t’th day.

28/03/2011 Markov models 33

Page 34: Markov Models

Markov Chain: e.g.

• The state-space of weather:

cloud

windrain

1/2

1/2

1/32/3

1

Rain Cloud Wind

Rain ½ 0 ½

Cloud 1/3 0 2/3

Wind 0 1 0

• Markov assumption: weather in the t+1’th day is

depends only on the t’th day.

• We have observed the weather in a week:

28/03/2011 Markov models 34

rain windcloud rainwind

Day: 0 1 2 3 4

Page 35: Markov Models

Markov Chain: e.g.

• The state-space of weather:

cloud

windrain

1/2

1/2

1/32/3

1

Rain Cloud Wind

Rain ½ 0 ½

Cloud 1/3 0 2/3

Wind 0 1 0

• Markov assumption: weather in the t+1’th day is

depends only on the t’th day.

• We have observed the weather in a week:

28/03/2011 Markov models 35

rain windcloud rainwind

Day: 0 1 2 3 4

Markov Chain

Page 36: Markov Models

Contents

• Introduction

• Markov Chain

• Hidden Markov Models

– Independent assumptions

– Formal definition– Formal definition

– Forward algorithm

– Viterbi algorithm

– Baum-Welch algorithm

• Markov Random Field

28/03/2011 36Markov models

Page 37: Markov Models

Modeling pairs of sequences

• In many applications, we have to model pair of sequences

• Examples:

– POS tagging in Natural Language Processing (assign each word in a

sentence to Noun, Adj, Verb...)

– Speech recognition (map acoustic sequences to sequences of words)

– Computational biology (recover gene boundaries in DNA sequences)

– Video tracking (estimate the underlying model states from the observation

sequences)

– And many others...

28/03/2011 Markov models 37

Page 38: Markov Models

Probabilistic models for sequence pairs

• We have two sequences of random variables:

X1, X2, ..., Xm and S1, S2, ..., Sm

• Intuitively, in a pratical system, each Xi corresponds to an observation

and each Si corresponds to a state that generated the observation.

• Let each Si be in 1, 2, ..., k and each Xi be in 1, 2, ..., o

• How do we model the joint distribution:

28/03/2011 Markov models 38

( )1 1 1 1,..., , ,...,m m m mp X x X x S s S s= = = =

Page 39: Markov Models

Hidden Markov Models (HMMs)

• In HMMs, we assume that

( )

( ) ( ) ( )

1 1 1 1

1 1 1 1

2 1

,..., , ,...,m m m m

m m

j j j j j j j j

j j

p X

p

x X x S s S s

s p S s S s p X x S sS − −= =

= = = =

= = == = =∏ ∏˚ ˚

• This is often called Independence assumptions in

HMMs

• We are gonna prove it in the next slides

28/03/2011 Markov models 39

Page 40: Markov Models

Independence Assumptions in HMMs [1]

• By the chain rule, the following equality is exact:

( )

( )

( )

1 1 1 1

1 1

1 1 1 1

,..., , ,...,

,...,

,..., ,...,

m m m m

m m

m m m m

p

p

p

X x X x S s S s

S s S s

X x X x S s S s

= = = =

= = = ×

= = = =˚

( ) ( ) ( ) ( ) ( ) ( )| |ABC p A BC p BC p A BC p Bp C p C= = ˚

• Assumption 1: the state sequence forms a Markov chain

28/03/2011 Markov models 40

( ) ( ) ( )1 1 1 1 1 1

2

,...,m

m m j j j j

j

S s S s p S s p S s Sp s− −

=

= = = = = =∏ ˚

Page 41: Markov Models

Independence Assumptions in HMMs [2]

• By the chain rule, the following equality is exact:

• Assumption 2: each observation depends only on the underlying

( )

( )

1 1 1

1

1

1

1

1 11 1,

,..., ,...,

,..., ,...,

m m m m

m

j j m m

j

j j

X x X x S s S s

X x S s S s X x X x

p

p −

=

= = = =

= = = = = =∏

˚

˚

• Assumption 2: each observation depends only on the underlying

state

• These two assumptions are often called independence

assumptions in HMMs

28/03/2011 Markov models 41

( )

( )1 1 1 1 1 1,..., ,., ..,j j m m j j

j j j j

X x S s S s x X x

X

X

p

p

x S s

− −= = = = =

= = =

˚

˚

Page 42: Markov Models

The Model form for HMMs

• The model takes the following form:

• Parameters in the model:

( ) ( ) ( ) ( )1 1 1 1

2 1

,.., , ,..., ;m m

m m j j j j

j j

x x s s s t s sp e x sθ π −= =

= ∏ ∏˚ ˚

Parameters in the model:

28/03/2011 Markov models 42

( ) Initial probabilities for 1, 2,...,s s kπ ∈

( ) Transition probabilities for , ' 1, 2,...,t s s s s k′ ∈˚

( )

Emission probabilities for 1,2,...,

and 1,2,..,

e x s s k

x o

˚

Page 43: Markov Models

6 components of HMMs

• Discrete timesteps: 1, 2, ...

• Finite state space: si (N states)

• Events xi (M events)

• Vector of initial probabilities ππππi

ΠΠΠΠ = πi = p(q1 = si)

• Matrix of transition probabilities

s1 s2 s3

t11

t21

t12

t31

t23

t32

e11 e

e13 e23 e

start

ππππ1ππππ2 ππππ3

• Matrix of transition probabilities

T = Tij = p(qt+1=sj|qt=si)

• Matrix of emission probabilities

E = Eij = p(ot=xj|qt=si)

28/03/2011 Markov models 43

x1 x2 x3

e11 e31 e22

e23 e33

The observations at continuous timesteps form an observation sequence

o1, o2, ..., ot, where oi ∈ x1, x2, ..., xM

Page 44: Markov Models

6 components of HMMs

• Discrete timesteps: 1, 2, ...

• Finite state space: si (N states)

• Events xi (M events)

• Vector of initial probabilities ππππi

ΠΠΠΠ = πi = p(q1 = si)

• Matrix of transition probabilities

s1 s2 s3

t11

t21

t12

t31

t23

t32

e11 e

e13 e23 e

start

ππππ1ππππ2 ππππ3

• Matrix of transition probabilities

T = Tij = p(qt+1=sj|qt=si)

• Matrix of emission probabilities

E = Eij = p(ot=xj|qt=si)

28/03/2011 Markov models 44

x1 x2 x3

e11 e31 e22

e23 e33

The observations at continuous timesteps form an observation sequence

o1, o2, ..., ot, where oi ∈ x1, x2, ..., xM1 1 1

Constraints:

1 1 1N N M

i ij ij

i j j

T Eπ= = =

= = =∑ ∑ ∑

Page 45: Markov Models

6 components of HMMs

• Given a specific HMM and an

observation sequence, the

corresponding sequence of states

is generally not deterministic

• Example:

Given the observation sequence:

x , x , x , x

s1 s2 s3

t11

t21

t12

t31

t23

t32

e11 e

e13 e23 e

start

ππππ1ππππ2 ππππ3

x1, x3, x3, x2

The corresponding states can be

any of following sequences:

s1, s2, s1, s2

s1, s2, s3, s2

s1, s1, s1, s2

...

28/03/2011 Markov models 45

x1 x2 x3

e11 e31 e22

e23 e33

Page 46: Markov Models

Here’s an HMM

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.7

0.1

0.9 0.8

28/03/2011 Markov models 46

T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

Page 47: Markov Models

Here’s a HMM

• Start randomly in state 1, 2

or 3.

• Choose a output at each

state in random.

• Let’s generate a sequence

of observations:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.7

0.1

0.9 0.8

28/03/2011 Markov models 47

T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 o1

q2 o2

q3 o3

0.3 - 0.3 - 0.4

randomply choice

between S1, S2, S3

Page 48: Markov Models

Here’s a HMM

• Start randomly in state 1, 2

or 3.

• Choose a output at each

state in random.

• Let’s generate a sequence

of observations:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.7

0.1

0.9 0.8

28/03/2011 Markov models 48

T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 S3 o1

q2 o2

q3 o3

0.2 - 0.8

choice between X1

and X3

Page 49: Markov Models

Here’s a HMM

• Start randomly in state 1, 2

or 3.

• Choose a output at each

state in random.

• Let’s generate a sequence

of observations:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.7

0.1

0.9 0.8

28/03/2011 Markov models 49

T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 S3 o1 X3

q2 o2

q3 o3

Go to S2 with

probability 0.8 or

S1 with prob. 0.2

Page 50: Markov Models

Here’s a HMM

• Start randomly in state 1, 2

or 3.

• Choose a output at each

state in random.

• Let’s generate a sequence

of observations:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.7

0.1

0.9 0.8

28/03/2011 Markov models 50

T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 S3 o1 X3

q2 S1 o2

q3 o3

0.3 - 0.7

choice between X1

and X3

Page 51: Markov Models

Here’s a HMM

• Start randomly in state 1, 2

or 3.

• Choose a output at each

state in random.

• Let’s generate a sequence

of observations:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.7

0.1

0.9 0.8

28/03/2011 Markov models 51

T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 S3 o1 X3

q2 S1 o2 X1

q3 o3

Go to S2 with

probability 0.5 or

S1 with prob. 0.5

Page 52: Markov Models

Here’s a HMM

• Start randomly in state 1, 2

or 3.

• Choose a output at each

state in random.

• Let’s generate a sequence

of observations:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.7

0.1

0.9 0.8

28/03/2011 Markov models 52

T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 S3 o1 X3

q2 S1 o2 X1

q3 S1 o3

0.3 - 0.7

choice between X1

and X3

Page 53: Markov Models

Here’s a HMM

• Start randomly in state 1, 2

or 3.

• Choose a output at each

state in random.

• Let’s generate a sequence

of observations:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.7

0.1

0.9 0.8

We got a sequence

28/03/2011 Markov models 53

T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 S3 o1 X3

q2 S1 o2 X1

q3 S1 o3 X3

We got a sequence

of states and

corresponding

observations!

Page 54: Markov Models

Three famous HMM tasks

• Given a HMM Φ = (T, E, π). Three famous HMM tasks are:

• Probability of an observation sequence (state estimation)

– Given: Φ, observation O = o1, o2,..., ot

– Goal: p(O|Φ), or equivalently p(st = Si|O)

• Most likely expaination (inference)Most likely expaination (inference)

– Given: Φ, the observation O = o1, o2,..., ot

– Goal: Q* = argmaxQ p(Q|O)

• Learning the HMM

– Given: observation O = o1, o2,..., ot and corresponding state sequence

– Goal: estimate parameters of the HMM Φ = (T, E, π)

28/03/2011 Markov models 54

Page 55: Markov Models

Three famous HMM tasks

• Given a HMM Φ = (T, E, π). Three famous HMM tasks are:

• Probability of an observation sequence (state estimation)

– Given: Φ, observation O = o1, o2,..., ot

– Goal: p(O|Φ), or equivalently p(st = Si|O)

• Most likely expaination (inference)

Calculating the probability of

observing the sequence O over Most likely expaination (inference)

– Given: Φ, the observation O = o1, o2,..., ot

– Goal: Q* = argmaxQ p(Q|O)

• Learning the HMM

– Given: observation O = o1, o2,..., ot and corresponding state sequence

– Goal: estimate parameters of the HMM Φ = (T, E, π)

28/03/2011 Markov models 55

all of possible sequences.

Page 56: Markov Models

Three famous HMM tasks

• Given a HMM Φ = (T, E, π). Three famous HMM tasks are:

• Probability of an observation sequence (state estimation)

– Given: Φ, observation O = o1, o2,..., ot

– Goal: p(O|Φ), or equivalently p(st = Si|O)

• Most likely expaination (inference)

Calculating the best

corresponding state sequence, Most likely expaination (inference)

– Given: Φ, the observation O = o1, o2,..., ot

– Goal: Q* = argmaxQ p(Q|O)

• Learning the HMM

– Given: observation O = o1, o2,..., ot and corresponding state sequence

– Goal: estimate parameters of the HMM Φ = (T, E, π)

28/03/2011 Markov models 56

given an observation

sequence.

Page 57: Markov Models

Three famous HMM tasks

• Given a HMM Φ = (T, E, π). Three famous HMM tasks are:

• Probability of an observation sequence (state estimation)

– Given: Φ, observation O = o1, o2,..., ot

– Goal: p(O|Φ), or equivalently p(st = Si|O)

• Most likely expaination (inference)

Given an (or a set of)

observation sequence and

corresponding state sequence, Most likely expaination (inference)

– Given: Φ, the observation O = o1, o2,..., ot

– Goal: Q* = argmaxQ p(Q|O)

• Learning the HMM

– Given: observation O = o1, o2,..., ot and corresponding state sequence

– Goal: estimate parameters of the HMM Φ = (T, E, π)

28/03/2011 Markov models 57

corresponding state sequence,

estimate the Transition matrix,

Emission matrix and initial

probabilities of the HMM

Page 58: Markov Models

Three famous HMM tasks

Problem Algorithm Complexity

State estimation

Calculating: p(O|Φ)

Forward O(TN2)

Inference

Calculating: Q*= argmaxQp(Q|O)

Viterbi decoding O(TN2)

28/03/2011 Markov models 58

Calculating: Q*= argmaxQp(Q|O)

Learning

Calculating: Φ* = argmaxΦp(O|Φ)

Baum-Welch (EM) O(TN2)

T: number of timesteps

N: number of states

Page 59: Markov Models

State estimation problem

• Given: Φ = (T, E, π), observation O = o1, o2,..., ot

• Goal: What is p(o1o2...ot) ?

• We can do this in a slow, stupid wayWe can do this in a slow, stupid way

– As shown in the next slide...

28/03/2011 Markov models 59

Page 60: Markov Models

Here’s a HMM

• What is p(O) = p(o1o2o3)

= p(o1=X3 ∧ o2=X1 ∧ o3=X3)?

• Slow, stupid way:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.70.1

0.9 0.8

( ) ( )

( ) ( )paths of length 3

paths of length 3

|

Q

Q

p O p OQ

p QO Q p

=

= ∑

• How to compute p(Q) for an

arbitrary path Q?

• How to compute p(O|Q) for an

arbitrary path Q?

28/03/2011 Markov models 60

paths of length 3Q∈

Page 61: Markov Models

Here’s a HMM

• What is p(O) = p(o1o2o3)

= p(o1=X3 ∧ o2=X1 ∧ o3=X3)?

• Slow, stupid way:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.70.1

0.9 0.8

( ) ( )

( ) ( )paths of length 3

paths of length 3

|

Q

Q

p O p OQ

p QO Q p

=

= ∑

∑ππππ s1 s2 s3

• How to compute p(Q) for an

arbitrary path Q?

• How to compute p(O|Q) for an

arbitrary path Q?

28/03/2011 Markov models 61

paths of length 3Q∈

p(Q) = p(q1q2q3)

= p(q1)p(q2|q1)p(q3|q2,q1) (chain)

= p(q1)p(q2|q1)p(q3|q2) (why?)

Example in the case Q=S3S1S1

P(Q) = 0.4 * 0.2 * 0.5 = 0.04

0.3 0.3 0.4

Page 62: Markov Models

Here’s a HMM

• What is p(O) = p(o1o2o3)

= p(o1=X3 ∧ o2=X1 ∧ o3=X3)?

• Slow, stupid way:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.70.1

0.9 0.8

( ) ( )

( ) ( )paths of length 3

paths of length 3

|

Q

Q

p O p OQ

p QO Q p

=

= ∑

∑ππππ s1 s2 s3

• How to compute p(Q) for an

arbitrary path Q?

• How to compute p(O|Q) for an

arbitrary path Q?

28/03/2011 Markov models 62

paths of length 3Q∈

p(O|Q) = p(o1o2o3|q1q2q3)

= p(o1|q1)p(o2|q1)p(o3|q3) (why?)

Example in the case Q=S3S1S1

P(O|Q) = p(X3|S3)p(X1|S1) p(X3|S1)

=0.8 * 0.3 * 0.7 = 0.168

0.3 0.3 0.4

Page 63: Markov Models

Here’s a HMM

• What is p(O) = p(o1o2o3)

= p(o1=X3 ∧ o2=X1 ∧ o3=X3)?

• Slow, stupid way:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.70.1

0.9 0.8

( ) ( )

( ) ( )paths of length 3

paths of length 3

|

Q

Q

p O p OQ

p QO Q p

=

= ∑

∑ππππ s1 s2 s3

• How to compute p(Q) for an

arbitrary path Q?

• How to compute p(O|Q) for an

arbitrary path Q?

28/03/2011 Markov models 63

paths of length 3Q∈

p(O|Q) = p(o1o2o3|q1q2q3)

= p(o1|q1)p(o2|q1)p(o3|q3) (why?)

Example in the case Q=S3S1S1

P(O|Q) = p(X3|S3)p(X1|S1) p(X3|S1)

=0.8 * 0.3 * 0.7 = 0.168

0.3 0.3 0.4

p(O) needs 27 p(Q)

computations and 27

p(O|Q) computations.

What if the sequence has

20 observations?So let’s be smarter...

Page 64: Markov Models

The Forward algorithm

• Given observation o1o2...oT

• Forward probabilities:

αt(i) = p(o1o2...ot ∧ qt = si | Φ) where 1 ≤ t ≤ T

ααt(i) = probability that, in a random trial:

– We’d have seen the first t observations

– We’d have ended up in si as the t’th state visited.

• In our example, what is α2(3) ?

28/03/2011 Markov models 64

Page 65: Markov Models

ααααt(i): easy to define recursively

( ) ( )

( ) ( )

( )

1 1 1

1 1 1

1

|

i

i i

i i

i p o q s

p q s p o q s

oE

α

π

= ∧ =

= = =

=

( ) ( )1 2... |t t t ii p o o o q sα = ∧ = Φ

( ) ( )

( )

1 1 1 1

1 1 1

2

2

...

...

t t t i

N

t t j t t i

i p o o o q s

o o o q s o q sp

α + + +

+ +

= ∧ =

= ∧ = ∧ ∧ =∑

( )

( ) ( )

1

1

|

|

i i

ij t j t i

ij t j t i

p q s

T p s q s

E p x q

q

o s

π

+

Π = = =

= = = =

= = = =

T

E

28/03/2011 Markov models 65

( )

( ) ( )

( ) ( )

( ) ( ) ( )

( ) ( )

1 1 1

1

1 1 1 1

1

1 1

1

1 1 1

2

2 2

1

1

1

.|

|

|

. . .

|

. .

t t j t t i

j

N

t t i t t j t t j

j

N

t t i t j t

j

N

t t i t i t j t

j

N

ji i t t

j

o q s o o o q s op p

p

p p

T

o o q s

o q s q s j

o q s q s q s j

o jE

α

α

α

+ +

=

+ +

=

+ +

=

+ + +

=

+

=

= ∧ = ∧ = ∧ =

= ∧ = =

= = = =

=

Page 66: Markov Models

In our example

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.70.1

0.9 0.8

ππππ s1 s2 s3

( ) ( )

( ) ( )

( ) ( ) ( ) ( ) ( )

1 2

1 1

1 1 1

. |..t t t i

i i

t ji i t t i t ji t

j j

i p o o o q s

i E o

i T E o j E o T j

α

α π

α α α+ + +

= ∧ = Φ

=

= =∑ ∑

28/03/2011 Markov models 66

ππππ s1 s2 s3

0.3 0.3 0.4We observed: x1x2

α1(1) = 0.3 * 0.3 = 0.09

α1(2) = 0

α1(3) = 0.2 * 0.4 = 0.08

α2(1) = 0 * (0.09*0 .5+ 0*0.4 + 0.08*0.2) = 0

α2(2) = 0.1 * (0.09*0.5 + 0*0 + 0.08*0.8) = 0.0109

α2(3) = 0 * (0.09*0 + 0*0.6 + 0.08*0) = 0

Page 67: Markov Models

Forward probabilities - Trellis

N

s3

s4

28/03/2011 Markov models 67

T1 2 3 4 5 6

s1

s2

Page 68: Markov Models

Forward probabilities - Trellis

N

s3

s4

α1 (4)

α1 (3) α2 (3) α6 (3)

28/03/2011 Markov models 68

T1 2 3 4 5 6

s1

s2

α1 (2)

α1 (1)

α3 (2)

α4 (1)

α5 (2)

Page 69: Markov Models

Forward probabilities - Trellis

N

s3

s4

α1 (4)

α1 (3) α2 (3)

( ) ( )1 1i ii E oα π=

28/03/2011 Markov models 69

T1 2 3 4 5 6

s1

s2

α1 (2)

α1 (1)

Page 70: Markov Models

Forward probabilities - Trellis

N

s3

s4

α1 (4)

α1 (3) α2 (3)

( ) ( ) ( )1 1t i t ji t

j

i E o T jα α+ += ∑

28/03/2011 Markov models 70

T1 2 3 4 5 6

s1

s2

α1 (2)

α1 (1)

Page 71: Markov Models

Forward probabilities

• So, we can cheaply compute:

• How can we cheaply compute:

( ) ( )1 2...t t t ii p o o o q sα = ∧ =

( )1 2 ... tp o o o

• How can we cheaply compute:

28/03/2011 Markov models 71

( )1 2| ...t i tp q s o o o=

Page 72: Markov Models

( )t

i

iα= ∑

Forward probabilities

• So, we can cheaply compute:

• How can we cheaply compute:

( ) ( )1 2...t t t ii p o o o q sα = ∧ =

( )1 2 ... tp o o oi

• How can we cheaply compute:

Look back the trellis...

28/03/2011 Markov models 72

( )1 2| ...t i tp q s o o o=( )( )

j

t

t

i

j

α

α=∑

Page 73: Markov Models

State estimation problem

• State estimation is solved:

• Can we utilize the elegant trellis to solve the Inference

problem?

( ) ( ) ( )1 2

1

|N

t i

i

O p o o o ip α=

Φ = … =∑

problem?

– Given an observation sequence O, find the best state sequence Q

28/03/2011 Markov models 73

( )* arg max |Q

OQ p Q=

Page 74: Markov Models

Inference problem

• Given: Φ = (T, E, π), observation O = o1, o2,..., ot

• Goal: Find

• Practical problems:

( )

( )1 2

*

1 2 1 2

arg max |

arg max |t

Q

t tq q q

Q p Q O

p q q q o o o…

=

= … …

• Practical problems:

– Speech recognition: Given an utterance (sound), what is

the best sentence (text) that matches the utterance?

– Video tracking

– POS Tagging

28/03/2011 Markov models 74

s1 s2

x1 x2 x3

s3

Page 75: Markov Models

Inference problem

• We can do this in a slow, stupid way:

( )

( ) ( )( )

( ) ( )

* arg max |

|arg max

Q

Q

p Q O

p O Q p Q

p O

Q =

=

• But it’s better if we can find another way to

compute the most probability path (MPP)...

28/03/2011 Markov models 75

( ) ( )

( ) ( )1 2

arg max |

arg max |t

Q

Q

p O Q p Q

p o o o Q p Q

=

= …

Page 76: Markov Models

Efficient MPP computation

• We are going to compute the following variables:

• δt(i) is the probability of the best path of length

( ) ( )1 2 1

1 2 1 1 2maxt

t t t i tq q q

i p q q q q s o o oδ−

−…

= … ∧ = ∧ …

t(i) is the probability of the best path of length

t-1 which ends up in si and emits o1...ot.

• Define: mppt(i) = that path

so: δt(i) = p(mppt(i))

28/03/2011 Markov models 76

Page 77: Markov Models

Viterbi algorithm( ) ( )

( ) ( )

( ) ( )

( ) ( )

1 2 1

1 2 1

1 2 1 1 2

1 2 1 1 2

1 11one choice

1 1

max

arg max

max

t

t

t t t i tq q q

t t t

i

i tq q q

i

i

i p q q q q s o o o

mpp i p q q q q s o o o

i p q s o

E o iπ

δ

δ

α

−…

−…

= … ∧ = ∧ …

= … ∧ = ∧ …

= = ∧

= =

N

s4

δ1 (4)

28/03/2011 Markov models 77

T1 2 3 4 5 6

s1

s2

s3

s4

δ 1 (3)

δ 1 (2)

δ 1 (1)

δ 2 (3)

Page 78: Markov Models

Viterbi algorithm

• The most prob path with last two states

sisj is the most path to si, followed by

transition si sj.

• The prob of that path will be:

s1

si sj

time t time t + 1

......

δt(i) × p(si sj ∧ ot+1)

= δt(i)TijEj(ot+1)

• So, the previous state at time t is:

28/03/2011 Markov models 78

( ) ( )*

1arg maxt ij j t

i

i T Ei oδ +=

...

Page 79: Markov Models

Viterbi algorithm

• Summary: ( ) ( ) ( )

( ) ( )( ) ( )

*

1

1

1

*

1

* arg max

t ij j t

t t j

t i

t

j ji

t

i T E o

mpp j mpp i s

i T E

j

i o

δδ

δ

++

+

+

=

=

=

N

sδ1 (4)

( ) ( ) ( )11 1iii E o iδ π α= =

28/03/2011 Markov models 79T1 2 3 4 5 6

s1

s2

s3

s4

δ1 (4)

δ 1 (3)

δ 1 (2)

δ 1 (1)

δ 2 (3)

Page 80: Markov Models

What’s Viterbi used for?

• Speech Recognition

28/03/2011 Markov models 80

Chong, Jike and Yi, Youngmin and Faria, Arlo and Satish, Nadathur Rajagopalan and Keutzer, Kurt, “Data-Parallel Large Vocabulary

Continuous Speech Recognition on Graphics Processors”, EECS Department, University of California, Berkeley, 2008.

Page 81: Markov Models

Training HMMs

• Given: large sequence of observation o1o2...oT

and number of states N.

• Goal: Estimation of parameters Φ = ⟨T, E, π⟩

• That is, how to design an HMM.

• We will infer the model from a large amount of

data o1o2...oT with a big “T”.

28/03/2011 Markov models 81

Page 82: Markov Models

Training HMMs

• Remember, we have just computed

p(o1o2...oT | Φ)

• Now, we have some observations and we want to inference Φ

from them.

• So, we could use:

– MAX LIKELIHOOD:

– BAYES:

Compute

then take or

28/03/2011 Markov models 82

( )1arg max |Tp o oΦ

Φ = … Φ

( )1| Tp o oΦ …

[ ]E Φ ( )1max | Top oΦ

Φ …

Page 83: Markov Models

Max likelihood for HMMs

• Forward probability: the probability of producing o1...ot while

ending up in state si

( ) ( )1 2...t t t ii p o o o q sα = ∧ =( ) ( )

( ) ( ) ( )1 1

1 1

i i

t i t ji t

j

i E o

i E o T j

α π

α α+ +

=

= ∑

• Backward probability: the probability of producing ot+1...oT given

that at time t, we are at state si

28/03/2011 Markov models 83

( ) ( )1 2. |..t t t iTti p o o o q sβ + += =

Page 84: Markov Models

Max likelihood for HMMs - Backward

• Backward probability: easy to define recursively

( ) ( )

( )

( ) ( )

1 2

1 2 1

1

...

...

|

1

|

t t t t i

T

N

t t t t j t i

j

T

T

i p o o o q s

i

i p o o o q s q s

β

β

β

+ +

+ + +=

= =

=

= ∧ ∧ = =∑

( )

( ) ( ) ( )1 1

1

1T

N

t t ij j t

j

i

i T oj E

β

β β + +=

=

=∑

28/03/2011 Markov models 84

( ) ( )

( ) ( )

( ) ( )

1

1 1 2 1 1

1

1 1 2 1

1

1 1

1

.| |

| |

..

...

j

N

t t j t i t t t j t i

j

N

t t j t i t t j

j

N

t ij j t

T

T

j

p o q s q s p o o o q s q s

p o q s q s p o o q s

T oj Eβ

=

+ + + + +=

+ + + +=

+ +=

= ∧ = = ∧ = ∧ =

= ∧ = = =

=

Page 85: Markov Models

Max likelihood for HMMs

• The probability of traversing a certain arc at time t given

o1o2...oT:

( ) ( )

( )( )

1 1 2

1 1 2

|ij t i t j T

t i t j T

t p q s q s o o o

p q s q s o o o

p o o o

ε +

+

= = ∧ = …

= ∧ = ∧ …=

28/03/2011 Markov models 85

( )

( ) ( ) ( )

( ) ( )

( )( ) ( )

( ) ( )

1 2

1 2 1 1 2

1 2 1 2

1

1

|

|

T

t t i t i t j t t T t i

N

t i t t t i

i

ij t

ij N

t

i

t T

t

t

p o o o

p o o o q s p q s q s p o o o q s

p o o o q p o o o q

i T it

i i

s s

α β

α

ε

β

+ + +

+ +=

=

=…

… ∧ = = ∧

=

= … ==

… ∧ …

=

=∑

Page 86: Markov Models

Max likelihood for HMMs

• The probability of being at state si at time t given o1o2...oT:

( ) ( )

( )

1 2

1 1 2

1

|

|

i t i T

N

t i t j T

j

t p q s o o o

p q s q s o o o

γ

+=

= = …

= = ∧ = …∑

28/03/2011 Markov models 86

( ) ( )

1

1

j

N

ij

j

i t tγ ε=

=∑

Page 87: Markov Models

Max likelihood for HMMs

• Sum over the time index:

– Expected # of transitions from state i to j in o1o2...oT:

( )1

1

T

ij

t

tε−

=

– Expected # of transitions from state i in o1o2...oT :

28/03/2011 Markov models 87

1t=

( ) ( ) ( )11 1

1 1 1 1 1

TT T N N

i ij ij

t t j j t

t t tε εγ−− −

= = = = =

= =∑ ∑∑ ∑∑

Page 88: Markov Models

Update parameters

( )

( )

( )

( )

( )

1 1

1 1

1 1

ˆ expected frequency in state i at time t = 1 1

expected # of transitions from state i to j

expected # of transitions from state i

i i

T T

ij ij

t tij T N T

i ij

t t

T

t t

π γ

γ

ε ε

ε

− −

= =− −

= =

= = =∑ ∑

∑ ∑∑

( )

( ) ( )

1

1

|

|

i i

ij t j t i

ij t j t i

p q s

T p s q s

E p x q

q

o s

π

+

Π = = =

= = = =

= = = =

T

E

28/03/2011 Markov models 88

( ) ( )1 1 1

expected

i ij

t j t

ik

t t

E

γ ε= = =

=

∑ ∑∑

( ) ( )

( )

( ) ( )

( )

11

1 11

1 1

1 1 1

# of transitions from state i with x observed

expected # of transitions from state i

,,

k

N TT

t k ijt k ij tt

T N T

i ij

t j t

o x to x t

t t

δ εδ γ

γ ε

−−

= ==− −

= = =

= =

∑∑∑

∑ ∑∑

Page 89: Markov Models

The inner loop of Forward-Backward

Given an input sequence.

1. Calculate forward probability:

– Base case:

– Recursive case:

2. Calculate backward probability:

( ) ( )

( ) ( ) ( )1 1

1 1

i i

t i t ji t

j

i E o

i E o T j

α π

α α+ +

=

= ∑

– Base case:

– Recursive case:

3. Calculate expected count:

4. Update parameters:

28/03/2011 Markov models 89

( )

( ) ( ) ( )1 1

1

1T

N

t t ij j t

j

i

i T oj E

β

β β + +=

=

=∑( )

( ) ( )

( ) ( )1

ij t

ij N

t

t

i

t

i T it

i i

α β

α β

ε

=

=

∑( )

( )

( ) ( )

( )

11

1 11

1 1

1 1 1 1

,N TT

t k ijijj tt

ij ikN T N T

ij ij

j t j t

o x tt

T E

t t

δ ε

ε

ε

ε

−−

= ==− −

= = = =

= =

∑∑∑

∑∑ ∑∑

Page 90: Markov Models

Forward-Backward: EM for HMM

• If we knew Φ we could estimate expectations of quantities

such as

– Expected number of times in state i

– Expected number of transitions i j

• If we knew the quantities such as• If we knew the quantities such as

– Expected number of times in state i

– Expected number of transitions i j

we could compute the max likelihood estimate of Φ = ⟨T, E, Π⟩

• Also known (for the HMM case) as the Baum-Welch algorithm.

28/03/2011 Markov models 90

Page 91: Markov Models

EM for HMM

• Each iteration provides values for all the parameters

• The new model always improve the likeliness of the

training data:

( ) ( )ˆ| |p po o o o o o… Φ ≥ … Φ

• The algorithm does not guarantee to reach global

maximum.

28/03/2011 Markov models 91

( ) ( )1 2 1 2ˆ| |

T Tp po o o o o o… Φ ≥ … Φ

Page 92: Markov Models

EM for HMM

• Bad News

– There are lots of local minima

• Good News

– The local minima are usually adequate models of the data.

• Notice

– EM does not estimate the number of states. That must be given (tradeoffs)

– Often, HMMs are forced to have some links with zero probability. This is done

by setting Tij = 0 in initial estimate Φ(0)

– Easy extension of everything seen today: HMMs with real valued outputs

28/03/2011 Markov models 92

Page 93: Markov Models

Contents

• Introduction

• Markov Chain

• Hidden Markov ModelsHidden Markov Models

• Markov Random Field (from the viewpoint of

classification)

28/03/2011 93Markov models

Page 94: Markov Models

Example: Image segmentation

• Observations: pixel values

• Hidden variable: class of each pixel

• It’s reasonable to think that there are some underlying relationships

between neighbouring pixels... Can we use Markov models?

• Errr.... the relationships are in 2D!

28/03/2011 Markov models 94

Page 95: Markov Models

MRF as a 2D generalization of MC

• Array of observations:

• Classes/States:

• Our objective is classification: given the array of

observations, estimate the corresponding values of the

, 0 ,0 yij xi N j NX x < ≤ <= ≤

, 1...ij ijS s s M= =

observations, estimate the corresponding values of the

state array S so that

28/03/2011 Markov models 95

( ) ( )| is maximum.Xp S p S

Page 96: Markov Models

2D context-dependent classification

• Assumptions:

– The values of elements in S are mutually dependent.

– The range of this dependence is limited within a neighborhood.

• For each (i, j) element of S, a neighborhood Nij is defined so

thatthat

– sij ∉ Nij: (i, j) element does not belong to its own set of neighbors.

– sij ∈ Nkl ⇔ skl ∈ Nij: if sij is a neighbor of skl then skl is also a neighbor

of sij

28/03/2011 Markov models 96

Page 97: Markov Models

2D context-dependent classification

• The Markov property for 2D case:

where includes all the elements of S except the (i, j) one.

• The elegeant dynamic programing is not applicable: the problem is

( ) ( )| |ij ij ij ijs S p sp = N

ijS

much harder now!

28/03/2011 Markov models 97

Page 98: Markov Models

2D context-dependent classification

• The Markov property for 2D case:

where includes all the elements of S except the (i, j) one.

• The elegeant dynamic programing is not applicable: the problem is

( ) ( )| |ij ij ij ijs S p sp = N

ijS

We are gonna see an

application of MRF for much harder now!

28/03/2011 Markov models 98

application of MRF for

Image Segmentation

and Restoration.

Page 99: Markov Models

MRF for Image Segmentation

• Cliques: a set of each pixel which are neighbors

of each other (w.r.t the type of neighborhood)

28/03/2011 Markov models 99

Page 100: Markov Models

MRF for Image Segmentation

• Dual Lattice number

• Line process:

28/03/2011 Markov models 100

Page 101: Markov Models

MRF for Image Segmentation

• Gibbs distribution:

– Z: normalizing constant

( )( )1

expU s

sZ T

π−

=

– T: parameter

• It turns out that Gibbs distribution implies MRF

([Gema 84])

28/03/2011 Markov models 101

Page 102: Markov Models

MRF for Image Segmentation

• A Gibbs conditional probability is of the form:

– Ck(i, j): clique of the pixel (i, j)

( ) ( )( )1 1

| exp ,ij ij k k

k

s F C i jZ T

p

= −

∑N

– Fk: some functions, e.g.

28/03/2011 Markov models 102

( ) ( )( )1 2 1, 1, 2 , 1 , 1

1ij i j i j i j i js s s s s

Tα α α− + − ++ +− + +

Page 103: Markov Models

MRF for Image Segmentation

• Then, the joint probability for the Gibbs model is

– The sum is calculated over all possible cliques associated

( )( )( )

,

,

exp

k k

i j kp

F C i j

ST

= −

∑∑

– The sum is calculated over all possible cliques associated

with the neighborhood.

• We also need to work out p(X|S)

• Then p(X|S)p(S) can be maximized... [Gema 84]

28/03/2011 Markov models 103

Page 104: Markov Models

More on Markov models...

• MRF does not stop there... Here are some related models:

– Conditional random field (CRF)

– Graphical models

– ...

• Markov Chain and HMM does not stop there...• Markov Chain and HMM does not stop there...

– Markov chain of order m

– Continuous-time Markov chains...

– Real-value observations

– ...

28/03/2011 Markov models 104

Page 105: Markov Models

What you should know

• Markov property, Markov Chain

• HMM:

– Defining and computing αt(i)

– Viterbi algorithm

– Outline of the EM algorithm for HMM

• Markov Random Field

– And an application in Image Segmentation

– [Geman 84] for more information.

28/03/2011 Markov models 105

Page 106: Markov Models

Q & A

28/03/2011 Markov models 106

Page 107: Markov Models

References

• L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications

in Speech Recognition“, Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989.

• Andrew W. Moore, “Hidden Markov Models”, http://www.autonlab.org/tutorials/

• Geman S., Geman D. “Stochastic relaxation, Gibbs distributions and the

Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and

Machine Intelligence, Vol. 6(6), pp. 721-741, 1984.

28/03/2011 Markov models 107