markov models

PATTERN RECOGNITION

Markov modelsMarkov models

Vu PHAM

[email protected]

Department of Computer Science

March 28th, 2011

28/03/2011 1Markov models

Contents

• Introduction

– Introduction

– Motivation

• Markov Chain

• Hidden Markov Models

• Markov Random Field


Introduction

• Markov processes are first proposed by

Russian mathematician Andrei Markov

– He used these processes to investigate

Pushkin’s poem.

• Nowadays, Markov property and HMMs are

widely used in many domains:widely used in many domains:

– Natural Language Processing

– Speech Recognition

– Bioinformatics

– Image/video processing

– ...

28/03/2011 Markov models 3

Motivation [0]

• As shown in his paper in 1906, Markov’s original

motivation is purely mathematical:

– Application of The Weak Law of Large Number to dependent

random variables.random variables.

• However, we shall not follow this motivation...


Motivation [1]

• From the viewpoint of classification:

– Context-free classification: Bayes classifier

( ) ( )| |i j

j ip pω ω> ∀ ≠x x


Motivation [1]



( ) ( )| |i j



• Classes are independent.

• Feature vectors are independent.

Motivation [1]



( ) ( )| |i j


– However, there are some applications where various

classes are closely realated:

• POS Tagging, Tracking, Gene boundary recover...


s1...s2 s3

...sm

Motivation [1]

• Context-dependent classification:

– s1, s2, ..., sm: sequence of m feature vector

– ω1, ω2,..., ωN: classes in which these vectors are classified: ωi = 1...k.

s1...s2 s3

...sm


Motivation [1]




s1...s2 s3

...sm

• To apply Bayes classifier:

– X = s1s2...sm: extened feature vector

– Ωi = ωi1, ωi2,..., ωiN : a classification Nm possible classifications


( ) ( )| |i j j ip pΩ > Ω ∀ ≠X X

( ) ( ) ( ) ( )| |i i j jp p jpp iΩ Ω > Ω Ω ∀ ≠X X

Motivation [2]

• From a general view, sometimes we want to evaluate the joint

distribution of a sequence of dependent random variables


Motivation [2]



Hôm nay mùng tám tháng ba

Chị em phụ nữ đi ra đi vào...


Hôm ...nay mùng ...vào

q1 q2 q3 qm

Motivation [2]





• What is p(Hôm nay.... vào) = p(q1=Hôm q2=nay ... qm=vào)?



q1 q2 q3 qm

Motivation [2]





• What is p(Hôm nay.... vào) = p(q1=Hôm q2=nay ... qm=vào)?



q1 q2 q3 qm

p(sm|s1s2...sm-1) =p(s1s2... sm-1 sm)

p(s1s2... sm-1)

Contents

• Introduction

• Markov Chain

• Hidden Markov ModelsHidden Markov Models



Markov Chain

• Has N states, called s1, s2, ..., sN

• There are discrete timesteps, t=0,

t=1,...

• On the t’th timestep the system is in

exactly one of the available states.

s1

s

s2

Call it


s3

N = 3

t = 0

qt = q0 = s3

1 2, ,...,t Ns sq s∈

Current state

Markov Chain



t=1,...



s1

s

s2

Current state

Call it

• Between each timestep, the next

state is chosen randomly.


s3

N = 3

t = 1

qt = q1 = s2

1 2, ,...,t Ns sq s∈

Markov Chain



t=1,...



s1

s

s2

( ) 0p q s q s == =˚

˚

( )

( )

( )2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

Call it



• The current state determines the

probability for the next state.


s3

N = 3

t = 1

qt = q1 = s2

1 2, ,...,t Ns sq s∈( )

( )

( )

1 1 1

12

3 1

0

0

1

t tp q

p

s q s

s s

sp s

+ =

=

=

=

=˚

˚

˚ ( )

( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

Markov Chain



t=1,...



s1

s

s2

( ) 0p q s q s == =˚

˚

( )

( )

( )2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

11/3

1/2

1/2

2/3

Call it



• The current state determines the

probability for the next state.

– Often notated with arcs between states


s3

N = 3

t = 1

qt = q1 = s2

1 2, ,...,t Ns sq s∈( )

( )

( )

1 1 1

12

3 1

0

0

1

t tp q

p

s q s

s s

sp s

+ =

=

=

=

=˚

˚

˚ ( )

( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

Markov Property

• qt+1 is conditionally independent of

qt-1, qt-2,..., q0 given qt.

• In other words:s1

s

s2

( )

( )1 1 0

1

, ,...,t t t

t t

p q q q q

p q q

+ −

+=

˚

˚( ) 0p q s q s == =˚

˚

( )

( )

( )2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

11/3

1/2

1/2

2/3


s3

N = 3

t = 1

qt = q1 = s2

( )1t tp q q+=

˚

˚( )

( )

( )

1 1 1

12

3 1

0

0

1

t tp q

p

s q s

s s

sp s

+ =

=

=

=

=˚

˚

˚ ( )

( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

Markov Property




s

s2

( )

( )1 1 0

1

, ,...,t t t

t t

p q q q q

p q q

+ −

+=

˚

˚( ) 0p q s q s == =˚

˚

( )

( )

( )2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

11/3

1/2

1/2

2/3


s3

N = 3

t = 1

qt = q1 = s2

( )1t tp q q+=

˚

˚( )

( )

( )

1 1 1

12

3 1

0

0

1

t tp q

p

s q s

s s

sp s

+ =

=

=

=

=˚

˚

˚ ( )

( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

The state at timestep t+1 depends

only on the state at timestep t

Markov Property




s

s2

( )

( )1 1 0

1

, ,...,t t t

t t

p q q q q

p q q

+ −

+=

˚

˚( ) 0p q s q s == =˚

˚

( )

( )

( )2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

11/3

1/2

1/2

2/3


s3

N = 3

t = 1

qt = q1 = s2

( )1t tp q q+=

˚

˚( )

( )

( )

1 1 1

12

3 1

0

0

1

t tp q

p

s q s

s s

sp s

+ =

=

=

=

=˚

˚

˚ ( )

( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚



A Markov chain of order m (m finite): the state at

timestep t+1 depends on the past m states:

( ) ( )1 1 0 1 1 1, ,..., , ,...,t t t t t t t mp q q q q p q q q q+ − + − − +=˚ ˚

Markov Property



• In other words:

( )

( )1 1 0

1

, ,...,t t t

t t

p q q q q

p q q

+ −

+=

˚

˚( ) 0p q s q s == =˚

˚

( )

( )

( )2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

s1

s

s2

11/3

1/2

1/2

2/3

• How to represent the joint

distribution of (q0, q1, q2...) using

graphical models?


N = 3

t = 1

qt = q1 = s2

( )1t tp q q+=

˚

˚( )

( )

( )

1 1 1

12

3 1

0

0

1

t tp q

p

s q s

s s

sp s

+ =

=

=

=

=˚

˚

˚ ( )

( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

s3



Markov Property




s

s2

( )

( )1 1 0

1

, ,...,t t t

t t

p q q q q

p q q

+ −

+=

˚

˚( ) 0p q s q s == =˚

˚

( )

( )

( )2

2

1

3

2

2

1 2

1 2

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚

11/3

1/2

1/2

1/3

q0

q1

q

• How to represent the joint

distribution of (q0, q1, q2...) using

graphical models?


s3

N = 3

t = 1

qt = q1 = s2

( )1t tp q q+=

˚

˚( )

( )

( )

1 1 1

12

3 1

0

0

1

t tp q

p

s q s

s s

sp s

+ =

=

=

=

=˚

˚

˚ ( )

( )

( )2

3

1

3

3

3

1 3

2 3

0

s s

s s

p

p

sp s

=

=

=

˚

˚

˚



q2

q3

Markov chain

• So, the chain of qt is called Markov chain

q0 q1 q2 q3


Markov chain


• Each qt takes value from the countable state-space s1, s2, s3...

• Each qt is observed at a discrete timestep t

• qt sastifies the Markov property:

q0 q1 q2 q3

( ) ( )1 1 0 1, ,...,t t t t tp q q q q p q q+ − +=˚ ˚• qt sastifies the Markov property:


( ) ( )1 1 0 1, ,...,t t t t tp q q q q p q q+ − +=˚ ˚

Markov chain





q0 q1 q2 q3


• The transition from qt to qt+1 is calculated from the transition

probability matrix


s1

s3

s2

11/3

1/2

1/2

2/3

s1 s2 s3

s10 0 1

s2½ ½ 0

s31/3 2/3 0

Transition probabilities

( ) ( )1 1 0 1, ,...,t t t t tp q q q q p q q+ − +=˚ ˚

Markov Chain – Important property

• In a Markov chain, the joint distribution is

( ) ( ) ( )0 1 0 1

1

, ,..., |m

m j j

j

q q q p q q qp p −

=

= ∏


Markov Chain – Important property

• In a Markov chain, the joint distribution is

• Why?( ) ( ) ( ), previous st, ,... ates, |

m

q q q p q p q qp = ∏

( ) ( ) ( )0 1 0 1

1

, ,..., |m

m j j

j

q q q p q q qp p −

=

= ∏

Why?


( ) ( ) ( )

( ) ( )

0 1 0 1

1

0 1

1

, previous st, ,... ates, |

|

m j j

j

m

j j

j

q q q p q p q q

p q p q q

p −

=

−=

=

=

∏

∏

Due to the Markov property

Markov Chain: e.g.

• The state-space of weather:

cloud

windrain


Markov Chain: e.g.


cloud

windrain

1/2

1/2

1/32/3

1

Rain Cloud Wind

Rain ½ 0 ½

Cloud 1/3 0 2/3

Wind 0 1 0


Markov Chain: e.g.


cloud

windrain

1/2

1/2

1/32/3

1

Rain Cloud Wind

Rain ½ 0 ½

Cloud 1/3 0 2/3

Wind 0 1 0

• Markov assumption: weather in the t+1’th day is

depends only on the t’th day.


Markov Chain: e.g.


cloud

windrain

1/2

1/2

1/32/3

1

Rain Cloud Wind

Rain ½ 0 ½

Cloud 1/3 0 2/3

Wind 0 1 0



• We have observed the weather in a week:


rain windcloud rainwind

Day: 0 1 2 3 4

Markov Chain: e.g.


cloud

windrain

1/2

1/2

1/32/3

1

Rain Cloud Wind

Rain ½ 0 ½

Cloud 1/3 0 2/3

Wind 0 1 0



• We have observed the weather in a week:


rain windcloud rainwind

Day: 0 1 2 3 4

Markov Chain

Contents

• Introduction

• Markov Chain

• Hidden Markov Models

– Independent assumptions

– Formal definition– Formal definition

– Forward algorithm

– Viterbi algorithm

– Baum-Welch algorithm



Modeling pairs of sequences

• In many applications, we have to model pair of sequences

• Examples:

– POS tagging in Natural Language Processing (assign each word in a

sentence to Noun, Adj, Verb...)

– Speech recognition (map acoustic sequences to sequences of words)

– Computational biology (recover gene boundaries in DNA sequences)

– Video tracking (estimate the underlying model states from the observation

sequences)

– And many others...


Probabilistic models for sequence pairs

• We have two sequences of random variables:

X1, X2, ..., Xm and S1, S2, ..., Sm

• Intuitively, in a pratical system, each Xi corresponds to an observation

and each Si corresponds to a state that generated the observation.

• Let each Si be in 1, 2, ..., k and each Xi be in 1, 2, ..., o

• How do we model the joint distribution:


( )1 1 1 1,..., , ,...,m m m mp X x X x S s S s= = = =

Hidden Markov Models (HMMs)

• In HMMs, we assume that

( )

( ) ( ) ( )

1 1 1 1

1 1 1 1

2 1

,..., , ,...,m m m m

m m

j j j j j j j j

j j

p X

p

x X x S s S s

s p S s S s p X x S sS − −= =

= = = =

= = == = =∏ ∏˚ ˚

• This is often called Independence assumptions in

HMMs

• We are gonna prove it in the next slides


Independence Assumptions in HMMs [1]

• By the chain rule, the following equality is exact:

( )

( )

( )

1 1 1 1

1 1

1 1 1 1

,..., , ,...,

,...,

,..., ,...,

m m m m

m m

m m m m

p

p

p

X x X x S s S s

S s S s

X x X x S s S s

= = = =

= = = ×

= = = =˚

( ) ( ) ( ) ( ) ( ) ( )| |ABC p A BC p BC p A BC p Bp C p C= = ˚

• Assumption 1: the state sequence forms a Markov chain


( ) ( ) ( )1 1 1 1 1 1

2

,...,m

m m j j j j

j

S s S s p S s p S s Sp s− −

=

= = = = = =∏ ˚

Independence Assumptions in HMMs [2]

• By the chain rule, the following equality is exact:

• Assumption 2: each observation depends only on the underlying

( )

( )

1 1 1

1

1

1

1

1 11 1,

,..., ,...,

,..., ,...,

m m m m

m

j j m m

j

j j

X x X x S s S s

X x S s S s X x X x

p

p −

=

−

= = = =

= = = = = =∏

˚

˚

• Assumption 2: each observation depends only on the underlying

state

• These two assumptions are often called independence

assumptions in HMMs


( )

( )1 1 1 1 1 1,..., ,., ..,j j m m j j

j j j j

X x S s S s x X x

X

X

p

p

x S s

− −= = = = =

= = =

˚

˚

The Model form for HMMs

• The model takes the following form:

• Parameters in the model:

( ) ( ) ( ) ( )1 1 1 1

2 1

,.., , ,..., ;m m

m m j j j j

j j

x x s s s t s sp e x sθ π −= =

= ∏ ∏˚ ˚

Parameters in the model:

–

–

–


( ) Initial probabilities for 1, 2,...,s s kπ ∈

( ) Transition probabilities for , ' 1, 2,...,t s s s s k′ ∈˚

( )

Emission probabilities for 1,2,...,

and 1,2,..,

e x s s k

x o

∈

∈

˚

6 components of HMMs

• Discrete timesteps: 1, 2, ...

• Finite state space: si (N states)

• Events xi (M events)

• Vector of initial probabilities ππππi

ΠΠΠΠ = πi = p(q1 = si)

• Matrix of transition probabilities

s1 s2 s3

t11

t21

t12

t31

t23

t32

e11 e

e13 e23 e

start

ππππ1ππππ2 ππππ3


T = Tij = p(qt+1=sj|qt=si)

• Matrix of emission probabilities

E = Eij = p(ot=xj|qt=si)


x1 x2 x3

e11 e31 e22

e23 e33

The observations at continuous timesteps form an observation sequence

o1, o2, ..., ot, where oi ∈ x1, x2, ..., xM


• Discrete timesteps: 1, 2, ...

• Finite state space: si (N states)

• Events xi (M events)

• Vector of initial probabilities ππππi

ΠΠΠΠ = πi = p(q1 = si)


s1 s2 s3

t11

t21

t12

t31

t23

t32

e11 e

e13 e23 e

start



T = Tij = p(qt+1=sj|qt=si)

• Matrix of emission probabilities

E = Eij = p(ot=xj|qt=si)


x1 x2 x3

e11 e31 e22

e23 e33

The observations at continuous timesteps form an observation sequence

o1, o2, ..., ot, where oi ∈ x1, x2, ..., xM1 1 1

Constraints:

1 1 1N N M

i ij ij

i j j

T Eπ= = =

= = =∑ ∑ ∑


• Given a specific HMM and an

observation sequence, the

corresponding sequence of states

is generally not deterministic

• Example:

Given the observation sequence:

x , x , x , x

s1 s2 s3

t11

t21

t12

t31

t23

t32

e11 e

e13 e23 e

start


x1, x3, x3, x2

The corresponding states can be

any of following sequences:

s1, s2, s1, s2

s1, s2, s3, s2

s1, s1, s1, s2

...


x1 x2 x3

e11 e31 e22

e23 e33

Here’s an HMM

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.7

0.1

0.9 0.8


T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

Here’s a HMM

• Start randomly in state 1, 2

or 3.

• Choose a output at each

state in random.

• Let’s generate a sequence

of observations:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.7

0.1

0.9 0.8


T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 o1

q2 o2

q3 o3

0.3 - 0.3 - 0.4

randomply choice

between S1, S2, S3

Here’s a HMM


or 3.


state in random.


of observations:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.7

0.1

0.9 0.8


T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 S3 o1

q2 o2

q3 o3

0.2 - 0.8

choice between X1

and X3

Here’s a HMM


or 3.


state in random.


of observations:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.7

0.1

0.9 0.8


T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 S3 o1 X3

q2 o2

q3 o3

Go to S2 with

probability 0.8 or

S1 with prob. 0.2

Here’s a HMM


or 3.


state in random.


of observations:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.7

0.1

0.9 0.8


T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 S3 o1 X3

q2 S1 o2

q3 o3

0.3 - 0.7

choice between X1

and X3

Here’s a HMM


or 3.


state in random.


of observations:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.7

0.1

0.9 0.8


T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 S3 o1 X3

q2 S1 o2 X1

q3 o3

Go to S2 with

probability 0.5 or

S1 with prob. 0.5

Here’s a HMM


or 3.


state in random.


of observations:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.7

0.1

0.9 0.8


T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 S3 o1 X3

q2 S1 o2 X1

q3 S1 o3

0.3 - 0.7

choice between X1

and X3

Here’s a HMM


or 3.


state in random.


of observations:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.7

0.1

0.9 0.8

We got a sequence


T s1 s2 s3

s1 0.5 0.5 0

s2 0.4 0 0.6

s3 0.2 0.8 0

E x1 x2 x3

s1 0.3 0 0.7

s2 0 0.1 0.9

s3 0.2 0 0.8

ππππ s1 s2 s3

0.3 0.3 0.4

q1 S3 o1 X3

q2 S1 o2 X1

q3 S1 o3 X3

We got a sequence

of states and

corresponding

observations!

Three famous HMM tasks

• Given a HMM Φ = (T, E, π). Three famous HMM tasks are:

• Probability of an observation sequence (state estimation)

– Given: Φ, observation O = o1, o2,..., ot

– Goal: p(O|Φ), or equivalently p(st = Si|O)

• Most likely expaination (inference)Most likely expaination (inference)

– Given: Φ, the observation O = o1, o2,..., ot

– Goal: Q* = argmaxQ p(Q|O)

• Learning the HMM

– Given: observation O = o1, o2,..., ot and corresponding state sequence

– Goal: estimate parameters of the HMM Φ = (T, E, π)







• Most likely expaination (inference)

Calculating the probability of

observing the sequence O over Most likely expaination (inference)







all of possible sequences.







Calculating the best

corresponding state sequence, Most likely expaination (inference)







given an observation

sequence.







Given an (or a set of)

observation sequence and

corresponding state sequence, Most likely expaination (inference)







corresponding state sequence,

estimate the Transition matrix,

Emission matrix and initial

probabilities of the HMM


Problem Algorithm Complexity

State estimation

Calculating: p(O|Φ)

Forward O(TN2)

Inference

Calculating: Q*= argmaxQp(Q|O)

Viterbi decoding O(TN2)


Calculating: Q*= argmaxQp(Q|O)

Learning

Calculating: Φ* = argmaxΦp(O|Φ)

Baum-Welch (EM) O(TN2)

T: number of timesteps

N: number of states

State estimation problem

• Given: Φ = (T, E, π), observation O = o1, o2,..., ot

• Goal: What is p(o1o2...ot) ?

• We can do this in a slow, stupid wayWe can do this in a slow, stupid way

– As shown in the next slide...


Here’s a HMM

• What is p(O) = p(o1o2o3)

= p(o1=X3 ∧ o2=X1 ∧ o3=X3)?

• Slow, stupid way:

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.70.1

0.9 0.8

( ) ( )

( ) ( )paths of length 3

paths of length 3

|

Q

Q

p O p OQ

p QO Q p

∈

∈

=

= ∑

∑

• How to compute p(Q) for an

arbitrary path Q?

• How to compute p(O|Q) for an

arbitrary path Q?


paths of length 3Q∈

Here’s a HMM


= p(o1=X3 ∧ o2=X1 ∧ o3=X3)?


s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.70.1

0.9 0.8

( ) ( )


paths of length 3

|

Q

Q

p O p OQ

p QO Q p

∈

∈

=

= ∑

∑ππππ s1 s2 s3


arbitrary path Q?


arbitrary path Q?



p(Q) = p(q1q2q3)

= p(q1)p(q2|q1)p(q3|q2,q1) (chain)

= p(q1)p(q2|q1)p(q3|q2) (why?)

Example in the case Q=S3S1S1

P(Q) = 0.4 * 0.2 * 0.5 = 0.04

0.3 0.3 0.4

Here’s a HMM


= p(o1=X3 ∧ o2=X1 ∧ o3=X3)?


s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.70.1

0.9 0.8

( ) ( )


paths of length 3

|

Q

Q

p O p OQ

p QO Q p

∈

∈

=

= ∑



arbitrary path Q?


arbitrary path Q?



p(O|Q) = p(o1o2o3|q1q2q3)

= p(o1|q1)p(o2|q1)p(o3|q3) (why?)


P(O|Q) = p(X3|S3)p(X1|S1) p(X3|S1)

=0.8 * 0.3 * 0.7 = 0.168

0.3 0.3 0.4

Here’s a HMM


= p(o1=X3 ∧ o2=X1 ∧ o3=X3)?


s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.70.1

0.9 0.8

( ) ( )


paths of length 3

|

Q

Q

p O p OQ

p QO Q p

∈

∈

=

= ∑



arbitrary path Q?


arbitrary path Q?



p(O|Q) = p(o1o2o3|q1q2q3)

= p(o1|q1)p(o2|q1)p(o3|q3) (why?)


P(O|Q) = p(X3|S3)p(X1|S1) p(X3|S1)

=0.8 * 0.3 * 0.7 = 0.168

0.3 0.3 0.4

p(O) needs 27 p(Q)

computations and 27

p(O|Q) computations.

What if the sequence has

20 observations?So let’s be smarter...

The Forward algorithm

• Given observation o1o2...oT

• Forward probabilities:

αt(i) = p(o1o2...ot ∧ qt = si | Φ) where 1 ≤ t ≤ T

ααt(i) = probability that, in a random trial:

– We’d have seen the first t observations

– We’d have ended up in si as the t’th state visited.

• In our example, what is α2(3) ?


ααααt(i): easy to define recursively

( ) ( )

( ) ( )

( )

1 1 1

1 1 1

1

|

i

i i

i i

i p o q s

p q s p o q s

oE

α

π

= ∧ =

= = =

=

( ) ( )1 2... |t t t ii p o o o q sα = ∧ = Φ

( ) ( )

( )

1 1 1 1

1 1 1

2

2

...

...

t t t i

N

t t j t t i

i p o o o q s

o o o q s o q sp

α + + +

+ +

= ∧ =

= ∧ = ∧ ∧ =∑

( )

( ) ( )

1

1

|

|

i i

ij t j t i

ij t j t i

p q s

T p s q s

E p x q

q

o s

π

+

Π = = =

= = = =

= = = =

T

E


( )

( ) ( )

( ) ( )

( ) ( ) ( )

( ) ( )

1 1 1

1

1 1 1 1

1

1 1

1

1 1 1

2

2 2

1

1

1

.|

|

|

. . .

|

. .

t t j t t i

j

N

t t i t t j t t j

j

N

t t i t j t

j

N

t t i t i t j t

j

N

ji i t t

j

o q s o o o q s op p

p

p p

T

o o q s

o q s q s j

o q s q s q s j

o jE

α

α

α

+ +

=

+ +

=

+ +

=

+ + +

=

+

=

= ∧ = ∧ = ∧ =

= ∧ = =

= = = =

=

∑

∑

∑

∑

∑

In our example

s1 s2 s3

x1 x2 x3

0.5

0.4

0.5

0.2

0.6

0.8

0.30.2

0.70.1

0.9 0.8

ππππ s1 s2 s3

( ) ( )

( ) ( )

( ) ( ) ( ) ( ) ( )

1 2

1 1

1 1 1

. |..t t t i

i i

t ji i t t i t ji t

j j

i p o o o q s

i E o

i T E o j E o T j

α

α π

α α α+ + +

= ∧ = Φ

=

= =∑ ∑


ππππ s1 s2 s3

0.3 0.3 0.4We observed: x1x2

α1(1) = 0.3 * 0.3 = 0.09

α1(2) = 0

α1(3) = 0.2 * 0.4 = 0.08

α2(1) = 0 * (0.09*0 .5+ 0*0.4 + 0.08*0.2) = 0

α2(2) = 0.1 * (0.09*0.5 + 0*0 + 0.08*0.8) = 0.0109

α2(3) = 0 * (0.09*0 + 0*0.6 + 0.08*0) = 0

Forward probabilities - Trellis

N

s3

s4


T1 2 3 4 5 6

s1

s2


N

s3

s4

α1 (4)

α1 (3) α2 (3) α6 (3)


T1 2 3 4 5 6

s1

s2

α1 (2)

α1 (1)

α3 (2)

α4 (1)

α5 (2)


N

s3

s4

α1 (4)

α1 (3) α2 (3)

( ) ( )1 1i ii E oα π=


T1 2 3 4 5 6

s1

s2

α1 (2)

α1 (1)


N

s3

s4

α1 (4)

α1 (3) α2 (3)

( ) ( ) ( )1 1t i t ji t

j

i E o T jα α+ += ∑


T1 2 3 4 5 6

s1

s2

α1 (2)

α1 (1)

Forward probabilities

• So, we can cheaply compute:

• How can we cheaply compute:

( ) ( )1 2...t t t ii p o o o q sα = ∧ =

( )1 2 ... tp o o o



( )1 2| ...t i tp q s o o o=

( )t

i

iα= ∑

Forward probabilities

• So, we can cheaply compute:


( ) ( )1 2...t t t ii p o o o q sα = ∧ =

( )1 2 ... tp o o oi


Look back the trellis...


( )1 2| ...t i tp q s o o o=( )( )

j

t

t

i

j

α

α=∑

State estimation problem

• State estimation is solved:

• Can we utilize the elegant trellis to solve the Inference

problem?

( ) ( ) ( )1 2

1

|N

t i

i

O p o o o ip α=

Φ = … =∑

problem?

– Given an observation sequence O, find the best state sequence Q


( )* arg max |Q

OQ p Q=

Inference problem

• Given: Φ = (T, E, π), observation O = o1, o2,..., ot

• Goal: Find

• Practical problems:

( )

( )1 2

*

1 2 1 2

arg max |

arg max |t

Q

t tq q q

Q p Q O

p q q q o o o…

=

= … …

• Practical problems:

– Speech recognition: Given an utterance (sound), what is

the best sentence (text) that matches the utterance?

– Video tracking

– POS Tagging


s1 s2

x1 x2 x3

s3

Inference problem

• We can do this in a slow, stupid way:

( )

( ) ( )( )

( ) ( )

* arg max |

|arg max

Q

Q

p Q O

p O Q p Q

p O

Q =

=

• But it’s better if we can find another way to

compute the most probability path (MPP)...


( ) ( )

( ) ( )1 2

arg max |

arg max |t

Q

Q

p O Q p Q

p o o o Q p Q

=

= …

Efficient MPP computation

• We are going to compute the following variables:

• δt(i) is the probability of the best path of length

( ) ( )1 2 1

1 2 1 1 2maxt

t t t i tq q q

i p q q q q s o o oδ−

−…

= … ∧ = ∧ …

t(i) is the probability of the best path of length

t-1 which ends up in si and emits o1...ot.

• Define: mppt(i) = that path

so: δt(i) = p(mppt(i))


Viterbi algorithm( ) ( )

( ) ( )

( ) ( )

( ) ( )

1 2 1

1 2 1

1 2 1 1 2

1 2 1 1 2

1 11one choice

1 1

max

arg max

max

t

t

t t t i tq q q

t t t

i

i tq q q

i

i

i p q q q q s o o o

mpp i p q q q q s o o o

i p q s o

E o iπ

δ

δ

α

−

−

−…

−…

= … ∧ = ∧ …

= … ∧ = ∧ …

= = ∧

= =

N

s4

δ1 (4)


T1 2 3 4 5 6

s1

s2

s3

s4

δ 1 (3)

δ 1 (2)

δ 1 (1)

δ 2 (3)

Viterbi algorithm

• The most prob path with last two states

sisj is the most path to si, followed by

transition si sj.

• The prob of that path will be:

s1

si sj

time t time t + 1

......

δt(i) × p(si sj ∧ ot+1)

= δt(i)TijEj(ot+1)

• So, the previous state at time t is:


( ) ( )*

1arg maxt ij j t

i

i T Ei oδ +=

...

Viterbi algorithm

• Summary: ( ) ( ) ( )

( ) ( )( ) ( )

*

1

1

1

*

1

* arg max

t ij j t

t t j

t i

t

j ji

t

i T E o

mpp j mpp i s

i T E

j

i o

δδ

δ

++

+

+

=

=

=

N

sδ1 (4)

( ) ( ) ( )11 1iii E o iδ π α= =

28/03/2011 Markov models 79T1 2 3 4 5 6

s1

s2

s3

s4

δ1 (4)

δ 1 (3)

δ 1 (2)

δ 1 (1)

δ 2 (3)

What’s Viterbi used for?

• Speech Recognition


Chong, Jike and Yi, Youngmin and Faria, Arlo and Satish, Nadathur Rajagopalan and Keutzer, Kurt, “Data-Parallel Large Vocabulary

Continuous Speech Recognition on Graphics Processors”, EECS Department, University of California, Berkeley, 2008.

Training HMMs

• Given: large sequence of observation o1o2...oT

and number of states N.

• Goal: Estimation of parameters Φ = ⟨T, E, π⟩

• That is, how to design an HMM.

• We will infer the model from a large amount of

data o1o2...oT with a big “T”.


Training HMMs

• Remember, we have just computed

p(o1o2...oT | Φ)

• Now, we have some observations and we want to inference Φ

from them.

• So, we could use:

– MAX LIKELIHOOD:

– BAYES:

Compute

then take or


( )1arg max |Tp o oΦ

Φ = … Φ

( )1| Tp o oΦ …

[ ]E Φ ( )1max | Top oΦ

Φ …

Max likelihood for HMMs

• Forward probability: the probability of producing o1...ot while

ending up in state si

( ) ( )1 2...t t t ii p o o o q sα = ∧ =( ) ( )

( ) ( ) ( )1 1

1 1

i i

t i t ji t

j

i E o

i E o T j

α π

α α+ +

=

= ∑

• Backward probability: the probability of producing ot+1...oT given

that at time t, we are at state si


( ) ( )1 2. |..t t t iTti p o o o q sβ + += =

Max likelihood for HMMs - Backward

• Backward probability: easy to define recursively

( ) ( )

( )

( ) ( )

1 2

1 2 1

1

...

...

|

1

|

t t t t i

T

N

t t t t j t i

j

T

T

i p o o o q s

i

i p o o o q s q s

β

β

β

+ +

+ + +=

= =

=

= ∧ ∧ = =∑

( )

( ) ( ) ( )1 1

1

1T

N

t t ij j t

j

i

i T oj E

β

β β + +=

=

=∑


( ) ( )

( ) ( )

( ) ( )

1

1 1 2 1 1

1

1 1 2 1

1

1 1

1

.| |

| |

..

...

j

N

t t j t i t t t j t i

j

N

t t j t i t t j

j

N

t ij j t

T

T

j

p o q s q s p o o o q s q s

p o q s q s p o o q s

T oj Eβ

=

+ + + + +=

+ + + +=

+ +=

= ∧ = = ∧ = ∧ =

= ∧ = = =

=

∑

∑

∑

∑


• The probability of traversing a certain arc at time t given

o1o2...oT:

( ) ( )

( )( )

1 1 2

1 1 2

|ij t i t j T

t i t j T

t p q s q s o o o

p q s q s o o o

p o o o

ε +

+

= = ∧ = …

= ∧ = ∧ …=

…


( )

( ) ( ) ( )

( ) ( )

( )( ) ( )

( ) ( )

1 2

1 2 1 1 2

1 2 1 2

1

1

|

|

T

t t i t i t j t t T t i

N

t i t t t i

i

ij t

ij N

t

i

t T

t

t

p o o o

p o o o q s p q s q s p o o o q s

p o o o q p o o o q

i T it

i i

s s

α β

α

ε

β

+ + +

+ +=

=

=…

… ∧ = = ∧

=

= … ==

… ∧ …

=

=∑

∑


• The probability of being at state si at time t given o1o2...oT:

( ) ( )

( )

1 2

1 1 2

1

|

|

i t i T

N

t i t j T

j

t p q s o o o

p q s q s o o o

γ

+=

= = …

= = ∧ = …∑


( ) ( )

1

1

j

N

ij

j

i t tγ ε=

=∑


• Sum over the time index:

– Expected # of transitions from state i to j in o1o2...oT:

( )1

1

T

ij

t

tε−

=

∑

– Expected # of transitions from state i in o1o2...oT :


1t=

( ) ( ) ( )11 1

1 1 1 1 1

TT T N N

i ij ij

t t j j t

t t tε εγ−− −

= = = = =

= =∑ ∑∑ ∑∑

Update parameters

( )

( )

( )

( )

( )

1 1

1 1

1 1

ˆ expected frequency in state i at time t = 1 1

expected # of transitions from state i to j

expected # of transitions from state i

i i

T T

ij ij

t tij T N T

i ij

t t

T

t t

π γ

γ

ε ε

ε

− −

= =− −

= =

= = =∑ ∑

∑ ∑∑

( )

( ) ( )

1

1

|

|

i i

ij t j t i

ij t j t i

p q s

T p s q s

E p x q

q

o s

π

+

Π = = =

= = = =

= = = =

T

E


( ) ( )1 1 1

expected

i ij

t j t

ik

t t

E

γ ε= = =

=

∑ ∑∑

( ) ( )

( )

( ) ( )

( )

11

1 11

1 1

1 1 1

# of transitions from state i with x observed

expected # of transitions from state i

,,

k

N TT

t k ijt k ij tt

T N T

i ij

t j t

o x to x t

t t

δ εδ γ

γ ε

−−

= ==− −

= = =

= =

∑∑∑

∑ ∑∑

The inner loop of Forward-Backward

Given an input sequence.

1. Calculate forward probability:

– Base case:

– Recursive case:

2. Calculate backward probability:

( ) ( )

( ) ( ) ( )1 1

1 1

i i

t i t ji t

j

i E o

i E o T j

α π

α α+ +

=

= ∑

– Base case:

– Recursive case:

3. Calculate expected count:

4. Update parameters:


( )

( ) ( ) ( )1 1

1

1T

N

t t ij j t

j

i

i T oj E

β

β β + +=

=

=∑( )

( ) ( )

( ) ( )1

ij t

ij N

t

t

i

t

i T it

i i

α β

α β

ε

=

=

∑( )

( )

( ) ( )

( )

11

1 11

1 1

1 1 1 1

,N TT

t k ijijj tt

ij ikN T N T

ij ij

j t j t

o x tt

T E

t t

δ ε

ε

ε

ε

−−

= ==− −

= = = =

= =

∑∑∑

∑∑ ∑∑

Forward-Backward: EM for HMM

• If we knew Φ we could estimate expectations of quantities

such as

– Expected number of times in state i

– Expected number of transitions i j

• If we knew the quantities such as• If we knew the quantities such as

– Expected number of times in state i

– Expected number of transitions i j

we could compute the max likelihood estimate of Φ = ⟨T, E, Π⟩

• Also known (for the HMM case) as the Baum-Welch algorithm.


EM for HMM

• Each iteration provides values for all the parameters

• The new model always improve the likeliness of the

training data:

( ) ( )ˆ| |p po o o o o o… Φ ≥ … Φ

• The algorithm does not guarantee to reach global

maximum.


( ) ( )1 2 1 2ˆ| |

T Tp po o o o o o… Φ ≥ … Φ

EM for HMM

• Bad News

– There are lots of local minima

• Good News

– The local minima are usually adequate models of the data.

• Notice

– EM does not estimate the number of states. That must be given (tradeoffs)

– Often, HMMs are forced to have some links with zero probability. This is done

by setting Tij = 0 in initial estimate Φ(0)

– Easy extension of everything seen today: HMMs with real valued outputs


Contents

• Introduction

• Markov Chain

• Hidden Markov ModelsHidden Markov Models

• Markov Random Field (from the viewpoint of

classification)


Example: Image segmentation

• Observations: pixel values

• Hidden variable: class of each pixel

• It’s reasonable to think that there are some underlying relationships

between neighbouring pixels... Can we use Markov models?

• Errr.... the relationships are in 2D!


MRF as a 2D generalization of MC

• Array of observations:

• Classes/States:

• Our objective is classification: given the array of

observations, estimate the corresponding values of the

, 0 ,0 yij xi N j NX x < ≤ <= ≤

, 1...ij ijS s s M= =

observations, estimate the corresponding values of the

state array S so that


( ) ( )| is maximum.Xp S p S

2D context-dependent classification

• Assumptions:

– The values of elements in S are mutually dependent.

– The range of this dependence is limited within a neighborhood.

• For each (i, j) element of S, a neighborhood Nij is defined so

thatthat

– sij ∉ Nij: (i, j) element does not belong to its own set of neighbors.

– sij ∈ Nkl ⇔ skl ∈ Nij: if sij is a neighbor of skl then skl is also a neighbor

of sij



• The Markov property for 2D case:

where includes all the elements of S except the (i, j) one.

• The elegeant dynamic programing is not applicable: the problem is

( ) ( )| |ij ij ij ijs S p sp = N

ijS

much harder now!



• The Markov property for 2D case:

where includes all the elements of S except the (i, j) one.

• The elegeant dynamic programing is not applicable: the problem is

( ) ( )| |ij ij ij ijs S p sp = N

ijS

We are gonna see an

application of MRF for much harder now!


application of MRF for

Image Segmentation

and Restoration.

MRF for Image Segmentation

• Cliques: a set of each pixel which are neighbors

of each other (w.r.t the type of neighborhood)



• Dual Lattice number

• Line process:



• Gibbs distribution:

– Z: normalizing constant

( )( )1

expU s

sZ T

π−

=

– T: parameter

• It turns out that Gibbs distribution implies MRF

([Gema 84])



• A Gibbs conditional probability is of the form:

– Ck(i, j): clique of the pixel (i, j)

( ) ( )( )1 1

| exp ,ij ij k k

k

s F C i jZ T

p

= −

∑N

– Fk: some functions, e.g.


( ) ( )( )1 2 1, 1, 2 , 1 , 1

1ij i j i j i j i js s s s s

Tα α α− + − ++ +− + +


• Then, the joint probability for the Gibbs model is

– The sum is calculated over all possible cliques associated

( )( )( )

,

,

exp

k k

i j kp

F C i j

ST

= −

∑∑

– The sum is calculated over all possible cliques associated

with the neighborhood.

• We also need to work out p(X|S)

• Then p(X|S)p(S) can be maximized... [Gema 84]


More on Markov models...

• MRF does not stop there... Here are some related models:

– Conditional random field (CRF)

– Graphical models

– ...

• Markov Chain and HMM does not stop there...• Markov Chain and HMM does not stop there...

– Markov chain of order m

– Continuous-time Markov chains...

– Real-value observations

– ...


What you should know

• Markov property, Markov Chain

• HMM:

– Defining and computing αt(i)

– Viterbi algorithm

– Outline of the EM algorithm for HMM


– And an application in Image Segmentation

– [Geman 84] for more information.


Q & A


References

• L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications

in Speech Recognition“, Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989.

• Andrew W. Moore, “Hidden Markov Models”, http://www.autonlab.org/tutorials/

• Geman S., Geman D. “Stochastic relaxation, Gibbs distributions and the

Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and

Machine Intelligence, Vol. 6(6), pp. 721-741, 1984.


markov models

Education