markov models
DESCRIPTION
A presentation on Markov Chain, HMM, Markov Random Fields with the needed algorithms and detailed explanations.TRANSCRIPT
PATTERN RECOGNITION
Markov modelsMarkov models
Vu PHAM
Department of Computer Science
March 28th, 2011
28/03/2011 1Markov models
Contents
• Introduction
– Introduction
– Motivation
• Markov Chain
• Hidden Markov Models
• Markov Random Field
28/03/2011 2Markov models
Introduction
• Markov processes are first proposed by
Russian mathematician Andrei Markov
– He used these processes to investigate
Pushkin’s poem.
• Nowadays, Markov property and HMMs are
widely used in many domains:widely used in many domains:
– Natural Language Processing
– Speech Recognition
– Bioinformatics
– Image/video processing
– ...
28/03/2011 Markov models 3
Motivation [0]
• As shown in his paper in 1906, Markov’s original
motivation is purely mathematical:
– Application of The Weak Law of Large Number to dependent
random variables.random variables.
• However, we shall not follow this motivation...
28/03/2011 Markov models 4
Motivation [1]
• From the viewpoint of classification:
– Context-free classification: Bayes classifier
( ) ( )| |i j
j ip pω ω> ∀ ≠x x
28/03/2011 Markov models 5
Motivation [1]
• From the viewpoint of classification:
– Context-free classification: Bayes classifier
( ) ( )| |i j
j ip pω ω> ∀ ≠x x
28/03/2011 Markov models 6
• Classes are independent.
• Feature vectors are independent.
Motivation [1]
• From the viewpoint of classification:
– Context-free classification: Bayes classifier
( ) ( )| |i j
j ip pω ω> ∀ ≠x x
– However, there are some applications where various
classes are closely realated:
• POS Tagging, Tracking, Gene boundary recover...
28/03/2011 Markov models 7
s1...s2 s3
...sm
Motivation [1]
• Context-dependent classification:
– s1, s2, ..., sm: sequence of m feature vector
– ω1, ω2,..., ωN: classes in which these vectors are classified: ωi = 1...k.
s1...s2 s3
...sm
28/03/2011 Markov models 8
Motivation [1]
• Context-dependent classification:
– s1, s2, ..., sm: sequence of m feature vector
– ω1, ω2,..., ωN: classes in which these vectors are classified: ωi = 1...k.
s1...s2 s3
...sm
• To apply Bayes classifier:
– X = s1s2...sm: extened feature vector
– Ωi = ωi1, ωi2,..., ωiN : a classification Nm possible classifications
28/03/2011 Markov models 9
( ) ( )| |i j j ip pΩ > Ω ∀ ≠X X
( ) ( ) ( ) ( )| |i i j jp p jpp iΩ Ω > Ω Ω ∀ ≠X X
Motivation [1]
• Context-dependent classification:
– s1, s2, ..., sm: sequence of m feature vector
– ω1, ω2,..., ωN: classes in which these vectors are classified: ωi = 1...k.
s1...s2 s3
...sm
• To apply Bayes classifier:
– X = s1s2...sm: extened feature vector
– Ωi = ωi1, ωi2,..., ωiN : a classification Nm possible classifications
28/03/2011 Markov models 10
( ) ( )| |i j j ip pΩ > Ω ∀ ≠X X
( ) ( ) ( ) ( )| |i i j jp p jpp iΩ Ω > Ω Ω ∀ ≠X X
Motivation [2]
• From a general view, sometimes we want to evaluate the joint
distribution of a sequence of dependent random variables
28/03/2011 Markov models 11
Motivation [2]
• From a general view, sometimes we want to evaluate the joint
distribution of a sequence of dependent random variables
Hôm nay mùng tám tháng ba
Chị em phụ nữ đi ra đi vào...
28/03/2011 Markov models 12
Hôm ...nay mùng ...vào
q1 q2 q3 qm
Motivation [2]
• From a general view, sometimes we want to evaluate the joint
distribution of a sequence of dependent random variables
Hôm nay mùng tám tháng ba
Chị em phụ nữ đi ra đi vào...
• What is p(Hôm nay.... vào) = p(q1=Hôm q2=nay ... qm=vào)?
28/03/2011 Markov models 13
Hôm ...nay mùng ...vào
q1 q2 q3 qm
Motivation [2]
• From a general view, sometimes we want to evaluate the joint
distribution of a sequence of dependent random variables
Hôm nay mùng tám tháng ba
Chị em phụ nữ đi ra đi vào...
• What is p(Hôm nay.... vào) = p(q1=Hôm q2=nay ... qm=vào)?
28/03/2011 Markov models 14
Hôm ...nay mùng ...vào
q1 q2 q3 qm
p(sm|s1s2...sm-1) =p(s1s2... sm-1 sm)
p(s1s2... sm-1)
Contents
• Introduction
• Markov Chain
• Hidden Markov ModelsHidden Markov Models
• Markov Random Field
28/03/2011 15Markov models
Markov Chain
• Has N states, called s1, s2, ..., sN
• There are discrete timesteps, t=0,
t=1,...
• On the t’th timestep the system is in
exactly one of the available states.
s1
s
s2
Call it
28/03/2011 16Markov models
s3
N = 3
t = 0
qt = q0 = s3
1 2, ,...,t Ns sq s∈
Current state
Markov Chain
• Has N states, called s1, s2, ..., sN
• There are discrete timesteps, t=0,
t=1,...
• On the t’th timestep the system is in
exactly one of the available states.
s1
s
s2
Current state
Call it
• Between each timestep, the next
state is chosen randomly.
28/03/2011 17Markov models
s3
N = 3
t = 1
qt = q1 = s2
1 2, ,...,t Ns sq s∈
Markov Chain
• Has N states, called s1, s2, ..., sN
• There are discrete timesteps, t=0,
t=1,...
• On the t’th timestep the system is in
exactly one of the available states.
s1
s
s2
( ) 0p q s q s == =˚
˚
( )
( )
( )2
2
1
3
2
2
1 2
1 2
0
s s
s s
p
p
sp s
=
=
=
˚
˚
˚
Call it
• Between each timestep, the next
state is chosen randomly.
• The current state determines the
probability for the next state.
28/03/2011 18Markov models
s3
N = 3
t = 1
qt = q1 = s2
1 2, ,...,t Ns sq s∈( )
( )
( )
1 1 1
12
3 1
0
0
1
t tp q
p
s q s
s s
sp s
+ =
=
=
=
=˚
˚
˚ ( )
( )
( )2
3
1
3
3
3
1 3
2 3
0
s s
s s
p
p
sp s
=
=
=
˚
˚
˚
Markov Chain
• Has N states, called s1, s2, ..., sN
• There are discrete timesteps, t=0,
t=1,...
• On the t’th timestep the system is in
exactly one of the available states.
s1
s
s2
( ) 0p q s q s == =˚
˚
( )
( )
( )2
2
1
3
2
2
1 2
1 2
0
s s
s s
p
p
sp s
=
=
=
˚
˚
˚
11/3
1/2
1/2
2/3
Call it
• Between each timestep, the next
state is chosen randomly.
• The current state determines the
probability for the next state.
– Often notated with arcs between states
28/03/2011 19Markov models
s3
N = 3
t = 1
qt = q1 = s2
1 2, ,...,t Ns sq s∈( )
( )
( )
1 1 1
12
3 1
0
0
1
t tp q
p
s q s
s s
sp s
+ =
=
=
=
=˚
˚
˚ ( )
( )
( )2
3
1
3
3
3
1 3
2 3
0
s s
s s
p
p
sp s
=
=
=
˚
˚
˚
Markov Property
• qt+1 is conditionally independent of
qt-1, qt-2,..., q0 given qt.
• In other words:s1
s
s2
( )
( )1 1 0
1
, ,...,t t t
t t
p q q q q
p q q
+ −
+=
˚
˚( ) 0p q s q s == =˚
˚
( )
( )
( )2
2
1
3
2
2
1 2
1 2
0
s s
s s
p
p
sp s
=
=
=
˚
˚
˚
11/3
1/2
1/2
2/3
28/03/2011 20Markov models
s3
N = 3
t = 1
qt = q1 = s2
( )1t tp q q+=
˚
˚( )
( )
( )
1 1 1
12
3 1
0
0
1
t tp q
p
s q s
s s
sp s
+ =
=
=
=
=˚
˚
˚ ( )
( )
( )2
3
1
3
3
3
1 3
2 3
0
s s
s s
p
p
sp s
=
=
=
˚
˚
˚
Markov Property
• qt+1 is conditionally independent of
qt-1, qt-2,..., q0 given qt.
• In other words:s1
s
s2
( )
( )1 1 0
1
, ,...,t t t
t t
p q q q q
p q q
+ −
+=
˚
˚( ) 0p q s q s == =˚
˚
( )
( )
( )2
2
1
3
2
2
1 2
1 2
0
s s
s s
p
p
sp s
=
=
=
˚
˚
˚
11/3
1/2
1/2
2/3
28/03/2011 21Markov models
s3
N = 3
t = 1
qt = q1 = s2
( )1t tp q q+=
˚
˚( )
( )
( )
1 1 1
12
3 1
0
0
1
t tp q
p
s q s
s s
sp s
+ =
=
=
=
=˚
˚
˚ ( )
( )
( )2
3
1
3
3
3
1 3
2 3
0
s s
s s
p
p
sp s
=
=
=
˚
˚
˚
The state at timestep t+1 depends
only on the state at timestep t
Markov Property
• qt+1 is conditionally independent of
qt-1, qt-2,..., q0 given qt.
• In other words:s1
s
s2
( )
( )1 1 0
1
, ,...,t t t
t t
p q q q q
p q q
+ −
+=
˚
˚( ) 0p q s q s == =˚
˚
( )
( )
( )2
2
1
3
2
2
1 2
1 2
0
s s
s s
p
p
sp s
=
=
=
˚
˚
˚
11/3
1/2
1/2
2/3
28/03/2011 22Markov models
s3
N = 3
t = 1
qt = q1 = s2
( )1t tp q q+=
˚
˚( )
( )
( )
1 1 1
12
3 1
0
0
1
t tp q
p
s q s
s s
sp s
+ =
=
=
=
=˚
˚
˚ ( )
( )
( )2
3
1
3
3
3
1 3
2 3
0
s s
s s
p
p
sp s
=
=
=
˚
˚
˚
The state at timestep t+1 depends
only on the state at timestep t
A Markov chain of order m (m finite): the state at
timestep t+1 depends on the past m states:
( ) ( )1 1 0 1 1 1, ,..., , ,...,t t t t t t t mp q q q q p q q q q+ − + − − +=˚ ˚
Markov Property
• qt+1 is conditionally independent of
qt-1, qt-2,..., q0 given qt.
• In other words:
( )
( )1 1 0
1
, ,...,t t t
t t
p q q q q
p q q
+ −
+=
˚
˚( ) 0p q s q s == =˚
˚
( )
( )
( )2
2
1
3
2
2
1 2
1 2
0
s s
s s
p
p
sp s
=
=
=
˚
˚
˚
s1
s
s2
11/3
1/2
1/2
2/3
• How to represent the joint
distribution of (q0, q1, q2...) using
graphical models?
28/03/2011 23Markov models
N = 3
t = 1
qt = q1 = s2
( )1t tp q q+=
˚
˚( )
( )
( )
1 1 1
12
3 1
0
0
1
t tp q
p
s q s
s s
sp s
+ =
=
=
=
=˚
˚
˚ ( )
( )
( )2
3
1
3
3
3
1 3
2 3
0
s s
s s
p
p
sp s
=
=
=
˚
˚
˚
s3
The state at timestep t+1 depends
only on the state at timestep t
Markov Property
• qt+1 is conditionally independent of
qt-1, qt-2,..., q0 given qt.
• In other words:s1
s
s2
( )
( )1 1 0
1
, ,...,t t t
t t
p q q q q
p q q
+ −
+=
˚
˚( ) 0p q s q s == =˚
˚
( )
( )
( )2
2
1
3
2
2
1 2
1 2
0
s s
s s
p
p
sp s
=
=
=
˚
˚
˚
11/3
1/2
1/2
1/3
q0
q1
q
• How to represent the joint
distribution of (q0, q1, q2...) using
graphical models?
28/03/2011 24Markov models
s3
N = 3
t = 1
qt = q1 = s2
( )1t tp q q+=
˚
˚( )
( )
( )
1 1 1
12
3 1
0
0
1
t tp q
p
s q s
s s
sp s
+ =
=
=
=
=˚
˚
˚ ( )
( )
( )2
3
1
3
3
3
1 3
2 3
0
s s
s s
p
p
sp s
=
=
=
˚
˚
˚
The state at timestep t+1 depends
only on the state at timestep t
q2
q3
Markov chain
• So, the chain of qt is called Markov chain
q0 q1 q2 q3
28/03/2011 Markov models 25
Markov chain
• So, the chain of qt is called Markov chain
• Each qt takes value from the countable state-space s1, s2, s3...
• Each qt is observed at a discrete timestep t
• qt sastifies the Markov property:
q0 q1 q2 q3
( ) ( )1 1 0 1, ,...,t t t t tp q q q q p q q+ − +=˚ ˚• qt sastifies the Markov property:
28/03/2011 Markov models 26
( ) ( )1 1 0 1, ,...,t t t t tp q q q q p q q+ − +=˚ ˚
Markov chain
• So, the chain of qt is called Markov chain
• Each qt takes value from the countable state-space s1, s2, s3...
• Each qt is observed at a discrete timestep t
• qt sastifies the Markov property:
q0 q1 q2 q3
( ) ( )1 1 0 1, ,...,t t t t tp q q q q p q q+ − +=˚ ˚• qt sastifies the Markov property:
• The transition from qt to qt+1 is calculated from the transition
probability matrix
28/03/2011 Markov models 27
s1
s3
s2
11/3
1/2
1/2
2/3
s1 s2 s3
s10 0 1
s2½ ½ 0
s31/3 2/3 0
Transition probabilities
( ) ( )1 1 0 1, ,...,t t t t tp q q q q p q q+ − +=˚ ˚
Markov chain
• So, the chain of qt is called Markov chain
• Each qt takes value from the countable state-space s1, s2, s3...
• Each qt is observed at a discrete timestep t
• qt sastifies the Markov property:
q0 q1 q2 q3
( ) ( )1 1 0 1, ,...,t t t t tp q q q q p q q+ − +=˚ ˚• qt sastifies the Markov property:
• The transition from qt to qt+1 is calculated from the transition
probability matrix
28/03/2011 Markov models 28
s1
s3
s2
11/3
1/2
1/2
2/3
s1 s2 s3
s10 0 1
s2½ ½ 0
s31/3 2/3 0
Transition probabilities
( ) ( )1 1 0 1, ,...,t t t t tp q q q q p q q+ − +=˚ ˚
Markov Chain – Important property
• In a Markov chain, the joint distribution is
( ) ( ) ( )0 1 0 1
1
, ,..., |m
m j j
j
q q q p q q qp p −
=
= ∏
28/03/2011 Markov models 29
Markov Chain – Important property
• In a Markov chain, the joint distribution is
• Why?( ) ( ) ( ), previous st, ,... ates, |
m
q q q p q p q qp = ∏
( ) ( ) ( )0 1 0 1
1
, ,..., |m
m j j
j
q q q p q q qp p −
=
= ∏
Why?
28/03/2011 Markov models 30
( ) ( ) ( )
( ) ( )
0 1 0 1
1
0 1
1
, previous st, ,... ates, |
|
m j j
j
m
j j
j
q q q p q p q q
p q p q q
p −
=
−=
=
=
∏
∏
Due to the Markov property
Markov Chain: e.g.
• The state-space of weather:
cloud
windrain
28/03/2011 Markov models 31
Markov Chain: e.g.
• The state-space of weather:
cloud
windrain
1/2
1/2
1/32/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
28/03/2011 Markov models 32
Markov Chain: e.g.
• The state-space of weather:
cloud
windrain
1/2
1/2
1/32/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
• Markov assumption: weather in the t+1’th day is
depends only on the t’th day.
28/03/2011 Markov models 33
Markov Chain: e.g.
• The state-space of weather:
cloud
windrain
1/2
1/2
1/32/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
• Markov assumption: weather in the t+1’th day is
depends only on the t’th day.
• We have observed the weather in a week:
28/03/2011 Markov models 34
rain windcloud rainwind
Day: 0 1 2 3 4
Markov Chain: e.g.
• The state-space of weather:
cloud
windrain
1/2
1/2
1/32/3
1
Rain Cloud Wind
Rain ½ 0 ½
Cloud 1/3 0 2/3
Wind 0 1 0
• Markov assumption: weather in the t+1’th day is
depends only on the t’th day.
• We have observed the weather in a week:
28/03/2011 Markov models 35
rain windcloud rainwind
Day: 0 1 2 3 4
Markov Chain
Contents
• Introduction
• Markov Chain
• Hidden Markov Models
– Independent assumptions
– Formal definition– Formal definition
– Forward algorithm
– Viterbi algorithm
– Baum-Welch algorithm
• Markov Random Field
28/03/2011 36Markov models
Modeling pairs of sequences
• In many applications, we have to model pair of sequences
• Examples:
– POS tagging in Natural Language Processing (assign each word in a
sentence to Noun, Adj, Verb...)
– Speech recognition (map acoustic sequences to sequences of words)
– Computational biology (recover gene boundaries in DNA sequences)
– Video tracking (estimate the underlying model states from the observation
sequences)
– And many others...
28/03/2011 Markov models 37
Probabilistic models for sequence pairs
• We have two sequences of random variables:
X1, X2, ..., Xm and S1, S2, ..., Sm
• Intuitively, in a pratical system, each Xi corresponds to an observation
and each Si corresponds to a state that generated the observation.
• Let each Si be in 1, 2, ..., k and each Xi be in 1, 2, ..., o
• How do we model the joint distribution:
28/03/2011 Markov models 38
( )1 1 1 1,..., , ,...,m m m mp X x X x S s S s= = = =
Hidden Markov Models (HMMs)
• In HMMs, we assume that
( )
( ) ( ) ( )
1 1 1 1
1 1 1 1
2 1
,..., , ,...,m m m m
m m
j j j j j j j j
j j
p X
p
x X x S s S s
s p S s S s p X x S sS − −= =
= = = =
= = == = =∏ ∏˚ ˚
• This is often called Independence assumptions in
HMMs
• We are gonna prove it in the next slides
28/03/2011 Markov models 39
Independence Assumptions in HMMs [1]
• By the chain rule, the following equality is exact:
( )
( )
( )
1 1 1 1
1 1
1 1 1 1
,..., , ,...,
,...,
,..., ,...,
m m m m
m m
m m m m
p
p
p
X x X x S s S s
S s S s
X x X x S s S s
= = = =
= = = ×
= = = =˚
( ) ( ) ( ) ( ) ( ) ( )| |ABC p A BC p BC p A BC p Bp C p C= = ˚
• Assumption 1: the state sequence forms a Markov chain
28/03/2011 Markov models 40
( ) ( ) ( )1 1 1 1 1 1
2
,...,m
m m j j j j
j
S s S s p S s p S s Sp s− −
=
= = = = = =∏ ˚
Independence Assumptions in HMMs [2]
• By the chain rule, the following equality is exact:
• Assumption 2: each observation depends only on the underlying
( )
( )
1 1 1
1
1
1
1
1 11 1,
,..., ,...,
,..., ,...,
m m m m
m
j j m m
j
j j
X x X x S s S s
X x S s S s X x X x
p
p −
=
−
= = = =
= = = = = =∏
˚
˚
• Assumption 2: each observation depends only on the underlying
state
• These two assumptions are often called independence
assumptions in HMMs
28/03/2011 Markov models 41
( )
( )1 1 1 1 1 1,..., ,., ..,j j m m j j
j j j j
X x S s S s x X x
X
X
p
p
x S s
− −= = = = =
= = =
˚
˚
The Model form for HMMs
• The model takes the following form:
• Parameters in the model:
( ) ( ) ( ) ( )1 1 1 1
2 1
,.., , ,..., ;m m
m m j j j j
j j
x x s s s t s sp e x sθ π −= =
= ∏ ∏˚ ˚
Parameters in the model:
–
–
–
28/03/2011 Markov models 42
( ) Initial probabilities for 1, 2,...,s s kπ ∈
( ) Transition probabilities for , ' 1, 2,...,t s s s s k′ ∈˚
( )
Emission probabilities for 1,2,...,
and 1,2,..,
e x s s k
x o
∈
∈
˚
6 components of HMMs
• Discrete timesteps: 1, 2, ...
• Finite state space: si (N states)
• Events xi (M events)
• Vector of initial probabilities ππππi
ΠΠΠΠ = πi = p(q1 = si)
• Matrix of transition probabilities
s1 s2 s3
t11
t21
t12
t31
t23
t32
e11 e
e13 e23 e
start
ππππ1ππππ2 ππππ3
• Matrix of transition probabilities
T = Tij = p(qt+1=sj|qt=si)
• Matrix of emission probabilities
E = Eij = p(ot=xj|qt=si)
28/03/2011 Markov models 43
x1 x2 x3
e11 e31 e22
e23 e33
The observations at continuous timesteps form an observation sequence
o1, o2, ..., ot, where oi ∈ x1, x2, ..., xM
6 components of HMMs
• Discrete timesteps: 1, 2, ...
• Finite state space: si (N states)
• Events xi (M events)
• Vector of initial probabilities ππππi
ΠΠΠΠ = πi = p(q1 = si)
• Matrix of transition probabilities
s1 s2 s3
t11
t21
t12
t31
t23
t32
e11 e
e13 e23 e
start
ππππ1ππππ2 ππππ3
• Matrix of transition probabilities
T = Tij = p(qt+1=sj|qt=si)
• Matrix of emission probabilities
E = Eij = p(ot=xj|qt=si)
28/03/2011 Markov models 44
x1 x2 x3
e11 e31 e22
e23 e33
The observations at continuous timesteps form an observation sequence
o1, o2, ..., ot, where oi ∈ x1, x2, ..., xM1 1 1
Constraints:
1 1 1N N M
i ij ij
i j j
T Eπ= = =
= = =∑ ∑ ∑
6 components of HMMs
• Given a specific HMM and an
observation sequence, the
corresponding sequence of states
is generally not deterministic
• Example:
Given the observation sequence:
x , x , x , x
s1 s2 s3
t11
t21
t12
t31
t23
t32
e11 e
e13 e23 e
start
ππππ1ππππ2 ππππ3
x1, x3, x3, x2
The corresponding states can be
any of following sequences:
s1, s2, s1, s2
s1, s2, s3, s2
s1, s1, s1, s2
...
28/03/2011 Markov models 45
x1 x2 x3
e11 e31 e22
e23 e33
Here’s an HMM
s1 s2 s3
x1 x2 x3
0.5
0.4
0.5
0.2
0.6
0.8
0.30.2
0.7
0.1
0.9 0.8
28/03/2011 Markov models 46
T s1 s2 s3
s1 0.5 0.5 0
s2 0.4 0 0.6
s3 0.2 0.8 0
E x1 x2 x3
s1 0.3 0 0.7
s2 0 0.1 0.9
s3 0.2 0 0.8
ππππ s1 s2 s3
0.3 0.3 0.4
Here’s a HMM
• Start randomly in state 1, 2
or 3.
• Choose a output at each
state in random.
• Let’s generate a sequence
of observations:
s1 s2 s3
x1 x2 x3
0.5
0.4
0.5
0.2
0.6
0.8
0.30.2
0.7
0.1
0.9 0.8
28/03/2011 Markov models 47
T s1 s2 s3
s1 0.5 0.5 0
s2 0.4 0 0.6
s3 0.2 0.8 0
E x1 x2 x3
s1 0.3 0 0.7
s2 0 0.1 0.9
s3 0.2 0 0.8
ππππ s1 s2 s3
0.3 0.3 0.4
q1 o1
q2 o2
q3 o3
0.3 - 0.3 - 0.4
randomply choice
between S1, S2, S3
Here’s a HMM
• Start randomly in state 1, 2
or 3.
• Choose a output at each
state in random.
• Let’s generate a sequence
of observations:
s1 s2 s3
x1 x2 x3
0.5
0.4
0.5
0.2
0.6
0.8
0.30.2
0.7
0.1
0.9 0.8
28/03/2011 Markov models 48
T s1 s2 s3
s1 0.5 0.5 0
s2 0.4 0 0.6
s3 0.2 0.8 0
E x1 x2 x3
s1 0.3 0 0.7
s2 0 0.1 0.9
s3 0.2 0 0.8
ππππ s1 s2 s3
0.3 0.3 0.4
q1 S3 o1
q2 o2
q3 o3
0.2 - 0.8
choice between X1
and X3
Here’s a HMM
• Start randomly in state 1, 2
or 3.
• Choose a output at each
state in random.
• Let’s generate a sequence
of observations:
s1 s2 s3
x1 x2 x3
0.5
0.4
0.5
0.2
0.6
0.8
0.30.2
0.7
0.1
0.9 0.8
28/03/2011 Markov models 49
T s1 s2 s3
s1 0.5 0.5 0
s2 0.4 0 0.6
s3 0.2 0.8 0
E x1 x2 x3
s1 0.3 0 0.7
s2 0 0.1 0.9
s3 0.2 0 0.8
ππππ s1 s2 s3
0.3 0.3 0.4
q1 S3 o1 X3
q2 o2
q3 o3
Go to S2 with
probability 0.8 or
S1 with prob. 0.2
Here’s a HMM
• Start randomly in state 1, 2
or 3.
• Choose a output at each
state in random.
• Let’s generate a sequence
of observations:
s1 s2 s3
x1 x2 x3
0.5
0.4
0.5
0.2
0.6
0.8
0.30.2
0.7
0.1
0.9 0.8
28/03/2011 Markov models 50
T s1 s2 s3
s1 0.5 0.5 0
s2 0.4 0 0.6
s3 0.2 0.8 0
E x1 x2 x3
s1 0.3 0 0.7
s2 0 0.1 0.9
s3 0.2 0 0.8
ππππ s1 s2 s3
0.3 0.3 0.4
q1 S3 o1 X3
q2 S1 o2
q3 o3
0.3 - 0.7
choice between X1
and X3
Here’s a HMM
• Start randomly in state 1, 2
or 3.
• Choose a output at each
state in random.
• Let’s generate a sequence
of observations:
s1 s2 s3
x1 x2 x3
0.5
0.4
0.5
0.2
0.6
0.8
0.30.2
0.7
0.1
0.9 0.8
28/03/2011 Markov models 51
T s1 s2 s3
s1 0.5 0.5 0
s2 0.4 0 0.6
s3 0.2 0.8 0
E x1 x2 x3
s1 0.3 0 0.7
s2 0 0.1 0.9
s3 0.2 0 0.8
ππππ s1 s2 s3
0.3 0.3 0.4
q1 S3 o1 X3
q2 S1 o2 X1
q3 o3
Go to S2 with
probability 0.5 or
S1 with prob. 0.5
Here’s a HMM
• Start randomly in state 1, 2
or 3.
• Choose a output at each
state in random.
• Let’s generate a sequence
of observations:
s1 s2 s3
x1 x2 x3
0.5
0.4
0.5
0.2
0.6
0.8
0.30.2
0.7
0.1
0.9 0.8
28/03/2011 Markov models 52
T s1 s2 s3
s1 0.5 0.5 0
s2 0.4 0 0.6
s3 0.2 0.8 0
E x1 x2 x3
s1 0.3 0 0.7
s2 0 0.1 0.9
s3 0.2 0 0.8
ππππ s1 s2 s3
0.3 0.3 0.4
q1 S3 o1 X3
q2 S1 o2 X1
q3 S1 o3
0.3 - 0.7
choice between X1
and X3
Here’s a HMM
• Start randomly in state 1, 2
or 3.
• Choose a output at each
state in random.
• Let’s generate a sequence
of observations:
s1 s2 s3
x1 x2 x3
0.5
0.4
0.5
0.2
0.6
0.8
0.30.2
0.7
0.1
0.9 0.8
We got a sequence
28/03/2011 Markov models 53
T s1 s2 s3
s1 0.5 0.5 0
s2 0.4 0 0.6
s3 0.2 0.8 0
E x1 x2 x3
s1 0.3 0 0.7
s2 0 0.1 0.9
s3 0.2 0 0.8
ππππ s1 s2 s3
0.3 0.3 0.4
q1 S3 o1 X3
q2 S1 o2 X1
q3 S1 o3 X3
We got a sequence
of states and
corresponding
observations!
Three famous HMM tasks
• Given a HMM Φ = (T, E, π). Three famous HMM tasks are:
• Probability of an observation sequence (state estimation)
– Given: Φ, observation O = o1, o2,..., ot
– Goal: p(O|Φ), or equivalently p(st = Si|O)
• Most likely expaination (inference)Most likely expaination (inference)
– Given: Φ, the observation O = o1, o2,..., ot
– Goal: Q* = argmaxQ p(Q|O)
• Learning the HMM
– Given: observation O = o1, o2,..., ot and corresponding state sequence
– Goal: estimate parameters of the HMM Φ = (T, E, π)
28/03/2011 Markov models 54
Three famous HMM tasks
• Given a HMM Φ = (T, E, π). Three famous HMM tasks are:
• Probability of an observation sequence (state estimation)
– Given: Φ, observation O = o1, o2,..., ot
– Goal: p(O|Φ), or equivalently p(st = Si|O)
• Most likely expaination (inference)
Calculating the probability of
observing the sequence O over Most likely expaination (inference)
– Given: Φ, the observation O = o1, o2,..., ot
– Goal: Q* = argmaxQ p(Q|O)
• Learning the HMM
– Given: observation O = o1, o2,..., ot and corresponding state sequence
– Goal: estimate parameters of the HMM Φ = (T, E, π)
28/03/2011 Markov models 55
all of possible sequences.
Three famous HMM tasks
• Given a HMM Φ = (T, E, π). Three famous HMM tasks are:
• Probability of an observation sequence (state estimation)
– Given: Φ, observation O = o1, o2,..., ot
– Goal: p(O|Φ), or equivalently p(st = Si|O)
• Most likely expaination (inference)
Calculating the best
corresponding state sequence, Most likely expaination (inference)
– Given: Φ, the observation O = o1, o2,..., ot
– Goal: Q* = argmaxQ p(Q|O)
• Learning the HMM
– Given: observation O = o1, o2,..., ot and corresponding state sequence
– Goal: estimate parameters of the HMM Φ = (T, E, π)
28/03/2011 Markov models 56
given an observation
sequence.
Three famous HMM tasks
• Given a HMM Φ = (T, E, π). Three famous HMM tasks are:
• Probability of an observation sequence (state estimation)
– Given: Φ, observation O = o1, o2,..., ot
– Goal: p(O|Φ), or equivalently p(st = Si|O)
• Most likely expaination (inference)
Given an (or a set of)
observation sequence and
corresponding state sequence, Most likely expaination (inference)
– Given: Φ, the observation O = o1, o2,..., ot
– Goal: Q* = argmaxQ p(Q|O)
• Learning the HMM
– Given: observation O = o1, o2,..., ot and corresponding state sequence
– Goal: estimate parameters of the HMM Φ = (T, E, π)
28/03/2011 Markov models 57
corresponding state sequence,
estimate the Transition matrix,
Emission matrix and initial
probabilities of the HMM
Three famous HMM tasks
Problem Algorithm Complexity
State estimation
Calculating: p(O|Φ)
Forward O(TN2)
Inference
Calculating: Q*= argmaxQp(Q|O)
Viterbi decoding O(TN2)
28/03/2011 Markov models 58
Calculating: Q*= argmaxQp(Q|O)
Learning
Calculating: Φ* = argmaxΦp(O|Φ)
Baum-Welch (EM) O(TN2)
T: number of timesteps
N: number of states
State estimation problem
• Given: Φ = (T, E, π), observation O = o1, o2,..., ot
• Goal: What is p(o1o2...ot) ?
• We can do this in a slow, stupid wayWe can do this in a slow, stupid way
– As shown in the next slide...
28/03/2011 Markov models 59
Here’s a HMM
• What is p(O) = p(o1o2o3)
= p(o1=X3 ∧ o2=X1 ∧ o3=X3)?
• Slow, stupid way:
s1 s2 s3
x1 x2 x3
0.5
0.4
0.5
0.2
0.6
0.8
0.30.2
0.70.1
0.9 0.8
( ) ( )
( ) ( )paths of length 3
paths of length 3
|
Q
Q
p O p OQ
p QO Q p
∈
∈
=
= ∑
∑
• How to compute p(Q) for an
arbitrary path Q?
• How to compute p(O|Q) for an
arbitrary path Q?
28/03/2011 Markov models 60
paths of length 3Q∈
Here’s a HMM
• What is p(O) = p(o1o2o3)
= p(o1=X3 ∧ o2=X1 ∧ o3=X3)?
• Slow, stupid way:
s1 s2 s3
x1 x2 x3
0.5
0.4
0.5
0.2
0.6
0.8
0.30.2
0.70.1
0.9 0.8
( ) ( )
( ) ( )paths of length 3
paths of length 3
|
Q
Q
p O p OQ
p QO Q p
∈
∈
=
= ∑
∑ππππ s1 s2 s3
• How to compute p(Q) for an
arbitrary path Q?
• How to compute p(O|Q) for an
arbitrary path Q?
28/03/2011 Markov models 61
paths of length 3Q∈
p(Q) = p(q1q2q3)
= p(q1)p(q2|q1)p(q3|q2,q1) (chain)
= p(q1)p(q2|q1)p(q3|q2) (why?)
Example in the case Q=S3S1S1
P(Q) = 0.4 * 0.2 * 0.5 = 0.04
0.3 0.3 0.4
Here’s a HMM
• What is p(O) = p(o1o2o3)
= p(o1=X3 ∧ o2=X1 ∧ o3=X3)?
• Slow, stupid way:
s1 s2 s3
x1 x2 x3
0.5
0.4
0.5
0.2
0.6
0.8
0.30.2
0.70.1
0.9 0.8
( ) ( )
( ) ( )paths of length 3
paths of length 3
|
Q
Q
p O p OQ
p QO Q p
∈
∈
=
= ∑
∑ππππ s1 s2 s3
• How to compute p(Q) for an
arbitrary path Q?
• How to compute p(O|Q) for an
arbitrary path Q?
28/03/2011 Markov models 62
paths of length 3Q∈
p(O|Q) = p(o1o2o3|q1q2q3)
= p(o1|q1)p(o2|q1)p(o3|q3) (why?)
Example in the case Q=S3S1S1
P(O|Q) = p(X3|S3)p(X1|S1) p(X3|S1)
=0.8 * 0.3 * 0.7 = 0.168
0.3 0.3 0.4
Here’s a HMM
• What is p(O) = p(o1o2o3)
= p(o1=X3 ∧ o2=X1 ∧ o3=X3)?
• Slow, stupid way:
s1 s2 s3
x1 x2 x3
0.5
0.4
0.5
0.2
0.6
0.8
0.30.2
0.70.1
0.9 0.8
( ) ( )
( ) ( )paths of length 3
paths of length 3
|
Q
Q
p O p OQ
p QO Q p
∈
∈
=
= ∑
∑ππππ s1 s2 s3
• How to compute p(Q) for an
arbitrary path Q?
• How to compute p(O|Q) for an
arbitrary path Q?
28/03/2011 Markov models 63
paths of length 3Q∈
p(O|Q) = p(o1o2o3|q1q2q3)
= p(o1|q1)p(o2|q1)p(o3|q3) (why?)
Example in the case Q=S3S1S1
P(O|Q) = p(X3|S3)p(X1|S1) p(X3|S1)
=0.8 * 0.3 * 0.7 = 0.168
0.3 0.3 0.4
p(O) needs 27 p(Q)
computations and 27
p(O|Q) computations.
What if the sequence has
20 observations?So let’s be smarter...
The Forward algorithm
• Given observation o1o2...oT
• Forward probabilities:
αt(i) = p(o1o2...ot ∧ qt = si | Φ) where 1 ≤ t ≤ T
ααt(i) = probability that, in a random trial:
– We’d have seen the first t observations
– We’d have ended up in si as the t’th state visited.
• In our example, what is α2(3) ?
28/03/2011 Markov models 64
ααααt(i): easy to define recursively
( ) ( )
( ) ( )
( )
1 1 1
1 1 1
1
|
i
i i
i i
i p o q s
p q s p o q s
oE
α
π
= ∧ =
= = =
=
( ) ( )1 2... |t t t ii p o o o q sα = ∧ = Φ
( ) ( )
( )
1 1 1 1
1 1 1
2
2
...
...
t t t i
N
t t j t t i
i p o o o q s
o o o q s o q sp
α + + +
+ +
= ∧ =
= ∧ = ∧ ∧ =∑
( )
( ) ( )
1
1
|
|
i i
ij t j t i
ij t j t i
p q s
T p s q s
E p x q
q
o s
π
+
Π = = =
= = = =
= = = =
T
E
28/03/2011 Markov models 65
( )
( ) ( )
( ) ( )
( ) ( ) ( )
( ) ( )
1 1 1
1
1 1 1 1
1
1 1
1
1 1 1
2
2 2
1
1
1
.|
|
|
. . .
|
. .
t t j t t i
j
N
t t i t t j t t j
j
N
t t i t j t
j
N
t t i t i t j t
j
N
ji i t t
j
o q s o o o q s op p
p
p p
T
o o q s
o q s q s j
o q s q s q s j
o jE
α
α
α
+ +
=
+ +
=
+ +
=
+ + +
=
+
=
= ∧ = ∧ = ∧ =
= ∧ = =
= = = =
=
∑
∑
∑
∑
∑
In our example
s1 s2 s3
x1 x2 x3
0.5
0.4
0.5
0.2
0.6
0.8
0.30.2
0.70.1
0.9 0.8
ππππ s1 s2 s3
( ) ( )
( ) ( )
( ) ( ) ( ) ( ) ( )
1 2
1 1
1 1 1
. |..t t t i
i i
t ji i t t i t ji t
j j
i p o o o q s
i E o
i T E o j E o T j
α
α π
α α α+ + +
= ∧ = Φ
=
= =∑ ∑
28/03/2011 Markov models 66
ππππ s1 s2 s3
0.3 0.3 0.4We observed: x1x2
α1(1) = 0.3 * 0.3 = 0.09
α1(2) = 0
α1(3) = 0.2 * 0.4 = 0.08
α2(1) = 0 * (0.09*0 .5+ 0*0.4 + 0.08*0.2) = 0
α2(2) = 0.1 * (0.09*0.5 + 0*0 + 0.08*0.8) = 0.0109
α2(3) = 0 * (0.09*0 + 0*0.6 + 0.08*0) = 0
Forward probabilities - Trellis
N
s3
s4
28/03/2011 Markov models 67
T1 2 3 4 5 6
s1
s2
Forward probabilities - Trellis
N
s3
s4
α1 (4)
α1 (3) α2 (3) α6 (3)
28/03/2011 Markov models 68
T1 2 3 4 5 6
s1
s2
α1 (2)
α1 (1)
α3 (2)
α4 (1)
α5 (2)
Forward probabilities - Trellis
N
s3
s4
α1 (4)
α1 (3) α2 (3)
( ) ( )1 1i ii E oα π=
28/03/2011 Markov models 69
T1 2 3 4 5 6
s1
s2
α1 (2)
α1 (1)
Forward probabilities - Trellis
N
s3
s4
α1 (4)
α1 (3) α2 (3)
( ) ( ) ( )1 1t i t ji t
j
i E o T jα α+ += ∑
28/03/2011 Markov models 70
T1 2 3 4 5 6
s1
s2
α1 (2)
α1 (1)
Forward probabilities
• So, we can cheaply compute:
• How can we cheaply compute:
( ) ( )1 2...t t t ii p o o o q sα = ∧ =
( )1 2 ... tp o o o
• How can we cheaply compute:
28/03/2011 Markov models 71
( )1 2| ...t i tp q s o o o=
( )t
i
iα= ∑
Forward probabilities
• So, we can cheaply compute:
• How can we cheaply compute:
( ) ( )1 2...t t t ii p o o o q sα = ∧ =
( )1 2 ... tp o o oi
• How can we cheaply compute:
Look back the trellis...
28/03/2011 Markov models 72
( )1 2| ...t i tp q s o o o=( )( )
j
t
t
i
j
α
α=∑
State estimation problem
• State estimation is solved:
• Can we utilize the elegant trellis to solve the Inference
problem?
( ) ( ) ( )1 2
1
|N
t i
i
O p o o o ip α=
Φ = … =∑
problem?
– Given an observation sequence O, find the best state sequence Q
28/03/2011 Markov models 73
( )* arg max |Q
OQ p Q=
Inference problem
• Given: Φ = (T, E, π), observation O = o1, o2,..., ot
• Goal: Find
• Practical problems:
( )
( )1 2
*
1 2 1 2
arg max |
arg max |t
Q
t tq q q
Q p Q O
p q q q o o o…
=
= … …
• Practical problems:
– Speech recognition: Given an utterance (sound), what is
the best sentence (text) that matches the utterance?
– Video tracking
– POS Tagging
28/03/2011 Markov models 74
s1 s2
x1 x2 x3
s3
Inference problem
• We can do this in a slow, stupid way:
( )
( ) ( )( )
( ) ( )
* arg max |
|arg max
Q
Q
p Q O
p O Q p Q
p O
Q =
=
• But it’s better if we can find another way to
compute the most probability path (MPP)...
28/03/2011 Markov models 75
( ) ( )
( ) ( )1 2
arg max |
arg max |t
Q
Q
p O Q p Q
p o o o Q p Q
=
= …
Efficient MPP computation
• We are going to compute the following variables:
• δt(i) is the probability of the best path of length
( ) ( )1 2 1
1 2 1 1 2maxt
t t t i tq q q
i p q q q q s o o oδ−
−…
= … ∧ = ∧ …
t(i) is the probability of the best path of length
t-1 which ends up in si and emits o1...ot.
• Define: mppt(i) = that path
so: δt(i) = p(mppt(i))
28/03/2011 Markov models 76
Viterbi algorithm( ) ( )
( ) ( )
( ) ( )
( ) ( )
1 2 1
1 2 1
1 2 1 1 2
1 2 1 1 2
1 11one choice
1 1
max
arg max
max
t
t
t t t i tq q q
t t t
i
i tq q q
i
i
i p q q q q s o o o
mpp i p q q q q s o o o
i p q s o
E o iπ
δ
δ
α
−
−
−…
−…
= … ∧ = ∧ …
= … ∧ = ∧ …
= = ∧
= =
N
s4
δ1 (4)
28/03/2011 Markov models 77
T1 2 3 4 5 6
s1
s2
s3
s4
δ 1 (3)
δ 1 (2)
δ 1 (1)
δ 2 (3)
Viterbi algorithm
• The most prob path with last two states
sisj is the most path to si, followed by
transition si sj.
• The prob of that path will be:
s1
si sj
time t time t + 1
......
δt(i) × p(si sj ∧ ot+1)
= δt(i)TijEj(ot+1)
• So, the previous state at time t is:
28/03/2011 Markov models 78
( ) ( )*
1arg maxt ij j t
i
i T Ei oδ +=
...
Viterbi algorithm
• Summary: ( ) ( ) ( )
( ) ( )( ) ( )
*
1
1
1
*
1
* arg max
t ij j t
t t j
t i
t
j ji
t
i T E o
mpp j mpp i s
i T E
j
i o
δδ
δ
++
+
+
=
=
=
N
sδ1 (4)
( ) ( ) ( )11 1iii E o iδ π α= =
28/03/2011 Markov models 79T1 2 3 4 5 6
s1
s2
s3
s4
δ1 (4)
δ 1 (3)
δ 1 (2)
δ 1 (1)
δ 2 (3)
What’s Viterbi used for?
• Speech Recognition
28/03/2011 Markov models 80
Chong, Jike and Yi, Youngmin and Faria, Arlo and Satish, Nadathur Rajagopalan and Keutzer, Kurt, “Data-Parallel Large Vocabulary
Continuous Speech Recognition on Graphics Processors”, EECS Department, University of California, Berkeley, 2008.
Training HMMs
• Given: large sequence of observation o1o2...oT
and number of states N.
• Goal: Estimation of parameters Φ = ⟨T, E, π⟩
• That is, how to design an HMM.
• We will infer the model from a large amount of
data o1o2...oT with a big “T”.
28/03/2011 Markov models 81
Training HMMs
• Remember, we have just computed
p(o1o2...oT | Φ)
• Now, we have some observations and we want to inference Φ
from them.
• So, we could use:
– MAX LIKELIHOOD:
– BAYES:
Compute
then take or
28/03/2011 Markov models 82
( )1arg max |Tp o oΦ
Φ = … Φ
( )1| Tp o oΦ …
[ ]E Φ ( )1max | Top oΦ
Φ …
Max likelihood for HMMs
• Forward probability: the probability of producing o1...ot while
ending up in state si
( ) ( )1 2...t t t ii p o o o q sα = ∧ =( ) ( )
( ) ( ) ( )1 1
1 1
i i
t i t ji t
j
i E o
i E o T j
α π
α α+ +
=
= ∑
• Backward probability: the probability of producing ot+1...oT given
that at time t, we are at state si
28/03/2011 Markov models 83
( ) ( )1 2. |..t t t iTti p o o o q sβ + += =
Max likelihood for HMMs - Backward
• Backward probability: easy to define recursively
( ) ( )
( )
( ) ( )
1 2
1 2 1
1
...
...
|
1
|
t t t t i
T
N
t t t t j t i
j
T
T
i p o o o q s
i
i p o o o q s q s
β
β
β
+ +
+ + +=
= =
=
= ∧ ∧ = =∑
( )
( ) ( ) ( )1 1
1
1T
N
t t ij j t
j
i
i T oj E
β
β β + +=
=
=∑
28/03/2011 Markov models 84
( ) ( )
( ) ( )
( ) ( )
1
1 1 2 1 1
1
1 1 2 1
1
1 1
1
.| |
| |
..
...
j
N
t t j t i t t t j t i
j
N
t t j t i t t j
j
N
t ij j t
T
T
j
p o q s q s p o o o q s q s
p o q s q s p o o q s
T oj Eβ
=
+ + + + +=
+ + + +=
+ +=
= ∧ = = ∧ = ∧ =
= ∧ = = =
=
∑
∑
∑
∑
Max likelihood for HMMs
• The probability of traversing a certain arc at time t given
o1o2...oT:
( ) ( )
( )( )
1 1 2
1 1 2
|ij t i t j T
t i t j T
t p q s q s o o o
p q s q s o o o
p o o o
ε +
+
= = ∧ = …
= ∧ = ∧ …=
…
28/03/2011 Markov models 85
( )
( ) ( ) ( )
( ) ( )
( )( ) ( )
( ) ( )
1 2
1 2 1 1 2
1 2 1 2
1
1
|
|
T
t t i t i t j t t T t i
N
t i t t t i
i
ij t
ij N
t
i
t T
t
t
p o o o
p o o o q s p q s q s p o o o q s
p o o o q p o o o q
i T it
i i
s s
α β
α
ε
β
+ + +
+ +=
=
=…
… ∧ = = ∧
=
= … ==
… ∧ …
=
=∑
∑
Max likelihood for HMMs
• The probability of being at state si at time t given o1o2...oT:
( ) ( )
( )
1 2
1 1 2
1
|
|
i t i T
N
t i t j T
j
t p q s o o o
p q s q s o o o
γ
+=
= = …
= = ∧ = …∑
28/03/2011 Markov models 86
( ) ( )
1
1
j
N
ij
j
i t tγ ε=
=∑
Max likelihood for HMMs
• Sum over the time index:
– Expected # of transitions from state i to j in o1o2...oT:
( )1
1
T
ij
t
tε−
=
∑
– Expected # of transitions from state i in o1o2...oT :
28/03/2011 Markov models 87
1t=
( ) ( ) ( )11 1
1 1 1 1 1
TT T N N
i ij ij
t t j j t
t t tε εγ−− −
= = = = =
= =∑ ∑∑ ∑∑
Update parameters
( )
( )
( )
( )
( )
1 1
1 1
1 1
ˆ expected frequency in state i at time t = 1 1
expected # of transitions from state i to j
expected # of transitions from state i
i i
T T
ij ij
t tij T N T
i ij
t t
T
t t
π γ
γ
ε ε
ε
− −
= =− −
= =
= = =∑ ∑
∑ ∑∑
( )
( ) ( )
1
1
|
|
i i
ij t j t i
ij t j t i
p q s
T p s q s
E p x q
q
o s
π
+
Π = = =
= = = =
= = = =
T
E
28/03/2011 Markov models 88
( ) ( )1 1 1
expected
i ij
t j t
ik
t t
E
γ ε= = =
=
∑ ∑∑
( ) ( )
( )
( ) ( )
( )
11
1 11
1 1
1 1 1
# of transitions from state i with x observed
expected # of transitions from state i
,,
k
N TT
t k ijt k ij tt
T N T
i ij
t j t
o x to x t
t t
δ εδ γ
γ ε
−−
= ==− −
= = =
= =
∑∑∑
∑ ∑∑
The inner loop of Forward-Backward
Given an input sequence.
1. Calculate forward probability:
– Base case:
– Recursive case:
2. Calculate backward probability:
( ) ( )
( ) ( ) ( )1 1
1 1
i i
t i t ji t
j
i E o
i E o T j
α π
α α+ +
=
= ∑
– Base case:
– Recursive case:
3. Calculate expected count:
4. Update parameters:
28/03/2011 Markov models 89
( )
( ) ( ) ( )1 1
1
1T
N
t t ij j t
j
i
i T oj E
β
β β + +=
=
=∑( )
( ) ( )
( ) ( )1
ij t
ij N
t
t
i
t
i T it
i i
α β
α β
ε
=
=
∑( )
( )
( ) ( )
( )
11
1 11
1 1
1 1 1 1
,N TT
t k ijijj tt
ij ikN T N T
ij ij
j t j t
o x tt
T E
t t
δ ε
ε
ε
ε
−−
= ==− −
= = = =
= =
∑∑∑
∑∑ ∑∑
Forward-Backward: EM for HMM
• If we knew Φ we could estimate expectations of quantities
such as
– Expected number of times in state i
– Expected number of transitions i j
• If we knew the quantities such as• If we knew the quantities such as
– Expected number of times in state i
– Expected number of transitions i j
we could compute the max likelihood estimate of Φ = ⟨T, E, Π⟩
• Also known (for the HMM case) as the Baum-Welch algorithm.
28/03/2011 Markov models 90
EM for HMM
• Each iteration provides values for all the parameters
• The new model always improve the likeliness of the
training data:
( ) ( )ˆ| |p po o o o o o… Φ ≥ … Φ
• The algorithm does not guarantee to reach global
maximum.
28/03/2011 Markov models 91
( ) ( )1 2 1 2ˆ| |
T Tp po o o o o o… Φ ≥ … Φ
EM for HMM
• Bad News
– There are lots of local minima
• Good News
– The local minima are usually adequate models of the data.
• Notice
– EM does not estimate the number of states. That must be given (tradeoffs)
– Often, HMMs are forced to have some links with zero probability. This is done
by setting Tij = 0 in initial estimate Φ(0)
– Easy extension of everything seen today: HMMs with real valued outputs
28/03/2011 Markov models 92
Contents
• Introduction
• Markov Chain
• Hidden Markov ModelsHidden Markov Models
• Markov Random Field (from the viewpoint of
classification)
28/03/2011 93Markov models
Example: Image segmentation
• Observations: pixel values
• Hidden variable: class of each pixel
• It’s reasonable to think that there are some underlying relationships
between neighbouring pixels... Can we use Markov models?
• Errr.... the relationships are in 2D!
28/03/2011 Markov models 94
MRF as a 2D generalization of MC
• Array of observations:
• Classes/States:
• Our objective is classification: given the array of
observations, estimate the corresponding values of the
, 0 ,0 yij xi N j NX x < ≤ <= ≤
, 1...ij ijS s s M= =
observations, estimate the corresponding values of the
state array S so that
28/03/2011 Markov models 95
( ) ( )| is maximum.Xp S p S
2D context-dependent classification
• Assumptions:
– The values of elements in S are mutually dependent.
– The range of this dependence is limited within a neighborhood.
• For each (i, j) element of S, a neighborhood Nij is defined so
thatthat
– sij ∉ Nij: (i, j) element does not belong to its own set of neighbors.
– sij ∈ Nkl ⇔ skl ∈ Nij: if sij is a neighbor of skl then skl is also a neighbor
of sij
28/03/2011 Markov models 96
2D context-dependent classification
• The Markov property for 2D case:
where includes all the elements of S except the (i, j) one.
• The elegeant dynamic programing is not applicable: the problem is
( ) ( )| |ij ij ij ijs S p sp = N
ijS
much harder now!
28/03/2011 Markov models 97
2D context-dependent classification
• The Markov property for 2D case:
where includes all the elements of S except the (i, j) one.
• The elegeant dynamic programing is not applicable: the problem is
( ) ( )| |ij ij ij ijs S p sp = N
ijS
We are gonna see an
application of MRF for much harder now!
28/03/2011 Markov models 98
application of MRF for
Image Segmentation
and Restoration.
MRF for Image Segmentation
• Cliques: a set of each pixel which are neighbors
of each other (w.r.t the type of neighborhood)
28/03/2011 Markov models 99
MRF for Image Segmentation
• Dual Lattice number
• Line process:
28/03/2011 Markov models 100
MRF for Image Segmentation
• Gibbs distribution:
– Z: normalizing constant
( )( )1
expU s
sZ T
π−
=
– T: parameter
• It turns out that Gibbs distribution implies MRF
([Gema 84])
28/03/2011 Markov models 101
MRF for Image Segmentation
• A Gibbs conditional probability is of the form:
– Ck(i, j): clique of the pixel (i, j)
( ) ( )( )1 1
| exp ,ij ij k k
k
s F C i jZ T
p
= −
∑N
– Fk: some functions, e.g.
28/03/2011 Markov models 102
( ) ( )( )1 2 1, 1, 2 , 1 , 1
1ij i j i j i j i js s s s s
Tα α α− + − ++ +− + +
MRF for Image Segmentation
• Then, the joint probability for the Gibbs model is
– The sum is calculated over all possible cliques associated
( )( )( )
,
,
exp
k k
i j kp
F C i j
ST
= −
∑∑
– The sum is calculated over all possible cliques associated
with the neighborhood.
• We also need to work out p(X|S)
• Then p(X|S)p(S) can be maximized... [Gema 84]
28/03/2011 Markov models 103
More on Markov models...
• MRF does not stop there... Here are some related models:
– Conditional random field (CRF)
– Graphical models
– ...
• Markov Chain and HMM does not stop there...• Markov Chain and HMM does not stop there...
– Markov chain of order m
– Continuous-time Markov chains...
– Real-value observations
– ...
28/03/2011 Markov models 104
What you should know
• Markov property, Markov Chain
• HMM:
– Defining and computing αt(i)
– Viterbi algorithm
– Outline of the EM algorithm for HMM
• Markov Random Field
– And an application in Image Segmentation
– [Geman 84] for more information.
28/03/2011 Markov models 105
Q & A
28/03/2011 Markov models 106
References
• L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications
in Speech Recognition“, Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989.
• Andrew W. Moore, “Hidden Markov Models”, http://www.autonlab.org/tutorials/
• Geman S., Geman D. “Stochastic relaxation, Gibbs distributions and the
Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, Vol. 6(6), pp. 721-741, 1984.
28/03/2011 Markov models 107