hidden markov models with slides from lise getoor, sebastian thrun, william cohen, and yair weiss
Post on 21-Dec-2015
222 views
TRANSCRIPT
.
Hidden Markov Models
with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss
Outline
Markov Models Hidden Markov Models The Main Problems in HMM Context Implementation Issues Applications of HMMs
Weather: A Markov Model
Sunny
Rainy
Snowy
80%
15%
5%
60%
2%
38%
20%
75% 5%
Ingredients of a Markov Model
States:
State transition probabilities:
Initial state distribution:
Sunny
Rainy
Snowy
80%
15%
5%
60%
2%
38%
20%
75% 5%
][ 1 ii SqP
},...,,{ 21 NSSS
)|( 1 jtitij SqSqPa
Ingredients of Our Markov Model
States:
State transition probabilities:
Initial state distribution:
)05.25.7.(
},,{ snowyrainysunny SSS
2.05.75.
02.6.38.
05.15.8.
A
Sunny
Rainy
Snowy
80%
15%
5%
60%
2%
38%
20%
75% 5%
Probability of a Seq. of States
Given:
What is the probability of this seq. of states?
)05.25.7.(
2.05.75.
02.6.38.
05.15.8.
A
0001512.02.002.06.06.015.07.0
)|()|()|()|()|()(
snowysnowyrainysnowy
rainyrainyrainyrainysunnyrainysunny
SSPSSPSSPSSPSSPSP
Outline
Markov Models Hidden Markov Models The Main Problems in HMM Context Implementation Issues Applications of HMMs
Hidden Markov Models
Sunny
Rainy
Snowy
80%
15%
5%
60%
2%
38%
20%
75% 5%
Sunny
Rainy
Snowy
80%
15%
5%
60%
2%
38%
20%
75% 5%
60%
10%
30%
65%
5%
30%
50%0%50%
NOT OBSERVABLE
Sunny
Rainy
Snowy
80%
15%
5%
60%
2%
38%
20%
75% 5%
60%
10%
30%
50%0%50%
65%
5%
30%Ingredients of an HMM
States: State transition probabilities:
Initial state distribution:][ 1 ii SqP
},...,,{ 21 NSSS
)|( 1 jtitij SqSqPa
Observations:
Observation probabilities:
},...,,{ 21 MOOO
)|()( jtktj SqOvPkb emit output k in state j
prob of moving from state j to
state i
Sunny
Rainy
Snowy
80%
15%
5%
60%
2%
38%
20%
75% 5%
60%
10%
30%
50%0%50%
65%
5%
30%
Ingredients of Our HMM States: Observations: State transition probabilities:
Initial state distribution: Observation probabilities:
)05.25.7.(
},,{ snowyrainysunny SSS
5.5.0
65.3.05.
1.3.6.
B
},,{ umbrellacoatshorts OOO
2.05.75.
02.6.38.
05.15.8.
A
Three Basic Problems
Evaluation (aka likelihood): compute P(O| an HMM)
Decoding (aka inference): given an observed output sequence O
compute most likely state at each time periodcompute most likely state sequence
q* = argmax_q P(q|O, HMM) Training (aka learning):
find HMM* = argmax_HMM P(O|HMM)
Probability of an Output Sequence
Given:
What is the probability of this output sequence?
)05.25.7.(
2.05.75.
02.6.38.
05.15.8.
A
5.5.0
65.3.05.
1.3.6.
B
),...,(),...,|()()|( 7171,..., all 71
qqPqqOPQPQOPqqQ
),...,,,()( umbrellaumbrellacoatcoat OOOOPOP
...6.01.03.08.07.0 426 exponential number of
terms
The Forward Algorithm
S2
S3
S1
S2
S3
S1
O2 O3O1O2 O3O1
S2
S3
S1
O2 O3O1
S2
S3
S1
O2 O3O1
S2
S3
S1
O2 O3O1
…
),,...,()( 1 ittt SqOOPi
N
i
N
iTiT iSqOPOP
1 1
)(),()(
The Forward Algorithm (cont.)
S2
S3
S1
S2
S3
S1
O2 O3O1O2 O3O1
S2
S3
S1
O2 O3O1
S2
S3
S1
O2 O3O1
S2
S3
S1
O2 O3O1
…
),,...,()( 1 ittt SqOOPi
)()(
)()|,(
),,...,(),,...,|,,...,(
),,...,()(
11
111
111111
1111
iaOb
iSqSqOP
SqOOPSqOOSqOOP
SqOOPj
tijt
N
ij
titjtt
N
i
N
iittittjtt
jttt
)()( 11 Obi ii
first, get to state i, then move to state j,
then omit output
O[t+1]
Exercise
What is the probability of observing AB?
a. Initial state s1:
b. Initial state chosen at random:
s2s1
0.60.4
0.30.7
0.3
B
0.7
A
0.2
B
0.8
A
0.2 (0.4 0.8 + 0.6 0.7) = 0.148
0.5 0.148 + (0.5 0.3 (0.3 0.7 + 0.7 0.8)) = 0.1895
The Backward Algorithm
S2
S3
S1
S2
S3
S1
O2 O3O1O2 O3O1
S2
S3
S1
O2 O3O1
S2
S3
S1
O2 O3O1
S2
S3
S1
O2 O3O1
…
)|,...,,()( 21 itTttt SqOOOPi
1)( iT
N
jttjijt jObai
111 )()(...)(
P(O) = sum over i: P(q1 is i) * P(emit O1 in state i) * beta_1(i)
The Forward-Backward Algorithm
S2
S3
S1
S2
S3
S1
O2 O3O1O2 O3O1
S2
S3
S1
O2 O3O1
S2
S3
S1
O2 O3O1
S2
S3
S1
O2 O3O1
…
)(it)(it
P(O) = sum over i: alpha_t(i) * beta_t(i) for any t=> you can derive the formulas for forward algand backward alg from this
Finding the best state sequenceWe would like to find the most likely path (and not just the most likely state at each time slice)
The Viterbi algorithm is an efficient method for finding the MPE:
and we to reconstruct the path:
)O,Q(P)O|Q(P maxargmaxargQQ
jitj
1tjitj
1ti1t
1ii1
a)j(maxarg)i(a)j(max)O(b)i(
)O(bq)i(
)Q(Q
)i(maxargQ)i(max)Q(P
tt1t
Ti
TTi
Hidden Markov Models
Sunny
Rainy
Snowy
80%
15%
5%
60%
2%
38%
20%
75% 5%
Sunny
Rainy
Snowy
80%
15%
5%
60%
2%
38%
20%
75% 5%
60%
10%
30%
65%
5%
30%
50%0%50%
NOT OBSERVABLE
Learning the Model with EM
Problem: Find HMM that makes data most likely
E-Step: Computefor given
M-Step: Compute new under these expectations (this is now a Markov model)
),|( OSqP it
E-Step
Calculate
using the forward-backward algorithm, for fixed model
),|()( OSqPi itt
),|,(),( 1 OSqSqPji jtitt
The M Step: generate =(, a, b))(1 at time statein timesofnumber expected 1 iSii
T
tt
T
tt
i
jiji
i
ji
S
SSa
1
1
)(
),(
state from ns transitioofnumber expected
state to state from ns transitioofnumber expected
T
tt
T
ttvO
i
kii
i
iI
S
vSkb
kt
1
1
)(
)(
statein timesofnumber expected
observing statein timesofnumber expected)(
Understanding the EM Algorithm
The best way to understand the EM algorithm start with the M step, understand what quantities
it needs then look at the E step, see how it computes
those quantities with the help of the forward-backward algorithm
Summary (Learning)
Given observation sequence O Guess initial model Iterate:
Calculate expected times in state Si at time t (and in Sj at
time t) using forward-backward algorithm Find new model by frequency counts
Implementing HMM Algorithms
Quantities get very small for long sequences Taking logarithm helps
the Viterbi algorithm computing the alphas and betas not helpful in computing gammas
Normalization method can help these problems see the note by ChengXiang Zhai
Problems with HMMs
Zero probabilities Training sequence: AAABBBAAA Test sequence: AAABBBCAAA
Finding “right” number of states, right structure Numerical instabilities
Outline
Markov Models Hidden Markov Models The Main Problems in HMM Context Implementation Issues Applications of HMMs
Three Problems
What bird is this?
How will the song continue?
Is this bird abnormal?
Time series classification
Time series prediction
Outlier detection
Time Series Classification
Train one HMM l for each bird l Given time series O, calculate
'' )'()|(
)()|()|bird(
ll
l
lPOP
lPOPOlP
Outlier Detection
Train HMM Given time series O, calculate probability
If abnormally low, raise flag If high, raise flag
)|( OP
Time Series Prediction
Train HMM Given time series O, calculate distribution over final
state (via )
and ‘hallucinate’ new states and observations according to a, b
),|( OSqP iT
Typical HMM in Speech Recognition
20-dim frequency spaceclustered using EM
Use Bayes rule + Viterbi for classification
Linear HMM representing one phoneme
[Rabiner 86] + everyone else
Typical HMM in Robotics
[Blake/Isard 98, Fox/Dellaert et al 99]
IE with Hidden Markov Models
Yesterday Pedro Domingos spoke this example sentence.
Yesterday Pedro Domingos spoke this example sentence.
Person name: Pedro Domingos
Given a sequence of observations:
and a trained HMM:
Find the most likely state sequence: (Viterbi)
Any words said to be generated by the designated “person name”state extract as a person name:
),(maxarg osPs
person name
location name
background
HMM for Segmentation
Simplest Model: One state per entity type
What is a “symbol” ???
Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ?
4601 => “4601”, “9999”, “9+”, “number”, … ?
000.. . . .999
3 -d ig i ts
00000 .. . .99999
5 -d ig i ts
0 ..99 0000 ..9999 000000 ..
O th e rs
N u m b e rs
A .. ..z
C h a rs
a a ..
M u lt i -le tte r
W o rds
. , / - + ? #
D e lim ite rs
A ll
Datamold: choose best abstraction level using holdout set
HMM Example: “Nymble”
Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]
Task: Named Entity Extraction
Train on ~500k words of news wire text.
Case Language F1 .Mixed English 93%Upper English 91%Mixed Spanish 90%
[Bikel, et al 1998], [BBN “IdentiFinder”]
Person
Org
Other
(Five other name classes)
start-of-sentence
end-of-sentence
Transitionprobabilities
Observationprobabilities
P(st | st-1, ot-1 ) P(ot | st , st-1 )
Back-off to: Back-off to:
P(st | st-1 )
P(st )
P(ot | st , ot-1 )
P(ot | st )
P(ot )
or
Results:
Passage Selection (e.g., for IR)
Document
Query
Collection Information
Relevantpassages
How is a relevant passage different from a background passage in terms of language modeling?
Backgroundpassages
HMMs: Main Lessons
HMMs: Generative probabilistic models of time series (with hidden state)
Forward-Backward: Algorithm for computing probabilities over hidden states
Learning models: EM, iterates estimation of hidden state and model fitting
Extremely practical, best known methods in speech, computer vision, robotics, …
Numerous extensions exist (continuous observations, states; factorial HMMs, controllable HMMs=POMDPs, …)