hidden markov models · hidden markov models 1 10-601 introduction to machine learning matt gormley...
TRANSCRIPT
Hidden Markov Models
1
10-601 Introduction to Machine Learning
Matt GormleyLecture 20
Nov. 7, 2018
Machine Learning DepartmentSchool of Computer ScienceCarnegie Mellon University
Reminders
• Homework 6: PAC Learning / GenerativeModels– Out: Wed, Oct 31– Due: Wed, Nov 7 at 11:59pm (1 week)
• Homework 7: HMMs– Out: Wed, Nov 7– Due: Mon, Nov 19 at 11:59pm
2
HMM Outline• Motivation
– Time Series Data• Hidden Markov Model (HMM)
– Example: Squirrel Hill Tunnel Closures [courtesy of Roni Rosenfeld]
– Background: Markov Models– From Mixture Model to HMM– History of HMMs– Higher-order HMMs
• Training HMMs– (Supervised) Likelihood for HMM– Maximum Likelihood Estimation (MLE) for HMM– EM for HMM (aka. Baum-Welch algorithm)
• Forward-Backward Algorithm– Three Inference Problems for HMM– Great Ideas in ML: Message Passing– Example: Forward-Backward on 3-word Sentence– Derivation of Forward Algorithm– Forward-Backward Algorithm– Viterbi algorithm
3
This Lecture
Last Lecture
SUPERVISED LEARNING FOR HMMS
4
HMM Parameters:
Hidden Markov Model
6
X1 X2 X3 X4 X5
Y1 Y2 Y3 Y4 Y5
O S CO .9 .08.02S .2 .7 .1C .9 0 .1
1min
2min
3min
…
O .1 .2 .3S .01 .02.03C 0 0 0
O S CO .9 .08.02S .2 .7 .1C .9 0 .1
1min
2min
3min
…
O .1 .2 .3S .01 .02.03C 0 0 0
O .8S .1C .1
Training HMMs
Whiteboard– (Supervised) Likelihood for an HMM– Maximum Likelihood Estimation (MLE) for HMM
7
Supervised Learning for HMMsLearning an HMM decomposes into solving two (independent) Mixture Models
8
Yt Yt+1
Xt
Yt
HMM Parameters:
Assumption:Generative Story:
Hidden Markov Model
9X1 X2 X3 X4 X5
Y1 Y2 Y3 Y4 Y5Y0
y0 = STARTFor notational
convenience, we fold the initial probabilities C into the transition matrix B by
our assumption.
Joint Distribution:
Hidden Markov Model
10X1 X2 X3 X4 X5
Y1 Y2 Y3 Y4 Y5Y0
y0 = START
Supervised Learning for HMMsLearning an HMM decomposes into solving two (independent) Mixture Models
11
Yt Yt+1
Xt
Yt
HMMs: History
• Markov chains: Andrey Markov (1906)
– Random walks and Brownian motion
• Used in Shannon’s work on information theory (1948)
• Baum-Welsh learning algorithm: late 60’s, early 70’s.
– Used mainly for speech in 60s-70s.
• Late 80’s and 90’s: David Haussler (major player in
learning theory in 80’s) began to use HMMs for
modeling biological sequences
• Mid-late 1990’s: Dayne Freitag/Andrew McCallum
– Freitag thesis with Tom Mitchell on IE from Web
using logic programs, grammar induction, etc.
– McCallum: multinomial Naïve Bayes for text
– With McCallum, IE using HMMs on CORA
• …
13Slide from William Cohen
Higher-order HMMs
• 1st-order HMM (i.e. bigram HMM)
• 2nd-order HMM (i.e. trigram HMM)
• 3rd-order HMM
14
Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
<START>
Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
<START>
Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
<START>
BACKGROUND: MESSAGE PASSING
15
Great Ideas in ML: Message Passing
3 behind you
2 behind you
1 behind you
4 behind you
5 behind you
1 beforeyou
2 beforeyou
there's1 of me
3 beforeyou
4 beforeyou
5 beforeyou
Count the soldiers
16adapted from MacKay (2003) textbook
Great Ideas in ML: Message Passing
3 behind you
2 beforeyou
there's1 of me
Belief:Must be2 + 1 + 3 = 6 of us
only seemy incomingmessages
2 31
Count the soldiers
17adapted from MacKay (2003) textbook
2 beforeyou
Great Ideas in ML: Message Passing
4 behind you
1 beforeyou
there's1 of me
only seemy incomingmessages
Count the soldiers
18adapted from MacKay (2003) textbook
Belief:Must be2 + 1 + 3 = 6 of us2 31
Belief:Must be1 + 1 + 4 = 6 of us
1 41
Great Ideas in ML: Message Passing
7 here
3 here
11 here(= 7+3+1)
1 of me
Each soldier receives reports from all branches of tree
19adapted from MacKay (2003) textbook
Great Ideas in ML: Message Passing
3 here
3 here
7 here(= 3+3+1)
Each soldier receives reports from all branches of tree
20adapted from MacKay (2003) textbook
Great Ideas in ML: Message Passing
7 here
3 here
11 here(= 7+3+1)
Each soldier receives reports from all branches of tree
21adapted from MacKay (2003) textbook
Great Ideas in ML: Message Passing
7 here
3 here
3 here
Belief:Must be14 of us
Each soldier receives reports from all branches of tree
22adapted from MacKay (2003) textbook
Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree
7 here
3 here
3 here
Belief:Must be14 of us
23adapted from MacKay (2003) textbook
THE FORWARD-BACKWARD ALGORITHM
24
Inference for HMMs
Whiteboard– Three Inference Problems for an HMM
1. Evaluation: Compute the probability of a given sequence of observations
2. Viterbi Decoding: Find the most-likely sequence of hidden states, given a sequence of observations
3. Marginals: Compute the marginal distribution for a hidden state, given a sequence of observations
25
n n v d nSample 2:
time likeflies an arrow
Dataset for Supervised Part-of-Speech (POS) Tagging
26
n v p d nSample 1:
time likeflies an arrow
p n n v vSample 4:
with youtime will see
n v p n nSample 3:
flies withfly their wings
D = {x(n),y(n)}Nn=1Data:
y(1)
x(1)
y(2)
x(2)
y(3)
x(3)
y(4)
x(4)
time flies like an arrow
n v p d n<START>
Hidden Markov Model
28
A Hidden Markov Model (HMM) provides a joint distribution over the the sentence/tags with an assumption of dependence between adjacent tags.
v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0
v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0
tim
efli
eslik
e…
v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1
tim
efli
eslik
e…
v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1
p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .2 * .5 * …)
X3X2X1
Y2 Y3Y1
29
find preferred tags
Could be adjective or verb Could be noun or verbCould be verb or noun
Forward-Backward Algorithm
Forward-Backward Algorithm
30
Y2 Y3Y1
X3X2X1find preferred tags
Y2 Y3Y1
X3X2X1find preferred tags
Forward-Backward Algorithm
31
v
n
a
v
n
a
v
n
a
START END
• Let’s show the possible values for each variable• One possible assignment• And what the 7 factors think of it …
Y2 Y3Y1
X3X2X1find preferred tags
Forward-Backward Algorithm
32
v
n
a
v
n
a
v
n
a
START END
• Let’s show the possible values for each variable• One possible assignment• And what the 7 factors think of it …
Y2 Y3Y1
X3X2X1find preferred tags
33
v
n
a
v
n
a
v
n
a
START END
• Let’s show the possible values for each variable• One possible assignment• And what the 7 factors think of it …
Forward-Backward Algorithm
Y2 Y3Y1
X3X2X1find preferred tags
34
v
n
a
v
n
a
v
n
a
START END
• Let’s show the possible values for each variable• One possible assignment• And what the 7 transition / emission factors think of it …
Forward-Backward Algorithm
Y2 Y3Y1
X3X2X1find preferred tags
35
v
n
a
v
n
a
v
n
a
START END
• Let’s show the possible values for each variable• One possible assignment• And what the 7 transition / emission factors think of it …
Forward-Backward Algorithmfin
dp
ref.
tag
s…
v 3 5 3n 4 5 2a 0.1 0.2 0.1
v n av 1 6 4n 8 4 0.1a 0.1 8 0
Y2 Y3Y1
X3X2X1find preferred tags
Viterbi Algorithm: Most Probable Assignment
36
v
n
a
v
n
a
v
n
a
START END
• So p(v a n) = (1/Z) * product of 7 numbers• Numbers associated with edges and nodes of path• Most probable assignment = path with highest product
B(a,END)
A(tags,n)
Y2 Y3Y1
X3X2X1find preferred tags
Viterbi Algorithm: Most Probable Assignment
37
v
n
a
v
n
a
v
n
a
START END
• So p(v a n) = (1/Z) * product weight of one path
B(a,END)
A(tags,n)
Y2 Y3Y1
X3X2X1find preferred tags
Forward-Backward Algorithm: Finds Marginals
38
v
n
a
v
n
a
v
n
a
START END
• So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = a)
= (1/Z) * total weight of all paths through a
Y2 Y3Y1
X3X2X1find preferred tags
Forward-Backward Algorithm: Finds Marginals
39
v
n
a
v
n
a
v
n
a
START END
• So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = n)
= (1/Z) * total weight of all paths through n
Y2 Y3Y1
X3X2X1find preferred tags
Forward-Backward Algorithm: Finds Marginals
40
v
n
a
v
n
a
v
n
a
START END
• So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = v)
= (1/Z) * total weight of all paths through v
Y2 Y3Y1
X3X2X1find preferred tags
Forward-Backward Algorithm: Finds Marginals
41
v
n
a
v
n
a
v
n
a
START END
• So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = n)
= (1/Z) * total weight of all paths through n
Y2 Y3Y1
X3X2X1find preferred tags
α2(n) = total weight of thesepath prefixes
(found by dynamic programming: matrix-vector products)
Forward-Backward Algorithm: Finds Marginals
42
v
n
a
v
n
a
v
n
a
START END
Y2 Y3Y1
X3X2X1find preferred tags
= total weight of thesepath suffixes
b2(n)
(found by dynamic programming: matrix-vector products)
Forward-Backward Algorithm: Finds Marginals
43
v
n
a
v
n
a
v
n
a
START END
Y2 Y3Y1
X3X2X1find preferred tags
v
n
a
v
n
a
v
n
a
START END
α2(n) = total weight of thesepath prefixes
= total weight of thesepath suffixes
Forward-Backward Algorithm: Finds Marginals
44
b2(n)(a + b + c) (x + y + z)
Product gives ax+ay+az+bx+by+bz+cx+cy+cz = total weight of paths
Y2 Y3Y1
X3X2X1find preferred tags
v
n
a
v
n
a
v
n
a
START END
Forward-Backward Algorithm: Finds Marginals
45
total weight of all paths through= × ×
n
A(pref., n)
α2(n) b2(n)
α2(n) A(pref., n) b2(n)
“belief that Y2 = n”
Oops! The weight of a path through a state also
includes a weight at that state.
So α(n)·β(n) isn’t enough.
The extra weight is the opinion of the emission
probability at this variable.
Y2 Y3Y1
X3X2X1find preferred tags
v
n
a a
v
n
a
START END
Forward-Backward Algorithm: Finds Marginals
46
total weight of all paths through= × ×
v
α2(v) A(pref., v) b2(v)
n
v
“belief that Y2 = n”α2(v) b2(v)
“belief that Y2 = v”
A(pref., v)
Y2 Y3Y1
X3X2X1find preferred tags
v
n
a
v
n
a
START END
Forward-Backward Algorithm: Finds Marginals
47
total weight of all paths through= × ×
a
α2(a) A(pref., a) b2(a)
n
v
“belief that Y2 = n”α2(a) b2(a)
“belief that Y2 = v”
A(pref., a)
a “belief that Y2 = a”
sum = Z(total weightof all paths)
v 0.1n 0a 0.4
v 0.2n 0a 0.8
divide by Z=0.5
to get marginal
probs
X3X2X1
Y2 Y3Y1
48
find preferred tags
Could be adjective or verb Could be noun or verbCould be verb or noun
Forward-Backward Algorithm
Inference for HMMs
Whiteboard– Derivation of Forward algorithm– Forward-backward algorithm– Viterbi algorithm
49
Derivation of Forward Algorithm
50
Derivation:
Definition:
Forward-Backward Algorithm
51
Viterbi Algorithm
52
Inference in HMMs
What is the computational complexity of inference for HMMs?
• The naïve (brute force) computations for Evaluation, Decoding, and Marginals take exponential time, O(KT)
• The forward-backward algorithm and Viterbialgorithm run in polynomial time, O(T*K2)– Thanks to dynamic programming!
53
Shortcomings of Hidden Markov Models
• HMM models capture dependences between each state and only its corresponding observation
– NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non-local) features of the whole line such as line length, indentation, amount of white space, etc.
• Mismatch between learning objective function and prediction objective function
– HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X)
© Eric Xing @ CMU, 2005-2015 54
Y1 Y2 … … … Yn
X1 X2 … … … Xn
START
MBR DECODING
55
Inference for HMMs
– Three Inference Problems for an HMM1. Evaluation: Compute the probability of a given
sequence of observations2. Viterbi Decoding: Find the most-likely sequence of
hidden states, given a sequence of observations3. Marginals: Compute the marginal distribution for a
hidden state, given a sequence of observations4. MBR Decoding: Find the lowest loss sequence of
hidden states, given a sequence of observations (Viterbi decoding is a special case)
56
Minimum Bayes Risk Decoding• Suppose we given a loss function l(y’, y) and are
asked for a single tagging• How should we choose just one from our probability
distribution p(y|x)?• A minimum Bayes risk (MBR) decoder h(x) returns
the variable assignment with minimum expected loss under the model’s distribution
57
h✓
(x) = argminy
Ey⇠p✓(·|x)[`(y,y)]
= argminy
X
y
p✓
(y | x)`(y,y)
The 0-1 loss function returns 1 only if the two assignments are identical and 0 otherwise:
The MBR decoder is:
which is exactly the Viterbi decoding problem!
Minimum Bayes Risk Decoding
Consider some example loss functions:
58
`(y,y) = 1� I(y,y)
h✓(x) = argmin
y
X
y
p✓(y | x)(1� I(ˆy,y))
= argmax
yp✓(ˆy | x)
h✓
(x) = argminy
Ey⇠p✓(·|x)[`(y,y)]
= argminy
X
y
p✓
(y | x)`(y,y)
The Hamming loss corresponds to accuracy and returns the number of incorrect variable assignments:
The MBR decoder is:
This decomposes across variables and requires the variable marginals.
Minimum Bayes Risk Decoding
Consider some example loss functions:
59
`(y,y) =VX
i=1
(1� I(yi, yi))
yi = h✓(x)i = argmax
yi
p✓(yi | x)
h✓
(x) = argminy
Ey⇠p✓(·|x)[`(y,y)]
= argminy
X
y
p✓
(y | x)`(y,y)
Learning ObjectivesHidden Markov Models
You should be able to…1. Show that structured prediction problems yield high-computation inference
problems2. Define the first order Markov assumption3. Draw a Finite State Machine depicting a first order Markov assumption4. Derive the MLE parameters of an HMM5. Define the three key problems for an HMM: evaluation, decoding, and
marginal computation6. Derive a dynamic programming algorithm for computing the marginal
probabilities of an HMM7. Interpret the forward-backward algorithm as a message passing algorithm8. Implement supervised learning for an HMM9. Implement the forward-backward algorithm for an HMM10. Implement the Viterbi algorithm for an HMM11. Implement a minimum Bayes risk decoder with Hamming loss for an HMM
60