hidden markov models · hidden markov models 1 10-601 introduction to machine learning matt gormley...

Hidden Markov Models

1

10-601 Introduction to Machine Learning

Matt GormleyLecture 20

Nov. 7, 2018

Machine Learning DepartmentSchool of Computer ScienceCarnegie Mellon University

Reminders

• Homework 6: PAC Learning / GenerativeModels– Out: Wed, Oct 31– Due: Wed, Nov 7 at 11:59pm (1 week)

• Homework 7: HMMs– Out: Wed, Nov 7– Due: Mon, Nov 19 at 11:59pm

2

HMM Outline• Motivation

– Time Series Data• Hidden Markov Model (HMM)

– Example: Squirrel Hill Tunnel Closures [courtesy of Roni Rosenfeld]

– Background: Markov Models– From Mixture Model to HMM– History of HMMs– Higher-order HMMs

• Training HMMs– (Supervised) Likelihood for HMM– Maximum Likelihood Estimation (MLE) for HMM– EM for HMM (aka. Baum-Welch algorithm)

• Forward-Backward Algorithm– Three Inference Problems for HMM– Great Ideas in ML: Message Passing– Example: Forward-Backward on 3-word Sentence– Derivation of Forward Algorithm– Forward-Backward Algorithm– Viterbi algorithm

3

This Lecture

Last Lecture

SUPERVISED LEARNING FOR HMMS

4

HMM Parameters:

Hidden Markov Model

6

X1 X2 X3 X4 X5

Y1 Y2 Y3 Y4 Y5

O S CO .9 .08.02S .2 .7 .1C .9 0 .1

1min

2min

3min

…

O .1 .2 .3S .01 .02.03C 0 0 0

O S CO .9 .08.02S .2 .7 .1C .9 0 .1

1min

2min

3min

…

O .1 .2 .3S .01 .02.03C 0 0 0

O .8S .1C .1

Training HMMs

Whiteboard– (Supervised) Likelihood for an HMM– Maximum Likelihood Estimation (MLE) for HMM

7

Supervised Learning for HMMsLearning an HMM decomposes into solving two (independent) Mixture Models

8

Yt Yt+1

Xt

Yt

HMM Parameters:

Assumption:Generative Story:

Hidden Markov Model

9X1 X2 X3 X4 X5

Y1 Y2 Y3 Y4 Y5Y0

y0 = STARTFor notational

convenience, we fold the initial probabilities C into the transition matrix B by

our assumption.

Joint Distribution:

Hidden Markov Model

10X1 X2 X3 X4 X5

Y1 Y2 Y3 Y4 Y5Y0

y0 = START

Supervised Learning for HMMsLearning an HMM decomposes into solving two (independent) Mixture Models

11

Yt Yt+1

Xt

Yt

HMMs: History

• Markov chains: Andrey Markov (1906)

– Random walks and Brownian motion

• Used in Shannon’s work on information theory (1948)

• Baum-Welsh learning algorithm: late 60’s, early 70’s.

– Used mainly for speech in 60s-70s.

• Late 80’s and 90’s: David Haussler (major player in

learning theory in 80’s) began to use HMMs for

modeling biological sequences

• Mid-late 1990’s: Dayne Freitag/Andrew McCallum

– Freitag thesis with Tom Mitchell on IE from Web

using logic programs, grammar induction, etc.

– McCallum: multinomial Naïve Bayes for text

– With McCallum, IE using HMMs on CORA

• …

13Slide from William Cohen

Higher-order HMMs

• 1st-order HMM (i.e. bigram HMM)

• 2nd-order HMM (i.e. trigram HMM)

• 3rd-order HMM

14

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

BACKGROUND: MESSAGE PASSING

15

Great Ideas in ML: Message Passing

3 behind you

2 behind you

1 behind you

4 behind you

5 behind you

1 beforeyou

2 beforeyou

there's1 of me

3 beforeyou

4 beforeyou

5 beforeyou

Count the soldiers

16adapted from MacKay (2003) textbook


3 behind you

2 beforeyou

there's1 of me

Belief:Must be2 + 1 + 3 = 6 of us

only seemy incomingmessages

2 31

Count the soldiers


2 beforeyou


4 behind you

1 beforeyou

there's1 of me

only seemy incomingmessages

Count the soldiers


Belief:Must be2 + 1 + 3 = 6 of us2 31

Belief:Must be1 + 1 + 4 = 6 of us

1 41


7 here

3 here

11 here(= 7+3+1)

1 of me

Each soldier receives reports from all branches of tree



3 here

3 here

7 here(= 3+3+1)




7 here

3 here

11 here(= 7+3+1)




7 here

3 here

3 here

Belief:Must be14 of us



Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree

7 here

3 here

3 here

Belief:Must be14 of us


THE FORWARD-BACKWARD ALGORITHM

24

Inference for HMMs

Whiteboard– Three Inference Problems for an HMM

1. Evaluation: Compute the probability of a given sequence of observations

2. Viterbi Decoding: Find the most-likely sequence of hidden states, given a sequence of observations

3. Marginals: Compute the marginal distribution for a hidden state, given a sequence of observations

25

n n v d nSample 2:

time likeflies an arrow

Dataset for Supervised Part-of-Speech (POS) Tagging

26

n v p d nSample 1:

time likeflies an arrow

p n n v vSample 4:

with youtime will see

n v p n nSample 3:

flies withfly their wings

D = {x(n),y(n)}Nn=1Data:

y(1)

x(1)

y(2)

x(2)

y(3)

x(3)

y(4)

x(4)

time flies like an arrow

n v p d n<START>

Hidden Markov Model

28

A Hidden Markov Model (HMM) provides a joint distribution over the the sentence/tags with an assumption of dependence between adjacent tags.

v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0

v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0

tim

efli

eslik

e…

v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1

tim

efli

eslik

e…

v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1

p(n, v, p, d, n, time, flies, like, an, arrow) = (.3 * .8 * .2 * .5 * …)

X3X2X1

Y2 Y3Y1

29

find preferred tags

Could be adjective or verb Could be noun or verbCould be verb or noun

Forward-Backward Algorithm


30

Y2 Y3Y1

X3X2X1find preferred tags

Y2 Y3Y1



31

v

n

a

v

n

a

v

n

a

START END

• Let’s show the possible values for each variable• One possible assignment• And what the 7 factors think of it …

Y2 Y3Y1



32

v

n

a

v

n

a

v

n

a

START END


Y2 Y3Y1


33

v

n

a

v

n

a

v

n

a

START END



Y2 Y3Y1


34

v

n

a

v

n

a

v

n

a

START END

• Let’s show the possible values for each variable• One possible assignment• And what the 7 transition / emission factors think of it …


Y2 Y3Y1


35

v

n

a

v

n

a

v

n

a

START END

• Let’s show the possible values for each variable• One possible assignment• And what the 7 transition / emission factors think of it …

Forward-Backward Algorithmfin

dp

ref.

tag

s…

v 3 5 3n 4 5 2a 0.1 0.2 0.1

v n av 1 6 4n 8 4 0.1a 0.1 8 0

Y2 Y3Y1


Viterbi Algorithm: Most Probable Assignment

36

v

n

a

v

n

a

v

n

a

START END

• So p(v a n) = (1/Z) * product of 7 numbers• Numbers associated with edges and nodes of path• Most probable assignment = path with highest product

B(a,END)

A(tags,n)

Y2 Y3Y1


Viterbi Algorithm: Most Probable Assignment

37

v

n

a

v

n

a

v

n

a

START END

• So p(v a n) = (1/Z) * product weight of one path

B(a,END)

A(tags,n)

Y2 Y3Y1


Forward-Backward Algorithm: Finds Marginals

38

v

n

a

v

n

a

v

n

a

START END

• So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = a)

= (1/Z) * total weight of all paths through a

Y2 Y3Y1



39

v

n

a

v

n

a

v

n

a

START END

• So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = n)

= (1/Z) * total weight of all paths through n

Y2 Y3Y1



40

v

n

a

v

n

a

v

n

a

START END

• So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = v)

= (1/Z) * total weight of all paths through v

Y2 Y3Y1



41

v

n

a

v

n

a

v

n

a

START END

• So p(v a n) = (1/Z) * product weight of one path• Marginal probability p(Y2 = n)

= (1/Z) * total weight of all paths through n

Y2 Y3Y1


α2(n) = total weight of thesepath prefixes

(found by dynamic programming: matrix-vector products)


42

v

n

a

v

n

a

v

n

a

START END

Y2 Y3Y1


= total weight of thesepath suffixes

b2(n)

(found by dynamic programming: matrix-vector products)


43

v

n

a

v

n

a

v

n

a

START END

Y2 Y3Y1


v

n

a

v

n

a

v

n

a

START END

α2(n) = total weight of thesepath prefixes

= total weight of thesepath suffixes


44

b2(n)(a + b + c) (x + y + z)

Product gives ax+ay+az+bx+by+bz+cx+cy+cz = total weight of paths

Y2 Y3Y1


v

n

a

v

n

a

v

n

a

START END


45

total weight of all paths through= × ×

n

A(pref., n)

α2(n) b2(n)

α2(n) A(pref., n) b2(n)

“belief that Y2 = n”

Oops! The weight of a path through a state also

includes a weight at that state.

So α(n)·β(n) isn’t enough.

The extra weight is the opinion of the emission

probability at this variable.

Y2 Y3Y1


v

n

a a

v

n

a

START END


46


v

α2(v) A(pref., v) b2(v)

n

v

“belief that Y2 = n”α2(v) b2(v)

“belief that Y2 = v”

A(pref., v)

Y2 Y3Y1


v

n

a

v

n

a

START END


47


a

α2(a) A(pref., a) b2(a)

n

v

“belief that Y2 = n”α2(a) b2(a)

“belief that Y2 = v”

A(pref., a)

a “belief that Y2 = a”

sum = Z(total weightof all paths)

v 0.1n 0a 0.4

v 0.2n 0a 0.8

divide by Z=0.5

to get marginal

probs

X3X2X1

Y2 Y3Y1

48

find preferred tags

Could be adjective or verb Could be noun or verbCould be verb or noun


Inference for HMMs

Whiteboard– Derivation of Forward algorithm– Forward-backward algorithm– Viterbi algorithm

49

Derivation of Forward Algorithm

50

Derivation:

Definition:


51

Viterbi Algorithm

52

Inference in HMMs

What is the computational complexity of inference for HMMs?

• The naïve (brute force) computations for Evaluation, Decoding, and Marginals take exponential time, O(KT)

• The forward-backward algorithm and Viterbialgorithm run in polynomial time, O(T*K2)– Thanks to dynamic programming!

53

Shortcomings of Hidden Markov Models

• HMM models capture dependences between each state and only its corresponding observation

– NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non-local) features of the whole line such as line length, indentation, amount of white space, etc.

• Mismatch between learning objective function and prediction objective function

– HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X)

© Eric Xing @ CMU, 2005-2015 54

Y1 Y2 … … … Yn

X1 X2 … … … Xn

START

MBR DECODING

55

Inference for HMMs

– Three Inference Problems for an HMM1. Evaluation: Compute the probability of a given

sequence of observations2. Viterbi Decoding: Find the most-likely sequence of

hidden states, given a sequence of observations3. Marginals: Compute the marginal distribution for a

hidden state, given a sequence of observations4. MBR Decoding: Find the lowest loss sequence of

hidden states, given a sequence of observations (Viterbi decoding is a special case)

56

Minimum Bayes Risk Decoding• Suppose we given a loss function l(y’, y) and are

asked for a single tagging• How should we choose just one from our probability

distribution p(y|x)?• A minimum Bayes risk (MBR) decoder h(x) returns

the variable assignment with minimum expected loss under the model’s distribution

57

h✓

(x) = argminy

Ey⇠p✓(·|x)[`(y,y)]

= argminy

X

y

p✓

(y | x)`(y,y)

The 0-1 loss function returns 1 only if the two assignments are identical and 0 otherwise:

The MBR decoder is:

which is exactly the Viterbi decoding problem!

Minimum Bayes Risk Decoding

Consider some example loss functions:

58

`(y,y) = 1� I(y,y)

h✓(x) = argmin

y

X

y

p✓(y | x)(1� I(ˆy,y))

= argmax

yp✓(ˆy | x)

h✓

(x) = argminy

Ey⇠p✓(·|x)[`(y,y)]

= argminy

X

y

p✓

(y | x)`(y,y)

The Hamming loss corresponds to accuracy and returns the number of incorrect variable assignments:

The MBR decoder is:

This decomposes across variables and requires the variable marginals.

Minimum Bayes Risk Decoding

Consider some example loss functions:

59

`(y,y) =VX

i=1

(1� I(yi, yi))

yi = h✓(x)i = argmax

yi

p✓(yi | x)

h✓

(x) = argminy

Ey⇠p✓(·|x)[`(y,y)]

= argminy

X

y

p✓

(y | x)`(y,y)

Learning ObjectivesHidden Markov Models

You should be able to…1. Show that structured prediction problems yield high-computation inference

problems2. Define the first order Markov assumption3. Draw a Finite State Machine depicting a first order Markov assumption4. Derive the MLE parameters of an HMM5. Define the three key problems for an HMM: evaluation, decoding, and

marginal computation6. Derive a dynamic programming algorithm for computing the marginal

probabilities of an HMM7. Interpret the forward-backward algorithm as a message passing algorithm8. Implement supervised learning for an HMM9. Implement the forward-backward algorithm for an HMM10. Implement the Viterbi algorithm for an HMM11. Implement a minimum Bayes risk decoder with Hamming loss for an HMM

60

hidden markov models · hidden markov models 1 10-601 introduction to machine learning matt gormley...

Documents