fall 2001 ee669: natural language processing 1 lecture 9: hidden markov models (hmms) (chapter 9 of...

Fall 2001 EE669: Natural Language Processing 1

Lecture 9: Hidden Markov Models (HMMs)

(Chapter 9 of Manning and Schutze)

Dr. Mary P. Harper

ECE, Purdue University

[email protected]

yara.ecn.purdue.edu/~harper


Markov Assumptions• Often we want to consider a sequence of randm variables that aren’t

independent, but rather the value of each variable depends on previous elements in the sequence.

• Let X = (X1, .., XT) be a sequence of random variables taking values

in some finite set S = {s1, …, sN}, the state space. If X possesses the following

properties, then X is a Markov Chain.

• Limited Horizon:

P(Xt+1= sk|X1, .., Xt) = P(X t+1 = sk |Xt) i.e., a word’s state only

depends on the previous state.

• Time Invariant (Stationary):

P(Xt+1= sk|Xt) = P(X2 = sk|X1) i.e., the dependency does not change

over time.


Definition of a Markov Chain• A is a stochastic N x N matrix

of the probabilities of transitions with:– aij = Pr{transition from si to sj} = Pr(Xt+1= sj | Xt= si) ij, aij 0, and i:

is a vector of N elements representing the initial state probability distribution with: i = Pr{probability that the initial state is si} = Pr(X1= si) i:

– Can avoid this by creating a special start state s0.

11

n

jij

a

11

n

ii


Markov Chain


Markov Process & Language Models

• Chain rule:

P(W) = P(w1,w2,...,wT) = i=1..T p(wi|w1,w2,..,wi-n+1,..,wi-1)

• n-gram language models:– Markov process (chain) of the order n-1:

P(W) = P(w1,w2,...,wT) = i=1..T p(wi|wi-n+1,wi-n+2,..,wi-1)

Using just one distribution (Ex.: trigram model: p(wi|wi-2,wi-1)):

Positions: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Words: My car broke down , and within hours Bob ’s car broke down , too .

p( ,|broke down) = p(w5|w3,w4) = p(w14|w12,w13)


Example with a Markov Chain

1

.4

1

.3.3

.4

.6 1

.6

.4

te

ha p

i

Start


Probability of a Sequence of States

• P(t, a, p, p) = 1.0 × 0.3 × 0.4 × 1.0 = 0.12• P(t, a, p, e) = ?

1

1

1121

1321121321

11

)|()|()(

)|()|()()(

T

txxx

TT

TTT

tta

xxPxxPxP

xxxxxPxxPxPxxxxP


Hidden Markov Models (HMMs)

• Sometimes it is not possible to know precisely which states the model passes through; all we can do is observe some phenomena that occurs when in that state with some probability distribution.

• An HMM is an appropriate model for cases when you don’t know the state sequence that the model passes through, but only some probabilistic function of it. For example:– Word recognition (discrete utterance or within continuous speech) – Phoneme recognition– Part-of-Speech Tagging– Linear Interpolation


Example HMM• Process:

– According to , pick a state qi. Select a ball from urn i (state is hidden).

– Observe the color (observable).

– Replace the ball.

– According to A, transition to the next urn from which to pick a ball (state is hidden).

N states corresponding to urns: q1, q2, …,qN

M colors for balls found in urns: z1, z2, …, zM

A: N x N matrix; transition probability distribution (between urns)B: N x M matrix; for each urn, there is a probability density function for each color.

bij = Pr(COLOR= zj | URN = qi): N vector; initial sate probability distribution (where do we start?)


What is an HMM?

• Green circles are hidden states

• Dependent only on the previous state


What is an HMM?

• Purple nodes are observations

• Dependent only on their corresponding hidden state (for a state emit model)


Elements of an HMM• HMM (the most general case):

– five-tuple (S, K, , A, B), where:• S = {s1,s2,...,sN} is the set of states,

• K = {k1,k2,...,kM} is the output alphabet,

= {i} i S, initial state probability set

• A = {aij} i,j S, transition probability set

• B = {bijk} i,j S, k K (arc emission)

• B = {bik} i S, k K (state emission), emission probability set

• State Sequence: X = (x1, x2, ..., xT+1),

xt : S {1,2,…, N}

• Observation (output) Sequence: O = (o1, ..., oT), ot K


HMM Formalism

• {S, K, • S : {s1…sN} are the values for the hidden states

• K : {k1…kM} are the values for the observations

SSS

KKK

S

K

S

K


HMM Formalism

• {S, K, iare the initial state probabilities

• A = {aij} are the state transition probabilities

• B = {bik} are the observation probabilities

A

B

AAA

BB

SSS

KKK

S

K

S

K

i


Example of an Arc Emit HMM

3

1

4

20.6

10.4 1

0.12

enter

here

p(t)=.5p(o)=.2p(e)=.3

p(t)=.8p(o)=.1p(e)=.1

p(t)=0p(o)=0p(e)=1

p(t)=.1p(o)=.7p(e)=.2

p(t)=0p(o)=.4p(e)=.6

p(t)=0p(o)=1p(e)=0

0.88


HMMs1. N states in the model: a state has some measurable,

distinctive properties.2. At clock time t, you make a state transition (to a different

state or back to the same state), based on a transition probability distribution that depends on the current state (i.e., the one you're in before making the transition).

3. After each transition, you output an observation output symbol according to a probability density distribution which depends on the current state or the current arc.

Goal: From the observations, determine what model generated the output, e.g., in word recognition with a different HMM for each word, the goal is to choose the one that best fits the input word (based on state transitions and output observations).


Example HMM

• N states corresponding to urns: q1, q2, …,qN

• M colors for balls found in urns: z1, z2, …, zM

• A: N x N matrix; transition probability distribution (between urns)

• B: N x M matrix; for each urn, there is a probability density function for each color.– bij = Pr(COLOR= zj | URN = qi)

: N vector; initial sate probability distribution (where do we start?)


Example HMM• Process:

– According to , pick a state qi. Select a ball from urn i (state is hidden).

– Observe the color (observable).

– Replace the ball.

– According to A, transition to the next urn from which to pick a ball (state is hidden).

• Design: Choose N and M. Specify a model = (A, B, ) from the training data. Adjust the model parameters (A, B, ) to maximize P(O| ).

• Use: Given O and = (A, B, ), what is P(O| )? If we compute P(O| ) for all models , then we can determine which model most likely generated O.


Simulating a Markov Process

t:= 1;Start in state si with probability i (i.e., X1= i)Forever do Move from state si to state sj with probability aij

(i.e., Xt+1= j) Emit observation symbol ot = k with probability bik (bijk ) t:= t+1End


Why Use Hidden Markov Models?• HMMs are useful when one can think of

underlying events probabilistically generating surface events. Example: Part-of-Speech-Tagging.

• HMMs can be efficiently trained using the EM Algorithm.

• Another example where HMMs are useful is in generating parameters for linear interpolation of n-gram models.

• Assuming that some set of data was generated by an HMM, and then an HMM is useful for calculating the probabilities for possible underlying state sequences.


Fundamental Questions for HMMs

• Given a model = (A, B, ), how do we efficiently compute how likely a certain observation is, that is, P(O| )

• Given the observation sequence O and a model , how do we choose a state sequence (X1, …, XT+1) that best explains the observations?

• Given an observation sequence O, and a space of possible models found by varying the model parameters = (A, B, ), how do we find the model that best explains the observed data?


Probability of an Observation

• Given the observation sequence O = (o1, …, oT) and a model = (A, B, ), we wish to know how to efficiently compute P(O| ).

• For any state sequence X = (X1, …, XT+1), we find: P(O| ) = X P(X | ) P(O | X, )

• This is simply the probability of an observation sequence given the model.

• Direct evaluation of this expression is extremely inefficient; however, there are dynamic programming methods that compute it quite efficiently.


)|( Compute

),,( ,)...( 1

OP

BAooO T

oTo1 otot-1 ot+1

Given an observation sequence and a model, compute the probability of the observation sequence




TT oxoxox bbbXOP ...),|(2211

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

b




TT xxxxxxx aaaXP132211

...)|(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

b

a



),|()|()|,( XOPXPXOP



...)|(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1



)|(),|()|,( XPXOPXOP



...)|(

X

XOPXPOP ),|()|()|(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1


111

1

111

1

1}...{

)|(

tttt

T

oxxx

T

txxoxx babOP

Probability of an Observation(State Emit)

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1


Probability of an Observation with an Arc Emit HMM

3

1

4

20.6

10.4 1

0.12

enter

here

p(t)=.5p(o)=.2p(e)=.3

p(toe) = .6.881 +

.4 1.88 +

.4 1.12

.237

p(t)=.8p(o)=.1p(e)=.1

p(t)=0p(o)=0p(e)=1

p(t)=.1p(o)=.7p(e)=.2

p(t)=0p(o)=.4p(e)=.6

p(t)=0p(o)=1p(e)=0

0.88


Making Computation Efficient

• To avoid this complexity, use dynamic programming or memoization techniques due to trellis structure of the problem.

• Use an array of states versus time to compute the probability of being at each state at time t+1 in terms of the probabilities for being in each state at t.

• A trellis can record the probability of all initial subpaths of the HMM that end in a certain state at a certain time. The probability of longer subpaths can then be worked out in terms of the shorter subpaths.

• A forward variable, i(t)= P(o1o2…ot-1, Xt=i| ) is stored at (si, t) in the trellis and expresses the total probability of ending up in state si at time t.


The Trellis Structure


)|,...()( 11 ixooPt tti

Forward Procedure

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

• Special structure gives us an efficient solution using dynamic programming.

• Intuition: Probability of the first t observations is the same for all possible t+1 length state sequences (so don’t recompute it!)

• Define:


)|(),...(

)()|()|...(

)()|...(

),...(

1111

11111

1111

111

jxoPjxooP

jxPjxoPjxooP

jxPjxooP

jxooP

tttt

ttttt

ttt

tt

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj Note: We omit | from formula.

Note: We omit | from formula.


oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj

)|(),...(

)()|()|...(

)()|...(

),...(

1111

11111

1111

111

jxoPjxooP

jxPjxoPjxooP

jxPjxooP

jxooP

tttt

ttttt

ttt

tt

Note: P(A , B) = P(A| B)*P(B)

Note: P(A , B) = P(A| B)*P(B)


oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj

)|(),...(

)()|()|...(

)()|...(

),...(

1111

11111

1111

111

jxoPjxooP

jxPjxoPjxooP

jxPjxooP

jxooP

tttt

ttttt

ttt

tt


Niijoiji

ttttNi

tt

tttNi

ttt

ttNi

ttt

tbat

jxoPixjxPixooP

jxoPixPixjxooP

jxoPjxixooP

...1

111...1

1

11...1

11

11...1

11

1)(

)|()|(),...(

)|()()|,...(

)|(),,...(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure


The Forward Procedure

• Forward variables are calculated as follows:

– Initialization: i(1)= i , 1 i N

– Induction: j(t+1)=i=1, N i(t)aijbijot, 1 tT, 1 jN

– Total: P(O|)= i=1,N i(T+1) = i=1,N i=1, N i(t)aijbijot

• This algorithm requires 2N2T multiplications (much less than the direct method which takes (2T+1)NT+1


The Backward Procedure

• We could also compute these probabilities by working backward through time.

• The backward procedure computes backward variables which are the total probability of seeing the rest of the observation sequence given that we were in state si at time t.

i(t) = P(ot…oT | Xt = i, ) is a backward variable.• Backward variables are useful for the problem of

parameter reestimation.


)|...()( ixooPt tTti

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Backward Procedure

1)1( Ti

Nj

jijoiji tbatt

...1)1()(



• Backward variables can be calculated working backward through the treillis as follows:– Initialization: i(T+1) = 1, 1 i N

– Induction: i(t) =j=1,N aijbijotj(t+1), 1 t T, 1 i N

– Total: P(O|)=i=1, N ii(1)

• Backward variables can also be combined with forward variables:

P(O|) = i=1,N i(t)i(t), 1 t T+1



• Backward variables can be calculated working backward through the treillis as follows:– Initialization: i(T+1) = 1, 1 i N

– Induction: i(t) =j=1,N aijbijotj(t+1), 1 t T, 1 i N

– Total: P(O|)=i=1, N ii(1)

• Backward variables can also be combined with forward variables:

P(O|) = i=1,N i(t)i(t), 1 t T+1

)()(

),|...()|,...(

),,...|...()|,...(

)|...,,...(

)|,...()|,(

11

1111

11

1

tt

iXooPiXooP

iXooooPiXooP

ooiXooP

iXooPiXOP

ii

tTttt

ttTttt

Tttt

tTt


oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1


N

ii TOP

1

)()|(

N

iiiOP

1

)1()|(

)()()|(1

ttOP i

N

ii

Forward Procedure

Backward Procedure

Combination


Finding the Best State Sequence

• One method consists of optimizing on the states individually.• For each t, 1 t T+1, we would like to find Xt that

maximizes P(Xt|O, ).

• Let i(t) = P(Xt = i |O, ) = P(Xt = i, O|)/P(O|) = (i(t)i(t))/j=1,N

j(t)j(t)• The individually most likely state is

Xt =argmax1iN i(t), 1 t T+1• This quantity maximizes the expected number of states that

will be guessed correctly. However, it may yield a quite unlikely state sequence.

^


oTo1 otot-1 ot+1

Best State Sequence

• Find the state sequence that best explains the observation sequence

• Viterbi algorithm

),|(maxarg OXPX


Finding the Best State Sequence: The Viterbi Algorithm

• The Viterbi algorithm efficiently computes the most likely state sequence.

• To find the most likely complete path, compute: argmaxX P(X|O,)

• To do this, it is sufficient to maximize for a fixed O: argmaxX P(X,O|)

• We define j(t) = maxX1..Xt-1

P(X1…Xt-1, o1..ot-1, Xt=j|) j(t) records the node of the incoming arc that led to this most probable path.


oTo1 otot-1 ot+1

Viterbi Algorithm

)|,...,...(max)( 1111... 11

jxooxxPt tttxx

jt

The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t

x1 xt-1 j


oTo1 otot-1 ot+1

Viterbi Algorithm

)|,...,...(max)( 1111... 11

jxooxxPt tttxx

jt

tijoijii

j batt )(max)1(

tijoijii

j batt )(maxarg)1(

Recursive Computation

x1 xt-1 xt xt+1


Finding the Best State Sequence: The Viterbi Algorithm

The Viterbi Algorithm works as follows:• Initialization: j(1) = j, 1 j N

• Induction: j(t+1) = max1 iN i(t)aijbijot, 1 j N

Store backtrace:

j(t+1) = argmax1 iN i(t)aij bijot, 1 j N

• Termination and path readout: XT+1 = argmax1 iN i(T+1) Xt = Xt+1(t+1) P(X) = max1 iN i(T+1)

^^

^^


oTo1 otot-1 ot+1

Viterbi Algorithm

)(maxargˆ TX ii

T

)1(ˆ1

^

tXtX

t

)(max)ˆ( TXP ii

Compute the most likely state sequence by working backwards

x1 xt-1 xt xt+1 xT


Third Problem: Parameter Estimation• Given a certain observation sequence, we want to

find the values of the model parameters =(A, B, ) which best explain what we observed.

• Using Maximum Likelihood Estimation, find values to maximize P(O| ), i.e., argmax P(Otraining| )

• There is no known analytic method to choose to maximize P(O| ). However, we can locally maximize it by an iterative hill-climbing algorithm known as Baum-Welch or Forward-Backward algorithm. (This is a special case of the EM Algorithm)


Parameter Estimation: The Forward-Backward Algorithm

• We don’t know what the model is, but we can work out the probability of the observation sequence using some (perhaps randomly chosen) model.

• Looking at that calculation, we can see which state transitions and symbol emissions were probably used the most.

• By increasing the probability of those, we can choose a revised model which gives a higher probability to the observation sequence.


oTo1 otot-1 ot+1

Parameter Estimation

• Given an observation sequence, find the model that is most likely to produce that sequence.

• No analytic method• Given a model and training sequences, update the

model parameters to better fit the observations.

A

B

AAA

BBB B


Definitions

Nmmm

jijoiji

tt

ttt

tt

tbat

Op

OjXiXp

OjXiXpjip

t

...1

1

1

)()(

)1()(

)|(

)|,,(

),|,(),(

1

The probability of traversing a certain arc at time t givenObservation sequence O:

N

jti jipt

1

),()(

Let:


The Probability of Traversing an Arc

sisj

aij bijot

……

t t+1t-1 t+2i(t) j(t+1)


Definitions

N

jti

T

ti jiptt

11

),()( ),(

• The expected number of transitions from state i in O:

• The expected number of transitions from state i to j in O

),(1

jipT

tt


Reestimation Procedure

• Begin with model perhaps selected at random.• Run O through the current model to estimate the

expectations of each model parameter.• Change model to maximize the values of the paths

that are used a lot while respecting stochastic constraints.

• Repeat this process (hoping to converge on optimal values for the model parameters ).


The Reestimation Formulas

)1(ˆ ii The expected frequency in state i at time t = 1:

T

ti

T

tt

ij

t

jip

i

jia

1

1

)(

),(

state from ns transitio# expected

to state from ns transitio# expectedˆ

).|P(O)ˆ|P(O that Note

).ˆ,B,A(ˆ derive ),B,(A, From


Reestimation Formulas

T

tt

Ttkot t

ijk

jip

jip

ji

kjib

t

1

)1,:(

),(

),(

to state from ns transitio# expected

observing to state from ns transitio# expectedˆ


oTo1 otot-1 ot+1


A

B

AAA

BBB B

Nmmm

jjoiji

t tt

tbatjip t

...1

)()(

)1()(),( 1

Probability of traversing an arcat time t

Nj

ti jipt...1

),()( Probability of being in state i


oTo1 otot-1 ot+1


A

B

AAA

BBB B

)1(ˆ i i

Now we can compute the new estimates of the model parameters.

T

t i

T

t tij

t

jipa

1

1

)(

),(ˆ

T

t i

kot i

ikt

tb t

1

}:{

)(

)(ˆ


HMM Applications

• Parameters for interpolated n-gram models

• Part of speech tagging

• Speech recognition

fall 2001 ee669: natural language processing 1 lecture 9: hidden markov models (hmms) (chapter 9 of...

Documents

t pw i w in

s i ij

s j x t

natural language processing

stochastic n x n matrix

n states

markov chain slide

state sequence