andrew rosenberg- lecture 12: hidden markov models machine learning

Post on 30-Sep-2014

70 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lecture 12: Hidden Markov Models

Machine Learning

Andrew Rosenberg

March 12, 2010

1 / 34

Last Time

Clustering

2 / 34

Today

Hidden Markov Models

3 / 34

Dice Example

Imagine a game of dice.

When the croupier rolls 4,5,6 you win.

When the croupier rolls 1,2,3 you lose.Model the likelihood of winning.

IID multinomials

4 / 34

The Moody Croupier

Now imagine that the croupier cheats.

There are three dice.

One fair (fair)One good for the house (bad)One good for you (good)

5 / 34

The Moody Croupier

Model the Likelihood of winning.

IID multinomials

Latent variable '

&

$

%i ∈ {0..n − 1}

qi

xi

6 / 34

The Moody Croupier

Model the Likelihood of winning.

IID multinomials

Latent variable

Allow a prior over die choices'

&

$

%i ∈ {0..n − 1}

θ

qi

xi

7 / 34

The Moody Croupier

Now what if the dealer is moody?

The dealer doesn’t like to change the die that often

The dealer doesn’t like to switch from the good die to thebad die.

No longer iid!

The die he uses at time t is dependent on the die used at t − 1

θ

q0

x0

q1

x1

q2

x2

qT−1

xT−1

. . .

8 / 34

Sequential Modeling

q0

x0

q1

x1

q2

x2

qT−1

xT−1

. . .

Temporal or sequence model.

Markov Assumption

future ⊥⊥ past|present

p(qt |qt−1, qt−2, qt−3, ..., q0) = p(qt |qt−1)

Get the overall likelihood from the graphical model.

p(x) = p(q0)

T−1∏

t=1

p(qt |qt−1)

T−1∏

t=0

p(xt |qt)

9 / 34

Sequential Modeling

future ⊥⊥ past|present

p(qt |qt−1, qt−2, qt−3, ..., q0) = p(qt |qt−1)

Get the overall likelihood from the graphical model.

p(x) = p(q0)

T−1∏

t=1

p(qt |qt−1)

T−1∏

t=0

p(xt |qt)

p(qt |qt−1)?

10 / 34

HMMs as state machines

HMMs have two variables: state q and emission y

In general the state is an unobserved latent variable.

Can consider HMMs as stochastic automata – weighted finitestate machines.

11 / 34

HMM state machine

good

.5

fair .5

bad

.5

.3

.2

.2

.3

.2

.3

12 / 34

HMMs as state machines

HMMs have two variables: state q and emission x

In general the state is an unobserved latent variable.

No observation of q directly. Only a related emissiondistribution. “doubly-stochastic automaton”.

13 / 34

HMM Applications

Speech Recognition (Rabiner): phonemes from audiocepstral vectors

Language (Jelinek): part of speech tag from words

Biology (Baldi): splice site from gene sequence

Gesture (Starner): word from hand coordinates

Emotion (Picard): emotion from EEG

14 / 34

Types of Variables

Continuous States

E.g. Kalman filtersp(qt |qt − 1) = N(qt |Aqt−1,Q)

Discrete States

E.g., Finite state machine

p(qt |qt − 1) =∏M−1

i=0

∏M−1j=0 [αij ]

qit−1q

jt

Continuous Observations

E.g. time series datap(xt |qt) = N(xt |µqt

,Σqt)

Discrete Observations

E.g. strings

p(xt |qt) =∏M−1

i=0

∏N−1j=0 [ηij ]

qitx

jt

15 / 34

HMM Parameters

M states and N-class observationsComplete likelihood from Graphical Model

p(x) = p(q0)

T−1Y

t=1

p(qt |qt−1)

T−1Y

t=0

p(xt |qt)

Marginalize over unobserved hidden states

p(x) =X

q0

· · ·X

qT−1

p(q, x)

CPTs are reused: θ = {π, η, α}

p(qt |qt−1) =M−1Y

i=0

M−1Y

j=0

[αij ]qit−1x

jt

p(xt |qt) =

M−1Y

i=0

N−1Y

j=0

[ηij ]qitx

jt

p(q0) =M−1Y

i=0

[πi ]qi0

M−1X

j=0

αij = 1

N−1X

j=0

ηij = 1

M−1X

j=0

πi = 1

16 / 34

HMM Operations

Evaluate

Evaluate the likelihood of a model given data.

Decode

Identify the most likely sequence of states

Max Likelihood

Estimate the parameters.

17 / 34

JTA on HMM

Junction Tree

18 / 34

JTA on HMMs

Initialization

ψ(q0, x0) = p(q0)p(x0, q0)

ψ(qt , qt+1) = p(qt+1|qt) = Aqt ,qt+1

ψ(qt , xt) = p(xt |qt)

Z = 1

φ(qt) = 1

ζ(qt) = 1

19 / 34

JTA on HMMs

Update

Collect up from leaves – don’t change zeta separators.

ζ∗(qt) =∑

xt

ψ(qt , xt) =∑

xt

p(xt |qt) = 1

ψ∗(qt−1, qt) =ζ∗(qt)

ζ(qt)ψ(qt−1, qt) = ψ(qt−1, qt)

20 / 34

JTA on HMMs

Update

Collect left-right over phi – state sequence becomes marginals.

φ∗(q0) =∑

x0

ψ(q0, x0) = p(q0)

φ∗(qt) =∑

qt−1

ψ(qt , qt−1) = p(qt)

ψ∗(q0, q1) =φ∗(q0)

φ(q0)ψ(q0, q1) = p(q0, q1)

21 / 34

JTA on HMMs

Distribute

Distribute to separators

ζ∗∗(qt) =

X

qt−1

ψ∗(qt−1, qt) =

X

qt−1

p(qt−1, qt) = p(qt)

ψ∗∗(qt , xt) =

ζ∗∗(qt)

ζ∗(qt)ψ(qt , xt) =

p(qt)

1p(xt |qt) = p(xt , qt)

22 / 34

Introduction of Evidence

p(q|x̄) = p(q0)T−1∏

t=0

p(qt |qt−1)T−1∏

t=0

p(x̄t |qt)

Observe a sequence of data.

Potentials become slices

ψ(qt , x̄t) = p(x̄t |qt)

ζ∗(qt) = ψ(qt , x̄t) = p(x̄t |qt)

ζ∗(qt) 6=∑

xt

ψ(qt , x̄t)

Collect zeta separators bottom upζ∗(qt) = ψ(qt , x̄t) = p(x̄t |qt)

Collect phi separators to the rightφ∗(q0) =

∑x0ψ(q0, x̄0)δ(x0 − x̄0) = p(q0, x̄0)

23 / 34

JTA collect

Collecting up and to the left, updating potentials by left andbottom separators

ψ∗(qt , qt+1) =φ∗(qt)

1

ζ∗(qt+1)

1ψ(qt , qt+1) = φ∗(qt)p(x̄t+1|qt+1)αqtqt+1

φ∗(qt+1) =∑

qt

ψ∗(qt , qt+1) =∑

qt

φ∗(qt)p(x̄t+1|qt+1)αqtqt+1

Note:

φ∗(q0) = p(x̄0, q0)

φ∗(q1) =

X

q0

p(x̄0, q0)p(x̄1|q1)p(q1|q0) = p(x̄0, x̄1, q1)

φ∗(q2) =

X

q1

p(x̄0, x̄1, q0)p(x̄2|q2)p(q2|q1) = p(x̄0, x̄1, x̄2, q2)

φ∗(qt+1) =

X

qt

p(x̄0, . . . , x̄t+1, qt+1)p(x̄t+1|qt+1)p(qt+1|qt) = p(x̄0, . . . , x̄t+1, qt+1)

24 / 34

Evaluation

Compute the likelihood of the sequence.

Collection is sufficient.

From previous slide

φ∗(qt+1) =

X

qt

p(x̄0, . . . , x̄t , qt)p( ¯xt+1, qt+1)p(qt+1|qt) = p(x̄0, . . . , ¯xt+1, qt+1)

So the rightmost node gives:

φ∗(qT−1) = p(x̄0, . . . , ¯xT−1, qT−1)

The likelihood just requires marginalization over qT−1.

p(x̄0, . . . , ¯xT−1) =X

qT−1

p(x̄0, . . . , ¯xT−1, qT−1) =X

qT−1

φ∗(qT−1)

25 / 34

Distribute

But the potentials cannot be read as marginals without theDistribute step of the JTA.

Last state of collection

ψ∗(qT−2, qT−1) =

φ∗(qT−2)

1

ζ∗(qT−1)

1ψ(qT−2, qT−1) = φ

∗(qT−2)p(x̄T−1|qT−1)αqT−2qT

Distribute ** along the state nodes to the left.

Distribute ** down from state nodes to observation nodes.

Update parameters.

ψ∗∗(qT−2, qT−1) = ψ

∗(qT−2, qT−1)

φ∗∗(qt) =

X

qt+1

ψ∗∗(qt , qt+1)

ζ∗∗(qt+1) =

X

qt

ψ∗∗(qt , qt+1)

ψ∗∗(qt , qt+1) =

φ∗∗(qt+1)

φ∗(qt+1)ψ

∗(qt , qt+1)

26 / 34

Decoding

Decode: Given x0, . . . xT−1 identify the most likelyq0, . . . qT−1.

Now that JTA is finished we have marginals in the potentialsand separators

φ∗∗(qt) ∝ p(qt |x̄0, . . . , x̄T−1)

ζ∗∗(qt+1) ∝ p(qt+1|x̄0, . . . , x̄T−1)

ψ∗∗(qt , qt+1) ∝ p(qt , qt+1|x̄0, . . . , x̄T−1)

Need to find the most likely path from q0 to qT−1

Argmax JTA.Run JTA but rather than sums in the update rule, use the maxoperator.Then find the largest entry in the separators

q̂t = argmaxqt

φ∗∗(qt)

27 / 34

Viterbi Decoding

Finding an optimal state sequence can be intractable.

There are MT possible paths, for M states and T time steps.

T can easily be on the order of 1000 in speech recognition.

Construct a Lattice of state transitions

28 / 34

Viterbi Decoding

Only continue to explore paths with likelihood greater thansome threshold, or only continue to explore the top N-paths

Also known as beam search

Polynomial time algorithm to approximately decode a lattice.

Algorithm:

Initialize paths at every state.

For each transition follow only the most likely edge.

or

Initialize paths at every state.

For each transition follow only those paths that have alikelihood over some threshold.

29 / 34

Viterbi decoding

30 / 34

Maximum Likelihood

Training parameters with observed states.

Maximum likelihood (as ever).

l(θ) = log(p(q, x̄))

= log

p(q0)

T−1Y

t=1

p(qt |qt−1)

T−1Y

t=0

p(x̄i |qi )

!

= log p(q0) +T−1X

t=1

log p(qt |qt−1) +T−1X

t=0

log p(x̄i |qi )

= log

M−1Y

i=0

[πi ]qi0 +

T−1X

t=1

log

M−1Y

i=0

M−1Y

j=0

[αij ]qit−1q

jt +

T−1X

t=0

log

M−1Y

i=0

N−1Y

j=0

[ηij ]qit x̄

jt

=

M−1X

i=0

qi0 log πi +

T−1X

t=1

M−1X

i=0

M−1X

j=0

qit−1q

jt logαij +

T−1X

t=0

M−1X

i=0

N−1X

j=0

qit x̄

jt log ηij

Introduce Lagrange multipliers, take partials, set to zero.

31 / 34

Maximum Likelihood

Training parameters with observed states.

Maximum likelihood – as ever.

l(θ) =M−1X

i=0

qi0 log πi +

T−1X

t=1

M−1X

i=0

M−1X

j=0

qit−1q

jt logαij +

T−1X

t=0

M−1X

i=0

N−1X

j=0

qit x̄

jt log ηij

Introduce Lagrange multipliers, take partials, set to zero.

M−1∑

i=0

πi = 1

M−1∑

j=0

αij = 1

N−1∑

j=0

ηij = 1

π̂i = qi0

α̂ij =

∑T−2t=0 qi

tqjt+1∑M−1

k=1

∑T−2t=0 qi

tqkt+1

η̂ij =

∑T−1t=0 qi

t x̄jt∑M−1

k=1

∑T−1t=0 qi

t x̄kt

32 / 34

Expectation Maximization

However, we may not have observed state sequences.

The Moody Croupier

Need to do unsupervised learning (clustering) on the states.

Maximize the Expected likelihood given a guess for p(q)Expectation Maximization – Covered when we move tounsupervised techniques

33 / 34

Bye

Next

Perceptron and Neural Networks

34 / 34

top related