sequence models

51
Sequence Models With slides by me, Joshua Goodman, Fei Xia

Upload: afram

Post on 05-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Sequence Models. With slides by me, Joshua Goodman, Fei Xia. Outline. Language Modeling Ngram Models Hidden Markov Models Supervised Parameter Estimation Probability of a sequence Viterbi (or decoding) Baum-Welch. A bad language model. A bad language model. A bad language model. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sequence Models

Sequence Models

With slides by me, Joshua Goodman, Fei Xia

Page 2: Sequence Models

Outline

• Language Modeling• Ngram Models• Hidden Markov Models– Supervised Parameter Estimation– Probability of a sequence– Viterbi (or decoding)– Baum-Welch

Page 3: Sequence Models

3

A bad language model

Page 4: Sequence Models

4

A bad language model

Page 5: Sequence Models

5

A bad language model

Page 6: Sequence Models

6

A bad language model

Page 7: Sequence Models

What is a language model?

Language Model: A distribution that assigns a probability to language utterances.

e.g., PLM(“zxcv ./,mwea afsido”) is zero;

PLM(“mat cat on the sat”) is tiny;

PLM(“Colorless green ideas sleep furiously”) is bigger;

PLM(“A cat sat on the mat”) is bigger still.

Page 8: Sequence Models

What’s a language model for?

• Information Retrieval• Handwriting recognition• Speech Recognition• Spelling correction• Optical character recognition• Machine translation• …

Page 9: Sequence Models

Example Language Model Application

Speech Recognition: convert an acoustic signal (sound wave recorded by a microphone) to a sequence of words (text file).

Straightforward model:

But this can be hard to train effectively (although see CRFs later).

)|( soundtextP

Page 10: Sequence Models

Example Language Model Application

Speech Recognition: convert an acoustic signal (sound wave recorded by a microphone) to a sequence of words (text file).

Traditional solution: Bayes’ Rule

)(

)()|()|(

soundP

textPtextsoundPsoundtextP

Ignore: doesn’t matter for picking a good text

Acoustic Model(easier to train)

Language Model

Page 11: Sequence Models

Importance of Sequence

So far, we’ve been making the exchangeability, or bag-of-words, assumption:

The order of words is not important.

It turns out, that’s actually not true (duh!).

“cat mat on the the sat” ≠ “the cat sat on the mat”

“Mary loves John” ≠ “John loves Mary”

Page 12: Sequence Models

Language Models with Sequence Information

Problem: How can we define a model that

• assigns probability to sequences of words (a language model)

• the probability depends on the order of the words• the model can be trained and computed tractably?

Page 13: Sequence Models

Outline

• Language Modeling• Ngram Models• Hidden Markov Models– Supervised parameter estimation– Probability of a sequence (decoding)– Viterbi (Best hidden layer sequence)– Baum-Welch

Page 14: Sequence Models

14

Smoothing: Kneser-Ney

P(Francisco | eggplant) vs P(stew | eggplant)• “Francisco” is common, so backoff,

interpolated methods say it is likely• But it only occurs in context of “San”• “Stew” is common, and in many contexts• Weight backoff by number of contexts word

occurs in

Page 15: Sequence Models

15

Kneser-Ney smoothing (cont)

Interpolation:

Backoff:

Page 16: Sequence Models

Outline

• Language Modeling• Ngram Models• Hidden Markov Models– Supervised parameter estimation– Probability of an observation sequence– Viterbi (Best hidden layer sequence)– Baum-Welch

Page 17: Sequence Models

The Hidden Markov Model

A dynamic Bayes Net (dynamic because the size can change).

The Oi nodes are called observed nodes.The Xi nodes are called hidden nodes.

NLP 17

X1

O1

X2

O2

XT

OT…

Page 18: Sequence Models

HMMs and Language Processing

• HMMs have been used in a variety of applications, but especially:– Speech recognition

(hidden nodes are text words, observations are spoken words)

– Part of Speech Tagging(hidden nodes are parts of speech, observations are words)

NLP 18

X1

O1

X2

O2

XT

OT…

Page 19: Sequence Models

HMM Independence Assumptions

HMMs assume that:• Xi is independent of X1 through Xi-2, given Xi-1 (Markov assumption)• Oi is independent of all other nodes, given Xi

• P(Xi | Xi-1) and P(Oi | Xi) do not depend on i (stationary assumption)

Not very realistic assumptions about language – but HMMs are often good enough, and very convenient.

NLP 19

X1

O1

X2

O2

XT

OT…

Page 20: Sequence Models

HMM Formula

An HMM predicts that the probability of observing a sequence o = <o1, o2, …, oT> with a particular set of hidden states x = <x1, … xT> is:

To calculate, we need: - Prior: P(x1) for all values of x1

- Observation: P(oi|xi) for all values of oi and xi

- Transition: P(xi|xi-1) for all values of xi and xi-1

T

iiiii xoPxxPxoPxPP

21111 )|()|()|()(),( xo

Page 21: Sequence Models

HMM: Pieces1) A set of hidden states H = {h1, …, hN} that are the values which hidden

nodes may take.

2) A vocabulary, or set of states V = {v1, …, vM} that are the values which an observed node may take.

3) Initial probabilities P(x1=hj) for all j- Written as a vector of N initial probabilities, called π- πj = P(x1=hj)

4) Transition probabilities P(xt=hj | xt-1=hk) for all j, k- Written as an NxN ‘transition matrix’ A- Aj,k = P(xt=hj | xt-1=hk)

5) Observation probabilities P(ot=vk|st=hj) for all k, j‐ written as an MxN ‘observation matrix’ B‐ Bj,k = P(ot=vk|st=hj)

Page 22: Sequence Models

HMM for POS Tagging1) H = {DT, NN, VB, IN, …}, the set of all POS tags.

2) V = the set of all words in English.

3) Initial probabilities πj are the probability that POS tag hj can start a sentence.

4) Transition probabilities Ajk represent the probability that one tag (e.g., hk=VB) can follow another (e.g., hj=NN)

5) Observation probabilities Bjk represent the probability that a tag (e.g., hj=NN) will generate a particular word (e.g., vk=cat).

Page 23: Sequence Models

Outline

• Language Models• Ngram modeling• Hidden Markov Models– Supervised parameter estimation– Probability of a sequence– Viterbi: what’s the best hidden state sequence?– Baum-Welch: unsupervised parameter estimation

Page 24: Sequence Models

Supervised Parameter Estimation

• Given an observation sequence and states, find the HMM parameters (π, A, and B) that are most likely to produce the sequence.

• For example, POS-tagged data from the Penn Treebank

A

B

AAA

BBB B

oTo1 otot-1 ot+1

X1 Xt-1 Xt Xt+1 XT

Page 25: Sequence Models

Maximum Likelihood Parameter Estimation

A

B

AAA

BBB B

oTo1 otot-1 ot+1

x1 xt-1 xt xt+1 xT

sentences#

state with starting sentences#ˆ ii

h

data in the is times#

by followed is times#ˆ

i

jiij h

hha

data in the is times#

produces times#ˆi

kiik h

vhb

Page 26: Sequence Models

Outline

• Language Modeling• Ngram models• Hidden Markov Models– Supervised parameter estimation– Probability of a sequence– Viterbi– Baum-Welch

Page 27: Sequence Models

What’s the probability of a sentence?

Suppose I asked you, ‘What’s the probability of seeing a sentence v1, …, vT on the web?’

If we have an HMM model of English, we can use it to estimate the probability.

(In other words, HMMs can be used as language models.)

Page 28: Sequence Models

Conditional Probability of a Sentence

• If we knew the hidden states that generated each word in the sentence, it would be easy:

T

iii

T

iii

T

iiiii

T

TTTT

xoP

xxPxP

xoPxxPxoPxP

xxP

xxooPxxooP

1

211

21111

1

1111

)|(

)|()(

)|()|()|()(

),...,(

),...,,,...,(),...,|,...,(

Page 29: Sequence Models

Marginal Probability of a Sentence

Via marginalization, we have:

Unfortunately, if there are N values for each xi (h1 through hN),

Then there are NT values for x1,…,xT.

Brute-force computation of this sum is intractable.

T

T

xx

T

iiiii

xxTTT

xoPxxPxoPxP

xxooPooP

,..., 21111

,...,111

1

1

)|()|()|()(

),...,,,...,(),...,(

Page 30: Sequence Models

),,|,...(

),,|,,...,,...()(

1

,...,111

11

BAπ

BAπ

ixooP

ixxxooPt

tt

xxttti

t

Forward Procedure

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

• Special structure gives us an efficient solution using dynamic programming.

• Intuition: Probability of the first t observations is the same for all possible t+1 length state sequences.

• Define:

1)1( ioii B

Page 31: Sequence Models

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj

)|(),...(

)()|()|...(

)()|...(

),...(

1111

11111

1111

111

jttjtt

jtjttjtt

jtjtt

jtt

hxoPhxooP

hxPhxoPhxooP

hxPhxooP

hxooP

Page 32: Sequence Models

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj

)|(),...(

)()|()|...(

)()|...(

),...(

1111

11111

1111

111

jttjtt

jtjttjtt

jtjtt

jtt

hxoPhxooP

hxPhxoPhxooP

hxPhxooP

hxooP

Page 33: Sequence Models

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj

)|(),...(

)()|()|...(

)()|...(

),...(

1111

11111

1111

111

jttjtt

jtjttjtt

jtjtt

jtt

hxoPhxooP

hxPhxoPhxooP

hxPhxooP

hxooP

Page 34: Sequence Models

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj

)|(),...(

)()|()|...(

)()|...(

),...(

1111

11111

1111

111

jttjtt

jtjttjtt

jtjtt

jtt

hxoPhxooP

hxPhxoPhxooP

hxPhxooP

hxooP

Page 35: Sequence Models

Nijoiji

jttitjtNi

itt

jttitNi

itjtt

jttNi

jtitt

tBAt

hxoPhxhxPhxooP

hxoPhxPhxhxooP

hxoPhxhxooP

...1

111...1

1

11...1

11

11...1

11

1)(

)|()|(),...(

)|()()|,...(

)|(),,...(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

Page 36: Sequence Models

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

Nijoiji

jttitjtNi

itt

jttitNi

itjtt

jttNi

jtitt

tBAt

hxoPhxhxPhxooP

hxoPhxPhxhxooP

hxoPhxhxooP

...1

111...1

1

11...1

11

11...1

11

1)(

)|()|(),...(

)|()()|,...(

)|(),,...(

Page 37: Sequence Models

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

Nijoiji

jttitjtNi

itt

jttitNi

itjtt

jttNi

jtitt

tBAt

hxoPhxhxPhxooP

hxoPhxPhxhxooP

hxoPhxhxooP

...1

111...1

1

11...1

11

11...1

11

1)(

)|()|(),...(

)|()()|,...(

)|(),,...(

Page 38: Sequence Models

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

Nijoiji

jttitjtNi

itt

jttitNi

itjtt

jttNi

jtitt

tBAt

hxoPhxhxPhxooP

hxoPhxPhxhxooP

hxoPhxhxooP

...1

111...1

1

11...1

11

11...1

11

1)(

)|()|(),...(

)|()()|,...(

)|(),,...(

Page 39: Sequence Models

)|...()( 1 ixooPt tTti

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Backward Procedure

1)1( Ti

Nj

jioiji tBAtt

...1

)1()(

Probability of the rest of the states given the first state

Page 40: Sequence Models

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Decoding Solution

N

ii TP

1

)(),,|( BAπO

N

iiioiBP

1

)1(),,|(1BAπO

)()(),,|(1

ttP i

N

ii

BAπO

Forward Procedure

Backward Procedure

Combination

Page 41: Sequence Models

Outline

• Language modeling• Ngram models• Hidden Markov Models– Supervised parameter estimation– Probability of a sequence– Viterbi: what’s the best hidden state sequence?– Baum-Welch

Page 42: Sequence Models

oTo1 otot-1 ot+1

Best State Sequence

• Find the hidden state sequence that best explains the observations

• Viterbi algorithm

)|(maxarg OXPX

Page 43: Sequence Models

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

tjtttxx

j ohxooxxPtt

The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state hj, and seeing the observation at time t

x1 xt-1 hj

Page 44: Sequence Models

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

tjtttxx

j ohxooxxPtt

1)(max)1(

tjoijii

j BAtt

1)(maxarg)1(

tjoijii

j BAtt Recursive Computation

x1 xt-1 xt xt+1

Page 45: Sequence Models

oTo1 otot-1 ot+1

Viterbi Algorithm

)(maxargˆ

TTi

i

hX

)1(ˆ1

^

tXtX

t

)(max)ˆ( TXP ii

Compute the most likely state sequence by working backwards

x1 xt-1 xt xt+1 xT

Page 46: Sequence Models

Outline

• Language modeling• Ngram models• Hidden Markov Models– Supervised parameter estimation– Probability of a sequence– Viterbi– Baum-Welch: Unsupervised parameter

estimation

Page 47: Sequence Models

oTo1 otot-1 ot+1

Unsupervised Parameter Estimation

• Given an observation sequence, find the model that is most likely to produce that sequence.

• No analytic method• Given a model and observation sequence, update

the model parameters to better fit the observations.

A

B

AAA

BBB B

Page 48: Sequence Models

oTo1 otot-1 ot+1

Parameter EstimationA

B

AAA

BBB B

Nmmm

jjoijit tt

tBAtjip t

...1

)()(

)1()(),( 1

Probability of traversing an arc

Njtt jipi

...1

),()( Probability of being in state i

Page 49: Sequence Models

oTo1 otot-1 ot+1

Parameter EstimationA

B

AAA

BBB B

)(ˆ 1 ii

Now we can compute the new estimates of the model parameters.

T

t t

T

t tij

i

jipa

1

1

)(

),(ˆ

T

t t

kot t

iki

ib t

1

}:{

)(

)(ˆ

Page 50: Sequence Models

oTo1 otot-1 ot+1

Parameter EstimationA

B

AAA

BBB B

• Guarantee: P(o1:T|A,B,π) <= P(o1:T|A ̂,B ̂, π� )• In other words, by repeating this procedure, we

can gradually improve how well the HMM fits the unlabeled data.

• There is no guarantee that this will converge to the best possible HMM, however (only guaranteed to find a local maximum).

Page 51: Sequence Models

oTo1 otot-1 ot+1

The Most Important ThingA

B

AAA

BBB B

We can use the special structure of this model to do a lot of neat math and solve problems that are otherwise not tractable.