part-of-speech tagging - columbia universitymadigan/dm08/hmm.pdf · part-of-speech tagging...

38
Part-of-Speech Tagging Assign grammatical tags to words Basic task in the analysis of natural language data Phrase identification, entity extraction, etc. Ambiguity: “tag” could be a noun or a verb “a tag is a part-of-speech label” – context resolves the ambiguity

Upload: lenhi

Post on 29-Mar-2018

219 views

Category:

Documents


2 download

TRANSCRIPT

Part-of-Speech Tagging

• Assign grammatical tags to words• Basic task in the analysis of natural language data• Phrase identification, entity extraction, etc.• Ambiguity: “tag” could be a noun or a verb• “a tag is a part-of-speech label” – context resolves the

ambiguity

The Penn Treebank POS Tag Set

POS Tagging Process

Berlin Chen

POS Tagging Algorithms

• Rule-based taggers: large numbers of hand-craftedrules

• Probabilistic tagger: used a tagged corpus to trainsome sort of model, e.g. HMM.

tag1

word1

tag2

word2

tag3

word3

The Brown Corpus

• Comprises about 1 million English words• HMM’s first used for tagging on the Brown Corpus• 1967. Somewhat dated now.• British National Corpus has 100 million words

Simple Charniak Model

w1

t1

w2

t2

w3

t3

•What about words that have never been seen before? •Clever tricks for smoothing the number of parameters (aka priors)

some details…

number of times word j appears with tag i

number of times word j appears

number of times a word that had never beenseen with tag i gets tag i

number of such occurrences in total

Test data accuracy on Brown Corpus = 91.51%

HMM

t1

w1

t2

w2

t3

w3

•Brown test set accuracy = 95.97%

Morphological Features• Knowledge that “quickly” ends in “ly” should

help identify the word as an adverb• “randomizing” -> “ing”• Split each word into a root (“quick”) and a

suffix (“ly”)

t1

r1

t2 t3

s1 r2 s2

Morphological Features• Typical morphological analyzers produce

multiple possible splits• “Gastroenteritis” ???

• Achieves 96.45% on the Brown Corpus

Inference in an HMM

• Compute the probability of a given observationsequence

• Given an observation sequence, compute the mostlikely hidden state sequence

• Given an observation sequence and set of possiblemodels, which model most closely fits the data?

David Blei

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

ttttxx

j ojxooxxPtt

==!!

!

"

The state sequence which maximizes theprobability of seeing the observations to timet-1, landing in state j, and seeing theobservation at time t

x1 xt-1 j

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

ttttxx

j ojxooxxPtt

==!!

!

"

1)(max)1(

+

=+tjoiji

ij batt !!

1)(maxarg)1(

+

=+tjoiji

ij batt !"

RecursiveComputation

x1 xt-1 xt xt+1

State TransitionProbability

“Emission”Probability

oTo1 otot-1 ot+1

Viterbi Algorithm

ˆ X T = argmaxj

! j (T)

ˆ X t

=!X

^

t +1

(t + 1)

P( ˆ X ) = maxi

!i(T)

Compute the mostlikely state sequenceby workingbackwards

x1 xt-1 xt xt+1 xT

Viterbi Small Example

o2o1

x1 x2Pr(x1=T) = 0.2Pr(x2=T|x1=T) = 0.7Pr(x2=T|x1=F) = 0.1Pr(o=T|x=T) = 0.4Pr(o=T|x=F) = 0.9o1=T; o2=F

Brute ForcePr(x1=T,x2=T, o1=T,o2=F) = 0.2 x 0.4 x 0.7 x 0.6 = 0.0336Pr(x1=T,x2=F, o1=T,o2=F) = 0.2 x 0.4 x 0.3 x 0.1 = 0.0024Pr(x1=F,x2=T, o1=T,o2=F) = 0.8 x 0.9 x 0.1 x 0.6 = 0.0432Pr(x1=F,x2=F, o1=T,o2=F) = 0.8 x 0.9 x 0.9 x 0.1 = 0.0648

Pr(X1,X2 | o1=T,o2=F) ∝ Pr(X1,X2 , o1=T,o2=F)

Viterbi Small Example

o2o1

x1 x2

ˆ X 2 = argmaxj

! j (2)

! j (2) =maxi!i(1)aijb jo2

!T(1) = Pr(x1 = T)Pr(o1 = T | x1 = T) = 0.2 " 0.4 = 0.08

!F(1) = Pr(x1 = F)Pr(o1 = T | x1 = F) = 0.8 " 0.9 = 0.72

!T(2) =max !

F(1) "Pr(x2 = T | x1 = F)Pr(o2 = F | x2 = T),!

T(1) "Pr(x2 = T | x1 = T)Pr(o2 = F | x2 = T)( )

=max(0.72 " 0.1" 0.6,0.08 " 0.7 " 0.6) = 0.0432

!F(2) =max !

F(1) "Pr(x2 = F | x1 = F)Pr(o2 = F | x2 = F),!

T(1) "Pr(x2 = F | x1 = T)Pr(o2 = F | x2 = F)( )

=max(0.72 " 0.9 " 0.1,0.0.08 " 0.3" 0.1) = 0.0648

Bayesian Analysis of a Markov Chain

y1

y2

y3 y

4y5

0 1π π π

iid

Bayesian Analysis of a HMM

• Widely used in speech recognition, finance,bioinformatics, etc.

• The y’s are observed but the z’s (discrete) arenot

• Combines a first-order dependence structurewith a mixture model.

z1

z2

z3

y1

y2

y3

Bayesian HMM (continued)

this is generally intractable but the conditionals are OK:

depends on the priorsdepends on the priors……

yi~ N(µ

zi,1)

Serfling's method

Bayesian HMM for High-Frequency Data

• Observations may not arrive regularly

• Elapsed time between observations may be

related to state

• Finance: tick-level stock data

• Molecular biology: single molecule experiments

• Assume now that z’s follow a continuous-time

first-order Markov chain

HF-HMM

z1

z2

z3

y1

y2

y3

!2 !3

For K=2, suppose the time z stays in state i is exp(λi). Then:

For K>2, from state i, z transitions to state i+1 withprobability βi and state i-1 with probability 1- βi (i.e. birth &death, reflecting boundaries)

eigendecomposition…

HF-HMM Priors

z1 z2 z3

y1

y2

y3

!2

!3

"0 "1

#0

#1

p0

p1

Posteriors not available inclosed-form…

HF-HMM Gibbs Sampler

Use a Metropolis step for this one

Metropolis Within Gibbs

Let

Generate a candidate from:

Accept with probability:

HMMHMM

MEMMMEMMmaximum entropy markov model

MEMM

•MEMM learns a single multiclass Logistic regression modelfor yi | yi-1, xi

•Predict y1 from x1, then y2 from y1 and x2, etc.

•No reason for the features not to include xi-1, xi+1, etc.

ti ti-1 f1 f2 f3 … fdPERLOC T 1 2.7 0……

Dependency Network

• Toutanova et al., 2003,use a “dependencynetwork” and richerfeature set

•Idea: using the “next” tag as well as the “previous” tag shouldimprove tagging performance•Need modified Viterbi to find most likely sequence

ti ti-1 ti+1 f1 f2 … fdPERLOCPER 1 2.7 0……

Conditional Random Fields

•Dependency network does consider the tag sequence in itsentirety•CRF’s optimize model parameters with respect to the entiresequence•More expensive optimization; increased flexibility and accuracy

From Logistic Regression to CRF

•Logistic regression:

•Or

•Linear chain CRF:

vector

scalar

p(y | x) =1

Z(x)exp !k fk (yt , yt"1, x)

k=1

K

#$%&

'()t

*

CRF Parameter Estimation•Conditional log likelihood:

•Regularized log likelihood:

•Conjugate gradient, BFGS, etc.

•POS Tagging, 45 tags, 10^6 words = 1 week

Sutton and McCallum (2006)

Skip-Chain CRF’s

Bock’s Results: POS Tagging

Penn Treebank

Bock’s Results: Named Entity

CoNLL-03 task