part-of-speech tagging - columbia universitymadigan/dm08/hmm.pdf · part-of-speech tagging...
TRANSCRIPT
Part-of-Speech Tagging
• Assign grammatical tags to words• Basic task in the analysis of natural language data• Phrase identification, entity extraction, etc.• Ambiguity: “tag” could be a noun or a verb• “a tag is a part-of-speech label” – context resolves the
ambiguity
POS Tagging Algorithms
• Rule-based taggers: large numbers of hand-craftedrules
• Probabilistic tagger: used a tagged corpus to trainsome sort of model, e.g. HMM.
tag1
word1
tag2
word2
tag3
word3
The Brown Corpus
• Comprises about 1 million English words• HMM’s first used for tagging on the Brown Corpus• 1967. Somewhat dated now.• British National Corpus has 100 million words
Simple Charniak Model
w1
t1
w2
t2
w3
t3
•What about words that have never been seen before? •Clever tricks for smoothing the number of parameters (aka priors)
some details…
number of times word j appears with tag i
number of times word j appears
number of times a word that had never beenseen with tag i gets tag i
number of such occurrences in total
Test data accuracy on Brown Corpus = 91.51%
Morphological Features• Knowledge that “quickly” ends in “ly” should
help identify the word as an adverb• “randomizing” -> “ing”• Split each word into a root (“quick”) and a
suffix (“ly”)
t1
r1
t2 t3
s1 r2 s2
Morphological Features• Typical morphological analyzers produce
multiple possible splits• “Gastroenteritis” ???
• Achieves 96.45% on the Brown Corpus
Inference in an HMM
• Compute the probability of a given observationsequence
• Given an observation sequence, compute the mostlikely hidden state sequence
• Given an observation sequence and set of possiblemodels, which model most closely fits the data?
David Blei
oTo1 otot-1 ot+1
Viterbi Algorithm
),,...,...(max)( 1111... 11
ttttxx
j ojxooxxPtt
==!!
!
"
The state sequence which maximizes theprobability of seeing the observations to timet-1, landing in state j, and seeing theobservation at time t
x1 xt-1 j
oTo1 otot-1 ot+1
Viterbi Algorithm
),,...,...(max)( 1111... 11
ttttxx
j ojxooxxPtt
==!!
!
"
1)(max)1(
+
=+tjoiji
ij batt !!
1)(maxarg)1(
+
=+tjoiji
ij batt !"
RecursiveComputation
x1 xt-1 xt xt+1
State TransitionProbability
“Emission”Probability
oTo1 otot-1 ot+1
Viterbi Algorithm
ˆ X T = argmaxj
! j (T)
ˆ X t
=!X
^
t +1
(t + 1)
P( ˆ X ) = maxi
!i(T)
Compute the mostlikely state sequenceby workingbackwards
x1 xt-1 xt xt+1 xT
Viterbi Small Example
o2o1
x1 x2Pr(x1=T) = 0.2Pr(x2=T|x1=T) = 0.7Pr(x2=T|x1=F) = 0.1Pr(o=T|x=T) = 0.4Pr(o=T|x=F) = 0.9o1=T; o2=F
Brute ForcePr(x1=T,x2=T, o1=T,o2=F) = 0.2 x 0.4 x 0.7 x 0.6 = 0.0336Pr(x1=T,x2=F, o1=T,o2=F) = 0.2 x 0.4 x 0.3 x 0.1 = 0.0024Pr(x1=F,x2=T, o1=T,o2=F) = 0.8 x 0.9 x 0.1 x 0.6 = 0.0432Pr(x1=F,x2=F, o1=T,o2=F) = 0.8 x 0.9 x 0.9 x 0.1 = 0.0648
Pr(X1,X2 | o1=T,o2=F) ∝ Pr(X1,X2 , o1=T,o2=F)
Viterbi Small Example
o2o1
x1 x2
ˆ X 2 = argmaxj
! j (2)
! j (2) =maxi!i(1)aijb jo2
!T(1) = Pr(x1 = T)Pr(o1 = T | x1 = T) = 0.2 " 0.4 = 0.08
!F(1) = Pr(x1 = F)Pr(o1 = T | x1 = F) = 0.8 " 0.9 = 0.72
!T(2) =max !
F(1) "Pr(x2 = T | x1 = F)Pr(o2 = F | x2 = T),!
T(1) "Pr(x2 = T | x1 = T)Pr(o2 = F | x2 = T)( )
=max(0.72 " 0.1" 0.6,0.08 " 0.7 " 0.6) = 0.0432
!F(2) =max !
F(1) "Pr(x2 = F | x1 = F)Pr(o2 = F | x2 = F),!
T(1) "Pr(x2 = F | x1 = T)Pr(o2 = F | x2 = F)( )
=max(0.72 " 0.9 " 0.1,0.0.08 " 0.3" 0.1) = 0.0648
Bayesian Analysis of a HMM
• Widely used in speech recognition, finance,bioinformatics, etc.
• The y’s are observed but the z’s (discrete) arenot
• Combines a first-order dependence structurewith a mixture model.
z1
z2
z3
y1
y2
y3
Bayesian HMM (continued)
this is generally intractable but the conditionals are OK:
depends on the priorsdepends on the priors……
yi~ N(µ
zi,1)
Bayesian HMM for High-Frequency Data
• Observations may not arrive regularly
• Elapsed time between observations may be
related to state
• Finance: tick-level stock data
• Molecular biology: single molecule experiments
• Assume now that z’s follow a continuous-time
first-order Markov chain
HF-HMM
z1
z2
z3
y1
y2
y3
!2 !3
For K=2, suppose the time z stays in state i is exp(λi). Then:
For K>2, from state i, z transitions to state i+1 withprobability βi and state i-1 with probability 1- βi (i.e. birth &death, reflecting boundaries)
eigendecomposition…
MEMM
•MEMM learns a single multiclass Logistic regression modelfor yi | yi-1, xi
•Predict y1 from x1, then y2 from y1 and x2, etc.
•No reason for the features not to include xi-1, xi+1, etc.
ti ti-1 f1 f2 f3 … fdPERLOC T 1 2.7 0……
Dependency Network
• Toutanova et al., 2003,use a “dependencynetwork” and richerfeature set
•Idea: using the “next” tag as well as the “previous” tag shouldimprove tagging performance•Need modified Viterbi to find most likely sequence
ti ti-1 ti+1 f1 f2 … fdPERLOCPER 1 2.7 0……
Conditional Random Fields
•Dependency network does consider the tag sequence in itsentirety•CRF’s optimize model parameters with respect to the entiresequence•More expensive optimization; increased flexibility and accuracy
From Logistic Regression to CRF
•Logistic regression:
•Or
•Linear chain CRF:
vector
scalar
p(y | x) =1
Z(x)exp !k fk (yt , yt"1, x)
k=1
K
#$%&
'()t
*
CRF Parameter Estimation•Conditional log likelihood:
•Regularized log likelihood:
•Conjugate gradient, BFGS, etc.
•POS Tagging, 45 tags, 10^6 words = 1 week