conditional markov models: maxent tagging and memms william w. cohen feb 8 ie lecture

Conditional Markov Models: MaxEnt Tagging and MEMMs

William W. Cohen

Feb 8 IE Lecture

Top ten answers to “Is it cold enough for you?”

• No, but don’t change it just for me, the others might like it this way.

• No, in fact I need it to drop another 40 degrees to improve statistical significance of the results in my upcoming grant proposal to Exxon-Mobile refuting the theory of global warming.

• No, but …?• …?• …?• …?

Projects and Critiques

• I believe I’ve responded to all submitted project proposals (even if briefly).– If you haven’t heard from me, check in after class.

• I believe two people have not submitted project proposals.– If you haven’t done that, definitely check in with me.

• Everyone: please look over these proposals – even if you’re pretty sure what you’re going to do

• Likewise if you’re behind on critiques.– Note: ZMM does not have a happy ending.

• Reminder, next week form teams.– Singleton teams are not encouraged.

Sample CritiquesD. Freitag and N. Kushmerick. Boosted wrapper induction. In Proc. of the 17th NationalConference on Artificial Intelligence AAAI-2000, pages 577–583, 2000.… There were several things that made this a strong paper. First, they wereclear about where they were starting from (wrapper induction) and what theircontribution was. Also, they described their algorithm with sufficient generality andclarity for a reader to implement it or adapt it to a different problem. They did not domuch “feature engineering” or plug in many outside resources into their system. I likedthis approach, since they showed that it can be done easily in their framework, but alsothat they don’t need to do massive feature engineering to gain a performanceimprovement over other rule-based systems. I always like when a paper abstracts itsnovel pieces away from the problem at hand and presents them more theoretically thanwould be necessary to simply communicate a problem solution. The paper clearly has avery different feel than, say, Borthwick et al. (1998), which focused more on a systemsview and delved deeply into the choice of features in the system.

Borthwick, A., Sterling, J., Agichtein, E., & Grishman, R. (1998). Exploiting diverseknowledge sources via maximum entropy in named entity recognition. In Proceedings ofthe Sixth Workshop on Very Large Corpora New Brunswick, New Jersey. Associationfor Computational Linguistics.… They describe the use of many knowledge sources in addition to standard features for named-entity recognition, but it’s unclear how much each of these help performance. As a result,if researchers were to build a similar system based on this paper, they would probablyhave to discover these results themselves. It’s interesting how they didn’t do as well as the other systems on the surprise domain of the test data. Assuming they would be evaluated on data of the same domain as the training data, they did not hesitate to include domain-specific features in their model. It’s in contrast to the Jansche and Abney paper, in which the latter focused onfeatures that would avoid overfitting on feature instances that are common in the trainingdata.

…. I was a bit

disappointed that the authors stopped at L=4 in the table, when the

first column was still increasing in accuracy – I wanted to see exactly

how far it would go. (It also made me wonder if future work was possible

to develop heuristics that would allow a larger window, with

sub-exponential increase in training time, perhaps only still using L

words, but also adding a distance parameter, such that you could have

something like <pre-prefix, distance> <prefix> <suffix>). …

They also mention that they might use hundreds ofthese high precision, low recall patterns. While this might work wellempirically, it just doesn't seem very elegant. There is a certainappeal to looking for simple rules to identify fields, especially inhighly structured text like certain cgi generated web pages. Butthere is also an appeal to regularization: it does not seem thatmemorizing hundreds of one- or two-off rules is the best way to learn

what's really going on in a document.

Sample Critiques

======================================Information Extraction from Voicemail Transcripts- Jansche & Abney======================================

…

One thing that was not clear to me and hence I did not like was theexplanation of the exact feature representation for each task. Forexample the authors repeatedly mention "a small set of lexicalfeatures" and "handful of common first names" but do not explainwhere they came from, who selected them, by what method …

I was wondering why the authors decided to use classification forpredicting length of the caller phrases/names as opposed toregression. I realize that they have argued that these lengths arediscrete valued and therefore they chose classification, but thelength attribute has the significance of order in its values …

… There were two things that really bothered me aboutthe evaluation, however. One thing was that in their numbers they included the empty calls (hangups). …. Secondly, and perhaps a larger issue for me, is that I'm not clear if the hand-crafted rules were made before or after they looked at the corpora they were using. It seems to me that if you hand craft rules after looking at your data, what you are in essence doing is training on your test data. This makes it seem a bit unfair to compare these results against strictly machine-learning based approaches.

I particularly like one of the concluding points to the article: the

authors’ clearly demonstrated that generic NE extractors can not be

used on every task. In the phone number extraction task, longer-

range patterns are necessary, but the bi- or tri-gram features cannot

reflect that. Their model, with features specifically designed for

the task is a clear winner.

Review: Hidden Markov Models

• Efficient dynamic programming algorithms exist for– Finding Pr(S)– The highest probability path P that

maximizes Pr(S,P) (Viterbi)

• Training the model– (Baum-Welch algorithm)

S2

S4

S1

0.9

0.5

0.50.8

0.2

0.1

S3

A

C

0.6

0.4

A

C

0.3

0.7

A

C

0.5

0.5

A

C

0.9

0.1

HMM for Segmentation

• Simplest Model: One state per entity type

HMM Learning

• Manally pick HMM’s graph (eg simple model, fully connected)

• Learn transition probabilities: Pr(si|sj)

• Learn emission probabilities: Pr(w|si)

Learning model parameters• When training data defines unique path through HMM

– Transition probabilities• Probability of transitioning from state i to state j =

number of transitions from i to j total transitions from state i

– Emission probabilities• Probability of emitting symbol k from state i =

number of times k generated from i number of transition from I

• When training data defines multiple path:– A more general EM like algorithm (Baum-Welch)

What is a “symbol” ???

Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ?

4601 => “4601”, “9999”, “9+”, “number”, … ?

000.. . . .999

3 -d ig i ts

00000 .. . .99999

5 -d ig i ts

0 ..99 0000 ..9999 000000 ..

O th e rs

N u m b e rs

A .. ..z

C h a rs

a a ..

M u lt i -le tte r

W o rds

. , / - + ? #

D e lim ite rs

A ll

Datamold: choose best abstraction level using holdout set

What is a symbol?

Bikel et al mix symbols from two abstraction levels

What is a symbol?

Ideally we would like to use many, arbitrary, overlapping features of words.

St -1

St

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…

…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Lots of learning systems are not confounded by multiple, non-independent features: decision trees, neural nets, SVMs, …

Stupid HMM tricks

startPr(red)

Pr(green) Pr(green|green) = 1

Pr(red|red) = 1

Stupid HMM tricks

startPr(red)

Pr(green)Pr(green|green) = 1

Pr(red|red) = 1

Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x)

argmax{y} Pr(y|x) = argmax{y} Pr(x|y) * Pr(y)= argmax{y} Pr(y) * Pr(x1|y)*Pr(x2|y)*...*Pr(xm|y)

Pr(“I voted for Ralph Nader”|ggggg) = Pr(g)*Pr(I|g)*Pr(voted|g)*Pr(for|g)*Pr(Ralph|g)*Pr(Nader|g)

HMM’s = sequential NB

From NB to Maxent

Zy

yw

docf

docf

k

i

kj

/)Pr(

)|Pr(

ncombinatiok j,th -i )(

0]:doc?1 of jposition at appearsk [word )(,

xjw

ywyZ

xy

k

jk

in wordis where

)|Pr()Pr(1

)|Pr( i

xfi )(0

From NB to Maxent

xjw

ywyZ

xy

k

jk

in wordis where

)|Pr()Pr(1

)|Pr( i

xfi )(0

From NB to Maxent

i

xfn

ijni

ni

jixx

xx

)(0),...,Pr(

,..., data of likelihood

Learning: set alpha parameters to maximize this: the ML model of the data, given we’re using the same functional form as NB.

Turns out this is the same as maximizing entropy of p(y|x) over all distributions.

MaxEnt Comments

– Implementation: • All methods are iterative

• Numerical issues (underflow rounding) are important.

• For NLP like problems with many features, modern gradient-like or Newton-like methods work well – sometimes better(?) and faster than GIS and IIS

– Smoothing: • Typically maxent will overfit data if there are many infrequent

features.

• Common solutions: discard low-count features; early stopping with holdout set; Gaussian prior centered on zero to limit size of alphas (ie, optimize log likelihood - sum alpha)

MaxEnt Comments

– Performance:• Good MaxEnt methods are

competitive with linear SVMs and other state of are classifiers in accuracy.

• Can’t as easily extend to higher-order interactions (e.g. kernel SVMs, AdaBoost) – but see [Lafferty, Zhu, Liu ICML2004]

• Training is relatively expensive.

– Embedding in a larger system:

• MaxEnt optimizes Pr(y|x), not error rate.

MaxEnt Comments

– MaxEnt competitors:• Model Pr(y|x) with Pr(y|score(x)) using score from

SVM’s, NB, …• Regularized Winnow, BPETs, …• Ranking-based methods that estimate if Pr(y1|

x)>Pr(y2|x).

– Things I don’t understand:• Why don’t we call it logistic regression?• Why is always used to estimate the density of (y,x)

pairs rather than a separate density for each class y?• When are its confidence estimates reliable?

What is a symbol?

Ideally we would like to use many, arbitrary, overlapping features of words.

St -1

St

Ot

St+1

Ot +1

Ot -1


…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

What is a symbol?

St -1

St

Ot

St+1

Ot +1

Ot -1


…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Idea: replace generative model in HMM with a maxent model, where state depends on observations

...)|Pr( tt xs

What is a symbol?

St -1

St

Ot

St+1

Ot +1

Ot -1


…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

...),|Pr( ,1 ttt sxs

What is a symbol?

St -1 S

t

Ot

St+1

Ot +1

Ot -1


…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history

......),|Pr( ,2,1 tttt ssxs

Ratnaparkhi’s MXPOST

• Sequential learning problem: predict POS tags of words.

• Uses MaxEnt model described above.

• Rich feature set.

• To smooth, discard features occurring < 10 times.

MXPOST

MXPOST: learning & inference

GISFeature

selection

Alternative inference schemes

MXPost inference

Inference for MENE

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

When will prof Cohen post the notes …

Inference for MXPOST

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O


),|Pr(

)...,,|Pr(

)...,,|Pr()|Pr(

1

1,

1,1

iii

iikii

iii

yxy

yyxy

yyxyxy

(Approx view): find best path, weights are now on arcs from state to state.


B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O


More accurately: find total flow to each node, weights are now on arcs from state to state.

'

11 )',|Pr()'()(y

tttt yYxyYyy


B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O

B

I

O


),|Pr(

)...,,|Pr(

)...,,|Pr()|Pr(

1,2

1,

1,1

iiii

iikii

iii

yyxy

yyxy

yyxyxy

Find best path? tree? Weights are on hyperedges

Inference for MxPOST

I

O

iI

iO


oI

oO

iI

iO

oI

oO

iI

iO

oI

oO

iI

iO

oI

oO

iI

iO

oI

oO

iI

iO

oI

oO

… …

Beam search is alternative to Viterbi: at each stage, find all children, score them, and discard all but the top n states

MXPost results

• State of art accuracy (for 1996)

• Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art).

• Same (or similar) approaches used for NER by Borthwick, Malouf, Manning, and others.

MEMMs

• Basic difference from ME tagging:– ME tagging: previous state is feature of MaxEnt

classifier– MEMM: build a separate MaxEnt classifier for each

state.• Can build any HMM architecture you want; eg parallel nested

HMM’s, etc.• Data is fragmented: examples where previous tag is “proper

noun” give no information about learning tags when previous tag is “noun”

– Mostly a difference in viewpoint– MEMM does allow possibility of “hidden” states and

Baum-Welsh like training

MEMM task: FAQ parsing

MEMM features

Some interesting points to ponder

• “Easier to think of observations as part of the the arcs, rather than the states.”

• FeatureHMM works surprisingly? well.

• Both approaches allow Pr(yi|x,yi-1,…) to be determined by arbitrary features of the history.– “Factored” MEMM

conditional markov models: maxent tagging and memms william w. cohen feb 8 ie lecture

Documents

abney paper

strong paper

different problem

domainspecific features

submitted project proposals

standard features

massive feature engineering

feature instances