maxent models and discriminative estimation

74
Maxent Models and Discriminative Estimation Generative vs. Discriminative models Christopher Manning

Upload: hedya

Post on 16-Mar-2016

62 views

Category:

Documents


2 download

DESCRIPTION

Maxent Models and Discriminative Estimation. Generative vs. Discriminative models Christopher Manning. Introduction. So far we’ve looked at “generative models” Language models, Naive Bayes - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Maxent Models and Discriminative Estimation

Maxent Models and Discriminative Estimation

Generative vs. Discriminative models

Christopher Manning

Page 2: Maxent Models and Discriminative Estimation

Introduction

• So far we’ve looked at “generative models”– Language models, Naive Bayes

• But there is now much use of conditional or discriminative probabilistic models in NLP, Speech, IR (and ML generally)

• Because:– They give high accuracy performance– They make it easy to incorporate lots of linguistically important

features– They allow automatic building of language independent,

retargetable NLP modules

Page 3: Maxent Models and Discriminative Estimation

Joint vs. Conditional Models

• We have some data {(d, c)} of paired observations d and hidden classes c.

• Joint (generative) models place probabilities over both observed data and the hidden stuff (gene-rate the observed data from hidden stuff): – All the classic StatNLP models:

• n-gram models, Naive Bayes classifiers, hidden Markov models, probabilistic context-free grammars, IBM machine translation alignment models

P(c,d)

Page 4: Maxent Models and Discriminative Estimation

Joint vs. Conditional Models

• Discriminative (conditional) models take the data as given, and put a probability over hidden structure given the data:

• Logistic regression, conditional loglinear or maximum entropy models, conditional random fields

• Also, SVMs, (averaged) perceptron, etc. are discriminative classifiers (but not directly probabilistic)

P(c|d)

Page 5: Maxent Models and Discriminative Estimation

Bayes Net/Graphical Models

• Bayes net diagrams draw circles for random variables, and lines for direct dependencies

• Some variables are observed; some are hidden• Each node is a little classifier (conditional probability table) based on

incoming arcs

c

d1 d 2 d 3

Naive Bayes

c

d1 d2 d3

Generative

Logistic Regression

Discriminative

Page 6: Maxent Models and Discriminative Estimation

Conditional vs. Joint Likelihood

• A joint model gives probabilities P(d,c) and tries to maximize this joint likelihood.– It turns out to be trivial to choose weights: just relative

frequencies.• A conditional model gives probabilities P(c|d). It takes

the data as given and models only the conditional probability of the class.– We seek to maximize conditional likelihood.– Harder to do (as we’ll see…)– More closely related to classification error.

Page 7: Maxent Models and Discriminative Estimation

Conditional models work well: Word Sense Disambiguation

• Even with exactly the same features, changing from joint to conditional estimation increases performance

• That is, we use the same smoothing, and the same word-class features, we just change the numbers (parameters)

Training SetObjective Accurac

yJoint Like. 86.8Cond. Like.

98.5

Test SetObjective Accurac

yJoint Like. 73.6Cond. Like. 76.1

(Klein and Manning 2002, using Senseval-1 Data)

Page 8: Maxent Models and Discriminative Estimation

Discriminative Model Features

Making features from text for discriminative NLP models

Christopher Manning

Page 9: Maxent Models and Discriminative Estimation

Features

• In these slides and most maxent work: features f are elementary pieces of evidence that link aspects of what we observe d with a category c that we want to predict

• A feature is a function with a bounded real value: f: C D → ℝ

Page 10: Maxent Models and Discriminative Estimation

Example features– f1(c, d) [c = LOCATION w-1 = “in” isCapitalized(w)]– f2(c, d) [c = LOCATION hasAccentedLatinChar(w)]– f3(c, d) [c = DRUG ends(w, “c”)]

• Models will assign to each feature a weight:– A positive weight votes that this configuration is likely correct– A negative weight votes that this configuration is likely incorrect

LOCATIONin Québec

PERSONsaw Sue

DRUGtaking Zantac

LOCATIONin Arcadia

Page 11: Maxent Models and Discriminative Estimation

Feature Expectations

• We will crucially make use of two expectations • actual or predicted counts of a feature firing:

– Empirical count (expectation) of a feature:

– Model expectation of a feature:

Page 12: Maxent Models and Discriminative Estimation

Features

• In NLP uses, usually a feature specifies (1) an indicator function – a yes/no boolean matching function – of properties of the input and (2) a particular class

– fi(c, d) [Φ(d) c = cj] [Value is 0 or 1]

– They pick out a data subset and suggest a label for it.• We will say that Φ(d) is a feature of the data d, when, for

each cj, the conjunction Φ(d) c = cj is a feature of the data-class pair (c, d)

Page 13: Maxent Models and Discriminative Estimation

Feature-Based Models• The decision about a data point is based only on

the features active at that point.

BUSINESS: Stocks hit a yearly low …

Data

Features{…, stocks, hit, a, yearly, low, …}

Label: BUSINESS

Text Categorization

… to restructure bank:MONEY debt.

Data

Features{…, w-1=restructure, w+1=debt, L=12, …}

Label: MONEY

Word-Sense Disambiguation

DT JJ NN …The previous fall …

Data

Features{w=fall, t-1=JJ w-

1=previous}

Label: NN

POS Tagging

Page 14: Maxent Models and Discriminative Estimation

Example: Text Categorization(Zhang and Oles 2001)• Features are presence of each word in a document and the document class

(they do feature selection to use reliable indicator words)• Tests on classic Reuters data set (and others)

– Naïve Bayes: 77.0% F1

– Linear regression: 86.0%– Logistic regression: 86.4%– Support vector machine: 86.5%

• Paper emphasizes the importance of regularization (smoothing) for successful use of discriminative methods (not used in much early NLP/IR work)

Page 15: Maxent Models and Discriminative Estimation

Other Maxent Classifier Examples

• You can use a maxent classifier whenever you want to assign data points to one of a number of classes:– Sentence boundary detection (Mikheev 2000)

• Is a period end of sentence or abbreviation?– Sentiment analysis (Pang and Lee 2002)

• Word unigrams, bigrams, POS counts, …– PP attachment (Ratnaparkhi 1998)

• Attach to verb or noun? Features of head noun, preposition, etc.

– Parsing decisions in general (Ratnaparkhi 1997; Johnson et al. 1999, etc.)

Page 16: Maxent Models and Discriminative Estimation

Feature-based Linear Classifiers

How to put features into a classifier

16

Page 17: Maxent Models and Discriminative Estimation

Feature-Based Linear Classifiers• Linear classifiers at classification time:

– Linear function from feature sets {fi} to classes {c}. – Assign a weight i to each feature fi.– We consider each class for an observed datum d– For a pair (c,d), features vote with their weights:

• vote(c) = ifi(c,d)

– Choose the class c which maximizes ifi(c,d)

LOCATIONin Québec

DRUGin Québec

PERSONin Québec

Page 18: Maxent Models and Discriminative Estimation

Feature-Based Linear Classifiers

There are many ways to chose weights for features

– Perceptron: find a currently misclassified example, and nudge weights in the direction of its correct classification

– Margin-based methods (Support Vector Machines)

Page 19: Maxent Models and Discriminative Estimation

Feature-Based Linear Classifiers• Exponential (log-linear, maxent, logistic, Gibbs) models:

– Make a probabilistic model from the linear combination ifi(c,d)

• P(LOCATION|in Québec) = e1.8e–0.6/(e1.8e–0.6 + e0.3 + e0) = 0.586• P(DRUG|in Québec) = e0.3 /(e1.8e–0.6 + e0.3 + e0) = 0.238• P(PERSON|in Québec) = e0 /(e1.8e–0.6 + e0.3 + e0) = 0.176

– The weights are the parameters of the probability model, combined via a “soft max” function

Makes votes positive

Normalizes votes

Page 20: Maxent Models and Discriminative Estimation

Feature-Based Linear Classifiers

• Exponential (log-linear, maxent, logistic, Gibbs) models:– Given this model form, we will choose parameters {i}

that maximize the conditional likelihood of the data according to this model.

– We construct not only classifications, but probability distributions over classifications.

• There are other (good!) ways of discriminating classes – SVMs, boosting, even perceptrons – but these methods are not as trivial to interpret as distributions over classes.

Page 21: Maxent Models and Discriminative Estimation

Aside: logistic regression

• Maxent models in NLP are essentially the same as multiclass logistic regression models in statistics (or machine learning)– If you have seen these before you might think about:

• The parameterization is slightly different in a way that is advantageous for NLP-style models with tons of sparse features (but statistically inelegant)

• The key role of feature functions in NLP and in this presentation

– The features are more general, with f also being a function of the class – when might this be useful?

21

Page 22: Maxent Models and Discriminative Estimation

Quiz Question• Assuming exactly the same set up (3 class decision:

LOCATION, PERSON, or DRUG; 3 features as before, maxent), what are:– P(PERSON | by Goéric) = – P(LOCATION | by Goéric) = – P(DRUG | by Goéric) = – 1.8 f1(c, d) [c = LOCATION w-1 = “in” isCapitalized(w)]– -0.6 f2(c, d) [c = LOCATION hasAccentedLatinChar(w)]– 0.3 f3(c, d) [c = DRUG ends(w, “c”)]

PERSONby Goéric

LOCATIONby Goéric

DRUGby Goéric

Page 23: Maxent Models and Discriminative Estimation

Building a Maxent Model

The nuts and bolts

Page 24: Maxent Models and Discriminative Estimation

Building a Maxent Model• We define features (indicator functions) over data points

– Features represent sets of data points which are distinctive enough to deserve model parameters.

• Words, but also “word contains number”, “word ends with ing”, etc.

• We will simply encode each Φ feature as a unique String– A datum will give rise to a set of Strings: the active Φ features– Each feature fi(c, d) [Φ(d) c = cj] gets a real number weight

• We concentrate on Φ features but the math uses i indices of fi

Page 25: Maxent Models and Discriminative Estimation

Building a Maxent Model

• Features are often added during model development to target errors– Often, the easiest thing to think of are features that mark bad

combinations

• Then, for any given feature weights, we want to be able to calculate:– Data conditional likelihood– Derivative of the likelihood wrt each feature weight

• Uses expectations of each feature according to the model

• We can then find the optimum feature weights (discussed later).

Page 26: Maxent Models and Discriminative Estimation

Naive Bayes vs. Maxent models

Generative vs. Discriminative models: The problem of overcounting evidence

Christopher Manning

Page 27: Maxent Models and Discriminative Estimation

Text classification: Asia or Europe

NB FACTORS:• P(A) = P(E) = • P(M|A) = • P(M|E) =

Europe Asia

Class

X1=M

NB ModelPREDICTIONS:• P(A,M) = • P(E,M) = • P(A|M) = • P(E|M) =

Training Data

Monaco Monaco

Monaco Monaco Hong Kong

Hong Kong Monaco

Monaco Hong Kong

Hong Kong

Monaco Monaco

Page 28: Maxent Models and Discriminative Estimation

Text classification: Asia or Europe

NB FACTORS:• P(A) = P(E) =• P(H|A) = P(K|A) = • P(H|E) = PK|E) =

Europe Asia

Class

X1=H X2=K

NB ModelPREDICTIONS:• P(A,H,K) = • P(E,H,K) = • P(A|H,K) = • P(E|H,K) =

Training Data

Monaco Monaco

Monaco Monaco Hong Kong

Hong Kong Monaco

Monaco Hong Kong

Hong Kong

Monaco Monaco

Page 29: Maxent Models and Discriminative Estimation

Text classification: Asia or Europe

NB FACTORS:• P(A) = P(E) =• P(M|A) = • P(M|E) =• P(H|A) = P(K|A) = • P(H|E) = PK|E) =

Europe Asia

Class

H K

NB ModelPREDICTIONS:• P(A,H,K,M) = • P(E,H,K,M) = • P(A|H,K,M) = • P(E|H,K,M) =

Training Data

M

Monaco Monaco

Monaco Monaco Hong Kong

Hong Kong Monaco

Monaco Hong Kong

Hong Kong

Monaco Monaco

Page 30: Maxent Models and Discriminative Estimation

Naive Bayes vs. Maxent Models

• Naive Bayes models multi-count correlated evidence– Each feature is multiplied in, even when you have multiple

features telling you the same thing

• Maximum Entropy models (pretty much) solve this problem– As we will see, this is done by weighting features so that model

expectations match the observed (empirical) expectations

Page 31: Maxent Models and Discriminative Estimation

Maxent Models and Discriminative Estimation

Maximizing the likelihood

Page 32: Maxent Models and Discriminative Estimation

Exponential Model Likelihood

• Maximum (Conditional) Likelihood Models :– Given a model form, choose values of parameters to

maximize the (conditional) likelihood of the data.

Page 33: Maxent Models and Discriminative Estimation

The Likelihood Value

• The (log) conditional likelihood of iid data (C,D) according to maxent model is a function of the data and the parameters :

• If there aren’t many values of c, it’s easy to calculate:

Page 34: Maxent Models and Discriminative Estimation

The Likelihood Value

• We can separate this into two components:

• The derivative is the difference between the derivatives of each component

Page 35: Maxent Models and Discriminative Estimation

The Derivative I: Numerator

Derivative of the numerator is: the empirical count(fi, c)

Page 36: Maxent Models and Discriminative Estimation

The Derivative II: Denominator

= predicted count(fi, )

Page 37: Maxent Models and Discriminative Estimation

The Derivative III

• The optimum parameters are the ones for which each feature’s predicted expectation equals its empirical expectation. The optimum distribution is:

– Always unique (but parameters may not be unique)– Always exists (if feature counts are from actual data).

• These models are also called maximum entropy models because we find the model having maximum entropy and satisfying the constraints:

Page 38: Maxent Models and Discriminative Estimation

Finding the optimal parameters

• We want to choose parameters λ1, λ2, λ3, … that maximize the conditional log-likelihood of the training data

• To be able to do that, we’ve worked out how to calculate the function value and its partial derivatives (its gradient)

Page 39: Maxent Models and Discriminative Estimation

A likelihood surface

Page 40: Maxent Models and Discriminative Estimation

Finding the optimal parameters• Use your favorite numerical optimization package….

• Commonly (and in our code), you minimize the negative of CLogLik

1. Gradient descent (GD); Stochastic gradient descent (SGD)2. Iterative proportional fitting methods: Generalized Iterative

Scaling (GIS) and Improved Iterative Scaling (IIS)3. Conjugate gradient (CG), perhaps with preconditioning4. Quasi-Newton methods – limited memory variable metric

(LMVM) methods, in particular, L-BFGS

Page 41: Maxent Models and Discriminative Estimation

Maxent Models and Discriminative Estimation

The maximum entropy model presentation

Page 42: Maxent Models and Discriminative Estimation

Maximum Entropy Models

• An equivalent approach:– Lots of distributions out there, most of them very spiked,

specific, overfit.– We want a distribution which is uniform except in specific ways

we require.– Uniformity means high entropy – we can search for distributions

which have properties we desire, but also have high entropy.

– Ignorance is preferable to error and he is less remote from the truth who believes nothing than he who believes what is wrong – Thomas Jefferson (1781)

Page 43: Maxent Models and Discriminative Estimation

(Maximum) Entropy

• Entropy: the uncertainty of a distribution.• Quantifying uncertainty (“surprise”):

– Event x– Probability px– “Surprise” log(1/px)

• Entropy: expected surprise (over p):A coin-flip is most uncertain for a fair coin.

pHEADS

H

Page 44: Maxent Models and Discriminative Estimation

Maxent Examples I• What do we want from a distribution?

– Minimize commitment = maximize entropy.– Resemble some reference distribution (data).

• Solution: maximize entropy H, subject to feature-based constraints:

Adding constraints (features):– Lowers maximum entropy– Raises maximum likelihood of data– Brings the distribution further from uniform– Brings the distribution closer to data

Unconstrained, max at 0.5

Constraint that pHEADS = 0.3

Page 45: Maxent Models and Discriminative Estimation

Maxent Examples IIH(pH pT,) pH + pT = 1 pH = 0.3

- x log x

1/e

Page 46: Maxent Models and Discriminative Estimation

Maxent Examples III• Let’s say we have the following event space:

• … and the following empirical data:

• Maximize H:

• … want probabilities: E[NN,NNS,NNP,NNPS,VBZ,VBD] = 1

NN NNS NNP NNPS

VBZ VBD

1/e 1/e 1/e 1/e 1/e 1/e

1/6 1/6 1/6 1/6 1/6 1/6

3 5 11 13 3 1

Page 47: Maxent Models and Discriminative Estimation

Maxent Examples IV• Too uniform!• N* are more common than V*, so we add the feature fN = {NN, NNS, NNP,

NNPS}, with E[fN] =32/36

• … and proper nouns are more frequent than common nouns, so we add fP = {NNP, NNPS}, with E[fP] =24/36

• … we could keep refining the models, e.g., by adding a feature to distinguish singular vs. plural nouns, or verb types.

8/36 8/36 8/36 8/36 2/36 2/36

4/36 4/36 12/36 12/36 2/36 2/36

NN NNS NNP NNPS VBZ VBD

Page 48: Maxent Models and Discriminative Estimation

Convexity

Convex Non-ConvexConvexity guarantees a single, global maximum because any higher points are greedily reachable.

Page 49: Maxent Models and Discriminative Estimation

Convexity II

• Constrained H(p) = – x log x is convex:– – x log x is convex– – x log x is convex (sum of convex functions is convex).– The feasible region of constrained H is a linear subspace (which is

convex)– The constrained entropy surface is therefore convex.

• The maximum likelihood exponential model (dual) formulation is also convex.

Page 50: Maxent Models and Discriminative Estimation

Feature Overlap/Feature Interaction

How overlapping features work in maxent models

Page 51: Maxent Models and Discriminative Estimation

Feature Overlap• Maxent models handle overlapping features well.• Unlike a NB model, there is no double counting!

A aB 2 1b 2 1

A aB 1/4 1/4b 1/4 1/4

Empirical

All = 1

A aBb

A aB 1/3 1/6b 1/3 1/6

A = 2/3

A aBb

A aB 1/3 1/6b 1/3 1/6

A = 2/3

A aBb

A aBb

A aB A

b A

A aB ’A+’’A

b ’A+’’A

Page 52: Maxent Models and Discriminative Estimation

Example: Named Entity Feature Overlap

Feature Type Feature

PERS LOC

Previous word at -0.73 0.94Current word Grace 0.03 0.00Beginning bigram <G 0.45 -0.04Current POS tag NNP 0.47 0.45Prev and cur tags IN NNP -0.10 0.14Previous state Other -0.70 -0.92Current signature Xx 0.80 0.46Prev state, cur sig O-Xx 0.68 0.37Prev-cur-next sig x-Xx-Xx -0.69 0.37P. state - p-cur sig O-x-Xx -0.20 0.82…Total: -0.58 2.68

Prev Cur NextState Othe

r??? ???

Word at Grace

Road

Tag IN NNP NNPSig x Xx Xx

Local Context

Feature WeightsGrace is correlated with PERSON, but does not add much evidence on top of already knowing prefix features.

Page 53: Maxent Models and Discriminative Estimation

Feature Interaction• Maxent models handle overlapping features well, but do not

automatically model feature interactions.

A aB 1 1b 1 0

A aB 1/4 1/4b 1/4 1/4

Empirical

All = 1

A aBb

A aB 1/3 1/6b 1/3 1/6

A = 2/3

A aBb

A aB 4/9 2/9b 2/9 1/9

B = 2/3

A aBb

A aB 0 0b 0 0

A aB A

b A

A aB A+B B

b A

Page 54: Maxent Models and Discriminative Estimation

Feature Interaction• If you want interaction terms, you have to add

them:

A disjunctive feature would also have done it (alone):

A aB 1 1b 1 0

Empirical

A aB 1/3 1/6b 1/3 1/6

A = 2/3

A aBb

A aB 4/9 2/9b 2/9 1/9

B = 2/3

A aBb

A aB 1/3 1/3b 1/3 0

AB = 1/3

A aBb

A aBb

A aB 1/3 1/3b 1/3 0

Page 55: Maxent Models and Discriminative Estimation

Feature Interaction

• For loglinear/logistic regression models in statistics, it is standard to do a greedy stepwise search over the space of all possible interaction terms.

• This combinatorial space is exponential in size, but that’s okay as most statistics models only have 4–8 features.

• In NLP, our models commonly use hundreds of thousands of features, so that’s not okay.

• Commonly, interaction terms are added by hand based on linguistic intuitions.

Page 56: Maxent Models and Discriminative Estimation

Example: NER Interaction

Feature Type Feature

PERS LOC

Previous word at -0.73 0.94Current word Grace 0.03 0.00Beginning bigram <G 0.45 -0.04Current POS tag NNP 0.47 0.45Prev and cur tags IN NNP -0.10 0.14Previous state Other -0.70 -0.92Current signature Xx 0.80 0.46Prev state, cur sig O-Xx 0.68 0.37Prev-cur-next sig x-Xx-Xx -0.69 0.37P. state - p-cur sig O-x-Xx -0.20 0.82…Total: -0.58 2.68

Prev Cur NextState Othe

r??? ???

Word at Grace

Road

Tag IN NNP NNPSig x Xx Xx

Local Context

Feature WeightsPrevious-state and current-signature have interactions, e.g. P=PERS-C=Xx indicates C=PERS much more strongly than C=Xx and P=PERS independently.This feature type allows the model to capture this interaction.

Page 57: Maxent Models and Discriminative Estimation

Conditional Maxent Models for Classification

The relationship between conditional and joint maxent/exponential models

Page 58: Maxent Models and Discriminative Estimation

Classification• What do these joint models of P(X) have to do with conditional

models P(C|D)?• Think of the space C×D as a complex X.

– C is generally small (e.g., 2-100 topic classes)– D is generally huge (e.g., space of documents)

• We can, in principle, build models over P(C,D).• This will involve calculating expectations of features (over CD):

• Generally impractical: can’t enumerate X efficiently.

X

CDD

C

Page 59: Maxent Models and Discriminative Estimation

Classification II• D may be huge or infinite, but only a few d occur in our

data. • What if we add one feature for each d and constrain its

expectation to match our empirical data?

• Now, most entries of P(c,d) will be zero.• We can therefore use the much easier sum:

D

C

Page 60: Maxent Models and Discriminative Estimation

Classification III

• But if we’ve constrained the D marginals

• then the only thing that can vary is the conditional distributions:

Page 61: Maxent Models and Discriminative Estimation

Classification IV

• This is the connection between joint and conditional maxent / exponential models:– Conditional models can be thought of as joint models

with marginal constraints.• Maximizing joint likelihood and conditional

likelihood of the data in this model are equivalent!

Page 62: Maxent Models and Discriminative Estimation

Smoothing/Priors/ Regularization for Maxent Models

Page 63: Maxent Models and Discriminative Estimation

Smoothing: Issues of Scale

Page 64: Maxent Models and Discriminative Estimation

Smoothing: Issues• Assume the following empirical distribution:

• Features: {Heads}, {Tails}• We’ll have the following model distribution:

• Really, only one degree of freedom ( = H−T)

Heads Tailsh t

Logistic regression model

Page 65: Maxent Models and Discriminative Estimation

Smoothing: Issues

• The data likelihood in this model is:

Heads Tails2 2

Heads Tails3 1

Heads Tails4 0

log P log P log P

Page 66: Maxent Models and Discriminative Estimation

Smoothing: Early Stopping

• In the 4/0 case, there were two problems:– The optimal value of was , which is a long trip for an

optimization procedure.– The learned distribution is just as spiked as the empirical one – no

smoothing.

• One way to solve both issues is to just stop the optimization early, after a few iterations.– The value of will be finite (but presumably big).– The optimization won’t take forever (clearly).– Commonly used in early maxent work.

Heads

Tails

4 0Head

sTails

1 0

Input

Output

Page 67: Maxent Models and Discriminative Estimation

Smoothing: Priors (MAP)• What if we had a prior expectation that parameter values

wouldn’t be very large?• We could then balance evidence suggesting large

parameters (or infinite) against our prior.• The evidence would never totally defeat the prior, and

parameters would be smoothed (and kept finite!).• We can do this explicitly by changing the optimization

objective to maximum posterior likelihood:

Posterior Prior Evidence

Page 68: Maxent Models and Discriminative Estimation

Smoothing: Priors

• Gaussian, or quadratic, or L2 priors:– Intuition: parameters shouldn’t be large.– Formalization: prior expectation that each parameter will be

distributed according to a gaussian with mean and variance 2.

– Penalizes parameters for drifting to far from their mean prior value (usually =0).

– 22=1 works surprisingly well.

They don’t even capitalize my name anymore!

22=1

22 = 10

22 =

Page 69: Maxent Models and Discriminative Estimation

Smoothing: Priors• If we use gaussian priors:

– Trade off some expectation-matching for smaller parameters.– When multiple features can be recruited to explain a data point,

the more common ones generally receive more weight.– Accuracy generally goes up!

• Change the objective:

• Change the derivative:

22

=1

22 = 10

22 =

Prior mean=0

Page 70: Maxent Models and Discriminative Estimation

Example: NER Smoothing

Feature Type Feature

PERS LOC

Previous word at -0.73 0.94Current word Grace 0.03 0.00Beginning bigram <G 0.45 -0.04Current POS tag NNP 0.47 0.45Prev and cur tags IN NNP -0.10 0.14Previous state Other -0.70 -0.92Current signature Xx 0.80 0.46Prev state, cur sig O-Xx 0.68 0.37Prev-cur-next sig x-Xx-Xx -0.69 0.37P. state - p-cur sig O-x-Xx -0.20 0.82…Total: -0.58 2.68

Prev Cur NextState Othe

r??? ???

Word at Grace

Road

Tag IN NNP NNPSig x Xx Xx

Local Context

Feature WeightsBecause of smoothing, the more common prefix and single-tag features have larger weights even though entire-word and tag-pair features are more specific.

Page 71: Maxent Models and Discriminative Estimation

Example: POS Tagging

• From (Toutanova et al., 2003):

• Smoothing helps:– Softens distributions.– Pushes weight onto more explanatory features.– Allows many features to be dumped safely into the mix.– Speeds up convergence (if both are allowed to converge)!

Overall Accuracy

Unknown Word Acc

Without Smoothing

96.54 85.20

With Smoothing

97.10 88.20

Page 72: Maxent Models and Discriminative Estimation

Smoothing: Regularization

• Talking of “priors” and “MAP estimation” is Bayesian language

• In frequentist statistics, people will instead talk about using “regularization”, and in particular, a gaussian prior is “L2 regularization”

• The choice of names makes no difference to the math

Page 73: Maxent Models and Discriminative Estimation

Smoothing: Virtual Data• Another option: smooth the data, not the parameters.• Example:

– Equivalent to adding two extra data points.– Similar to add-one smoothing for generative models.

• Hard to know what artificial data to create!

Heads

Tails

4 0

Heads

Tails

5 1

Page 74: Maxent Models and Discriminative Estimation

Smoothing: Count Cutoffs

• In NLP, features with low empirical counts are often dropped.– Very weak and indirect smoothing method.– Equivalent to locking their weight to be zero.– Equivalent to assigning them gaussian priors with mean zero and

variance zero.– Dropping low counts does remove the features which were most in

need of smoothing…– … and speeds up the estimation by reducing model size …– … but count cutoffs generally hurt accuracy in the presence of proper

smoothing.• We recommend: don’t use count cutoffs unless absolutely necessary for

memory usage reasons.