discriminative estimation (maxentmodelsandperceptron)

56
Discriminative Estimation (Maxent models and perceptron) Generative vs. Discriminative models Many slides are adapted from slides by Christopher Manning and perceptron slides by Alan Ritter

Upload: others

Post on 22-Dec-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discriminative Estimation (Maxentmodelsandperceptron)

DiscriminativeEstimation(Maxent models and perceptron)

Generativevs.Discriminativemodels

Many slides are adapted from slides by Christopher Manning and perceptron slides by Alan Ritter

Page 2: Discriminative Estimation (Maxentmodelsandperceptron)

Introduction

• Sofarwe’velookedat“generativemodels”• NaiveBayes

• ButthereisnowmuchuseofconditionalordiscriminativeprobabilisticmodelsinNLP,Speech,IR(andMLgenerally)

• Because:• Theygivehighaccuracyperformance• Theymakeiteasytoincorporatelotsoflinguisticallyimportantfeatures• Theyallowautomaticbuildingoflanguageindependent,retargetableNLPmodules

Page 3: Discriminative Estimation (Maxentmodelsandperceptron)

Jointvs.ConditionalModels

• Wehavesomedata{(d,c)}ofpairedobservationsd andhiddenclassesc.

• Joint(generative)modelsplaceprobabilitiesoverbothobserveddataandthehiddenstuff(gene-ratetheobserveddatafromhiddenstuff):• AlltheclassicStatNLP models:• n-grammodels,NaiveBayesclassifiers,hiddenMarkovmodels,probabilisticcontext-freegrammars,IBMmachinetranslationalignmentmodels

P(c,d)

Page 4: Discriminative Estimation (Maxentmodelsandperceptron)

Jointvs.ConditionalModels

• Discriminative(conditional)modelstakethedataasgiven,andputaprobabilityoverhiddenstructuregiventhedata:

• Logisticregression,conditionalloglinear ormaximumentropymodels,conditionalrandomfields

• Also,SVMs,(averaged)perceptron,etc.arediscriminativeclassifiers(butnotdirectlyprobabilistic)

P(c|d)

Page 5: Discriminative Estimation (Maxentmodelsandperceptron)

BayesNet/GraphicalModels

• Bayesnetdiagramsdrawcirclesforrandomvariables,andlinesfordirectdependencies

• Somevariablesareobserved;somearehidden• Eachnodeisalittleclassifier(conditionalprobabilitytable)basedon

incomingarcs c

d1 d 2 d 3

NaiveBayes

c

d1 d2 d3

GenerativeLogisticRegression

Discriminative

Page 6: Discriminative Estimation (Maxentmodelsandperceptron)

Conditionalvs.JointLikelihood

• AjointmodelgivesprobabilitiesP(d,c)andtriestomaximizethisjointlikelihood.• Itturnsouttobetrivialtochooseweights:justrelativefrequencies.

• AconditionalmodelgivesprobabilitiesP(c|d).Ittakesthedataasgivenandmodelsonlytheconditionalprobabilityoftheclass.• Weseektomaximizeconditionallikelihood.• Hardertodo(aswe’llsee…)• Morecloselyrelatedtoclassificationerror.

Page 7: Discriminative Estimation (Maxentmodelsandperceptron)

Maxent ModelsandDiscriminativeEstimation

Generativevs.Discriminativemodels

Page 8: Discriminative Estimation (Maxentmodelsandperceptron)

DiscriminativeModelFeatures

MakingfeaturesfromtextfordiscriminativeNLPmodels

Page 9: Discriminative Estimation (Maxentmodelsandperceptron)

Features

• Intheseslidesandmostmaxent work:features f areelementarypiecesofevidencethatlinkaspectsofwhatweobserved withacategoryc thatwewanttopredict

• Afeatureisafunctionwithaboundedrealvalue:f: C ´ D → ℝ

A Belief: to create a data partition

Page 10: Discriminative Estimation (Maxentmodelsandperceptron)

Features

• InNLPuses,usuallyafeaturespecifies1. anindicatorfunction– ayes/noboolean matchingfunction– of

propertiesoftheinputand2. aparticularclass

fi(c, d) º [Φ(d) Ù c = cj] [Valueis0or1]

• Eachfeaturepicksoutadatasubsetandsuggestsalabelforit

Page 11: Discriminative Estimation (Maxentmodelsandperceptron)

Examplefeatures

• f1(c, d) º [c = LOCATION Ù w-1 = “in” Ù isCapitalized(w)]• f2(c, d) º [c = LOCATION Ù hasAccentedLatinChar(w)]• f3(c, d) º [c = DRUG Ù ends(w, “c”)]

• Modelswillassigntoeachfeatureaweight:• Apositiveweightvotesthatthisconfigurationislikelycorrect• Anegativeweightvotesthatthisconfigurationislikelyincorrect

LOCATIONin Québec

PERSONsaw Sue

DRUGtaking Zantac

LOCATIONin Arcadia

Page 12: Discriminative Estimation (Maxentmodelsandperceptron)

Feature-BasedModels• Thedecisionaboutadatapointisbasedonlyonthe

features activeatthatpoint.

BUSINESS: Stocks hit a yearly low …

Data

Features{…, stocks, hit, a, yearly, low, …}

Label: BUSINESS

Text Categorization

… to restructure bank:MONEY debt.

Data

Features{…, w-1=restructure, w+1=debt, …}

Label: MONEY

Word-Sense Disambiguation

DT JJ NN …The previous fall …

Data

Features{w=fall, t-1=JJ w-

1=previous}

Label: NN

POS Tagging

Page 13: Discriminative Estimation (Maxentmodelsandperceptron)

Example:TextCategorization

(ZhangandOles 2001)• Featuresarepresenceofeachword inadocumentandthedocumentclass

(theydofeatureselectiontousereliableindicatorwords)• TestsonclassicReutersdataset(andothers)

• NaïveBayes:77.0%F1• Linearregression:86.0%• Logisticregression:86.4%• Supportvectormachine:86.5%

• Paperemphasizestheimportanceofregularization (smoothing)forsuccessfuluseofdiscriminativemethods(notusedinmuchearlyNLP/IRwork)

Page 14: Discriminative Estimation (Maxentmodelsandperceptron)

OtherMaxent ClassifierExamples

• Youcanuseamaxent classifierwheneveryouwanttoassigndatapointstooneofanumberofclasses:• Sentenceboundarydetection(Mikheev 2000)

• Isaperiodendofsentenceorabbreviation?• Sentimentanalysis(PangandLee2002)• Wordunigrams,bigrams,POScounts,…

• PPattachment(Ratnaparkhi 1998)• Attachtoverbornoun?Featuresofheadnoun,preposition,etc.

• Parsingdecisionsingeneral(Ratnaparkhi 1997;Johnsonetal.1999,etc.)

Page 15: Discriminative Estimation (Maxentmodelsandperceptron)

DiscriminativeModelFeatures

MakingfeaturesfromtextfordiscriminativeNLPmodels

Page 16: Discriminative Estimation (Maxentmodelsandperceptron)

Feature-basedLinearClassifiers

Howtoputfeaturesintoaclassifier

16

Page 17: Discriminative Estimation (Maxentmodelsandperceptron)

Feature-BasedLinearClassifiers

• Linearclassifiersatclassificationtime:• Linear function from feature sets {fi} to classes {c}.• Assign a weight li to each feature fi.

• We consider each class for an observed datum d• For a pair (c,d), features vote with their weights:

• vote(c) = Slifi(c,d)

• Choose the class c which maximizes Slifi(c,d)

LOCATIONin Québec

DRUGin Québec

PERSONin Québec

Page 18: Discriminative Estimation (Maxentmodelsandperceptron)

Feature-BasedLinearClassifiers

• Linearclassifiersatclassificationtime:• Linear function from feature sets {fi} to classes {c}.• Assign a weight li to each feature fi.

• We consider each class for an observed datum d• For a pair (c,d), features vote with their weights:

• vote(c) = Slifi(c,d)

• Choose the class c which maximizes Slifi(c,d) = LOCATION

1.8 –0.60.3LOCATION

in QuébecDRUG

in QuébecPERSON

in Québec

Page 19: Discriminative Estimation (Maxentmodelsandperceptron)

Feature-BasedLinearClassifiers

TherearemanywaystochoseweightsforfeaturesWithdifferentlossfunctionsastheoptimizationgoal

• Perceptron:findacurrentlymisclassifiedexample,andnudgeweightsinthedirectionofitscorrectclassification

• Margin-basedmethods(SupportVectorMachines)

Page 20: Discriminative Estimation (Maxentmodelsandperceptron)

Feature-BasedLinearClassifiers• Exponential(log-linear,maxent,logistic,Gibbs)models:

• MakeaprobabilisticmodelfromthelinearcombinationSlifi(c,d)

• P(LOCATION|in Québec) = e1.8e–0.6/(e1.8e–0.6 + e0.3 + e0) = 0.586

• P(DRUG|in Québec) = e0.3 /(e1.8e–0.6 + e0.3 + e0) = 0.238

• P(PERSON|in Québec) = e0 /(e1.8e–0.6 + e0.3 + e0) = 0.176

• The weights are the parameters of the probability model, combined via a “soft max” function

∑ ∑'

),'(expc i

ii dcfλ=),|( λdcP

∑i

ii dcf ),(exp λ Makes votes positive

Normalizes votes

Page 21: Discriminative Estimation (Maxentmodelsandperceptron)

Aside:logisticregression

• Maxent modelsinNLPareessentiallythesameasmulticlasslogisticregressionmodelsinstatistics(ormachinelearning)

• ThekeyroleoffeaturefunctionsinNLPandinthispresentation• Thefeaturesaremoregeneral,withf alsobeingafunctionoftheclass

21

Page 22: Discriminative Estimation (Maxentmodelsandperceptron)

Quiz Question

• Assuming exactly the same set up (3 class decision: LOCATION, PERSON, or DRUG; 3 features as before, maxent), what are:• P(PERSON| byGoéric) =

• P(LOCATION| byGoéric) =

• P(DRUG| byGoéric) =

• 1.8 f1(c, d) º [c = LOCATION Ù w-1 = “in” Ù isCapitalized(w)]• -0.6 f2(c, d) º [c = LOCATION Ù hasAccentedLatinChar(w)]• 0.3 f3(c, d) º [c = DRUG Ù ends(w, “c”)]

∑ ∑'

),'(expc i

ii dcfλ=),|( λdcP ∑

iii dcf ),(exp λ

PERSONby Goéric

LOCATIONby Goéric

DRUGby Goéric

Page 23: Discriminative Estimation (Maxentmodelsandperceptron)

Feature-basedLinearClassifiers

Howtoputfeaturesintoaclassifier

23

Page 24: Discriminative Estimation (Maxentmodelsandperceptron)

BuildingaMaxentModel

Thenutsandbolts

Page 25: Discriminative Estimation (Maxentmodelsandperceptron)

BuildingaMaxent Model

• Wedefinefeatures(indicatorfunctions)overdatapoints• Featuresrepresentsetsofdatapointswhicharedistinctiveenoughtodeservemodelparameters.• Words,butalso“wordcontainsnumber”,“wordendswithing”,etc.

• WewillsimplyencodeeachΦ featureasauniqueString(index)• AdatumwillgiverisetoasetofStrings:theactiveΦ features• Eachfeaturefi(c, d) º [Φ(d) Ù c = cj] getsarealnumberweight

• WeconcentrateonΦ featuresbutthemathusesi indicesoffi

Page 26: Discriminative Estimation (Maxentmodelsandperceptron)

BuildingaMaxentModel• Featuresareoftenaddedduringmodeldevelopmenttotargeterrors

• Often,theeasiestthingtothinkofarefeaturesthatmarkbadcombinations

• Then,foranygivenfeatureweights,wewanttobeabletocalculate:• Dataconditionallikelihood• Derivativeofthelikelihoodwrt eachfeatureweight• Usesexpectationsofeachfeatureaccordingtothemodel

• Wecanthenfindtheoptimumfeatureweights(discussedlater).

Page 27: Discriminative Estimation (Maxentmodelsandperceptron)

BuildingaMaxentModel

Thenutsandbolts

Page 28: Discriminative Estimation (Maxentmodelsandperceptron)

NaiveBayesvs.Maxent models

Generativevs.Discriminativemodels:Theproblemofovercounting evidence

Page 29: Discriminative Estimation (Maxentmodelsandperceptron)

Textclassification:AsiaorEurope

NBFACTORS:• P(A)=P(E)=• P(M|A)=• P(M|E)=• P(H|A)=P(K|A)=• P(H|E)=PK|E)=

Europe Asia

Class

H K

NBModel PREDICTIONS:• P(A,H,K,M)=• P(E,H,K,M)=• P(A|H,K,M)=• P(E|H,K,M)=

TrainingData

M

Monaco Monaco

Monaco Monaco Hong Kong

Hong Kong Monaco

Monaco Hong Kong

Hong Kong

Monaco Monaco

Page 30: Discriminative Estimation (Maxentmodelsandperceptron)

NaiveBayesvs.Maxent Models

• NaiveBayesmodelsmulti-countcorrelatedevidence• Eachfeatureismultipliedin,evenwhenyouhavemultiplefeaturestellingyouthesamething

• MaximumEntropymodels(prettymuch)solvethisproblem• Aswewillsee,thisisdonebyweightingfeaturessothatmodelexpectationsmatchtheobserved(empirical)expectations

Page 31: Discriminative Estimation (Maxentmodelsandperceptron)

NaiveBayesvs.Maxent models

Generativevs.Discriminativemodels:Theproblemofovercounting evidence

Page 32: Discriminative Estimation (Maxentmodelsandperceptron)

Maxent ModelsandDiscriminativeEstimation

Maximizingthelikelihood

Page 33: Discriminative Estimation (Maxentmodelsandperceptron)

FeatureExpectations

• Wewillcruciallymakeuseoftwoexpectations• actualorpredictedcountsofafeaturefiring:

• Empiricalcount(expectation)ofafeature:

• Modelexpectationofafeature:

∑ ∈=

),(observed),(),()( empirical

DCdc ii dcffE

∑ ∈=

),(),(),(),()(

DCdc ii dcfdcPfE

Goal: well fit the data

Page 34: Discriminative Estimation (Maxentmodelsandperceptron)

ExponentialModelLikelihood

• Maximum(Conditional)LikelihoodModels:• Givenamodelform,choosevaluesofparameterstomaximizethe(conditional)likelihoodofthedata.

∑∑∈∈

==),(),(),(),(

log),|(log),|(logDCdcDCdc

dcPDCP λλ∑ ∑'

),'(expc i

ii dcfλ

∑i

ii dcf ),(exp λ

Page 35: Discriminative Estimation (Maxentmodelsandperceptron)

TheLikelihoodValue

• The(log)conditionallikelihoodofiid data(C,D)accordingtomaxent modelisafunctionofthedataandtheparametersl:

• Iftherearen’tmanyvaluesofc,it’seasytocalculate:

∑∏∈∈

==),(),(),(),(

),|(log),|(log),|(logDCdcDCdc

dcPdcPDCP λλλ

∑∈

=),(),(

log),|(logDCdc

DCP λ∑ ∑'

),'(expc i

ii dcfλ

∑i

ii dcf ),(exp λ

Page 36: Discriminative Estimation (Maxentmodelsandperceptron)

TheLikelihoodValue

• Wecanseparatethisintotwocomponents:

• Thederivativeisthedifferencebetweenthederivativesofeachcomponent

∑ ∑ ∑∈ ),(),( '

),'(explogDCdc c i

ii dcfλ∑ ∑∈ ),(),(

),(explogDCdc i

ii dcfλ −=),|(log λDCP

)(λN )(λM=),|(log λDCP −

Page 37: Discriminative Estimation (Maxentmodelsandperceptron)

TheDerivativeI:Numerator

Derivativeofthenumeratoris:theempiricalcount(fi,c)

i

DCdc iii dcf

λ

λ

=∑ ∑∈ ),(),(

),(

∑∑

∈ ∂

∂=

),(),(

),(

DCdc i

iii dcf

λ

λ

∑∈

=),(),(

),(DCdci dcf

i

DCdc iici

i

dcfN

λ

λ

λλ

=∂

∂∑ ∑∈ ),(),(

),(explog)(

Page 38: Discriminative Estimation (Maxentmodelsandperceptron)

TheDerivativeII:Denominator

i

DCdc c iii

i

dcfM

λ

λ

λλ

=∂

∂∑ ∑ ∑∈ ),(),( '

),'(explog)(

∑∑ ∑

∑ ∑∈ ∂

∂=

),(),(

'

''

),'(exp

),''(exp1

DCdc i

c iii

c iii

dcf

dcf λ

λ

λ

∑ ∑∑∑

∑ ∑∈ ∂

∂=

),(),( '''

),'(

1

),'(exp

),''(exp1

DCdc c i

iii

iii

c iii

dcfdcf

dcf λ

λλ

λ

i

iii

DCdc cc i

ii

iii dcf

dcf

dcf

λ

λ

λ

λ

∂=

∑∑ ∑∑ ∑

∑∈

),'(

),''(exp

),'(exp

),(),( '''

∑ ∑∈

=),(),( '

),'(),|'(DCdc

ic

dcfdcP λ =predictedcount(fi,l)

Page 39: Discriminative Estimation (Maxentmodelsandperceptron)

TheDerivativeIII

• Theoptimumparametersaretheonesforwhicheachfeature’spredictedexpectationequalsitsempiricalexpectation.Theoptimumdistributionis:• Alwaysunique(butparametersmaynotbeunique)• Alwaysexists(iffeaturecountsarefromactualdata).

• Thesemodelsarealsocalledmaximumentropymodelsbecausewefindthemodelhavingmaximumentropyandsatisfyingtheconstraints:

=∂

i

DCPλ

λ),|(log),(countactual Cfi ),(countpredicted λif−

jfEfE jpjp ∀= ),()( ~

Page 40: Discriminative Estimation (Maxentmodelsandperceptron)

Findingtheoptimalparameters

• Wewanttochooseparametersλ1,λ2,λ3,…thatmaximizetheconditionallog-likelihoodofthetrainingdata

• Tobeabletodothat,we’veworkedouthowtocalculatethefunctionvalueanditspartialderivatives(itsgradient)

)|(log)(1

i

n

ii dcPDCLogLik ∑

=

=

Page 41: Discriminative Estimation (Maxentmodelsandperceptron)

Alikelihoodsurface

Page 42: Discriminative Estimation (Maxentmodelsandperceptron)

Findingtheoptimalparameters

• Useyourfavoritenumericaloptimizationpackage….• Commonly,youminimize thenegativeofCLogLik

1. Gradientdescent(GD);Stochasticgradientdescent(SGD)2. Iterativeproportionalfittingmethods:GeneralizedIterativeScaling

(GIS)andImprovedIterativeScaling(IIS)3. Conjugategradient(CG),perhapswithpreconditioning4. Quasi-Newtonmethods– limitedmemoryvariablemetric(LMVM)

methods,inparticular,L-BFGS

Page 43: Discriminative Estimation (Maxentmodelsandperceptron)

GradientDescent(GD)

43

Page 44: Discriminative Estimation (Maxentmodelsandperceptron)

Maxent ModelsandDiscriminativeEstimation

Maximizingthelikelihood

Page 45: Discriminative Estimation (Maxentmodelsandperceptron)

FeatureSparsityRegularization

Combatingoverfitting

Page 46: Discriminative Estimation (Maxentmodelsandperceptron)

Smoothing:IssuesofScale• Lotsoffeatures:

• NLPmaxent modelscanhavewelloveramillionfeatures.• Evenstoringasinglearrayofparametervaluescanhaveasubstantialmemorycost.

• Lotsofsparsity:• Overfitting veryeasy– weneedsmoothing!• Manyfeaturesseenintrainingwillneveroccuragainattesttime.

• Optimizationproblems:• Featureweightscanbeinfinite,anditerativesolverscantakealongtimetogetto

thoseinfinities.

Page 47: Discriminative Estimation (Maxentmodelsandperceptron)

Smoothing/Priors/Regularization

Page 48: Discriminative Estimation (Maxentmodelsandperceptron)

Standardvs.RegularizedUpdates

48

Page 49: Discriminative Estimation (Maxentmodelsandperceptron)

FeatureSparsityRegularization

Combatingoverfitting

Page 50: Discriminative Estimation (Maxentmodelsandperceptron)

Batchvs.OnlineLearning

GDvs.SGD

Page 51: Discriminative Estimation (Maxentmodelsandperceptron)

StochasticGradientDecent(SGD)

51

Batch vs. Online learning:

Page 52: Discriminative Estimation (Maxentmodelsandperceptron)

Batchvs.OnlineLearning

GDvs.SGD

Page 53: Discriminative Estimation (Maxentmodelsandperceptron)

Perceptron

AnotherOnlineLearningalgorithem

Page 54: Discriminative Estimation (Maxentmodelsandperceptron)

Perceptron Algorithm

54

Page 55: Discriminative Estimation (Maxentmodelsandperceptron)

MaxEnt v.s Perceptron

• Perceptron doesn’t always make updates• Probabilities v.s scores

55

Page 56: Discriminative Estimation (Maxentmodelsandperceptron)

RegularizationinthePerceptronAlgorithm

• No gradient computed,so can’t directly include a regularizer inan object function.

• Insteadrun different numbers of iterations• Use parameter averaging, for instance, average of all

parameters after seeing each data point

56