lecture 3: structures and decodingnasmith/psnlp/lecture3.pdf · • general algorithm for exact map...

Lecture3:StructuresandDecoding

Outline

1.  StructuresinNLP2.  HMMsasBNs–  ViterbialgorithmasvariableeliminaBon

3.  Linearmodels

4.  Fiveviewsofdecoding

TwoMeaningsof“Structure”

•  Yesterday:structureofagraphformodelingacollecBonofrandomvariablestogether.

•  Today:linguisBcstructure.– Sequencelabelings(POS,IOBchunkings,…)– Parsetrees(phrase‐structure,dependency,…)– Alignments(word,phrase,tree,…)– Predicate‐argumentstructures– Text‐to‐text(translaBon,paraphrase,answers,…)

AUsefulAbstracBon?

•  Wethinkso.•  BringsoutcommonaliBes:– Modelingformalisms(e.g.,linearmodelswithfeatures)

– Learningalgorithms(lectures4‐6)– Genericinferencealgorithms

•  Permitssharingacrossawiderspaceofproblems.

•  Disadvantage:hidesengineeringdetails.

FamiliarExample:HiddenMarkovModels

HiddenMarkovModel

•  XandYarebothsequencesofsymbols– XisasequencefromthevocabularyΣ– YisasequencefromthestatespaceΛ

•  Parameters:– TransiBonsp(y’|y)•  includingp(stop |y),p(y|start)

– Emissionsp(x|y)

HiddenMarkovModel

•  Thejointmodel’sindependenceassumpBonsareeasytocapturewithaBayesiannetwork.

Y1

X1

Y0 Y2

X2

Y3

X3

Yn

Xn

stop …

HiddenMarkovModel

•  ThejointmodelinstanBatesdynamicBayesiannetworks.

Yi

Xi

Yi‐1Y0 templatethatgetscopiedasmanyBmesasneeded

HiddenMarkovModel

•  GivenX’svalueasevidence,thedynamicpartbecomesunnecessary,sinceweknown.

Y1

X1=x1

Y0 Y2

X2=x2

Y3

X3=x3

Yn

Xn=xn

stop …

HiddenMarkovModel

•  TheusualinferenceproblemistofindthemostprobablevalueofYgivenX=x.

Y1

X1=x1

Y0 Y2

X2=x2

Y3

X3=x3

Yn

Xn=xn

stop …

HiddenMarkovModel


•  Factorgraph:

Y1

X1=x1

Y0 Y2

X2=x2

Y3

X3=x3

Yn

Xn=xn

stop …

HiddenMarkovModel


•  Factorgraphaferreducingfactorstorespectevidence:

Y1 Y2 Y3 Yn…

HiddenMarkovModel


•  Cleverorderingshouldbeapparent!

Y1 Y2 Y3 Yn…

HiddenMarkovModel

•  WhenweeliminateY1,wetakeaproductofthreerelevantfactors.•  p(Y1|start)•  η(Y1)=reducedp(x1|Y1)•  p(Y2|Y1)

Y1 Y2 Y3 Yn…

HiddenMarkovModel

•  WhenweeliminateY1,wefirsttakeaproductoftwofactorsthatonlyinvolveY1.

Y1 Y2 Y3 Yn…

y1y2…

y|Λ|y1y2…

y|Λ|η(Y1)=reducedp(x1|Y1)

p(Y1|start)

HiddenMarkovModel


•  ThisistheViterbiprobabilityvectorforY1.

Y1 Y2 Y3 Yn…y1y2…

y|Λ|φ1(Y1)

HiddenMarkovModel


•  ThisistheViterbiprobabilityvectorforY1.•  EliminaBngY1equatestosolvingtheViterbiprobabiliBesforY2.

Y1 Y2 Y3 Yn…y1y2…

y|Λ|φ1(Y1)

y1

y2

…

y|Λ|p(Y2|Y1)

HiddenMarkovModel

•  ProductofallfactorsinvolvingY1,thenreduce.•  φ2(Y2)=maxy∈Val(Y1)(φ1(y)⨉p(Y2|y))

•  ThisfactorholdsViterbiprobabiliiesyforY2.

Y2 Y3 Yn…

Y2

HiddenMarkovModel

•  WhenweeliminateY2,wetakeaproductoftheanalogoustworelevantfactors.

•  Thenreduce.•  φ3(Y3)=maxy∈Val(Y2)(φ2(y)⨉p(Y3|y))

Y3 Yn…

Yn

HiddenMarkovModel

•  Attheend,wehaveonefinalfactorwithonerow,φn+1.

•  Thisisthescoreofthebestsequence.•  Usebacktracetorecovervalues.

WhyThinkThisWay?

•  EasytoseehowtogeneralizeHMMs.– Moreevidence– Morefactors– Morehiddenstructure– Moredependencies

•  ProbabilisBcinterpretaBonoffactorsisnot centraltofindingthe“best”Y…– ManyfactorsarenotcondiBonalprobabilitytables.

GeneralizaBonExample1

•  Eachwordalsodependsonpreviousstate.

Y1

X1 X2 X3 X4 X5

Y2 Y3 Y4 Y5


•  “Trigram”HMM

Y1

X1 X2 X3 X4 X5

Y2 Y3 Y4 Y5


•  Aggregatebigrammodel(SaulandPereira,1997)

Y1

X1 X2 X3 X4 X5

Y2 Y3 Y4 Y5

GeneralDecodingProblem

•  Twostructuredrandomvariables,XandY.– SomeBmesdescribedascollec,onsofrandomvariables.

•  “Decode”observedvalueX=xintosomevalueofY.

•  Usually,weseektomaximizesomescore.– E.g.,MAPinferencefromyesterday.

LinearModels

•  DefineafeaturevectorfuncBongthatmaps(x,y)pairsintod‐dimensionalrealspace.

•  Scoreislinearing(x,y).

•  Results:–  decodingseeksytomaximizethescore.–  learningseekswto…dosomethingwe’lltalkaboutlater.

•  Extremelygeneral!

GenericNoisyChannelasLinearModel

•  Ofcourse,thetwoprobabilitytermsaretypicallycomposedof“smaller”factors;eachcanbeunderstoodasanexponenBatedweight.

MaxEntModelsasLinearModels

HMMsasLinearModels

RunningExample

•  IOBsequencelabeling,hereappliedtoNER•  OfensolvedwithHMMs,CRFs,M3Ns…

(WhatisNotALinearModel?)

•  Modelswithhiddenvariables

•  Modelsbasedonnon‐linearkernels

Decoding

•  ForHMMs,thedecodingalgorithmweusuallythinkoffirstistheViterbialgorithm.– Thisisjustoneexample.

•  Wewillviewdecodinginfivedifferentways.– Sequencemodelsasarunningexample.– TheseviewsarenotjustforHMMs.– SomeBmestheywillleadusbacktoViterbi!

FiveViewsofDecoding

1.ProbabilisBcGraphicalModels

•  ViewthelinguisBcstructureasacollecBonofrandomvariablesthatareinterdependent.

•  Representinterdependenciesasadirectedorundirectedgraphicalmodel.

•  CondiBonalprobabilitytables(BNs)orfactors(MNs)encodetheprobabilitydistribuBon.

InferenceinGraphicalModels

•  GeneralalgorithmforexactMAPinference:variableeliminaBon.–  IteraBvelysolveforthebestvaluesofeachvariablecondiBonedonvaluesof“preceding”neighbors.

– Thentraceback.

TheViterbialgorithmisaninstanceofmax‐productvariableeliminaBon!

MAPisLinearDecoding

•  Bayesiannetwork:

•  Markovnetwork:

•  ThisonlyworksifeveryvariableisinXorY.


•  Remember:moreedgesmakeinferencemoreexpensive.– Feweredgesmeansstrongerindependence.

•  Reallypleasant:


•  Remember:moreedgesmakeinferencemoreexpensive.– Feweredgesmeansstrongerindependence.

•  Reallyunpleasant:

2.Polytopes

“Parts”

•  AssumethatfeaturefuncBongbreaksdownintolocalparts.

•  Eachparthasanalphabetofpossiblevalues.– Decodingischoosingvaluesforallparts,withconsistencyconstraints.

–  (Inthegraphicalmodelsview,apartisaclique.)

Example

•  Onepartperword,eachisin{B,I,O}•  NofeatureslookatmulBpleparts– Fastinference– Notveryexpressive

Example

•  Onepartperbigram,eachisin{BB,BI,BO,IB,II,IO,OB,OO}

•  Featuresandconstraintscanlookatpairs– Slowerinference– Abitmoreexpressive

GeometricView

•  Letzi,πbe1ifpartitakesvalueπand0otherwise.

•  zisavectorin{0,1}N– N =totalnumberoflocalizedpartvalues– Eachzisavertexoftheunitcube

ScoreisLinearinz

notreallyequal;needtotransformbacktogety

Polyhedra

•  NotallverBcesoftheN‐dimensionalunitcubesaBsfytheconstraints.– E.g.,can’thavez1,BI=1 andz2,BI=1

•  SomeBmeswecanwritedownasmall(polynomialnumber)oflinearconstraintsonz.

•  Result:linearobjecBve,linearconstraints,integerconstraints…

IntegerLinearProgramming

•  Veryeasytoaddnewconstraintsandnon‐localfeatures.

•  ManydecodingproblemshavebeenmappedtoILP(sequencelabeling,parsing,…),butit’snot alwaystrivial.

•  NP‐hardingeneral.–  ButtherearepackagesthatofenworkwellinpracBce(e.g.,CPLEX)

–  Specializedalgorithmsinsomecases–  LPrelaxaBonforapproximatesoluBons

Remark

•  GraphicalmodelsassumedaprobabilisBcinterpretaBon– ThoughtheyarenotalwayslearnedusingaprobabilisBcinterpretaBon!

•  ThepolytopeviewisagnosBcabouthowyouinterprettheweights.–  ItonlysaysthatthedecodingproblemisanILP.

3.WeightedParsing

Grammars

•  Grammarsareofenassociatedwithnaturallanguageparsing,buttheyareextremelypowerfulforimposingconstraints.

•  Wecanaddweightstothem.– HMMsareakindofweightedregulargrammar(closelyconnectedtoWFSAs)

–  PCFGsareakindofweightedCFG– Many,manymore.

•  Weightedparsing:findthemaximum‐weightedderivaBonforastringx.

DecodingasWeightedParsing

•  EveryvalidyisagrammaBcalderivaBon(parse)forx.– HMM:sequenceof“grammaBcal”statesisoneallowedbythetransiBontable.

•  Augmentparsingalgorithmswithweightsandfindthebestparse.

TheViterbialgorithmisaninstanceofrecogniBonbyaweightedgrammar!

BIOTaggingasaCFG

•  Weighted(orprobabilisBc)CKYisadynamicprogrammingalgorithmverysimilarinstructuretoclassicalCKY.

4.PathsandHyperpaths

BestPath

•  Generalidea:takexandbuildagraph.•  Scoreofapathfactorsintotheedges.

•  Decodingisfindingthebest path.

TheViterbialgorithmisaninstanceoffindingabestpath!

“Lavce”ViewofViterbi

MinimumCostHyperpath

•  Generalidea:takexandbuildahypergraph.•  Scoreofahyperpathfactorsintothehyperedges.

•  Decodingisfindingthebesthyperpath.

•  ThisconnecBonwaselucidatedbyKleinandManning(2002).

ParsingasaHypergraph


cf. “Dean for democracy”


Forced to work on his thesis, sunshine streaming in the window, Mike experienced a …


Forced to work on his thesis, sunshine streaming in the window, Mike began to …

WhyHypergraphs?

•  Useful,compactencodingofthehypothesisspace.– Buildhypothesisspaceusinglocalfeatures,maybedosomefiltering.

– Passitofftoanothermoduleformorefine‐grainedscoringwithricherormoreexpensivefeatures.

5.WeightedLogicProgramming

LogicProgramming

•  Startwithasetofaxiomsandasetofinferencerules.

•  Thegoalistoproveaspecifictheorem,goal.•  Manyapproaches,butweassumeadeduc,veapproach.–  Startwithaxioms,iteraBvelyproducemoretheorems.

WeightedLogicProgramming

•  Twist:axiomshaveweights.•  Wanttheproofofgoal withthebestscore:

•  Notethataxiomscanbeusedmorethanonceinaproof(y).

WhenceWLP?

•  Shieber,Schabes,andPereira(1995):manyparsingalgorithmscanbeunderstoodinthesamededucBvelogicframework.

•  Goodman(1999):addweights,getmanyusefulNLPalgorithms.

•  Eisner,Goldlust,andSmith(2004,2005):semiring‐genericalgorithms,Dyna.

DynamicProgramming

•  Mostviews(excepBonispolytopes)canbeunderstoodasDPalgorithms.–  Thelow‐levelprocedures weuseareofenDP.–  EvenDPistoohigh‐leveltoknowthebestwaytoimplement.

•  DPdoesnot implypolynomialBmeandspace!– MostcommonapproximaBonswhenthedesiredstatespaceistoobig:beamsearch,cubepruning,agendaswithearlystopping,...

– Otherviewssuggestothers.

Summary

•  Decodingisthegeneralproblemofchoosingacomplexstructure.– LinguisBcanalysis,machinetranslaBon,speechrecogniBon,…

– StaBsBcalmodelsareusuallyinvolved(notnecessarilyprobabilisBc).

•  Noperfectgeneralview,butmuchcanbegainedthroughacombinaBonofviews.

lecture 3: structures and decodingnasmith/psnlp/lecture3.pdf · • general algorithm for exact map...

Documents