algorithms for nlp - cs.cmu.edutbergkir/11711fa17/fa17 11-711 lecture 10 -- parsing ii.pdf ·...

Tagging/ParsingTaylorBerg-Kirkpatrick– CMU

Slides:DanKlein– UCBerkeley

AlgorithmsforNLP

SoHowWellDoesItWork?§ Choosethemostcommontag

§ 90.3%withabadunknownwordmodel§ 93.7%withagoodone

§ TnT (Brants,2000):§ Acarefullysmoothedtrigramtagger§ Suffixtreesforemissions§ 96.7%onWSJtext(SOTAis97+%)

§ Noiseinthedata§ Manyerrorsinthetrainingandtestcorpora

§ Probablyabout2%guaranteederrorfromnoise(onthisdata)

NN NN NNchief executive officer

JJ NN NNchief executive officer

JJ JJ NNchief executive officer

NN JJ NNchief executive officer

DT NN IN NN VBD NNS VBDThe average of interbank offered rates plummeted …

Overview:Accuracies§ Roadmapof(known/unknown)accuracies:

§ Mostfreq tag: ~90%/~50%

§ TrigramHMM: ~95%/~55%

§ TnT (HMM++): 96.2%/86.0%

§ Maxent P(t|w): 93.7%/82.6%§ MEMMtagger: 96.9%/86.9%§ State-of-the-art: 97+%/89+%§ Upperbound: ~98%

Most errors on unknown

words

CommonErrors§ Commonerrors[fromToutanova&Manning00]

NN/JJ NN

official knowledge

VBD RP/IN DT NN

made up the story

RB VBD/VBN NNS

recently sold shares

RicherFeatures

BetterFeatures§ Candosurprisinglywelljustlookingatawordbyitself:

§ Word the:the® DT§ Lowercasedword Importantly:importantly® RB§ Prefixes unfathomable:un-® JJ§ Suffixes Surprisingly:-ly® RB§ Capitalization Meridian:CAP® NNP§ Wordshapes 35-year:d-x® JJ

§ Thenbuildamaxent(orwhatever)modeltopredicttag§ MaxentP(t|w): 93.7%/82.6% s3

w3

WhyLocalContextisUseful§ Lotsofrichlocalinformation!

§ Wecouldfixthiswithafeaturethatlookedatthenextword

§ Wecouldfixthisbylinkingcapitalizedwordstotheirlowercaseversions

§ Solution:discriminativesequencemodels(MEMMs,CRFs)

§ Realitycheck:§ Taggersarealreadyprettygoodonnewswiretext…§ Whattheworldneedsistaggersthatworkonothertext!

PRP VBD IN RB IN PRP VBD .They left as soon as he arrived .

NNP NNS VBD VBN .Intrinsic flaws remained undetected .

RB

JJ

Sequence-FreeTagging?

§ Whataboutlookingatawordanditsenvironment,butnosequenceinformation?

§ Addinprevious/nextword the__§ Previous/nextwordshapes X__X§ Crudeentitydetection __…..(Inc.|Co.)§ Phrasalverbinsentence? put……__§ Conjunctionsofthesethings

§ Allfeaturesexceptsequence:96.6%/86.8%§ Useslotsoffeatures:>200K

t3

w3 w4w2

NamedEntityRecognition§ Othersequencetasksusesimilarmodels

§ Example:nameentityrecognition(NER)

Prev Cur NextState ??? LOC ???Word at Grace RoadTag IN NNP NNPSig x Xx Xx

Local Context

Tim Boon has signed a contract extension with Leicestershire which will keep him at Grace Road .

PER PER O O O O O O ORG O O O O O LOC LOC O

MEMMTaggers§ Idea:left-to-rightlocaldecisions,conditiononprevioustags

andalsoentireinput

§ TrainupP(ti|w,ti-1,ti-2)asanormalmaxent model,thenusetoscoresequences

§ ThisisreferredtoasanMEMMtagger[Ratnaparkhi 96]§ Beamsearcheffective!(Why?)§ Whataboutbeamsize1?

§ Subtleissueswithlocalnormalization(cf.Laffertyetal01)

NERFeatures

Feature Type Feature PERS LOCPrevious word at -0.73 0.94Current word Grace 0.03 0.00First char of word G 0.45 -0.04Current POS tag NNP 0.47 0.45Prev and cur tags IN NNP -0.10 0.14Previous state Other -0.70 -0.92Current signature Xx 0.80 0.46Prev state, cur sig O-Xx 0.68 0.37Prev-cur-next sig x-Xx-Xx -0.69 0.37P. state - p-cur sig O-x-Xx -0.20 0.82…Total: -0.58 2.68

Prev Cur NextState Other LOC ???Word at Grace RoadTag IN NNP NNPSig x Xx Xx

Local Context

Feature WeightsBecause of regularization term, the more common prefixes have larger weights even though entire-word features are more specific.

Decoding§ DecodingMEMMtaggers:

§ JustlikedecodingHMMs,differentlocalscores§ Viterbi,beamsearch,posteriordecoding

§ Viterbialgorithm(HMMs):

§ Viterbialgorithm(MEMMs):

§ General:

ConditionalRandomFields(andFriends)

MaximumEntropyII

§ Remember:maximumentropyobjective

§ Problem:lotsoffeaturesallowperfectfittotrainingset§ Regularization(comparetosmoothing)

DerivativeforMaximumEntropy

Big weights are bad

Total count of feature n in correct candidates

Expected count of feature n in predicted

candidates

PerceptronReview

Perceptron§ Linearmodel:

§ …thatdecomposealongthesequence

§ …allowustopredictwiththeViterbialgorithm

§ …whichmeanswecantrainwiththeperceptronalgorithm(orrelatedupdates,likeMIRA)

[Collins 01]

ConditionalRandomFields§ Makeamaxentmodeloverentiretaggings

§ MEMM

§ CRF

CRFs§ Likeanymaxent model,derivativeis:

§ Soallweneedistobeabletocomputetheexpectationofeachfeature(forexamplethenumberoftimesthelabelpairDT-NNoccurs,orthenumberoftimesNN-interestoccurs)underthemodeldistribution

§ Criticalquantity:countsofposteriormarginals:

ComputingPosteriorMarginals§ Howmany(expected)timesiswordwtaggedwiths?

§ Howtocomputethatmarginal?^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

START Fed raises interest rates END

GlobalDiscriminativeTaggers

§ Newer,higher-powereddiscriminativesequencemodels§ CRFs(alsoperceptrons,M3Ns)§ Donotdecomposetrainingintoindependentlocalregions§ Canbedeathlyslowtotrain– requirerepeatedinferenceontraining

set§ DifferencestendnottobetooimportantforPOStagging§ Differencesmoresubstantialonothersequencetasks§ However:oneissueworthknowingaboutinlocalmodels

§ “Labelbias”andotherexplainingawayeffects§ MEMMtaggers’localscorescanbenearonewithouthavingboth

good“transitions”and“emissions”§ Thismeansthatoftenevidencedoesn’tflowproperly§ Whyisn’tthisabigdealforPOStagging?§ Also:indecoding,conditiononpredicted,notgold,histories

Transformation-BasedLearning

§ [Brill95]presentsatransformation-basedtagger§ Labelthetrainingsetwithmostfrequenttags

DTMDVBDVBD.Thecanwasrusted.

§ Addtransformationruleswhichreducetrainingmistakes

§ MD® NN:DT__§ VBD® VBN:VBD__.

§ Stopwhennotransformationsdosufficientgood§ Doesthisremindanyoneofanything?

§ Probablythemostwidelyusedtagger(esp.outsideNLP)§ …butdefinitelynotthemostaccurate:96.6%/82.0%

LearnedTransformations§ Whatgetslearned?[fromBrill95]

EngCGTagger

§ Englishconstraintgrammartagger§ [TapanainenandVoutilainen94]§ Somethingelseyoushouldknowabout§ Hand-writtenandknowledgedriven§ “Don’tguessifyouknow”(generalpoint

aboutmodelingmorestructure!)§ Tagsetdoesn’tmakeallofthehard

distinctionsasthestandardtagset(e.g.JJ/NN)

§ Theygetstellaraccuracies:99%ontheirtagset

§ Linguisticrepresentationmatters…§ …butit’seasiertowinwhenyoumakeup

therules

DomainEffects§ Accuraciesdegradeoutsideofdomain

§ Uptotripleerrorrate§ Usuallymakethemosterrorsonthethingsyoucareaboutinthedomain(e.g.proteinnames)

§ Openquestions§ Howtoeffectivelyexploitunlabeleddatafromanewdomain(whatcouldwegain?)

§ Howtobestincorporatedomainlexicainaprincipledway(e.g.UMLSspecialistlexicon,ontologies)

UnsupervisedTagging

UnsupervisedTagging?§ AKApart-of-speechinduction§ Task:

§ Rawsentencesin§ Taggedsentencesout

§ Obviousthingtodo:§ Startwitha(mostly)uniformHMM§ RunEM§ Inspectresults

EMforHMMs:Process§ Alternatebetweenrecomputingdistributionsoverhiddenvariables(the

tags)andreestimatingparameters§ Crucialstep:wewanttotallyuphowmany(fractional)countsofeach

kindoftransitionandemissionwehaveundercurrentparams:

§ SamequantitiesweneededtotrainaCRF!

EMforHMMs:Quantities§ Totalpathvalues(correspondtoprobabilitieshere):

TheStateLattice/Trellis

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$


EMforHMMs:Process

§ Fromthesequantities,cancomputeexpectedtransitions:

§ Andemissions:

Merialdo:Setup§ Some(discouraging)experiments[Merialdo94]

§ Setup:§ Youknowthesetofallowabletagsforeachword§ Fixktrainingexamplestotheirtruelabels

§ LearnP(w|t)ontheseexamples§ LearnP(t|t-1,t-2)ontheseexamples

§ Onnexamples,re-estimatewithEM

§ Note:weknowallowedtagsbutnotfrequencies

Merialdo:Results

DistributionalClustering

president the __ ofpresident the __ saidgovernor the __ ofgovernor the __ appointedsaid sources __ ¨said president __ thatreported sources __ ¨

presidentgovernor

saidreported

thea

¨ the president said that the downturn was over ¨

[Finch and Chater 92, Shuetze 93, many others]

DistributionalClustering§ Threemainvariantsonthesameidea:

§ Pairwisesimilaritiesandheuristicclustering§ E.g.[FinchandChater92]§ Producesdendrograms

§ Vectorspacemethods§ E.g.[Shuetze93]§ Modelsofambiguity

§ Probabilisticmethods§ Variousformulations,e.g.[LeeandPereira99]

NearestNeighbors

Dendrograms_

VectorSpaceVersion§ [Shuetze93]clusterswordsaspointsinRn

§ Vectorstoosparse,useSVDtoreduce

Mw

context counts

US V

w

context counts

Cluster these 50-200 dim vectors instead.

Õ -=i

iiii ccPcwPCSP )|()|(),( 1

Õ +-=i

iiiiii cwwPcwPcPCSP )|,()|()(),( 11

AProbabilisticVersion?


c1 c2 c6c5 c7c3 c4 c8


c1 c2 c6c5 c7c3 c4 c8

WhatElse?§ Variousnewerideas:

§ Contextdistributionalclustering[Clark00]§ Morphology-drivenmodels[Clark03]§ Contrastiveestimation[SmithandEisner05]§ Feature-richinduction[HaghighiandKlein06]

§ Also:§ Whataboutambiguouswords?§ Usingwidercontextsignatureshasbeenusedforlearningsynonyms(what’swrongwiththisapproach?)

§ Canextendtheseideasforgrammarinduction(later)

Computing Marginals

= sum of all paths through s at tsum of all paths

ForwardScores

Backward Scores

TotalScores

Syntax

ParseTrees

The move followed a round of similar increases by other lenders, reflecting a continuing decline in that market

PhraseStructureParsing§ Phrasestructureparsing

organizessyntaxintoconstituentsorbrackets

§ Ingeneral,thisinvolvesnestedtrees

§ Linguistscan,anddo,argueaboutdetails

§ Lotsofambiguity

§ Nottheonlykindofsyntax…

new art critics write reviews with computers

PP

NPNP

N’

NP

VP

S

ConstituencyTests

§ Howdoweknowwhatnodesgointhetree?

§ Classicconstituencytests:§ Substitutionbyproform§ Questionanswers§ Semanticgounds

§ Coherence§ Reference§ Idioms

§ Dislocation§ Conjunction

§ Cross-linguisticarguments,too

ConflictingTests§ Constituencyisn’talwaysclear

§ Unitsoftransfer:§ thinkabout~penser à§ talkabout~hablar de

§ Phonologicalreduction:§ Iwillgo® I’llgo§ Iwanttogo® Iwanna go§ alecentre® aucentre

§ Coordination§ Hewenttoandcamefromthestore.

La vélocité des ondes sismiques

ClassicalNLP:Parsing

§ Writesymbolicorlogicalrules:

§ Usedeductionsystemstoproveparsesfromwords§ Minimalgrammaron“Fedraises”sentence:36parses§ Simple10-rulegrammar:592parses§ Real-sizegrammar:manymillionsofparses

§ Thisscaledverybadly,didn’tyieldbroad-coveragetools

Grammar (CFG) Lexicon

ROOT ® S

S ® NP VP

NP ® DT NN

NP ® NN NNS

NN ® interest

NNS ® raises

VBP ® interest

VBZ ® raises

…

NP ® NP PP

VP ® VBP NP

VP ® VBP NP PP

PP ® IN NP

Ambiguities

Ambiguities:PPAttachment

Attachments

§ Icleanedthedishesfromdinner

§ Icleanedthedisheswithdetergent

§ Icleanedthedishesinmypajamas

§ Icleanedthedishesinthesink

SyntacticAmbiguitiesI

§ Prepositionalphrases:Theycookedthebeansinthepotonthestovewithhandles.

§ Particlevs.preposition:Thepuppytoreupthestaircase.

§ ComplementstructuresThetouristsobjectedtotheguidethattheycouldn’thear.Sheknowsyoulikethebackofherhand.

§ Gerundvs.participialadjectiveVisitingrelativescanbeboring.Changingschedulesfrequentlyconfusedpassengers.

SyntacticAmbiguitiesII§ ModifierscopewithinNPs

impracticaldesignrequirementsplasticcupholder

§ MultiplegapconstructionsThechickenisreadytoeat.Thecontractorsarerichenoughtosue.

§ Coordinationscope:Smallratsandmicecansqueezeintoholesorcracksinthewall.

DarkAmbiguities

§ Darkambiguities: mostanalysesareshockinglybad(meaning,theydon’thaveaninterpretationyoucangetyourmindaround)

§ Unknownwordsandnewusages§ Solution:Weneedmechanismstofocusattentiononthebestones,probabilistictechniquesdothis

Thisanalysiscorrespondstothecorrectparseof

“Thiswillpanicbuyers!”

AmbiguitiesasTrees

ProbabilisticContext-FreeGrammars

§ Acontext-freegrammarisatuple<N,T,S,R>§ N :thesetofnon-terminals

§ Phrasalcategories:S,NP,VP,ADJP,etc.§ Parts-of-speech(pre-terminals):NN,JJ,DT,VB

§ T :thesetofterminals(thewords)§ S :thestartsymbol

§ OftenwrittenasROOTorTOP§ Notusuallythesentencenon-terminalS

§ R :thesetofrules§ OftheformX® Y1 Y2 …Yk,withX,Yi Î N§ Examples:S® NPVP,VP® VPCCVP§ Alsocalledrewrites,productions,orlocaltrees

§ APCFGadds:§ Atop-downproductionprobabilityperruleP(Y1 Y2 …Yk|X)

TreebankSentences

TreebankGrammars

§ NeedaPCFGforbroadcoverageparsing.§ Cantakeagrammarrightoffthetrees(doesn’tworkwell):

§ Betterresultsbyenrichingthegrammar(e.g.,lexicalization).§ Canalsogetstate-of-the-artparserswithoutlexicalization.

ROOT ® S 1

S ® NP VP . 1

NP ® PRP 1

VP ® VBD ADJP 1

…..

PLURAL NOUN

NOUNDETDET

ADJ

NOUN

NP NP

CONJ

NP PP

TreebankGrammarScale

§ Treebankgrammarscanbeenormous§ AsFSAs,therawgrammarhas~10Kstates,excludingthelexicon§ Betterparsersusuallymakethegrammarslarger,notsmaller

NP

ChomskyNormalForm

§ Chomskynormalform:§ AllrulesoftheformX® YZorX® w§ Inprinciple,thisisnolimitationonthespaceof(P)CFGs

§ N-aryrulesintroducenewnon-terminals

§ Unaries/emptiesare“promoted”§ Inpracticeit’skindofapain:

§ Reconstructingn-ariesiseasy§ Reconstructingunariesistrickier§ Thestraightforwardtransformationsdon’tpreservetreescores

§ Makesparsingalgorithmssimpler!

VP

[VP ® VBD NP •]

VBD NP PP PP

[VP ® VBD NP PP •]

VBD NP PP PP

VP

CKYParsing

ARecursiveParser

§ Willthisparserwork?§ Whyorwhynot?§ Memoryrequirements?

bestScore(X,i,j,s)if (j = i+1)

return tagScore(X,s[i])else

return max score(X->YZ) *bestScore(Y,i,k) *bestScore(Z,k,j)

AMemoizedParser§ Onesmallchange:

bestScore(X,i,j,s)if (scores[X][i][j] == null)

if (j = i+1)score = tagScore(X,s[i])

elsescore = max score(X->YZ) *

bestScore(Y,i,k) *bestScore(Z,k,j)

scores[X][i][j] = scorereturn scores[X][i][j]

§ Canalsoorganizethingsbottom-up

ABottom-UpParser(CKY)

bestScore(s)for (i : [0,n-1])for (X : tags[s[i]])score[X][i][i+1] =

tagScore(X,s[i])for (diff : [2,n])for (i : [0,n-diff])j = i + difffor (X->YZ : rule)for (k : [i+1, j-1])score[X][i][j] = max score[X][i][j],

score(X->YZ) *score[Y][i][k] *score[Z][k][j]

Y Z

X

i k j

UnaryRules§ Unaryrules?

bestScore(X,i,j,s)if (j = i+1)


return max max score(X->YZ) *bestScore(Y,i,k) *bestScore(Z,k,j)

max score(X->Y) *bestScore(Y,i,j)

CNF+UnaryClosure

§ Weneedunariestobenon-cyclic§ Canaddressbypre-calculatingtheunaryclosure§ Ratherthanhavingzeroormoreunaries,alwayshaveexactlyone

§ Alternateunaryandbinarylayers§ Reconstructunarychainsafterwards

NP

DT NN

VP

VBDNP

DT NN

VP

VBD NP

VP

S

SBAR

VP

SBAR

AlternatingLayers

bestScoreU(X,i,j,s)if (j = i+1)


return max max score(X->Y) *bestScoreB(Y,i,j)

bestScoreB(X,i,j,s)return max max score(X->YZ) *

bestScoreU(Y,i,k) *bestScoreU(Z,k,j)

Analysis

Memory§ Howmuchmemorydoesthisrequire?

§ Havetostorethescorecache§ Cachesize:|symbols|*n2 doubles§ Fortheplaintreebankgrammar:

§ X~20K,n=40,double~8bytes=~256MB§ Big,butworkable.

§ Pruning:Beams§ score[X][i][j]cangettoolarge(when?)§ Cankeepbeams(truncatedmapsscore[i][j])whichonlystorethebestfew

scoresforthespan[i,j]

§ Pruning:Coarse-to-Fine§ UseasmallergrammartoruleoutmostX[i,j]§ Muchmoreonthislater…

Time:Theory§ Howmuchtimewillittaketoparse?

§ Foreachdiff(<=n)§ Foreachi (<=n)

§ ForeachruleX® YZ§ ForeachsplitpointkDoconstantwork

§ Totaltime:|rules|*n3

§ Somethinglike5secforanunoptimized parseofa20-wordsentence

Y Z

X

i k j

Time:Practice

§ Parsingwiththevanillatreebank grammar:

§ Why’sitworseinpractice?§ Longersentences“unlock”moreofthegrammar§ Allkindsofsystemsissuesdon’tscale

~ 20K Rules

(not an optimized parser!)

Observed exponent:

3.6

Same-SpanReachability

ADJP ADVPFRAG INTJ NPPP PRN QP SSBAR UCP VP

WHNP

TOP

LST

CONJP

WHADJP

WHADVP

WHPP

NX

NAC

SBARQ

SINV

RRCSQ X

PRT

RuleStateReachability

§ Manystatesaremorelikelytomatchlargerspans!

Example: NP CC •

NP CC

0 nn-11 Alignment

Example: NP CC NP •

NP CC

0 nn-k-1n AlignmentsNP

n-k

EfficientCKY

§ LotsoftrickstomakeCKYefficient§ Someofthemarelittleengineeringdetails:

§ E.g.,firstchoosek,thenenumeratethroughtheY:[i,k]whicharenon-zero,thenloopthroughrulesbyleftchild.

§ Optimallayoutofthedynamicprogramdependsongrammar,input,evensystemdetails.

§ Anotherkindismoreimportant(andinteresting):§ ManyX[i,j]canbesuppressedonthebasisoftheinputstring§ We’llseethisnextclassasfigures-of-merit,A*heuristics,coarse-to-fine,etc

Agenda-BasedParsing

Agenda-BasedParsing§ Agenda-basedparsingislikegraphsearch(butovera

hypergraph)§ Concepts:

§ Numbering:wenumberfencepostsbetweenwords§ “Edges”oritems:spanswithlabels,e.g.PP[3,5],representthesetsof

treesoverthosewordsrootedatthatlabel(cf.searchstates)§ Achart:recordsedgeswe’veexpanded(cf.closedset)§ Anagenda:aqueuewhichholdsedges(cf.afringeoropenset)

0 1 2 3 4 5critics write reviews with computers

PP

WordItems§ Buildinganitemforthefirsttimeiscalleddiscovery.Itemsgo

intotheagendaondiscovery.§ Toinitialize,wediscoverallworditems(withscore1.0).

critics write reviews with computers

critics[0,1], write[1,2], reviews[2,3], with[3,4], computers[4,5]

0 1 2 3 4 5

AGENDA

CHART [EMPTY]

UnaryProjection§ Whenwepopaworditem,thelexicontellsusthetagitem

successors(andscores)whichgoontheagenda

critics write reviews with computers


critics[0,1] write[1,2]NNS[0,1]

reviews[2,3] with[3,4] computers[4,5]VBP[1,2] NNS[2,3] IN[3,4] NNS[4,5]

ItemSuccessors§ Whenwepopitemsoffoftheagenda:

§ Graphsuccessors:unaryprojections(NNS® critics,NP® NNS)

§ Hypergraph successors:combinewithitemsalreadyinourchart

§ Enqueue /promoteresultingitems(ifnotinchartalready)§ Recordbacktraces asappropriate§ Stickthepoppededgeinthechart(closedset)

§ Queriesachartmustsupport:§ IsedgeX[i,j]inthechart?(Whatscore?)§ WhatedgeswithlabelYendatpositionj?§ WhatedgeswithlabelZstartatpositioni?

Y[i,j] with X ® Y forms X[i,j]

Y[i,j] and Z[j,k] with X ® Y Z form X[i,k]

Y Z

X

AnExample


NNS VBP NNS IN NNS

NNS[0,1] VBP[1,2] NNS[2,3] IN[3,4] NNS[3,4] NP[0,1] NP[2,3] NP[4,5]

NP NP NP

VP[1,2] S[0,2]

VP

PP[3,5]

PP

VP[1,3]

VP

ROOT[0,2]

SROOT

SROOT

S[0,3] VP[1,5]

VP

NP[2,5]

NP

ROOT[0,3] S[0,5] ROOT[0,5]

S

ROOT

EmptyElements§ Sometimeswewanttopositnodesinaparsetreethatdon’t

containanypronouncedwords:

§ Theseareeasytoaddtoaagenda-basedparser!§ Foreachpositioni,addthe“word”edgee[i,i]§ AddruleslikeNP® e tothegrammar§ That’sit!

0 1 2 3 4 5I like to parse empties

e e e e e e

NP VP

I want you to parse this sentence

I want [ ] to parse this sentence

UCS/A*

§ Withweightededges,ordermatters§ Mustexpandoptimalparsefrom

bottomup(subparsesfirst)§ CKYdoesthisbyprocessingsmaller

spansbeforelargerones§ UCSpopsitemsofftheagendain

orderofdecreasingViterbiscore§ A*searchalsowelldefined

§ Youcanalsospeedupthesearchwithoutsacrificingoptimality§ Canselectwhichitemstoprocessfirst§ Candowithany“figureofmerit”

[Charniak98]§ Ifyourfigure-of-meritisavalidA*

heuristic,nolossofoptimiality[KleinandManning03]

X

n0 i j

(Speech)Lattices§ Therewasnothingmagicalaboutwordsspanningexactly

oneposition.§ Whenworkingwithspeech,wegenerallydon’tknow

howmanywordsthereare,orwheretheybreak.§ Wecanrepresentthepossibilitiesasalatticeandparse

thesejustaseasily.

Iawe

of

van

eyes

sawa

‘ve

an

Ivan

LearningPCFGs

TreebankPCFGs§ UsePCFGsforbroadcoverageparsing§ Cantakeagrammarrightoffthetrees(doesn’tworkwell):

ROOT ® S 1

S ® NP VP . 1

NP ® PRP 1

VP ® VBD ADJP 1

…..

Model F1Baseline 72.0

[Charniak 96]

ConditionalIndependence?

§ NoteveryNPexpansioncanfilleveryNPslot§ Agrammarwithsymbolslike“NP”won’tbecontext-free§ Statistically,conditionalindependencetoostrong

Non-Independence§ Independenceassumptionsareoftentoostrong.

§ Example:theexpansionofanNPishighlydependentontheparentoftheNP(i.e.,subjectsvs.objects).

§ Also:thesubjectandobjectexpansionsarecorrelated!

11%9%

6%

NP PP DT NN PRP

9% 9%

21%

NP PP DT NN PRP

7%4%

23%

NP PP DT NN PRP

All NPs NPs under S NPs under VP

GrammarRefinement

§ Example:PPattachment

GrammarRefinement

§ StructureAnnotation[Johnson’98,Klein&Manning ’03]§ Lexicalization[Collins’99,Charniak ’00]§ LatentVariables[Matsuzaki etal.05,Petrov etal.’06]

StructuralAnnotation

TheGameofDesigningaGrammar

§ Annotation refines base treebank symbols to improve statistical fit of the grammar§ Structural annotation

TypicalExperimentalSetup

§ Corpus:PennTreebank,WSJ

§ Accuracy– F1:harmonicmeanofper-nodelabeledprecisionandrecall.

§ Here:alsosize– numberofsymbolsingrammar.

Training: sections 02-21Development: section 22 (here, first 20 files)Test: section 23

VerticalMarkovization

§ VerticalMarkovorder:rewritesdependonpastkancestornodes.(cf.parentannotation)

Order 1 Order 2

72%73%74%75%76%77%78%79%

1 2v 2 3v 3

Vertical Markov Order

05000

10000150002000025000

1 2v 2 3v 3

Vertical Markov Order

Symbols

HorizontalMarkovization

70%

71%

72%

73%

74%

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Symbols

Order 1 Order ¥

UnarySplits

§ Problem:unaryrewritesusedtotransmutecategoriessoahigh-probabilityrulecanbeused.

Annotation F1 SizeBase 77.8 7.5KUNARY 78.3 8.0K

n Solution: Mark unary rewrite sites with -U

TagSplits

§ Problem:Treebanktagsaretoocoarse.

§ Example:Sentential,PP,andotherprepositionsareallmarkedIN.

§ PartialSolution:§ SubdividetheINtag. Annotation F1 Size

Previous 78.3 8.0KSPLIT-IN 80.3 8.1K

AFullyAnnotated(Unlex)Tree

SomeTestSetResults

§ Beats“firstgeneration”lexicalizedparsers.§ Lotsofroomtoimprove– morecomplexmodelsnext.

Parser LP LR F1 CB 0 CB

Magerman 95 84.9 84.6 84.7 1.26 56.6

Collins 96 86.3 85.8 86.0 1.14 59.9

Unlexicalized 86.9 85.7 86.3 1.10 60.3

Charniak 97 87.4 87.5 87.4 1.00 62.1

Collins 99 88.7 88.6 88.6 0.90 67.1

EfficientParsingforStructuralAnnotation

GrammarProjections

NP^S→DT^NPN’[…DT]^NPNP→DTN’

Coarse Grammar Fine Grammar

Note:X-BarGrammarsareprojectionswithruleslikeXP→YX’orXP→X’YorX’→X

Coarse-to-FinePruning

ForeachcoarsechartitemX[i,j],computeposteriorprobability:

… QP NP VP …coarse:

refined:

E.g.considerthespan5to12:

< threshold

Computing(Max-)Marginals

InsideandOutsideScores

PruningwithA*

§ Youcanalsospeedupthesearchwithoutsacrificingoptimality

§ Foragenda-basedparsers:§ Canselectwhichitemstoprocessfirst

§ Candowithany“figureofmerit”[Charniak98]

§ Ifyourfigure-of-meritisavalidA*heuristic,nolossofoptimiality[KleinandManning03]

X

n0 i j

A*Parsing

Lexicalization

§ Annotation refines base treebank symbols to improve statistical fit of the grammar§ Structural annotation [Johnson ’98, Klein and Manning 03]§ Head lexicalization [Collins ’99, Charniak ’00]


ProblemswithPCFGs

§ Ifwedonoannotation,thesetreesdifferonlyinonerule:§ VP® VPPP§ NP® NPPP

§ Parsewillgoonewayortheother,regardlessofwords§ Weaddressedthisinonewaywithunlexicalizedgrammars(how?)§ Lexicalizationallowsustobesensitivetospecificwords

ProblemswithPCFGs

§ What’sdifferentbetweenbasicPCFGscoreshere?§ What(lexical)correlationsneedtobescored?

LexicalizedTrees

§ Add“headwords”toeachphrasalnode§ Syntacticvs.semantic

heads§ Headshipnotin(most)

treebanks§ Usuallyuseheadrules,

e.g.:§ NP:

§ TakeleftmostNP§ TakerightmostN*§ TakerightmostJJ§ Takerightchild

§ VP:§ TakeleftmostVB*§ TakeleftmostVP§ Takeleftchild

LexicalizedPCFGs?§ Problem:wenowhavetoestimateprobabilitieslike

§ Nevergoingtogettheseatomicallyoffofatreebank

§ Solution:breakupderivationintosmallersteps

LexicalDerivationSteps§ Aderivationofalocaltree[Collins99]

Choose a head tag and word

Choose a complement bag

Generate children (incl. adjuncts)

Recursively derive children

LexicalizedCKY

bestScore(X,i,j,h)if (j = i+1)return tagScore(X,s[i])

elsereturn max max score(X[h]->Y[h] Z[h’]) *

bestScore(Y,i,k,h) *bestScore(Z,k,j,h’)

max score(X[h]->Y[h’] Z[h]) *bestScore(Y,i,k,h’) *bestScore(Z,k,j,h)

Y[h] Z[h’]

X[h]

i h k h’ j

k,h’,X->YZ

(VP->VBD •)[saw] NP[her]

(VP->VBD...NP •)[saw]

k,h’,X->YZ

EfficientParsingforLexicalGrammars

QuarticParsing§ Turnsout,youcando(alittle)better[Eisner99]

§ GivesanO(n4)algorithm§ Stillprohibitiveinpracticeifnotpruned

Y[h] Z[h’]

X[h]

i h k h’ j

Y[h] Z

X[h]

i h k j

PruningwithBeams§ TheCollinsparserpruneswithper-

cellbeams[Collins99]§ Essentially,runtheO(n5) CKY§ Rememberonlyafewhypothesesfor

eachspan<i,j>.§ IfwekeepKhypothesesateachspan,

thenwedoatmostO(nK2)workperspan(why?)

§ Keepsthingsmoreorlesscubic(andinpracticeismorelikelinear!)

§ Also:certainspansareforbiddenentirelyonthebasisofpunctuation(crucialforspeed)

Y[h] Z[h’]

X[h]

i h k h’ j

PruningwithaPCFG

§ TheCharniak parserprunesusingatwo-pass,coarse-to-fineapproach[Charniak 97+]§ First,parsewiththebasegrammar§ ForeachX:[i,j]calculateP(X|i,j,s)

§ Thisisn’ttrivial,andtherearecleverspeedups§ Second,dothefullO(n5) CKY

§ SkipanyX:[i,j]whichhadlow(say,<0.0001)posterior§ Avoidsalmostallworkinthesecondphase!

§ Charniak etal06:canusemorepasses§ Petrov etal07:canusemanymorepasses

Results

§ Someresults§ Collins99– 88.6F1(generativelexical)§ CharniakandJohnson05– 89.7/91.3F1(generativelexical/reranked)

§ Petrovetal06– 90.7F1(generativeunlexical)§ McCloskyetal06– 92.1F1(gen+rerank+self-train)

§ However§ Bilexicalcountsrarelymakeadifference(why?)§ Gildea01– Removingbilexicalcountscosts<0.5F1

LatentVariablePCFGs

§ Annotationrefinesbasetreebank symbolstoimprovestatisticalfitofthegrammar§ Parentannotation[Johnson’98]


§ Annotationrefinesbasetreebank symbolstoimprovestatisticalfitofthegrammar§ Parentannotation[Johnson’98]§ Headlexicalization [Collins’99,Charniak ’00]


§ Annotationrefinesbasetreebank symbolstoimprovestatisticalfitofthegrammar§ Parentannotation[Johnson’98]§ Headlexicalization [Collins’99,Charniak ’00]§ Automaticclustering?


LatentVariableGrammars

Parse Tree Sentence Parameters

...

Derivations

Forward

LearningLatentAnnotations

EMalgorithm:

X1

X2X7X4

X5 X6X3

He was right

.

§ Brackets are known§ Base categories are known§ Only induce subcategories

JustlikeForward-BackwardforHMMs.Backward

RefinementoftheDTtag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchicalrefinement

HierarchicalEstimationResults

74

76

78

80

82

84

86

88

90

100 300 500 700 900 1100 1300 1500 1700Total Number of grammar symbols

Par

sing

acc

urac

y (F

1)

Model F1Flat Training 87.3Hierarchical Training 88.4

Refinementofthe,tag§ Splittingallcategoriesequallyiswasteful:

Adaptive Splitting

§ Want to split complex categories more§ Idea: split everything, roll back splits which

were least useful

AdaptiveSplittingResults

Model F1Previous 88.4With 50% Merging 89.5

0

5

10

15

20

25

30

35

40

NP

VP PP

ADVP S

ADJP

SBAR Q

P

WH

NP

PRN

NX

SIN

V

PRT

WH

PP SQ

CO

NJP

FRAG

NAC UC

P

WH

ADVP INTJ

SBAR

Q

RR

C

WH

ADJP X

RO

OT

LST

NumberofPhrasalSubcategories

NumberofLexicalSubcategories

0

10

20

30

40

50

60

70

NNP JJ

NNS NN VBN RB

VBG VB VBD CD IN

VBZ

VBP DT

NNPS CC JJ

RJJ

S :PR

PPR

P$ MD

RBR

WP

POS

PDT

WRB

-LRB

- .EX

WP$

WDT

-RRB

- ''FW RB

S TO$

UH, ``

SYM RP LS #

LearnedSplits

§ Proper Nouns (NNP):

§ Personal pronouns (PRP):

NNP-14 Oct. Nov. Sept.NNP-12 John Robert JamesNNP-2 J. E. L.NNP-1 Bush Noriega Peters

NNP-15 New San WallNNP-3 York Francisco Street

PRP-0 It He IPRP-1 it he theyPRP-2 it them him

§ Relativeadverbs(RBR):

§ CardinalNumbers(CD):

RBR-0 further lower higherRBR-1 more less MoreRBR-2 earlier Earlier later

CD-7 one two ThreeCD-4 1989 1990 1988CD-11 million billion trillionCD-0 1 50 100CD-3 1 30 31CD-9 78 58 34

LearnedSplits

FinalResults(Accuracy)

≤ 40 wordsF1

all F1

ENG

Charniak&Johnson ‘05 (generative) 90.1 89.6

Split / Merge 90.6 90.1

GER

Dubey ‘05 76.3 -


CH

N

Chiang et al. ‘02 80.0 76.6


Still higher numbers from reranking / self-training methods

EfficientParsingforHierarchicalGrammars

Coarse-to-FineInference§ Example:PPattachment

?????????

HierarchicalPruning

… QP NP VP …coarse:

split in two: … QP1 QP2 NP1 NP2 VP1 VP2 …

… QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 …split in four:

split in eight: … … … … … … … … … … … … … … … … …

BracketPosteriors

1621min111min35min

15min(nosearcherror)

UnsupervisedTagging

UnsupervisedTagging?§ AKApart-of-speechinduction§ Task:

§ Rawsentencesin§ Taggedsentencesout

§ Obviousthingtodo:§ Startwitha(mostly)uniformHMM§ RunEM§ Inspectresults

EMforHMMs:Process§ Alternatebetweenrecomputingdistributionsoverhiddenvariables(the

tags)andreestimatingparameters§ Crucialstep:wewanttotallyuphowmany(fractional)countsofeach

kindoftransitionandemissionwehaveundercurrentparams:

§ SamequantitiesweneededtotrainaCRF!

Merialdo:Setup§ Some(discouraging)experiments[Merialdo94]

§ Setup:§ Youknowthesetofallowabletagsforeachword§ Fixktrainingexamplestotheirtruelabels

§ LearnP(w|t)ontheseexamples§ LearnP(t|t-1,t-2)ontheseexamples

§ Onnexamples,re-estimatewithEM

§ Note:weknowallowedtagsbutnotfrequencies

EMforHMMs:Quantities§ Totalpathvalues(correspondtoprobabilitieshere):

TheStateLattice/Trellis

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$


EMforHMMs:Process

§ Fromthesequantities,cancomputeexpectedtransitions:

§ Andemissions:

Merialdo:Results

Projection-BasedA*

Factory payrolls fell in Sept.

NP PP

VP

S

Factory payrolls fell in Sept.

payrolls in

fell

fellFactory payrolls fell in Sept.

NP:payrolls PP:in

VP:fell

S:fellSYNTACTICp SEMANTICp

A*Speedup

§ TotaltimedominatedbycalculationofA*tablesineachprojection…O(n3)

0102030405060

0 5 10 15 20 25 30 35 40Length

Tim

e (s

ec) Combined Phase

Dependency PhasePCFG Phase

BreakingUptheSymbols

§ WecanrelaxindependenceassumptionsbyencodingdependenciesintothePCFGsymbols:

§ Whatarethemostuseful“features”toencode?

Parent annotation[Johnson 98]

Marking possessive NPs

OtherTagSplits

§ UNARY-DT:markdemonstrativesasDT^U (“theX”vs.“those”)

§ UNARY-RB:markphrasaladverbsasRB^U(“quickly”vs.“very”)

§ TAG-PA:marktagswithnon-canonicalparents(“not”isanRB^VP)

§ SPLIT-AUX:markauxiliaryverbswith–AUX[cf.Charniak97]

§ SPLIT-CC:separate“but”and“&”fromotherconjunctions

§ SPLIT-%:“%”getsitsowntag.

F1 Size

80.4 8.1K

80.5 8.1K

81.2 8.5K

81.6 9.0K

81.7 9.1K

81.8 9.3K

algorithms for nlp - cs.cmu.edutbergkir/11711fa17/fa17 11-711 lecture 10 -- parsing ii.pdf ·...

Documents