sequence models i

SequenceModelsI

WeiXu(many slides from Greg Durrett, Dan Klein, Vivek Srikumar, Chris Manning, Yoav Artzi)

Administrivia

‣ Project1isout,dueonSep20(nextMonday!).

‣ Reading:Eisenstein7.0-7.4,Jurafsky+MarOnChapter8

ThisLecture

‣ Sequencemodeling

‣ HMMsforPOStagging

‣ Viterbi,forward-backward

‣ HMMparameteresOmaOon

LinguisOcStructures

‣ Languageistree-structured

Iatethespaghe*withchops/cks Iatethespaghe*withmeatballs

‣ Understandingsyntaxfundamentallyrequirestrees—thesentenceshavethesameshallowanalysis

Iatethespaghe*withchops/cks Iatethespaghe*withmeatballsPRPVBZDTNNINNNS PRPVBZDTNNINNNS

LinguisOcStructures

‣ LanguageissequenOallystructured:interpretedinanonlineway

Tanenhausetal.(1995)

POSTagging

Ghana’sambassadorshouldhavesetupthebigmee/nginDCyesterday.

‣Whattagsareoutthere?

NNPPOSNNMDVBVBNRPDTJJNNINNNPNN.

POSTagging

Slidecredit:DanKlein

POSTagging

Slidecredit:YoavArtzi

POSTagging

Fedraisesinterestrates0.5percent

VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN

I’m0.5%interestedintheFed’sraises!

Iherebyincreaseinterestrates0.5%

Fedraisesinterestrates0.5percent

VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN

‣ OtherpathsarealsoplausiblebutevenmoresemanOcallyweird…‣Whatgovernsthecorrectchoice?Word+context‣WordidenOty:mostwordshave<=2tags,manyhaveone(percent,the)‣ Context:nounsstartsentences,nounsfollowverbs,etc.

Whatisthisgoodfor?

‣ Text-to-speech:record,lead

‣ PreprocessingstepforsyntacOcparsers

‣ Domain-independentdisambiguaOonforothertasks

‣ (Very)shallowinformaOonextracOon

SequenceModels

‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output

‣ POStagging:xisasequenceofwords,yisasequenceoftags

‣ Today:generaOvemodelsP(x,y);discriminaOvemodelsnextOme

HiddenMarkovModels

y = (y1, ..., yn)Output‣ Inputx = (x1, ..., xn)

‣ModelthesequenceofyasaMarkovprocess

y1 y2

‣Markovproperty:futureiscondiOonallyindependentofthepastgiventhepresent

‣ Ifyaretags,thisroughlycorrespondstoassumingthatthenexttagonlydependsonthecurrenttag,notanythingbefore

y3 P (y3|y1, y2) = P (y3|y2)

‣ LotsofmathemaOcaltheoryabouthowMarkovchainsbehave

HiddenMarkovModels

y1 y2 yn

x1 x2 xn

…


Fedraises percent…

NNP VBZ NN…

HiddenMarkovModels

y1 y2 yn

x1 x2 xn

…

P (y,x) = P (y1)nY

i=2

P (yi|yi�1)nY

i=1

P (xi|yi)

IniOaldistribuOon

TransiOonprobabiliOes

EmissionprobabiliOes

} }} ‣ P(x|y)isadistribuOonoverallwordsinthevocabulary—notadistribuOonoverfeatures(butcouldbe!)

‣MulOnomials:tagxtagtransiOons,tagxwordemissions

‣ ObservaOon(x)dependsonlyoncurrentstate(y)


TransiOonsinPOSTagging

‣ Dynamicsmodel

Fedraisesinterestrates0.5percent.

VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN

‣ likelybecausestartofsentence

‣ likelybecauseverbonenfollowsnoun

‣ directobjectfollowsverb,otherverbrarelyfollowspasttenseverb(mainverbscanfollowmodalsthough!)

P (y1 = NNP)

P (y2 = VBZ|y1 = NNP)

P (y3 = NN|y2 = VBZ)

P (y1)nY

i=2

P (yi|yi�1) NNP-propernoun,singularVBZ-verb,3rdps.sing.presentNN-noun,singularormass.

EsOmaOngTransiOons

‣ SimilartoNaiveBayesesOmaOon:maximumlikelihoodsoluOon=normalizedcounts(withsmoothing)readoffsuperviseddata

Fedraisesinterestrates0.5percent.NNP VBZ NN NNS CD NN

‣ Howtosmooth?

‣ Onemethod:smoothwithunigramdistribuOonovertags

‣ P(tag|NN)

P (tag|tag�1) = (1� �)P̂ (tag|tag�1) + �P̂ (tag)

=empiricaldistribuOon(readofffromdata)P̂

.

=(0.5.,0.5NNS)

‣ EmissionsP(x|y)capturethedistribuOonofwordsoccurringwithagiventag

EmissionsinPOSTagging

‣ P(word|NN)=(0.05person,0.04official,0.03interest,0.03percent…)

‣Whenyoucomputetheposteriorforagivenword’stags,thedistribuOonfavorstagsthataremorelikelytogeneratethatword

‣ Howshouldwesmooththis?

Fedraisesinterestrates0.5percent.NNP VBZ NN NNS CD NN .

EsOmaOngEmissions

‣ P(word|NN)=(0.5interest,0.5percent)—hardtosmooth!

‣ Fancytechniquesfromlanguagemodeling,e.g.lookattypeferOlity—P(tag|word)isflarerforsomekindsofwordsthanforothers)

Fedraisesinterestrates0.5percentNNP VBZ NN NNS CD NN

P (word|tag) = P (tag|word)P (word)

P (tag)

‣ AlternaOve:useBayes’rule

‣ CaninterpolatewithdistribuOonlookingatwordshapeP(wordshape|tag)(e.g.,P(capitalizedwordoflen>=8|tag))

‣ P(word|tag)canbealog-linearmodel—we’llseethisinafewlectures

InferenceinHMMs

‣ Inferenceproblem:

‣ ExponenOallymanypossibleyhere!

‣ SoluOon:dynamicprogramming(possiblebecauseofMarkovstructure!)

‣ManyneuralsequencemodelsdependonenOreprevioustagsequence,needtouseapproximaOonslikebeamsearch

‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output

y1 y2 yn

x1 x2 xn

… P (y,x) = P (y1)nY

i=2

P (yi|yi�1)nY

i=1

P (xi|yi)

argmaxyP (y|x) = argmaxyP (y,x)

P (x)

ViterbiAlgorithm

slidecredit:VivekSrikumar

ViterbiAlgorithm


best(parOal)scorefor asequenceendinginstates

ViterbiAlgorithm


ViterbiAlgorithm

slidecredit:DanKlein

‣ “Thinkabout”allpossibleimmediatepriorstatevalues.Everythingbeforethathasalreadybeenaccountedforbyearlierstages.

Forward-BackwardAlgorithm‣ InaddiOontofindingthebestpath,wemaywanttocomputemarginalprobabiliOesofpaths P (yi = s|x)

P (yi = s|x) =X

y1,...,yi�1,yi+1,...,yn

P (y|x)

‣WhatdidViterbicompute? P (ymax|x) = maxy1,...,yn

P (y|x)

‣ Cancomputemarginalswithdynamicprogrammingaswellusinganalgorithmcalledforward-backward

Forward-BackwardAlgorithm

P (y3 = 2|x) =sum of all paths through state 2 at time 3

sum of all paths


slidecredit:DanKlein

P (y3 = 2|x) =sum of all paths through state 2 at time 3

sum of all paths

=

‣ Easiestandmostflexibletodoonepasstocomputeandonetocompute


↵1(s) = P (s)P (x1|s)

↵t(st) =X

st�1

↵t�1(st�1)P (st|st�1)P (xt|st)

‣ IniOal:

‣ Recurrence:

‣ SameasViterbibutsumminginsteadofmaxing!

‣ ThesequanOOesgetverysmall!StoreeverythingaslogprobabiliOes


‣ IniOal:�n(s) = 1

�t(st) =X

st+1

�t+1(st+1)P (st+1|st)P (xt+1|st+1)

‣ Recurrence:

‣ Bigdifferences:countemissionforthenextOmestep(notcurrentone)


↵1(s) = P (s)P (x1|s)

↵t(st) =X

st�1


�n(s) = 1

�t(st) =X

st+1

�t+1(st+1)P (st+1|st)P (xt+1|st+1)

‣ Bigdifferences:countemissionforthenextOmestep(notcurrentone)


↵1(s) = P (s)P (x1|s)

↵t(st) =X

st�1


�n(s) = 1

�t(st) =X

st+1

�t+1(st+1)P (st+1|st)P (xt+1|st+1)

P (s3 = 2|x) = ↵3(2)�3(2)Pi ↵3(i)�3(i)

‣Whatisthedenominatorhere? P (x)

=

HMMPOSTagging

‣ Baseline:assigneachworditsmostfrequenttag:~90%accuracy

‣ TrigramHMM:~95%accuracy/55%onunknownwords


TrigramTaggers

‣ Trigrammodel:y1=(<S>,NNP),y2=(NNP,VBZ),…

‣ P((VBZ,NN)|(NNP,VBZ))—morecontext!Noun-verb-nounS-V-O

Fedraisesinterestrates0.5percentNNP VBZ NN NNS CD NN

‣ Tradeoffbetweenmodelcapacityanddatasize—trigramsarea“sweetspot”forPOStagging

HMMPOSTagging

‣ Baseline:assigneachworditsmostfrequenttag:~90%accuracy

‣ TrigramHMM:~95%accuracy/55%onunknownwords

‣ TnTtagger(Brants1998,tunedHMM):96.2%accuracy/86.0%onunks


‣ State-of-the-art(BiLSTM-CRFs):97.5%/89%+onunks

https://arxiv.org/pdf/cs/0003055.pdf

Errors

officialknowledge madeupthestory recentlysoldsharesJJ/NNNN VBDRP/INDTNN RBVBD/VBNNNS

Slidecredit:DanKlein/Toutanova+Manning(2000)(NNNN:taxcut,artgallery,…)

RemainingErrors

‣ Underspecified/unclear,goldstandardinconsistent/wrong:58%

‣ Lexicongap(wordnotseenwiththattagintraining)4.5%‣ Unknownword:4.5%‣ Couldgetright:16%(manyoftheseinvolveparsing!)

‣ DifficultlinguisOcs:20%

Theysetupabsurdsitua/ons,detachedfromrealityVBD/VBP?(pastorpresent?)

a$10millionfourth-quarterchargeagainstdiscon/nuedopera/onsadjecOveorverbalparOciple?JJ/VBN?

Manning2011“Part-of-SpeechTaggingfrom97%to100%:IsItTimeforSomeLinguisOcs?”

OtherLanguages

Petrovetal.2012

OtherLanguages

‣ UniversalPOStagset(~12tags),cross-lingualmodelworksaswellastunedCRFusingexternalresources

Gillicketal.2016

Byte-to-Span

Zero-shotCross-lingualTransferLearning

‣ModelsaretrainedonannotatedEnglishdata,thendirectlyapplethemtoArabictextsforPOStagging.

Lan,Chen,Xu,Rirer2020

Zero-shotCross-lingualTransferLearning

‣ModelsaretrainedonannotatedEnglishdata,thendirectlyapplethemtoArabictextsforPOStagging. Lan,Chen,Xu,Rirer2020

NextUp

‣ CRFs:feature-baseddiscriminaOvemodels

‣ NamedenOtyrecogniOon

sequence models i

Documents