sequence models i

45
Sequence Models I Wei Xu (many slides from Greg Durrett, Dan Klein,Vivek Srikumar, Chris Manning,Yoav Artzi)

Upload: others

Post on 20-Apr-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence Models I

SequenceModelsI

WeiXu(many slides from Greg Durrett, Dan Klein, Vivek Srikumar, Chris Manning, Yoav Artzi)

Page 2: Sequence Models I

Administrivia

‣ Project1isout,dueonSep20(nextMonday!).

‣ Reading:Eisenstein7.0-7.4,Jurafsky+MarOnChapter8

Page 3: Sequence Models I

ThisLecture

‣ Sequencemodeling

‣ HMMsforPOStagging

‣ Viterbi,forward-backward

‣ HMMparameteresOmaOon

Page 4: Sequence Models I

LinguisOcStructures

‣ Languageistree-structured

Iatethespaghe*withchops/cks Iatethespaghe*withmeatballs

‣ Understandingsyntaxfundamentallyrequirestrees—thesentenceshavethesameshallowanalysis

Iatethespaghe*withchops/cks Iatethespaghe*withmeatballsPRPVBZDTNNINNNS PRPVBZDTNNINNNS

Page 5: Sequence Models I

LinguisOcStructures

‣ LanguageissequenOallystructured:interpretedinanonlineway

Tanenhausetal.(1995)

Page 6: Sequence Models I

POSTagging

Ghana’sambassadorshouldhavesetupthebigmee/nginDCyesterday.

‣Whattagsareoutthere?

NNPPOSNNMDVBVBNRPDTJJNNINNNPNN.

Page 7: Sequence Models I

POSTagging

Slidecredit:DanKlein

Page 8: Sequence Models I

POSTagging

Slidecredit:YoavArtzi

Page 9: Sequence Models I

POSTagging

Fedraisesinterestrates0.5percent

VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN

I’m0.5%interestedintheFed’sraises!

Iherebyincreaseinterestrates0.5%

Fedraisesinterestrates0.5percent

VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN

‣ OtherpathsarealsoplausiblebutevenmoresemanOcallyweird…‣Whatgovernsthecorrectchoice?Word+context‣WordidenOty:mostwordshave<=2tags,manyhaveone(percent,the)‣ Context:nounsstartsentences,nounsfollowverbs,etc.

Page 10: Sequence Models I

Whatisthisgoodfor?

‣ Text-to-speech:record,lead

‣ PreprocessingstepforsyntacOcparsers

‣ Domain-independentdisambiguaOonforothertasks

‣ (Very)shallowinformaOonextracOon

Page 11: Sequence Models I

SequenceModels

‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output

‣ POStagging:xisasequenceofwords,yisasequenceoftags

‣ Today:generaOvemodelsP(x,y);discriminaOvemodelsnextOme

Page 12: Sequence Models I

HiddenMarkovModels

y = (y1, ..., yn)Output‣ Inputx = (x1, ..., xn)

‣ModelthesequenceofyasaMarkovprocess

y1 y2

‣Markovproperty:futureiscondiOonallyindependentofthepastgiventhepresent

‣ Ifyaretags,thisroughlycorrespondstoassumingthatthenexttagonlydependsonthecurrenttag,notanythingbefore

y3 P (y3|y1, y2) = P (y3|y2)

‣ LotsofmathemaOcaltheoryabouthowMarkovchainsbehave

Page 13: Sequence Models I

HiddenMarkovModels

y1 y2 yn

x1 x2 xn

y = (y1, ..., yn)Output‣ Inputx = (x1, ..., xn)

Fedraises percent…

NNP VBZ NN…

Page 14: Sequence Models I

HiddenMarkovModels

y1 y2 yn

x1 x2 xn

P (y,x) = P (y1)nY

i=2

P (yi|yi�1)nY

i=1

P (xi|yi)

IniOaldistribuOon

TransiOonprobabiliOes

EmissionprobabiliOes

} }} ‣ P(x|y)isadistribuOonoverallwordsinthevocabulary—notadistribuOonoverfeatures(butcouldbe!)

‣MulOnomials:tagxtagtransiOons,tagxwordemissions

‣ ObservaOon(x)dependsonlyoncurrentstate(y)

y = (y1, ..., yn)Output‣ Inputx = (x1, ..., xn)

Page 15: Sequence Models I

TransiOonsinPOSTagging

‣ Dynamicsmodel

Fedraisesinterestrates0.5percent.

VBDVBNNNP

VBZNNS

VBVBPNN

VBZNNS CD NN

‣ likelybecausestartofsentence

‣ likelybecauseverbonenfollowsnoun

‣ directobjectfollowsverb,otherverbrarelyfollowspasttenseverb(mainverbscanfollowmodalsthough!)

P (y1 = NNP)

P (y2 = VBZ|y1 = NNP)

P (y3 = NN|y2 = VBZ)

P (y1)nY

i=2

P (yi|yi�1) NNP-propernoun,singularVBZ-verb,3rdps.sing.presentNN-noun,singularormass.

Page 16: Sequence Models I

EsOmaOngTransiOons

‣ SimilartoNaiveBayesesOmaOon:maximumlikelihoodsoluOon=normalizedcounts(withsmoothing)readoffsuperviseddata

Fedraisesinterestrates0.5percent.NNP VBZ NN NNS CD NN

‣ Howtosmooth?

‣ Onemethod:smoothwithunigramdistribuOonovertags

‣ P(tag|NN)

P (tag|tag�1) = (1� �)P̂ (tag|tag�1) + �P̂ (tag)

=empiricaldistribuOon(readofffromdata)P̂

.

=(0.5.,0.5NNS)

Page 17: Sequence Models I

‣ EmissionsP(x|y)capturethedistribuOonofwordsoccurringwithagiventag

EmissionsinPOSTagging

‣ P(word|NN)=(0.05person,0.04official,0.03interest,0.03percent…)

‣Whenyoucomputetheposteriorforagivenword’stags,thedistribuOonfavorstagsthataremorelikelytogeneratethatword

‣ Howshouldwesmooththis?

Fedraisesinterestrates0.5percent.NNP VBZ NN NNS CD NN .

Page 18: Sequence Models I

EsOmaOngEmissions

‣ P(word|NN)=(0.5interest,0.5percent)—hardtosmooth!

‣ Fancytechniquesfromlanguagemodeling,e.g.lookattypeferOlity—P(tag|word)isflarerforsomekindsofwordsthanforothers)

Fedraisesinterestrates0.5percentNNP VBZ NN NNS CD NN

P (word|tag) = P (tag|word)P (word)

P (tag)

‣ AlternaOve:useBayes’rule

‣ CaninterpolatewithdistribuOonlookingatwordshapeP(wordshape|tag)(e.g.,P(capitalizedwordoflen>=8|tag))

‣ P(word|tag)canbealog-linearmodel—we’llseethisinafewlectures

Page 19: Sequence Models I

InferenceinHMMs

‣ Inferenceproblem:

‣ ExponenOallymanypossibleyhere!

‣ SoluOon:dynamicprogramming(possiblebecauseofMarkovstructure!)

‣ManyneuralsequencemodelsdependonenOreprevioustagsequence,needtouseapproximaOonslikebeamsearch

‣ Inputx = (x1, ..., xn) y = (y1, ..., yn)Output

y1 y2 yn

x1 x2 xn

… P (y,x) = P (y1)nY

i=2

P (yi|yi�1)nY

i=1

P (xi|yi)

argmaxyP (y|x) = argmaxyP (y,x)

P (x)

Page 20: Sequence Models I

ViterbiAlgorithm

slidecredit:VivekSrikumar

Page 21: Sequence Models I

ViterbiAlgorithm

slidecredit:VivekSrikumar

Page 22: Sequence Models I

ViterbiAlgorithm

slidecredit:VivekSrikumar

best(parOal)scorefor asequenceendinginstates

Page 23: Sequence Models I

ViterbiAlgorithm

slidecredit:VivekSrikumar

Page 24: Sequence Models I

ViterbiAlgorithm

slidecredit:VivekSrikumar

Page 25: Sequence Models I

ViterbiAlgorithm

slidecredit:VivekSrikumar

Page 26: Sequence Models I

ViterbiAlgorithm

slidecredit:VivekSrikumar

Page 27: Sequence Models I

ViterbiAlgorithm

slidecredit:VivekSrikumar

Page 28: Sequence Models I

ViterbiAlgorithm

slidecredit:DanKlein

‣ “Thinkabout”allpossibleimmediatepriorstatevalues.Everythingbeforethathasalreadybeenaccountedforbyearlierstages.

Page 29: Sequence Models I

Forward-BackwardAlgorithm‣ InaddiOontofindingthebestpath,wemaywanttocomputemarginalprobabiliOesofpaths P (yi = s|x)

P (yi = s|x) =X

y1,...,yi�1,yi+1,...,yn

P (y|x)

‣WhatdidViterbicompute? P (ymax|x) = maxy1,...,yn

P (y|x)

‣ Cancomputemarginalswithdynamicprogrammingaswellusinganalgorithmcalledforward-backward

Page 30: Sequence Models I

Forward-BackwardAlgorithm

P (y3 = 2|x) =sum of all paths through state 2 at time 3

sum of all paths

Page 31: Sequence Models I

Forward-BackwardAlgorithm

slidecredit:DanKlein

P (y3 = 2|x) =sum of all paths through state 2 at time 3

sum of all paths

=

‣ Easiestandmostflexibletodoonepasstocomputeandonetocompute

Page 32: Sequence Models I

Forward-BackwardAlgorithm

↵1(s) = P (s)P (x1|s)

↵t(st) =X

st�1

↵t�1(st�1)P (st|st�1)P (xt|st)

‣ IniOal:

‣ Recurrence:

‣ SameasViterbibutsumminginsteadofmaxing!

‣ ThesequanOOesgetverysmall!StoreeverythingaslogprobabiliOes

Page 33: Sequence Models I

Forward-BackwardAlgorithm

‣ IniOal:�n(s) = 1

�t(st) =X

st+1

�t+1(st+1)P (st+1|st)P (xt+1|st+1)

‣ Recurrence:

‣ Bigdifferences:countemissionforthenextOmestep(notcurrentone)

Page 34: Sequence Models I

Forward-BackwardAlgorithm

↵1(s) = P (s)P (x1|s)

↵t(st) =X

st�1

↵t�1(st�1)P (st|st�1)P (xt|st)

�n(s) = 1

�t(st) =X

st+1

�t+1(st+1)P (st+1|st)P (xt+1|st+1)

‣ Bigdifferences:countemissionforthenextOmestep(notcurrentone)

Page 35: Sequence Models I

Forward-BackwardAlgorithm

↵1(s) = P (s)P (x1|s)

↵t(st) =X

st�1

↵t�1(st�1)P (st|st�1)P (xt|st)

�n(s) = 1

�t(st) =X

st+1

�t+1(st+1)P (st+1|st)P (xt+1|st+1)

P (s3 = 2|x) = ↵3(2)�3(2)Pi ↵3(i)�3(i)

‣Whatisthedenominatorhere? P (x)

=

Page 36: Sequence Models I

HMMPOSTagging

‣ Baseline:assigneachworditsmostfrequenttag:~90%accuracy

‣ TrigramHMM:~95%accuracy/55%onunknownwords

Slidecredit:DanKlein

Page 37: Sequence Models I

TrigramTaggers

‣ Trigrammodel:y1=(<S>,NNP),y2=(NNP,VBZ),…

‣ P((VBZ,NN)|(NNP,VBZ))—morecontext!Noun-verb-nounS-V-O

Fedraisesinterestrates0.5percentNNP VBZ NN NNS CD NN

‣ Tradeoffbetweenmodelcapacityanddatasize—trigramsarea“sweetspot”forPOStagging

Page 38: Sequence Models I

HMMPOSTagging

‣ Baseline:assigneachworditsmostfrequenttag:~90%accuracy

‣ TrigramHMM:~95%accuracy/55%onunknownwords

‣ TnTtagger(Brants1998,tunedHMM):96.2%accuracy/86.0%onunks

Slidecredit:DanKlein

‣ State-of-the-art(BiLSTM-CRFs):97.5%/89%+onunks

https://arxiv.org/pdf/cs/0003055.pdf

Page 39: Sequence Models I

Errors

officialknowledge madeupthestory recentlysoldsharesJJ/NNNN VBDRP/INDTNN RBVBD/VBNNNS

Slidecredit:DanKlein/Toutanova+Manning(2000)(NNNN:taxcut,artgallery,…)

Page 40: Sequence Models I

RemainingErrors

‣ Underspecified/unclear,goldstandardinconsistent/wrong:58%

‣ Lexicongap(wordnotseenwiththattagintraining)4.5%‣ Unknownword:4.5%‣ Couldgetright:16%(manyoftheseinvolveparsing!)

‣ DifficultlinguisOcs:20%

Theysetupabsurdsitua/ons,detachedfromrealityVBD/VBP?(pastorpresent?)

a$10millionfourth-quarterchargeagainstdiscon/nuedopera/onsadjecOveorverbalparOciple?JJ/VBN?

Manning2011“Part-of-SpeechTaggingfrom97%to100%:IsItTimeforSomeLinguisOcs?”

Page 41: Sequence Models I

OtherLanguages

Petrovetal.2012

Page 42: Sequence Models I

OtherLanguages

‣ UniversalPOStagset(~12tags),cross-lingualmodelworksaswellastunedCRFusingexternalresources

Gillicketal.2016

Byte-to-Span

Page 43: Sequence Models I

Zero-shotCross-lingualTransferLearning

‣ModelsaretrainedonannotatedEnglishdata,thendirectlyapplethemtoArabictextsforPOStagging.

Lan,Chen,Xu,Rirer2020

Page 44: Sequence Models I

Zero-shotCross-lingualTransferLearning

‣ModelsaretrainedonannotatedEnglishdata,thendirectlyapplethemtoArabictextsforPOStagging. Lan,Chen,Xu,Rirer2020

Page 45: Sequence Models I

NextUp

‣ CRFs:feature-baseddiscriminaOvemodels

‣ NamedenOtyrecogniOon