reading comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ data: 1b word benchmark...

18
Reading Comprehension

Upload: others

Post on 11-Mar-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

ReadingComprehension

Page 2: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

Bidirec2onalA5en2onFlow

Seoetal.(2016)

Eachpassagewordnow“knowsabout”thequery

Page 3: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

QANet‣ OneofmanymodelsbuildingonBiDAFinmorecomplexways

Yuetal.(2018)

‣ SimilarstructureasBiDAF,buttransformerlayers(nextlecture)insteadofLSTMs

Page 4: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

SQuADSOTA:Fall18

‣ nlnet,QANet,r-net—duelingsupercomplexsystems(muchmorethanBiDAF…)

‣ BiDAF:73EM/81F1

Page 5: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

SQuAD2.0SOTA:Spring2019

‣ Sincespring2019:SQuADperformanceisdominatedbylargepre-trainedmodelslikeBERT

‣ HardervariantofSQuAD

Page 6: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

AdversarialExamples‣ Canconstructadversarialexamplesthatfoolthesesystems:addonecarefullychosensentenceandperformancedropstobelow50%

JiaandLiang(2017)

‣ S2ll“surface-level”matching,notcomplexunderstanding

‣ Otherchallenges:recognizingwhenanswersaren’tpresent,doingmul2-stepreasoning

Page 7: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

Pre-training/ELMo

Page 8: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

Whatispre-training?

‣ “Pre-train”amodelonalargedatasetfortaskX,then“fine-tune”itonadatasetfortaskY

‣ Keyidea:XissomewhatrelatedtoY,soamodelthatcandoXwillhavesomegoodneuralrepresenta2onsforYaswell

‣ GloVecanbeseenaspre-training:learnvectorswiththeskip-gramobjec2veonlargedata(taskX),thenfine-tunethemaspartofaneuralnetworkforsen2ment/anyothertask(taskY)

‣ ImageNetpre-trainingishugeincomputervision:learngenericvisualfeaturesforrecognizingobjects

Page 9: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

GloVeisinsufficient‣ GloVeusesalotofdatabutinaweakway

‣ Havingasingleembeddingforeachwordiswrong

‣ Iden2fyingdiscretewordsensesishard,doesn’tscale.Hardtoiden2fyhowmanysenseseachwordhas

‣ Takeapowerfullanguagemodel,trainitonlargeamountsofdata,thenusethoserepresenta2onsindownstreamtasks

theyhittheballstheydanceatballs

‣ Howcanwemakeourwordembeddingsmorecontext-dependent?

Page 10: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

Context-dependentEmbeddings

Petersetal.(2018)

‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,usethehiddenstates(output)ateachstepaswordembeddings

they hit the ballsthey dance at balls

‣ ThisisthekeyideabehindELMo:languagemodelscanallowustoformusefulwordrepresenta2onsinthesamewayword2vecdid

Page 11: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

ELMo‣ CNNovereachword=>RNN

JohnvisitedMadagascaryesterdayCharCNN CharCNN CharCNN CharCNN

4096-dimLSTMs

nextword

2048CNNfiltersprojecteddownto512-dim

Petersetal.(2018)

Representa2onofvisited(plusvectorsfromanotherLMrunningbackwards)

*gesngthismodelrighttookyears

Page 12: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

TrainingELMo‣ Data:1BWordBenchmark(Chelbaetal.,2014)

‣ Pre-training2me:2weekson3NVIDIAGTX1080GPUs

‣Muchlower2mecostifweusedV100s/Google’sTPUs,buts2llhundredsofdollarsincomputecosttotrainonce

‣ LargerBERTmodelstrainedonmoredata(nextweek)cost$10k+

‣ Pre-trainingisexpensive,butfine-tuningisdoable

Page 13: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

HowtoapplyELMo?

Someneuralnetwork

they dance at balls

Taskpredic2ons(sen2ment,etc.)‣ Takethoseembeddingsandfeedthemintowhateverarchitectureyouwanttouseforyourtask

‣ Frozenembeddings(mostcommon):updatetheweightsofyournetworkbutkeepELMo’sparametersfrozen

‣ Fine-tuning:backpropagateallthewayintoELMowhentrainingyourmodel

Page 14: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

Results:FrozenELMo

‣Massiveimprovements,bea2ngmodelshandcrawedforeachtask

Petersetal.(2018)

Five-classversionofsen2mentfromA1-A2

QA

(sortof)likedepparsing

‣ Thesearemostlytextanalysistasks.Otherpre-trainingapproachesneededfortextgenera2onliketransla2on

Page 15: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

Whyislanguagemodelingagoodobjec2ve?‣ “Impossible”problembutbiggermodelsseemtodobe5erandbe5eratdistribu2onalmodeling(noupperlimityet)

‣ Successfullypredic2ngnextwordsrequiresmodelinglotsofdifferenteffectsintext

Page 16: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

ProbingELMo‣ FromeachlayeroftheELMomodel,a5empttopredictsomething:POStags,wordsenses,etc.

‣ Higheraccuracy=>ELMoiscapturingthatthingmorestrongly

Petersetal.(2018)

Page 17: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

Analysis

Petersetal.(2018)

Page 18: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣

Takeaways

‣ Learningalargelanguagemodelcanbeaneffec2vewayofgenera2ng“wordembeddings”informedbytheircontext

‣ Nextclass:transformersandBERT

‣ Pre-trainingonmassiveamountsofdatacanimproveperformanceontaskslikeQA