![Page 1: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/1.jpg)
ReadingComprehension
![Page 2: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/2.jpg)
Bidirec2onalA5en2onFlow
Seoetal.(2016)
Eachpassagewordnow“knowsabout”thequery
![Page 3: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/3.jpg)
QANet‣ OneofmanymodelsbuildingonBiDAFinmorecomplexways
Yuetal.(2018)
‣ SimilarstructureasBiDAF,buttransformerlayers(nextlecture)insteadofLSTMs
![Page 4: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/4.jpg)
SQuADSOTA:Fall18
‣ nlnet,QANet,r-net—duelingsupercomplexsystems(muchmorethanBiDAF…)
‣ BiDAF:73EM/81F1
![Page 5: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/5.jpg)
SQuAD2.0SOTA:Spring2019
‣ Sincespring2019:SQuADperformanceisdominatedbylargepre-trainedmodelslikeBERT
‣ HardervariantofSQuAD
![Page 6: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/6.jpg)
AdversarialExamples‣ Canconstructadversarialexamplesthatfoolthesesystems:addonecarefullychosensentenceandperformancedropstobelow50%
JiaandLiang(2017)
‣ S2ll“surface-level”matching,notcomplexunderstanding
‣ Otherchallenges:recognizingwhenanswersaren’tpresent,doingmul2-stepreasoning
![Page 7: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/7.jpg)
Pre-training/ELMo
![Page 8: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/8.jpg)
Whatispre-training?
‣ “Pre-train”amodelonalargedatasetfortaskX,then“fine-tune”itonadatasetfortaskY
‣ Keyidea:XissomewhatrelatedtoY,soamodelthatcandoXwillhavesomegoodneuralrepresenta2onsforYaswell
‣ GloVecanbeseenaspre-training:learnvectorswiththeskip-gramobjec2veonlargedata(taskX),thenfine-tunethemaspartofaneuralnetworkforsen2ment/anyothertask(taskY)
‣ ImageNetpre-trainingishugeincomputervision:learngenericvisualfeaturesforrecognizingobjects
![Page 9: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/9.jpg)
GloVeisinsufficient‣ GloVeusesalotofdatabutinaweakway
‣ Havingasingleembeddingforeachwordiswrong
‣ Iden2fyingdiscretewordsensesishard,doesn’tscale.Hardtoiden2fyhowmanysenseseachwordhas
‣ Takeapowerfullanguagemodel,trainitonlargeamountsofdata,thenusethoserepresenta2onsindownstreamtasks
theyhittheballstheydanceatballs
‣ Howcanwemakeourwordembeddingsmorecontext-dependent?
![Page 10: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/10.jpg)
Context-dependentEmbeddings
Petersetal.(2018)
‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,usethehiddenstates(output)ateachstepaswordembeddings
they hit the ballsthey dance at balls
‣ ThisisthekeyideabehindELMo:languagemodelscanallowustoformusefulwordrepresenta2onsinthesamewayword2vecdid
![Page 11: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/11.jpg)
ELMo‣ CNNovereachword=>RNN
JohnvisitedMadagascaryesterdayCharCNN CharCNN CharCNN CharCNN
4096-dimLSTMs
nextword
2048CNNfiltersprojecteddownto512-dim
Petersetal.(2018)
Representa2onofvisited(plusvectorsfromanotherLMrunningbackwards)
*gesngthismodelrighttookyears
![Page 12: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/12.jpg)
TrainingELMo‣ Data:1BWordBenchmark(Chelbaetal.,2014)
‣ Pre-training2me:2weekson3NVIDIAGTX1080GPUs
‣Muchlower2mecostifweusedV100s/Google’sTPUs,buts2llhundredsofdollarsincomputecosttotrainonce
‣ LargerBERTmodelstrainedonmoredata(nextweek)cost$10k+
‣ Pre-trainingisexpensive,butfine-tuningisdoable
![Page 13: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/13.jpg)
HowtoapplyELMo?
Someneuralnetwork
they dance at balls
Taskpredic2ons(sen2ment,etc.)‣ Takethoseembeddingsandfeedthemintowhateverarchitectureyouwanttouseforyourtask
‣ Frozenembeddings(mostcommon):updatetheweightsofyournetworkbutkeepELMo’sparametersfrozen
‣ Fine-tuning:backpropagateallthewayintoELMowhentrainingyourmodel
![Page 14: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/14.jpg)
Results:FrozenELMo
‣Massiveimprovements,bea2ngmodelshandcrawedforeachtask
Petersetal.(2018)
Five-classversionofsen2mentfromA1-A2
QA
(sortof)likedepparsing
‣ Thesearemostlytextanalysistasks.Otherpre-trainingapproachesneededfortextgenera2onliketransla2on
![Page 15: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/15.jpg)
Whyislanguagemodelingagoodobjec2ve?‣ “Impossible”problembutbiggermodelsseemtodobe5erandbe5eratdistribu2onalmodeling(noupperlimityet)
‣ Successfullypredic2ngnextwordsrequiresmodelinglotsofdifferenteffectsintext
![Page 16: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/16.jpg)
ProbingELMo‣ FromeachlayeroftheELMomodel,a5empttopredictsomething:POStags,wordsenses,etc.
‣ Higheraccuracy=>ELMoiscapturingthatthingmorestrongly
Petersetal.(2018)
![Page 17: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/17.jpg)
Analysis
Petersetal.(2018)
![Page 18: Reading Comprehensiongdurrett/courses/sp2020/... · 2020. 4. 23. · ‣ Data: 1B Word Benchmark (Chelba et al., 2014) ‣ Pre-training 2me: 2 weeks on 3 NVIDIA GTX 1080 GPUs ‣](https://reader035.vdocuments.mx/reader035/viewer/2022071416/61134bc23c2d637a84446be1/html5/thumbnails/18.jpg)
Takeaways
‣ Learningalargelanguagemodelcanbeaneffec2vewayofgenera2ng“wordembeddings”informedbytheircontext
‣ Nextclass:transformersandBERT
‣ Pre-trainingonmassiveamountsofdatacanimproveperformanceontaskslikeQA