180912 lec6 rnn architectures(lstm, gru 등 최신 rnn 모델,...
TRANSCRIPT
Algorithmic IntelligenceLab
AlgorithmicIntelligenceLab
RNNArchitectures
EE807:RecentAdvancesinDeepLearningLecture6
Slidemadeby
HyungwonChoiandJongheon JeongKAISTEE
Algorithmic IntelligenceLab
• Processasequenceofvectorsbyapplyingrecurrenceformula ateverytimestep :
Recap:RNNbasics
2
Newstate Oldstate Inputvectorattimestept
Functionparameterizedby
*reference:http://cs231n.stanford.edu/2017/
Algorithmic IntelligenceLab
• SimpleRNN• Thestateconsistsofasingle“hidden”vector• VanillaRNN(orsometimescalledElmanRNN)
Recap:VanillaRNN
3*reference:http://cs231n.stanford.edu/2017/
Algorithmic IntelligenceLab
• ForvanillaRNN,itisdifficult tocapturelong-termdependency• VanishinggradientprobleminvanillaRNN
• Thegradientvanishesovertime• Whichrelatestooptimizationdifficulties inCNN
WhydowedevelopRNNarchitectures?
4*source:https://mediatum.ub.tum.de/doc/673554/file.pdf
Algorithmic IntelligenceLab
• Manyreal-worldtemporaldataisintrinsicallylong-term• Naturallanguage• Speech• Video
• Inordertosolvemuchcomplicatedreal-worldproblemsweneedabetterRNNarchitecturetocapturelong-termdependencyinthedata
WhydowedevelopRNNarchitectures?
5
Algorithmic IntelligenceLab
1. RNNArchitecturesandComparisons• LSTM(LongShort-TermMemory)andtheirvariants
• GRU(GatedRecurrentUnit)• StackedLSTM• GridLSTM• Bi-directionalLSTM
2. BreakthroughsofRNNsinMachineTranslation• Sequence toSequenceLearningwithNeuralNetworks• NeuralMachineTranslationwithAttention• Google’s NeuralMachineTranslation(GNMT)
3. OvercomingtheheavycomputationsofRNNs• ConvolutionalSequence toSequenceLearning• ExploringSparsityinRecurrentNeuralNetworks
TableofContents
6
Algorithmic IntelligenceLab
1. RNNArchitecturesandComparisons• LSTM(LongShort-TermMemory)andtheirvariants
• GRU(GatedRecurrentUnit)• StackedLSTM• GridLSTM• Bi-directionalLSTM
2. BreakthroughsofRNNsinMachineTranslation• Sequence toSequenceLearningwithNeuralNetworks• NMTwithAttentionMechanism• Google’s NeuralMachineTranslation(GNMT)
3. Overcomingtheheavycomputations ofRNNs• ConvolutionalSequence toSequenceLearning• ExploringSparsityinRNNs
TableofContents
7
Algorithmic IntelligenceLab
• LongShort-TermMemory(LSTM)• AspecialtypeofRNNunit
• i.e.,LSTMnetworks=RNNcomposedofLSTMunits• Originallyproposedby[Hochreiter andSchmidhuber, 1997]
• Explicitlydesigned RNNto• Capturelong-termdependency• Morerobusttovanishinggradientproblem
• Composedofacell,aninputgate,anoutputgate,andaforgetgate(willbecoveredindetailsoon)
RNNArchitectures:LSTM
8*source:https://en.wikipedia.org/wiki/Long_short-term_memory#/media/File:The_LSTM_cell.png
Algorithmic IntelligenceLab
• Averyoldmodel,butwhysopopular?• Popularizedbyseriesoffollowing works
[Gravesetal,2013][Sutskever etal.,2014],...
• Workverywellinvarietyofproblems• Speechrecognition• Machinetranslation• …
RNNArchitectures:LSTM
9
Next,comparisonwithVanillaRNN*source:https://en.wikipedia.org/wiki/Long_short-term_memory#/media/File:The_LSTM_cell.png
Algorithmic IntelligenceLab
• VanillaRNN(unrolled)
RNNArchitectures:VanillaRNN
10*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Algorithmic IntelligenceLab
• RepeatingmodulesinVanillaRNN containsasinglelayer
RNNArchitectures:VanillaRNN
11*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Algorithmic IntelligenceLab
• RepeatingmodulesinLSTM
RNNArchitectures:LSTM
12*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Layer Pointwiseoperation
VectorTransfer
concatenate Copy
Algorithmic IntelligenceLab
• ThecoreideabehindLSTM• Abletocontrolhowmuchinformationtopreserve frompreviousstate• Thehorizontal linerunning through thetopofthediagram(i.e.,thecellstate ormemory)• Onlylinear interactions fromtheoutput ofeach“gates”(preventvanishinggradient)• Controlhowmuchtoremoveoraddinformationtothecellstate
RNNArchitectures:LSTM
13*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Cellstate
Gates:Waytooptionally letinformationthrough
Next,LSTMstep-by-stepcomputation
Algorithmic IntelligenceLab
Step1:Decidewhat informationwe’regoingtothrowaway fromthecellstate• Asigmoid layercalled“Forgetgate”• Looksatandoutputsanumberbetween0and1foreachcellstate
• If1:completelykeep, if0:completelyremove
• e.g.,languagemodeltrying topredictthenextwordbasedonallpreviousones• Thecellstatemightincludethegenderofthepresentsubject• sothatthecorrectpronouns canbeused• Whenweseeanewsubject,wewanttoforgetthegenderoftheoldsubject
RNNArchitectures:LSTM
14*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Algorithmic IntelligenceLab
Step2:Decidewhat informationwe’regoingtostore inthecell state and update• First,asigmoid layercalledthe“Inputgate”decideswhichvaluestoupdate• Next, atanh layercreatesavectorofnewcandidatevalues
RNNArchitectures:LSTM
15*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Algorithmic IntelligenceLab
Step2:Decidewhat informationwe’regoingtostore inthecellstate and update• First,asigmoid layercalledthe“Inputgate”decideswhichvaluestoupdate• Next, atanh layercreatesavectorofnewcandidatevalues
• Then,updatetheoldcellstateintothenewcellstate• Multiplytheoldstateby• Add,newcandidatevaluesscaledbyhowmuchtoupdate
RNNArchitectures:LSTM
16*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Algorithmic IntelligenceLab
Step3:Decidewhatwe’regoingtooutput• Asigmoid layercalled“Outputgate”• Firstgothrough whichdecideswhatpartsofthecellstatetooutput• Then,put thecellstatethrough tanh (push thevaluestobebetween-1and1)andmultiply itby
RNNArchitectures:LSTM
17*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Algorithmic IntelligenceLab
• OverallLSTMoperations
RNNArchitectures:LSTM
18
StandardLSTM
*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Next,VariantsofLSTM
Algorithmic IntelligenceLab
• GatedRecurrentUnit(GRU)[Choet.al,2014]• Combines theforgetandinputgatesintoasingle“updategate”
• Controlstheratioofinformationtokeep betweenpreviousstateandnewstate• Resetgate controlshowmuchinformation toforget• Mergesthecellstateandhiddenstate• (+) Resultinginsimpler model (lessweights) thanstandardLSTM
RNNArchitectures:GRU
19
GatedRecurrentUnit
*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Algorithmic IntelligenceLab
• StandardLSTM
• Withsimplifieddiagram• :cellstate (ormemory)• :hiddenstate(oroutput)• Dashedlineindicatesidentitytransformation
RNNArchitectures:StackedLSTM
20*source:https://arxiv.org/pdf/1507.01526.pdf
Algorithmic IntelligenceLab
• StackedLSTM[Gravesetal,2013]• Addscapacitybysimply stackingLSTMlayerson topofeachother• Outputof1st layerLSTMgoesinto2nd layerLSTMasaninput• Butnoverticalinteractions
RNNArchitectures:StackedLSTM
21*source:https://arxiv.org/pdf/1507.01526.pdf
1st layerLSTM
2nd layerLSTM
Algorithmic IntelligenceLab
• GridLSTM[Kalchbrenner etal.,2016]• ExtendedversionofstackedLSTM• LSTMunitshavememory connectionsalongdepth dimension aswellastemporaldimension
RNNArchitectures:GridLSTM
22
2DGridLSTM
*source:https://github.com/coreylynch/grid-lstm
Performanceonwikipedia dataset(lowerthebetter)
Algorithmic IntelligenceLab
• Whatisthelimitationofallpreviousmodels?• Theylearnrepresentationsonly fromprevious timesteps• Usefultolearnfuture timestepsinorderto
• Betterunderstandthecontext• Eliminateambiguity
• Example• “Hesaid,Teddy bearsareonsale”• “Hesaid,Teddy RooseveltwasagreatPresident”• Inabovetwosentences,onlyseeingpreviouswordsisnotenough tounderstandthesentence
• Solution• Alsolookaheadè BidirectionalRNNs
RNNArchitectures:Limitation
23*reference:https://towardsdatascience.com/introduction-to-sequence-models-rnn-bidirectional-rnn-lstm-gru-73927ec9df15
Algorithmic IntelligenceLab
• WecanalsoextendRNNsintobi-directionalmodels• TherepeatingblockscouldbeanytypesofRNNS (VanillaRNN,LSTM,orGRU)• Theonlydifferenceisthatthereareadditionalpathsfromfuturetimesteps
RNNArchitectures:BidirectionalRNNs
24
Next,ComparisonofVariants*reference:https://towardsdatascience.com/introduction-to-sequence-models-rnn-bidirectional-rnn-lstm-gru-73927ec9df15
Algorithmic IntelligenceLab
• Whicharchitectureisthebest?• Thereisnoclearwinner;itdependslargelyonthetask• MoreempiricalexplorationofRNNscanbefoundhere[Jozefowicz etal.,2015]
RNNArchitectures:Comparisons
25
Model Advantages Disadvantages
LSTM
- Capable ofmodeling long-term sequentialdependencies betterthansimpleRNN-MorerobusttovanishinggradientsthansimpleRNN
- Increases computationalcomplexity- Highermemoryrequirement thanRNNduetomultiplememorycells
StackedLSTM -Models long-termsequentialdependenciesduetodeeper architecture
- Highermemoryrequirementandcomputationalcomplexity thanLSTMduetostackofLSTMcells
BidirectionalLSTM - PredictsbothinthefutureandpastcontextoftheinputsequencebetterthanLSTM - Increasescomputationalcomplexity thanLSTM
GRU- Capable ofmodeling long-term sequentialdependencies- Lessmemoryrequirements thanLSTM
- Higher computationalcomplexityand memoryrequirements thanRNNduetomultiplehiddenstatevectors
Grid LSTM -Modelsmultidimensional sequenceswithincreasedgridsize
- Highermemoryrequirementandcomputationalcomplexity thanLSTMduetomultiple recurrentconnections
*source:https://arxiv.org/pdf/1801.01078.pdf
Algorithmic IntelligenceLab
1. RNNArchitecturesandComparisons• LSTM(LongShort-TermMemory)andtheirvariants
• GRU(GatedRecurrentUnit)• StackedLSTM• GridLSTM• Bi-directionalLSTM
2. BreakthroughsofRNNsinMachineTranslation• Sequence toSequenceLearningwithNeuralNetworks• NeuralMachineTranslationwithAttention• Google’s NeuralMachineTranslation(GNMT)
3. Overcomingtheheavycomputations ofRNNs• ConvolutionalSequence toSequenceLearning• ExploringSparsityinRNNs
TableofContents
26
Algorithmic IntelligenceLab
• Whatismachinetranslation?• Taskofautomaticallyconverting sourcetextinonelanguagetoanother language• Nosingleanswerduetoambiguity/flexibility ofhumanlanguage (challenging)
• Classicalmachinetranslationmethods• Rule-basedmachinetranslation(RBMT)• Statisticalmachinetranslation(SMT;useofstatisticalmodel)
• NeuralMachineTranslation(NMT)• Useof neuralnetworkmodelstolearnastatisticalmodel formachinetranslation
MachineTranslation
27
Algorithmic IntelligenceLab
• DifficultiesinNeuralMachineTranslation• IntrinsicdifficultiesofMT(ambiguity oflanguage)• Variablelengthofinputandoutput sequence(difficult tolearnasinglemodel)
• Thecoreideaofsequence-to-sequencemodel[Sutskever etal.,2014]• Encoder-Decoder architecture(inputà vectorà output)• UseoneRNNnetwork(Encoder) toreadinputsequence atatimetoobtainlargefixed-length vectorrepresentation
• UseanotherRNN(Decoder) toextracttheoutput sequencefromthatvector
BreakthroughsinNMT:Sequence-to-SequenceLearning
28
Inputsequence“ABC”andoutput sequence“WXYZ”
Algorithmic IntelligenceLab
• Encoder• Readstheinput sentence,asequenceofvectorsintoavector• UseRNNssuchthatand,whereandaresomenon-linear functions
• LSTMsasand(intheoriginalseq2seqmodel)
• Decoder• Trainedtopredictthenextwordgiventhecontextvectorandthepreviouslypredictedwords
• Definesaprobabilityoverthetranslationbydecomposing thejointprobability intotheorderedconditionals:
where• Theconditionalprobability ismodeledas
whereisanonlinear, potentiallymulti-layeredfunction thatoutputs theprobabilityofandisthehidden stateoftheRNN
BreakthroughsinNMT:Sequence-to-SequenceLearning
29
Algorithmic IntelligenceLab
• Exampleoftheseq2seqmodel• ForEnglishà Frenchtask• With2-layerLSTMforencoderandencoder
BreakthroughsinNMT:Sequence-to-SequenceLearning
30*source:https://towardsdatascience.com/seq2seq-model-in-tensorflow-ec0c557e560f
Algorithmic IntelligenceLab
• ResultsonWMT’14EnglishtoFrenchdataset• Measure:BLEU(BilingualEvaluationUnderstudy) score
• WidelyusedquantitativemeasureforMTtask• Onparwiththestate-of-the-artSMTsystem(withoutanyneuralnetwork)• Achievedbetterresultsthanthepreviousbaselines
• SimplebutverypowerfulinMTtask
BreakthroughsinNMT:Sequence-to-SequenceLearning
31*source:http://papers.nips.cc/paper/5346-sequence-to-sequence-learning
Next,Seq2seqwithattention
Algorithmic IntelligenceLab
• NMTbyJointLearningtoAlignandTranslate[Bahdanau etal.,2015]
• Problemoforiginalencoder-decoder(orseq2seq)model• Needtocompress allthenecessaryinformationofasourcesentenceintoafixed-lengthvector
• Verydifficult tocopewithlongsentences,especiallywhenthetestsequenceislonger thanthesentencesinthetrainingcorpus
• Extensionofencoder-decodermodel+attentionmechanism• Encodeinput sentenceintoasequenceofvectors• Andchoosesasubsetofthesevectorsadaptivelywhiledecoding thetranslation• Freestheneuralnetworkmodel fromhavingtosquashalltheinformation intoasinglefixed-lengthvector
BreakthroughsinNMT:JointLearningtoAlignandTranslate
32
Algorithmic IntelligenceLab
• Defineeachconditionalprobabilityas:
whereisanRNNhiddenstatefortimecomputedby
• Distinctcontextvectorforeachtargetword• Thecontextvectoriscomputedasweightedsumof
• Theweightofeachiscomputedby
whereisanalignmentmodelwhichscoreshowwelltheinputs aroundpositionandtheoutputpositionmatch.
BreakthroughsinNMT:JointLearningtoAlignandTranslate
33
Illustrationofthemodel
Algorithmic IntelligenceLab
• Graphicalillustrationofseq2seqwithattention• e.g.,ChinesetoEnglish
BreakthroughsinNMT:JointLearningtoAlignandTranslate
34*source:https://google.github.io/seq2seq/
Algorithmic IntelligenceLab
• Results• RNNsearch (proposed) isbetterthanRNNenc (vanillaseq2seq)• RNNsearch-50:modeltrainedwithsentencesoflengthupto50words
BreakthroughsinNMT:JointLearningtoAlignandTranslate
35SamplealignmentresultsNext,Google’sNMT
Algorithmic IntelligenceLab
• Google’sNMT[Wuetal.,2016]• ImprovesoverpreviousNMTsystemsonaccuracy andspeed• 8-layerLSTMSforencoder/decoderwithattention• Achievemodelparallelism byassigningeachLSTMlayerintodifferentGPUs• Add residualconnectionsinstandardLSTM• …andlotsofdomain-specificdetailstoapplyittoproductionmodel
Google’sNeuralMachineTranslation(GNMT)
36
Algorithmic IntelligenceLab
• Addingresidualsconnections instackedLSTM• Inpractice,LSTMhasalsoproblemofvanishinggradientwhenstackingmorelayers• Empirically,4-layerworksokay,6-layerhasproblem,8-layerdoesnotworkatall• Apply residualconnectionswith8-layerstackedLSTMworkedbest
Google’sNeuralMachineTranslation(GNMT)
37
StandardstackedLSTM StackedLSTMwithresidualconnections
Algorithmic IntelligenceLab
• Results• State-of-the-artresultsonvariousMTdatasets• AlsocomparablewithHumanexpert
Google’sNeuralMachineTranslation(GNMT)
38
GNMTwithdifferentconfigurations
Algorithmic IntelligenceLab
• Furtherimprovedin[Johnsonetal.,2016]• ExtensionstomakethismodeltobeMultilingualNMTsystembyaddingartificialtoken toindicatetherequiredtargetlanguage• e.g.,thetoken“<2es>”indicatesthatthetargetsentenceisinSpanish• CandomultilingualNMT usingasinglemodelw/oincreasingtheparameters
Google’sMultilingualNeuralMachineTranslation(MultilingualGNMT)
39
Algorithmic IntelligenceLab
1. RNNArchitecturesandComparisons• LSTM(LongShort-TermMemory)andtheirvariants
• GRU(GatedRecurrentUnit)• StackedLSTM• GridLSTM• Bi-directionalLSTM
2. BreakthroughsofRNNsinMachineTranslation• Sequence toSequenceLearningwithNeuralNetworks• NeuralMachineTranslationwithAttention• Google’s NeuralMachineTranslation(GNMT)
3. Overcomingtheheavycomputations ofRNNs• ConvolutionalSequence toSequenceLearning• ExploringSparsityinRNNs
TableofContents
40
Algorithmic IntelligenceLab
• ThemostfundamentalproblemofRNNs• Requireheavycomputations(slow)• Especiallywhenwestackmultiple layers• GNMTsolvedthisbymodelparallelism
• Howtoalleviatethisissueintermsofarchitectures?• CNNencoder• CNNencoder+decoder• OptimizingRNNs(pruning approach)
ProbleminRNNs
41
Algorithmic IntelligenceLab
• AconvolutionalencodermodelforNeuralMachineTranslation[Gehring etal.,2016]• CNNencoder+RNNdecoder• ReplacetheRNNencoderwithstackof1-Dconvolutions withnonlinearities
• TwodifferentCNNsforattentionscorecomputationandconditionalinputaggregation• MoreparallelizablethanusingRNN
Solution1-1.CNNencoder
42
Algorithmic IntelligenceLab
• Convolutionalsequencetosequencelearning[Gehring etal.,2017]• CNNsforbothencoderandencoder
Solution1-2.CNNencoder+decoder
43
Performance
Generationspeed
Algorithmic IntelligenceLab
• ExploringSparsityinRecurrentNeuralNetworks[Narang etal.,2017]• PruningRNNstoimprove inferencetime withmarginalperformancedrop
• Simpleheuristicstocalculatethethreshold• Andapplythatthresholdtoeverybinarymaskcorrespondstoeachweight
• Reducesthesizeofthemodelby90%• Significant inferencetimespeed-upusingsparsematrixmultiplyaround to
Solution2.OptimizingRNNs
44GEMM(GeneralMatrix-MatrixMultiply) timescomparison
Algorithmic IntelligenceLab
• RNNarchitectureshavedevelopedinawaythat• Canbettermodel long-termdependency• Robust tovanishinggradientproblems• Whilehavinglessmemoryorcomputationalcosts
• Breakthroughsinmachinetranslation• Seq2seqmodelwithattention• GNMTandextensiontomultilingualNMT
• AlleviatingtheproblemofRNNs’heavycomputations• Convolutional sequence tosequencelearning• Pruning approach
• TherearevariousapplicationscombiningRNNswithothernetworks• Imagecaptiongeneration,visualquestionanswering(VQA), etc.• Willbecoveredinlaterlectures
Summary
45
Algorithmic IntelligenceLab
[Hochreiter andSchmidhuber, 1997]"Longshort-termmemory." Neuralcomputation 9.8(1997):1735-1780.link:http://www.bioinf.jku.at/publications/older/2604.pdf
[Gravesetal.,2005]"Framewise phoneme classificationwithbidirectional LSTMandotherneuralnetworkarchitectures."NeuralNetworks 18.5-6(2005):602-610.Link:ftp://ftp.idsia.ch/pub/juergen/nn_2005.pdf
[Gravesetal,2013]"Speechrecognitionwithdeeprecurrentneuralnetworks." Acoustics,speechandsignalprocessing(icassp), 2013ieee internationalconferenceon.IEEE,2013.Link:https://www.cs.toronto.edu/~graves/icassp_2013.pdf
[Cho etal.,2014]"LearningphraserepresentationsusingRNNencoder-decoder forstatisticalmachinetranslation." arXiv preprintarXiv:1406.1078 (2014).Link:https://arxiv.org/pdf/1406.1078v3.pdf
[Sutskever etal.,2014]"Sequencetosequence learningwithneuralnetworks." NIPS 2014.link:http://papers.nips.cc/paper/5346-sequence-to-sequence-learnin
[Sutskever etal.,2014]"Sequencetosequence learningwithneuralnetworks.“NIPS2014.
[Bahdanau etal.,2015]“"Neuralmachinetranslationbyjointlylearningtoalignandtranslate.“,ICLR2015Link:https://arxiv.org/pdf/1409.0473.pdf
[Jozefowicz etal.,2015]"Anempiricalexplorationofrecurrentnetworkarchitectures." ICML 2015.Link:http://proceedings.mlr.press/v37/jozefowicz15.pdf
[Bahdanau etal.,2015]Dzmitry,Kyunghyun Cho,andYoshua Bengio."Neuralmachinetranslationbyjointly learningtoalignandtranslate." ICLR2015link:https://arxiv.org/pdf/1409.0473.pdf
References
46
Algorithmic IntelligenceLab
[Kalchbrenner etal.,2016]"Gridlongshort-termmemory." ICLR2016Link:https://arxiv.org/pdf/1507.01526.pdf
[Gehring etal.,2016]"Aconvolutional encodermodelforneuralmachinetranslation." arXiv preprintarXiv:1611.02344 (2016).Link:https://arxiv.org/pdf/1611.02344.pdf
[Wuetal.,2016]"Google'sneuralmachinetranslationsystem:Bridgingthegapbetweenhumanandmachinetranslation." arXiv preprintarXiv:1609.08144 (2016).link:https://arxiv.org/pdf/1609.08144.pdf
[Johnson etal.,2016]"Google'smultilingual neuralmachinetranslationsystem:enablingzero-shottranslation." arXiv preprintarXiv:1611.04558 (2016).Link:https://arxiv.org/pdf/1611.04558.pdf
[Gehring etal.,2017]"Convolutional sequence tosequence learning." arXiv preprintarXiv:1705.03122 (2017).Link:https://arxiv.org/pdf/1705.03122.pdf
[Narang etal.,2017]"Exploringsparsityinrecurrentneuralnetworks.“,ICLR2017Link:https://arxiv.org/pdf/1704.05119.pdf
[Fei-Fei andKarpathy,2017]“CS231n: Convolutional NeuralNetworksforVisualRecognition”,2017.(StanfordUniversity)link:http://cs231n.stanford.edu/2017/
[Salehinejad etal.,2017]"RecentAdvancesinRecurrentNeuralNetworks." arXiv preprintarXiv:1801.01078 (2017).Link:https://arxiv.org/pdf/1801.01078.pdf
References
47