180912 lec6 rnn architectures(lstm, gru 등 최신 rnn 모델,...

Algorithmic IntelligenceLab

AlgorithmicIntelligenceLab

RNNArchitectures

EE807:RecentAdvancesinDeepLearningLecture6

Slidemadeby

HyungwonChoiandJongheon JeongKAISTEE


• Processasequenceofvectorsbyapplyingrecurrenceformula ateverytimestep :

Recap:RNNbasics

2

Newstate Oldstate Inputvectorattimestept

Functionparameterizedby

*reference:http://cs231n.stanford.edu/2017/


• SimpleRNN• Thestateconsistsofasingle“hidden”vector• VanillaRNN(orsometimescalledElmanRNN)

Recap:VanillaRNN

3*reference:http://cs231n.stanford.edu/2017/


• ForvanillaRNN,itisdifficult tocapturelong-termdependency• VanishinggradientprobleminvanillaRNN

• Thegradientvanishesovertime• Whichrelatestooptimizationdifficulties inCNN

WhydowedevelopRNNarchitectures?

4*source:https://mediatum.ub.tum.de/doc/673554/file.pdf


• Manyreal-worldtemporaldataisintrinsicallylong-term• Naturallanguage• Speech• Video

• Inordertosolvemuchcomplicatedreal-worldproblemsweneedabetterRNNarchitecturetocapturelong-termdependencyinthedata

WhydowedevelopRNNarchitectures?

5


1. RNNArchitecturesandComparisons• LSTM(LongShort-TermMemory)andtheirvariants

• GRU(GatedRecurrentUnit)• StackedLSTM• GridLSTM• Bi-directionalLSTM

2. BreakthroughsofRNNsinMachineTranslation• Sequence toSequenceLearningwithNeuralNetworks• NeuralMachineTranslationwithAttention• Google’s NeuralMachineTranslation(GNMT)

3. OvercomingtheheavycomputationsofRNNs• ConvolutionalSequence toSequenceLearning• ExploringSparsityinRecurrentNeuralNetworks

TableofContents

6




2. BreakthroughsofRNNsinMachineTranslation• Sequence toSequenceLearningwithNeuralNetworks• NMTwithAttentionMechanism• Google’s NeuralMachineTranslation(GNMT)

3. Overcomingtheheavycomputations ofRNNs• ConvolutionalSequence toSequenceLearning• ExploringSparsityinRNNs

TableofContents

7


• LongShort-TermMemory(LSTM)• AspecialtypeofRNNunit

• i.e.,LSTMnetworks=RNNcomposedofLSTMunits• Originallyproposedby[Hochreiter andSchmidhuber, 1997]

• Explicitlydesigned RNNto• Capturelong-termdependency• Morerobusttovanishinggradientproblem

• Composedofacell,aninputgate,anoutputgate,andaforgetgate(willbecoveredindetailsoon)

RNNArchitectures:LSTM

8*source:https://en.wikipedia.org/wiki/Long_short-term_memory#/media/File:The_LSTM_cell.png


• Averyoldmodel,butwhysopopular?• Popularizedbyseriesoffollowing works

[Gravesetal,2013][Sutskever etal.,2014],...

• Workverywellinvarietyofproblems• Speechrecognition• Machinetranslation• …


9

Next,comparisonwithVanillaRNN*source:https://en.wikipedia.org/wiki/Long_short-term_memory#/media/File:The_LSTM_cell.png


• VanillaRNN(unrolled)

RNNArchitectures:VanillaRNN

10*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/


• RepeatingmodulesinVanillaRNN containsasinglelayer

RNNArchitectures:VanillaRNN



• RepeatingmodulesinLSTM



Layer Pointwiseoperation

VectorTransfer

concatenate Copy


• ThecoreideabehindLSTM• Abletocontrolhowmuchinformationtopreserve frompreviousstate• Thehorizontal linerunning through thetopofthediagram(i.e.,thecellstate ormemory)• Onlylinear interactions fromtheoutput ofeach“gates”(preventvanishinggradient)• Controlhowmuchtoremoveoraddinformationtothecellstate



Cellstate

Gates:Waytooptionally letinformationthrough

Next,LSTMstep-by-stepcomputation


Step1:Decidewhat informationwe’regoingtothrowaway fromthecellstate• Asigmoid layercalled“Forgetgate”• Looksatandoutputsanumberbetween0and1foreachcellstate

• If1:completelykeep, if0:completelyremove

• e.g.,languagemodeltrying topredictthenextwordbasedonallpreviousones• Thecellstatemightincludethegenderofthepresentsubject• sothatthecorrectpronouns canbeused• Whenweseeanewsubject,wewanttoforgetthegenderoftheoldsubject




Step2:Decidewhat informationwe’regoingtostore inthecell state and update• First,asigmoid layercalledthe“Inputgate”decideswhichvaluestoupdate• Next, atanh layercreatesavectorofnewcandidatevalues




Step2:Decidewhat informationwe’regoingtostore inthecellstate and update• First,asigmoid layercalledthe“Inputgate”decideswhichvaluestoupdate• Next, atanh layercreatesavectorofnewcandidatevalues

• Then,updatetheoldcellstateintothenewcellstate• Multiplytheoldstateby• Add,newcandidatevaluesscaledbyhowmuchtoupdate




Step3:Decidewhatwe’regoingtooutput• Asigmoid layercalled“Outputgate”• Firstgothrough whichdecideswhatpartsofthecellstatetooutput• Then,put thecellstatethrough tanh (push thevaluestobebetween-1and1)andmultiply itby




• OverallLSTMoperations


18

StandardLSTM

*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Next,VariantsofLSTM


• GatedRecurrentUnit(GRU)[Choet.al,2014]• Combines theforgetandinputgatesintoasingle“updategate”

• Controlstheratioofinformationtokeep betweenpreviousstateandnewstate• Resetgate controlshowmuchinformation toforget• Mergesthecellstateandhiddenstate• (+) Resultinginsimpler model (lessweights) thanstandardLSTM

RNNArchitectures:GRU

19

GatedRecurrentUnit

*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/


• StandardLSTM

• Withsimplifieddiagram• :cellstate (ormemory)• :hiddenstate(oroutput)• Dashedlineindicatesidentitytransformation

RNNArchitectures:StackedLSTM

20*source:https://arxiv.org/pdf/1507.01526.pdf


• StackedLSTM[Gravesetal,2013]• Addscapacitybysimply stackingLSTMlayerson topofeachother• Outputof1st layerLSTMgoesinto2nd layerLSTMasaninput• Butnoverticalinteractions

RNNArchitectures:StackedLSTM

21*source:https://arxiv.org/pdf/1507.01526.pdf

1st layerLSTM

2nd layerLSTM


• GridLSTM[Kalchbrenner etal.,2016]• ExtendedversionofstackedLSTM• LSTMunitshavememory connectionsalongdepth dimension aswellastemporaldimension

RNNArchitectures:GridLSTM

22

2DGridLSTM

*source:https://github.com/coreylynch/grid-lstm

Performanceonwikipedia dataset(lowerthebetter)


• Whatisthelimitationofallpreviousmodels?• Theylearnrepresentationsonly fromprevious timesteps• Usefultolearnfuture timestepsinorderto

• Betterunderstandthecontext• Eliminateambiguity

• Example• “Hesaid,Teddy bearsareonsale”• “Hesaid,Teddy RooseveltwasagreatPresident”• Inabovetwosentences,onlyseeingpreviouswordsisnotenough tounderstandthesentence

• Solution• Alsolookaheadè BidirectionalRNNs

RNNArchitectures:Limitation

23*reference:https://towardsdatascience.com/introduction-to-sequence-models-rnn-bidirectional-rnn-lstm-gru-73927ec9df15


• WecanalsoextendRNNsintobi-directionalmodels• TherepeatingblockscouldbeanytypesofRNNS (VanillaRNN,LSTM,orGRU)• Theonlydifferenceisthatthereareadditionalpathsfromfuturetimesteps

RNNArchitectures:BidirectionalRNNs

24

Next,ComparisonofVariants*reference:https://towardsdatascience.com/introduction-to-sequence-models-rnn-bidirectional-rnn-lstm-gru-73927ec9df15


• Whicharchitectureisthebest?• Thereisnoclearwinner;itdependslargelyonthetask• MoreempiricalexplorationofRNNscanbefoundhere[Jozefowicz etal.,2015]

RNNArchitectures:Comparisons

25

Model Advantages Disadvantages

LSTM

- Capable ofmodeling long-term sequentialdependencies betterthansimpleRNN-MorerobusttovanishinggradientsthansimpleRNN

- Increases computationalcomplexity- Highermemoryrequirement thanRNNduetomultiplememorycells

StackedLSTM -Models long-termsequentialdependenciesduetodeeper architecture

- Highermemoryrequirementandcomputationalcomplexity thanLSTMduetostackofLSTMcells

BidirectionalLSTM - PredictsbothinthefutureandpastcontextoftheinputsequencebetterthanLSTM - Increasescomputationalcomplexity thanLSTM

GRU- Capable ofmodeling long-term sequentialdependencies- Lessmemoryrequirements thanLSTM

- Higher computationalcomplexityand memoryrequirements thanRNNduetomultiplehiddenstatevectors

Grid LSTM -Modelsmultidimensional sequenceswithincreasedgridsize

- Highermemoryrequirementandcomputationalcomplexity thanLSTMduetomultiple recurrentconnections

*source:https://arxiv.org/pdf/1801.01078.pdf






TableofContents

26


• Whatismachinetranslation?• Taskofautomaticallyconverting sourcetextinonelanguagetoanother language• Nosingleanswerduetoambiguity/flexibility ofhumanlanguage (challenging)

• Classicalmachinetranslationmethods• Rule-basedmachinetranslation(RBMT)• Statisticalmachinetranslation(SMT;useofstatisticalmodel)

• NeuralMachineTranslation(NMT)• Useof neuralnetworkmodelstolearnastatisticalmodel formachinetranslation

MachineTranslation

27


• DifficultiesinNeuralMachineTranslation• IntrinsicdifficultiesofMT(ambiguity oflanguage)• Variablelengthofinputandoutput sequence(difficult tolearnasinglemodel)

• Thecoreideaofsequence-to-sequencemodel[Sutskever etal.,2014]• Encoder-Decoder architecture(inputà vectorà output)• UseoneRNNnetwork(Encoder) toreadinputsequence atatimetoobtainlargefixed-length vectorrepresentation

• UseanotherRNN(Decoder) toextracttheoutput sequencefromthatvector

BreakthroughsinNMT:Sequence-to-SequenceLearning

28

Inputsequence“ABC”andoutput sequence“WXYZ”


• Encoder• Readstheinput sentence,asequenceofvectorsintoavector• UseRNNssuchthatand,whereandaresomenon-linear functions

• LSTMsasand(intheoriginalseq2seqmodel)

• Decoder• Trainedtopredictthenextwordgiventhecontextvectorandthepreviouslypredictedwords

• Definesaprobabilityoverthetranslationbydecomposing thejointprobability intotheorderedconditionals:

where• Theconditionalprobability ismodeledas

whereisanonlinear, potentiallymulti-layeredfunction thatoutputs theprobabilityofandisthehidden stateoftheRNN


29


• Exampleoftheseq2seqmodel• ForEnglishà Frenchtask• With2-layerLSTMforencoderandencoder


30*source:https://towardsdatascience.com/seq2seq-model-in-tensorflow-ec0c557e560f


• ResultsonWMT’14EnglishtoFrenchdataset• Measure:BLEU(BilingualEvaluationUnderstudy) score

• WidelyusedquantitativemeasureforMTtask• Onparwiththestate-of-the-artSMTsystem(withoutanyneuralnetwork)• Achievedbetterresultsthanthepreviousbaselines

• SimplebutverypowerfulinMTtask


31*source:http://papers.nips.cc/paper/5346-sequence-to-sequence-learning

Next,Seq2seqwithattention


• NMTbyJointLearningtoAlignandTranslate[Bahdanau etal.,2015]

• Problemoforiginalencoder-decoder(orseq2seq)model• Needtocompress allthenecessaryinformationofasourcesentenceintoafixed-lengthvector

• Verydifficult tocopewithlongsentences,especiallywhenthetestsequenceislonger thanthesentencesinthetrainingcorpus

• Extensionofencoder-decodermodel+attentionmechanism• Encodeinput sentenceintoasequenceofvectors• Andchoosesasubsetofthesevectorsadaptivelywhiledecoding thetranslation• Freestheneuralnetworkmodel fromhavingtosquashalltheinformation intoasinglefixed-lengthvector

BreakthroughsinNMT:JointLearningtoAlignandTranslate

32


• Defineeachconditionalprobabilityas:

whereisanRNNhiddenstatefortimecomputedby

• Distinctcontextvectorforeachtargetword• Thecontextvectoriscomputedasweightedsumof

• Theweightofeachiscomputedby

whereisanalignmentmodelwhichscoreshowwelltheinputs aroundpositionandtheoutputpositionmatch.


33

Illustrationofthemodel


• Graphicalillustrationofseq2seqwithattention• e.g.,ChinesetoEnglish


34*source:https://google.github.io/seq2seq/


• Results• RNNsearch (proposed) isbetterthanRNNenc (vanillaseq2seq)• RNNsearch-50:modeltrainedwithsentencesoflengthupto50words


35SamplealignmentresultsNext,Google’sNMT


• Google’sNMT[Wuetal.,2016]• ImprovesoverpreviousNMTsystemsonaccuracy andspeed• 8-layerLSTMSforencoder/decoderwithattention• Achievemodelparallelism byassigningeachLSTMlayerintodifferentGPUs• Add residualconnectionsinstandardLSTM• …andlotsofdomain-specificdetailstoapplyittoproductionmodel

Google’sNeuralMachineTranslation(GNMT)

36


• Addingresidualsconnections instackedLSTM• Inpractice,LSTMhasalsoproblemofvanishinggradientwhenstackingmorelayers• Empirically,4-layerworksokay,6-layerhasproblem,8-layerdoesnotworkatall• Apply residualconnectionswith8-layerstackedLSTMworkedbest


37

StandardstackedLSTM StackedLSTMwithresidualconnections


• Results• State-of-the-artresultsonvariousMTdatasets• AlsocomparablewithHumanexpert


38

GNMTwithdifferentconfigurations


• Furtherimprovedin[Johnsonetal.,2016]• ExtensionstomakethismodeltobeMultilingualNMTsystembyaddingartificialtoken toindicatetherequiredtargetlanguage• e.g.,thetoken“<2es>”indicatesthatthetargetsentenceisinSpanish• CandomultilingualNMT usingasinglemodelw/oincreasingtheparameters

Google’sMultilingualNeuralMachineTranslation(MultilingualGNMT)

39






TableofContents

40


• ThemostfundamentalproblemofRNNs• Requireheavycomputations(slow)• Especiallywhenwestackmultiple layers• GNMTsolvedthisbymodelparallelism

• Howtoalleviatethisissueintermsofarchitectures?• CNNencoder• CNNencoder+decoder• OptimizingRNNs(pruning approach)

ProbleminRNNs

41


• AconvolutionalencodermodelforNeuralMachineTranslation[Gehring etal.,2016]• CNNencoder+RNNdecoder• ReplacetheRNNencoderwithstackof1-Dconvolutions withnonlinearities

• TwodifferentCNNsforattentionscorecomputationandconditionalinputaggregation• MoreparallelizablethanusingRNN

Solution1-1.CNNencoder

42


• Convolutionalsequencetosequencelearning[Gehring etal.,2017]• CNNsforbothencoderandencoder

Solution1-2.CNNencoder+decoder

43

Performance

Generationspeed


• ExploringSparsityinRecurrentNeuralNetworks[Narang etal.,2017]• PruningRNNstoimprove inferencetime withmarginalperformancedrop

• Simpleheuristicstocalculatethethreshold• Andapplythatthresholdtoeverybinarymaskcorrespondstoeachweight

• Reducesthesizeofthemodelby90%• Significant inferencetimespeed-upusingsparsematrixmultiplyaround to

Solution2.OptimizingRNNs

44GEMM(GeneralMatrix-MatrixMultiply) timescomparison


• RNNarchitectureshavedevelopedinawaythat• Canbettermodel long-termdependency• Robust tovanishinggradientproblems• Whilehavinglessmemoryorcomputationalcosts

• Breakthroughsinmachinetranslation• Seq2seqmodelwithattention• GNMTandextensiontomultilingualNMT

• AlleviatingtheproblemofRNNs’heavycomputations• Convolutional sequence tosequencelearning• Pruning approach

• TherearevariousapplicationscombiningRNNswithothernetworks• Imagecaptiongeneration,visualquestionanswering(VQA), etc.• Willbecoveredinlaterlectures

Summary

45


[Hochreiter andSchmidhuber, 1997]"Longshort-termmemory." Neuralcomputation 9.8(1997):1735-1780.link:http://www.bioinf.jku.at/publications/older/2604.pdf

[Gravesetal.,2005]"Framewise phoneme classificationwithbidirectional LSTMandotherneuralnetworkarchitectures."NeuralNetworks 18.5-6(2005):602-610.Link:ftp://ftp.idsia.ch/pub/juergen/nn_2005.pdf

[Gravesetal,2013]"Speechrecognitionwithdeeprecurrentneuralnetworks." Acoustics,speechandsignalprocessing(icassp), 2013ieee internationalconferenceon.IEEE,2013.Link:https://www.cs.toronto.edu/~graves/icassp_2013.pdf

[Cho etal.,2014]"LearningphraserepresentationsusingRNNencoder-decoder forstatisticalmachinetranslation." arXiv preprintarXiv:1406.1078 (2014).Link:https://arxiv.org/pdf/1406.1078v3.pdf

[Sutskever etal.,2014]"Sequencetosequence learningwithneuralnetworks." NIPS 2014.link:http://papers.nips.cc/paper/5346-sequence-to-sequence-learnin

[Sutskever etal.,2014]"Sequencetosequence learningwithneuralnetworks.“NIPS2014.

[Bahdanau etal.,2015]“"Neuralmachinetranslationbyjointlylearningtoalignandtranslate.“,ICLR2015Link:https://arxiv.org/pdf/1409.0473.pdf

[Jozefowicz etal.,2015]"Anempiricalexplorationofrecurrentnetworkarchitectures." ICML 2015.Link:http://proceedings.mlr.press/v37/jozefowicz15.pdf

[Bahdanau etal.,2015]Dzmitry,Kyunghyun Cho,andYoshua Bengio."Neuralmachinetranslationbyjointly learningtoalignandtranslate." ICLR2015link:https://arxiv.org/pdf/1409.0473.pdf

References

46


[Kalchbrenner etal.,2016]"Gridlongshort-termmemory." ICLR2016Link:https://arxiv.org/pdf/1507.01526.pdf

[Gehring etal.,2016]"Aconvolutional encodermodelforneuralmachinetranslation." arXiv preprintarXiv:1611.02344 (2016).Link:https://arxiv.org/pdf/1611.02344.pdf

[Wuetal.,2016]"Google'sneuralmachinetranslationsystem:Bridgingthegapbetweenhumanandmachinetranslation." arXiv preprintarXiv:1609.08144 (2016).link:https://arxiv.org/pdf/1609.08144.pdf

[Johnson etal.,2016]"Google'smultilingual neuralmachinetranslationsystem:enablingzero-shottranslation." arXiv preprintarXiv:1611.04558 (2016).Link:https://arxiv.org/pdf/1611.04558.pdf

[Gehring etal.,2017]"Convolutional sequence tosequence learning." arXiv preprintarXiv:1705.03122 (2017).Link:https://arxiv.org/pdf/1705.03122.pdf

[Narang etal.,2017]"Exploringsparsityinrecurrentneuralnetworks.“,ICLR2017Link:https://arxiv.org/pdf/1704.05119.pdf

[Fei-Fei andKarpathy,2017]“CS231n: Convolutional NeuralNetworksforVisualRecognition”,2017.(StanfordUniversity)link:http://cs231n.stanford.edu/2017/

[Salehinejad etal.,2017]"RecentAdvancesinRecurrentNeuralNetworks." arXiv preprintarXiv:1801.01078 (2017).Link:https://arxiv.org/pdf/1801.01078.pdf

References

47

180912 lec6 rnn architectures(lstm, gru 등 최신 rnn 모델,...

Documents