180912 lec6 rnn architectures(lstm, gru 등 최신 rnn 모델,...

47
Algorithmic Intelligence Lab Algorithmic Intelligence Lab RNN Architectures EE807: Recent Advances in Deep Learning Lecture 6 Slide made by Hyungwon Choiand Jongheon Jeong KAIST EE

Upload: others

Post on 31-Dec-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

AlgorithmicIntelligenceLab

RNNArchitectures

EE807:RecentAdvancesinDeepLearningLecture6

Slidemadeby

HyungwonChoiandJongheon JeongKAISTEE

Page 2: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• Processasequenceofvectorsbyapplyingrecurrenceformula ateverytimestep :

Recap:RNNbasics

2

Newstate Oldstate Inputvectorattimestept

Functionparameterizedby

*reference:http://cs231n.stanford.edu/2017/

Page 3: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• SimpleRNN• Thestateconsistsofasingle“hidden”vector• VanillaRNN(orsometimescalledElmanRNN)

Recap:VanillaRNN

3*reference:http://cs231n.stanford.edu/2017/

Page 4: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• ForvanillaRNN,itisdifficult tocapturelong-termdependency• VanishinggradientprobleminvanillaRNN

• Thegradientvanishesovertime• Whichrelatestooptimizationdifficulties inCNN

WhydowedevelopRNNarchitectures?

4*source:https://mediatum.ub.tum.de/doc/673554/file.pdf

Page 5: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• Manyreal-worldtemporaldataisintrinsicallylong-term• Naturallanguage• Speech• Video

• Inordertosolvemuchcomplicatedreal-worldproblemsweneedabetterRNNarchitecturetocapturelong-termdependencyinthedata

WhydowedevelopRNNarchitectures?

5

Page 6: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

1. RNNArchitecturesandComparisons• LSTM(LongShort-TermMemory)andtheirvariants

• GRU(GatedRecurrentUnit)• StackedLSTM• GridLSTM• Bi-directionalLSTM

2. BreakthroughsofRNNsinMachineTranslation• Sequence toSequenceLearningwithNeuralNetworks• NeuralMachineTranslationwithAttention• Google’s NeuralMachineTranslation(GNMT)

3. OvercomingtheheavycomputationsofRNNs• ConvolutionalSequence toSequenceLearning• ExploringSparsityinRecurrentNeuralNetworks

TableofContents

6

Page 7: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

1. RNNArchitecturesandComparisons• LSTM(LongShort-TermMemory)andtheirvariants

• GRU(GatedRecurrentUnit)• StackedLSTM• GridLSTM• Bi-directionalLSTM

2. BreakthroughsofRNNsinMachineTranslation• Sequence toSequenceLearningwithNeuralNetworks• NMTwithAttentionMechanism• Google’s NeuralMachineTranslation(GNMT)

3. Overcomingtheheavycomputations ofRNNs• ConvolutionalSequence toSequenceLearning• ExploringSparsityinRNNs

TableofContents

7

Page 8: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• LongShort-TermMemory(LSTM)• AspecialtypeofRNNunit

• i.e.,LSTMnetworks=RNNcomposedofLSTMunits• Originallyproposedby[Hochreiter andSchmidhuber, 1997]

• Explicitlydesigned RNNto• Capturelong-termdependency• Morerobusttovanishinggradientproblem

• Composedofacell,aninputgate,anoutputgate,andaforgetgate(willbecoveredindetailsoon)

RNNArchitectures:LSTM

8*source:https://en.wikipedia.org/wiki/Long_short-term_memory#/media/File:The_LSTM_cell.png

Page 9: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• Averyoldmodel,butwhysopopular?• Popularizedbyseriesoffollowing works

[Gravesetal,2013][Sutskever etal.,2014],...

• Workverywellinvarietyofproblems• Speechrecognition• Machinetranslation• …

RNNArchitectures:LSTM

9

Next,comparisonwithVanillaRNN*source:https://en.wikipedia.org/wiki/Long_short-term_memory#/media/File:The_LSTM_cell.png

Page 10: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• VanillaRNN(unrolled)

RNNArchitectures:VanillaRNN

10*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 11: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• RepeatingmodulesinVanillaRNN containsasinglelayer

RNNArchitectures:VanillaRNN

11*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 12: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• RepeatingmodulesinLSTM

RNNArchitectures:LSTM

12*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Layer Pointwiseoperation

VectorTransfer

concatenate Copy

Page 13: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• ThecoreideabehindLSTM• Abletocontrolhowmuchinformationtopreserve frompreviousstate• Thehorizontal linerunning through thetopofthediagram(i.e.,thecellstate ormemory)• Onlylinear interactions fromtheoutput ofeach“gates”(preventvanishinggradient)• Controlhowmuchtoremoveoraddinformationtothecellstate

RNNArchitectures:LSTM

13*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Cellstate

Gates:Waytooptionally letinformationthrough

Next,LSTMstep-by-stepcomputation

Page 14: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

Step1:Decidewhat informationwe’regoingtothrowaway fromthecellstate• Asigmoid layercalled“Forgetgate”• Looksatandoutputsanumberbetween0and1foreachcellstate

• If1:completelykeep, if0:completelyremove

• e.g.,languagemodeltrying topredictthenextwordbasedonallpreviousones• Thecellstatemightincludethegenderofthepresentsubject• sothatthecorrectpronouns canbeused• Whenweseeanewsubject,wewanttoforgetthegenderoftheoldsubject

RNNArchitectures:LSTM

14*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 15: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

Step2:Decidewhat informationwe’regoingtostore inthecell state and update• First,asigmoid layercalledthe“Inputgate”decideswhichvaluestoupdate• Next, atanh layercreatesavectorofnewcandidatevalues

RNNArchitectures:LSTM

15*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 16: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

Step2:Decidewhat informationwe’regoingtostore inthecellstate and update• First,asigmoid layercalledthe“Inputgate”decideswhichvaluestoupdate• Next, atanh layercreatesavectorofnewcandidatevalues

• Then,updatetheoldcellstateintothenewcellstate• Multiplytheoldstateby• Add,newcandidatevaluesscaledbyhowmuchtoupdate

RNNArchitectures:LSTM

16*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 17: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

Step3:Decidewhatwe’regoingtooutput• Asigmoid layercalled“Outputgate”• Firstgothrough whichdecideswhatpartsofthecellstatetooutput• Then,put thecellstatethrough tanh (push thevaluestobebetween-1and1)andmultiply itby

RNNArchitectures:LSTM

17*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 18: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• OverallLSTMoperations

RNNArchitectures:LSTM

18

StandardLSTM

*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Next,VariantsofLSTM

Page 19: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• GatedRecurrentUnit(GRU)[Choet.al,2014]• Combines theforgetandinputgatesintoasingle“updategate”

• Controlstheratioofinformationtokeep betweenpreviousstateandnewstate• Resetgate controlshowmuchinformation toforget• Mergesthecellstateandhiddenstate• (+) Resultinginsimpler model (lessweights) thanstandardLSTM

RNNArchitectures:GRU

19

GatedRecurrentUnit

*reference:http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 20: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• StandardLSTM

• Withsimplifieddiagram• :cellstate (ormemory)• :hiddenstate(oroutput)• Dashedlineindicatesidentitytransformation

RNNArchitectures:StackedLSTM

20*source:https://arxiv.org/pdf/1507.01526.pdf

Page 21: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• StackedLSTM[Gravesetal,2013]• Addscapacitybysimply stackingLSTMlayerson topofeachother• Outputof1st layerLSTMgoesinto2nd layerLSTMasaninput• Butnoverticalinteractions

RNNArchitectures:StackedLSTM

21*source:https://arxiv.org/pdf/1507.01526.pdf

1st layerLSTM

2nd layerLSTM

Page 22: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• GridLSTM[Kalchbrenner etal.,2016]• ExtendedversionofstackedLSTM• LSTMunitshavememory connectionsalongdepth dimension aswellastemporaldimension

RNNArchitectures:GridLSTM

22

2DGridLSTM

*source:https://github.com/coreylynch/grid-lstm

Performanceonwikipedia dataset(lowerthebetter)

Page 23: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• Whatisthelimitationofallpreviousmodels?• Theylearnrepresentationsonly fromprevious timesteps• Usefultolearnfuture timestepsinorderto

• Betterunderstandthecontext• Eliminateambiguity

• Example• “Hesaid,Teddy bearsareonsale”• “Hesaid,Teddy RooseveltwasagreatPresident”• Inabovetwosentences,onlyseeingpreviouswordsisnotenough tounderstandthesentence

• Solution• Alsolookaheadè BidirectionalRNNs

RNNArchitectures:Limitation

23*reference:https://towardsdatascience.com/introduction-to-sequence-models-rnn-bidirectional-rnn-lstm-gru-73927ec9df15

Page 24: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• WecanalsoextendRNNsintobi-directionalmodels• TherepeatingblockscouldbeanytypesofRNNS (VanillaRNN,LSTM,orGRU)• Theonlydifferenceisthatthereareadditionalpathsfromfuturetimesteps

RNNArchitectures:BidirectionalRNNs

24

Next,ComparisonofVariants*reference:https://towardsdatascience.com/introduction-to-sequence-models-rnn-bidirectional-rnn-lstm-gru-73927ec9df15

Page 25: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• Whicharchitectureisthebest?• Thereisnoclearwinner;itdependslargelyonthetask• MoreempiricalexplorationofRNNscanbefoundhere[Jozefowicz etal.,2015]

RNNArchitectures:Comparisons

25

Model Advantages Disadvantages

LSTM

- Capable ofmodeling long-term sequentialdependencies betterthansimpleRNN-MorerobusttovanishinggradientsthansimpleRNN

- Increases computationalcomplexity- Highermemoryrequirement thanRNNduetomultiplememorycells

StackedLSTM -Models long-termsequentialdependenciesduetodeeper architecture

- Highermemoryrequirementandcomputationalcomplexity thanLSTMduetostackofLSTMcells

BidirectionalLSTM - PredictsbothinthefutureandpastcontextoftheinputsequencebetterthanLSTM - Increasescomputationalcomplexity thanLSTM

GRU- Capable ofmodeling long-term sequentialdependencies- Lessmemoryrequirements thanLSTM

- Higher computationalcomplexityand memoryrequirements thanRNNduetomultiplehiddenstatevectors

Grid LSTM -Modelsmultidimensional sequenceswithincreasedgridsize

- Highermemoryrequirementandcomputationalcomplexity thanLSTMduetomultiple recurrentconnections

*source:https://arxiv.org/pdf/1801.01078.pdf

Page 26: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

1. RNNArchitecturesandComparisons• LSTM(LongShort-TermMemory)andtheirvariants

• GRU(GatedRecurrentUnit)• StackedLSTM• GridLSTM• Bi-directionalLSTM

2. BreakthroughsofRNNsinMachineTranslation• Sequence toSequenceLearningwithNeuralNetworks• NeuralMachineTranslationwithAttention• Google’s NeuralMachineTranslation(GNMT)

3. Overcomingtheheavycomputations ofRNNs• ConvolutionalSequence toSequenceLearning• ExploringSparsityinRNNs

TableofContents

26

Page 27: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• Whatismachinetranslation?• Taskofautomaticallyconverting sourcetextinonelanguagetoanother language• Nosingleanswerduetoambiguity/flexibility ofhumanlanguage (challenging)

• Classicalmachinetranslationmethods• Rule-basedmachinetranslation(RBMT)• Statisticalmachinetranslation(SMT;useofstatisticalmodel)

• NeuralMachineTranslation(NMT)• Useof neuralnetworkmodelstolearnastatisticalmodel formachinetranslation

MachineTranslation

27

Page 28: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• DifficultiesinNeuralMachineTranslation• IntrinsicdifficultiesofMT(ambiguity oflanguage)• Variablelengthofinputandoutput sequence(difficult tolearnasinglemodel)

• Thecoreideaofsequence-to-sequencemodel[Sutskever etal.,2014]• Encoder-Decoder architecture(inputà vectorà output)• UseoneRNNnetwork(Encoder) toreadinputsequence atatimetoobtainlargefixed-length vectorrepresentation

• UseanotherRNN(Decoder) toextracttheoutput sequencefromthatvector

BreakthroughsinNMT:Sequence-to-SequenceLearning

28

Inputsequence“ABC”andoutput sequence“WXYZ”

Page 29: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• Encoder• Readstheinput sentence,asequenceofvectorsintoavector• UseRNNssuchthatand,whereandaresomenon-linear functions

• LSTMsasand(intheoriginalseq2seqmodel)

• Decoder• Trainedtopredictthenextwordgiventhecontextvectorandthepreviouslypredictedwords

• Definesaprobabilityoverthetranslationbydecomposing thejointprobability intotheorderedconditionals:

where• Theconditionalprobability ismodeledas

whereisanonlinear, potentiallymulti-layeredfunction thatoutputs theprobabilityofandisthehidden stateoftheRNN

BreakthroughsinNMT:Sequence-to-SequenceLearning

29

Page 30: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• Exampleoftheseq2seqmodel• ForEnglishà Frenchtask• With2-layerLSTMforencoderandencoder

BreakthroughsinNMT:Sequence-to-SequenceLearning

30*source:https://towardsdatascience.com/seq2seq-model-in-tensorflow-ec0c557e560f

Page 31: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• ResultsonWMT’14EnglishtoFrenchdataset• Measure:BLEU(BilingualEvaluationUnderstudy) score

• WidelyusedquantitativemeasureforMTtask• Onparwiththestate-of-the-artSMTsystem(withoutanyneuralnetwork)• Achievedbetterresultsthanthepreviousbaselines

• SimplebutverypowerfulinMTtask

BreakthroughsinNMT:Sequence-to-SequenceLearning

31*source:http://papers.nips.cc/paper/5346-sequence-to-sequence-learning

Next,Seq2seqwithattention

Page 32: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• NMTbyJointLearningtoAlignandTranslate[Bahdanau etal.,2015]

• Problemoforiginalencoder-decoder(orseq2seq)model• Needtocompress allthenecessaryinformationofasourcesentenceintoafixed-lengthvector

• Verydifficult tocopewithlongsentences,especiallywhenthetestsequenceislonger thanthesentencesinthetrainingcorpus

• Extensionofencoder-decodermodel+attentionmechanism• Encodeinput sentenceintoasequenceofvectors• Andchoosesasubsetofthesevectorsadaptivelywhiledecoding thetranslation• Freestheneuralnetworkmodel fromhavingtosquashalltheinformation intoasinglefixed-lengthvector

BreakthroughsinNMT:JointLearningtoAlignandTranslate

32

Page 33: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• Defineeachconditionalprobabilityas:

whereisanRNNhiddenstatefortimecomputedby

• Distinctcontextvectorforeachtargetword• Thecontextvectoriscomputedasweightedsumof

• Theweightofeachiscomputedby

whereisanalignmentmodelwhichscoreshowwelltheinputs aroundpositionandtheoutputpositionmatch.

BreakthroughsinNMT:JointLearningtoAlignandTranslate

33

Illustrationofthemodel

Page 34: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• Graphicalillustrationofseq2seqwithattention• e.g.,ChinesetoEnglish

BreakthroughsinNMT:JointLearningtoAlignandTranslate

34*source:https://google.github.io/seq2seq/

Page 35: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• Results• RNNsearch (proposed) isbetterthanRNNenc (vanillaseq2seq)• RNNsearch-50:modeltrainedwithsentencesoflengthupto50words

BreakthroughsinNMT:JointLearningtoAlignandTranslate

35SamplealignmentresultsNext,Google’sNMT

Page 36: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• Google’sNMT[Wuetal.,2016]• ImprovesoverpreviousNMTsystemsonaccuracy andspeed• 8-layerLSTMSforencoder/decoderwithattention• Achievemodelparallelism byassigningeachLSTMlayerintodifferentGPUs• Add residualconnectionsinstandardLSTM• …andlotsofdomain-specificdetailstoapplyittoproductionmodel

Google’sNeuralMachineTranslation(GNMT)

36

Page 37: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• Addingresidualsconnections instackedLSTM• Inpractice,LSTMhasalsoproblemofvanishinggradientwhenstackingmorelayers• Empirically,4-layerworksokay,6-layerhasproblem,8-layerdoesnotworkatall• Apply residualconnectionswith8-layerstackedLSTMworkedbest

Google’sNeuralMachineTranslation(GNMT)

37

StandardstackedLSTM StackedLSTMwithresidualconnections

Page 38: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• Results• State-of-the-artresultsonvariousMTdatasets• AlsocomparablewithHumanexpert

Google’sNeuralMachineTranslation(GNMT)

38

GNMTwithdifferentconfigurations

Page 39: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• Furtherimprovedin[Johnsonetal.,2016]• ExtensionstomakethismodeltobeMultilingualNMTsystembyaddingartificialtoken toindicatetherequiredtargetlanguage• e.g.,thetoken“<2es>”indicatesthatthetargetsentenceisinSpanish• CandomultilingualNMT usingasinglemodelw/oincreasingtheparameters

Google’sMultilingualNeuralMachineTranslation(MultilingualGNMT)

39

Page 40: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

1. RNNArchitecturesandComparisons• LSTM(LongShort-TermMemory)andtheirvariants

• GRU(GatedRecurrentUnit)• StackedLSTM• GridLSTM• Bi-directionalLSTM

2. BreakthroughsofRNNsinMachineTranslation• Sequence toSequenceLearningwithNeuralNetworks• NeuralMachineTranslationwithAttention• Google’s NeuralMachineTranslation(GNMT)

3. Overcomingtheheavycomputations ofRNNs• ConvolutionalSequence toSequenceLearning• ExploringSparsityinRNNs

TableofContents

40

Page 41: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• ThemostfundamentalproblemofRNNs• Requireheavycomputations(slow)• Especiallywhenwestackmultiple layers• GNMTsolvedthisbymodelparallelism

• Howtoalleviatethisissueintermsofarchitectures?• CNNencoder• CNNencoder+decoder• OptimizingRNNs(pruning approach)

ProbleminRNNs

41

Page 42: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• AconvolutionalencodermodelforNeuralMachineTranslation[Gehring etal.,2016]• CNNencoder+RNNdecoder• ReplacetheRNNencoderwithstackof1-Dconvolutions withnonlinearities

• TwodifferentCNNsforattentionscorecomputationandconditionalinputaggregation• MoreparallelizablethanusingRNN

Solution1-1.CNNencoder

42

Page 43: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• Convolutionalsequencetosequencelearning[Gehring etal.,2017]• CNNsforbothencoderandencoder

Solution1-2.CNNencoder+decoder

43

Performance

Generationspeed

Page 44: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• ExploringSparsityinRecurrentNeuralNetworks[Narang etal.,2017]• PruningRNNstoimprove inferencetime withmarginalperformancedrop

• Simpleheuristicstocalculatethethreshold• Andapplythatthresholdtoeverybinarymaskcorrespondstoeachweight

• Reducesthesizeofthemodelby90%• Significant inferencetimespeed-upusingsparsematrixmultiplyaround to

Solution2.OptimizingRNNs

44GEMM(GeneralMatrix-MatrixMultiply) timescomparison

Page 45: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

• RNNarchitectureshavedevelopedinawaythat• Canbettermodel long-termdependency• Robust tovanishinggradientproblems• Whilehavinglessmemoryorcomputationalcosts

• Breakthroughsinmachinetranslation• Seq2seqmodelwithattention• GNMTandextensiontomultilingualNMT

• AlleviatingtheproblemofRNNs’heavycomputations• Convolutional sequence tosequencelearning• Pruning approach

• TherearevariousapplicationscombiningRNNswithothernetworks• Imagecaptiongeneration,visualquestionanswering(VQA), etc.• Willbecoveredinlaterlectures

Summary

45

Page 46: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

[Hochreiter andSchmidhuber, 1997]"Longshort-termmemory." Neuralcomputation 9.8(1997):1735-1780.link:http://www.bioinf.jku.at/publications/older/2604.pdf

[Gravesetal.,2005]"Framewise phoneme classificationwithbidirectional LSTMandotherneuralnetworkarchitectures."NeuralNetworks 18.5-6(2005):602-610.Link:ftp://ftp.idsia.ch/pub/juergen/nn_2005.pdf

[Gravesetal,2013]"Speechrecognitionwithdeeprecurrentneuralnetworks." Acoustics,speechandsignalprocessing(icassp), 2013ieee internationalconferenceon.IEEE,2013.Link:https://www.cs.toronto.edu/~graves/icassp_2013.pdf

[Cho etal.,2014]"LearningphraserepresentationsusingRNNencoder-decoder forstatisticalmachinetranslation." arXiv preprintarXiv:1406.1078 (2014).Link:https://arxiv.org/pdf/1406.1078v3.pdf

[Sutskever etal.,2014]"Sequencetosequence learningwithneuralnetworks." NIPS 2014.link:http://papers.nips.cc/paper/5346-sequence-to-sequence-learnin

[Sutskever etal.,2014]"Sequencetosequence learningwithneuralnetworks.“NIPS2014.

[Bahdanau etal.,2015]“"Neuralmachinetranslationbyjointlylearningtoalignandtranslate.“,ICLR2015Link:https://arxiv.org/pdf/1409.0473.pdf

[Jozefowicz etal.,2015]"Anempiricalexplorationofrecurrentnetworkarchitectures." ICML 2015.Link:http://proceedings.mlr.press/v37/jozefowicz15.pdf

[Bahdanau etal.,2015]Dzmitry,Kyunghyun Cho,andYoshua Bengio."Neuralmachinetranslationbyjointly learningtoalignandtranslate." ICLR2015link:https://arxiv.org/pdf/1409.0473.pdf

References

46

Page 47: 180912 Lec6 RNN architectures(LSTM, GRU 등 최신 RNN 모델, …alinlab.kaist.ac.kr/resource/Lec6_RNN_architectures.pdf• Google’s Neural Machine Translation (GNMT) 3. Overcoming

Algorithmic IntelligenceLab

[Kalchbrenner etal.,2016]"Gridlongshort-termmemory." ICLR2016Link:https://arxiv.org/pdf/1507.01526.pdf

[Gehring etal.,2016]"Aconvolutional encodermodelforneuralmachinetranslation." arXiv preprintarXiv:1611.02344 (2016).Link:https://arxiv.org/pdf/1611.02344.pdf

[Wuetal.,2016]"Google'sneuralmachinetranslationsystem:Bridgingthegapbetweenhumanandmachinetranslation." arXiv preprintarXiv:1609.08144 (2016).link:https://arxiv.org/pdf/1609.08144.pdf

[Johnson etal.,2016]"Google'smultilingual neuralmachinetranslationsystem:enablingzero-shottranslation." arXiv preprintarXiv:1611.04558 (2016).Link:https://arxiv.org/pdf/1611.04558.pdf

[Gehring etal.,2017]"Convolutional sequence tosequence learning." arXiv preprintarXiv:1705.03122 (2017).Link:https://arxiv.org/pdf/1705.03122.pdf

[Narang etal.,2017]"Exploringsparsityinrecurrentneuralnetworks.“,ICLR2017Link:https://arxiv.org/pdf/1704.05119.pdf

[Fei-Fei andKarpathy,2017]“CS231n: Convolutional NeuralNetworksforVisualRecognition”,2017.(StanfordUniversity)link:http://cs231n.stanford.edu/2017/

[Salehinejad etal.,2017]"RecentAdvancesinRecurrentNeuralNetworks." arXiv preprintarXiv:1801.01078 (2017).Link:https://arxiv.org/pdf/1801.01078.pdf

References

47