part2:deep learning - hitir.hit.edu.cn/~car/talks/ijcnlp17-tutorial/lec02-deep_learning.pdf ·...
TRANSCRIPT
Part 2: DeepLearning
2017-11-27 IJCNLP 2017 Tutorial 1
Part 2.1: DeepLearning Background
2017-11-27 IJCNLP 2017 Tutorial 2
WhatisMachineLearning?
• FromDatatoKnowledge
InputAlgorithm
Output
Traditional Program
InputOutput
“Algorithm”
ML Program2017-11-27 IJCNLP 2017 Tutorial 3
AStandardExampleofML
• TheMNIST(ModifiedNIST)databaseofhand-writtendigitsrecognition– Publiclyavailable– AhugeamountabouthowwellvariousMLmethodsdoonit
– 60,000+ 10,000 hand-writtendigits(28x28pixelseach)
2017-11-27 IJCNLP 2017 Tutorial 4
Veryhardtosaywhatmakesa2
2017-11-27 IJCNLP 2017 Tutorial 5
Traditional Model(before2012)
• Fixed/engineeredfeatures+trainableclassifier– Designingafeatureextractorrequiresconsiderableeffortsbyexperts
SIFTGISTShapecontext
2017-11-27 IJCNLP 2017 Tutorial 6
DeepLearning (after2012)
• LearningHierarchicalRepresentations• DEEPmeansmorethanonestageofnon-linearfeaturetransformation
2017-11-27 IJCNLP 2017 Tutorial 7
DeepLearningArchitecture
2017-11-27 IJCNLP 2017 Tutorial 8
DeepLearningisNotNew
• 1980stechnology(NeuralNetworks)
2017-11-27 IJCNLP 2017 Tutorial 9
AboutNeuralNetworks
• Pros– Simpletolearnp(y|x)– Performance is OKforshallownets
• Cons– Troublewith>3layers– Overfitts– Slowtotrain
2017-11-27 IJCNLP 2017 Tutorial 10
DeepLearningbeatsNN
• Pros– Simpletolearnp(y|x)
– Performance is OK forshallownets
• Cons– Troublewith>3layers
– Overfitts
– Slowtotrain
• Dropout• Maxout• StochasticPooling
GPU
• Newactivationfunctions:ReLU,…
• Gatedmechanism
2017-11-27 IJCNLP 2017 Tutorial 11
ResultsonMNIST
• NaïveNeuralNetwork– 96.59%
• SVM(defaultsettingsforlibsvm)– 94.35%
• OptimalSVM[AndreasMueller]– 98.56%
• Thestateoftheart:ConvolutionalNN(2013)– 99.79%
2017-11-27 IJCNLP 2017 Tutorial 12
DeepLearningforSpeechRecognition
2017-11-27 IJCNLP 2017 Tutorial 13
DL forNLP: Representation Learning
A
B
A
B
A
B
- -
---
-
2017-11-27 IJCNLP 2017 Tutorial 14
DL forNLP: End-to-End Learning
2017-11-27 IJCNLP 2017 Tutorial 15
ROOThas_VBZ good_JJ Control_NN ._.
Stack Buffer
He_PRPnsubj
Configuration
0 0 1 0 1 … 0 1 0 0
Traditional Parser Stack-LSTM Parser
Part 2.2: Feedforward Neural Networks
2017-11-27 IJCNLP 2017 Tutorial 16
feature units
decision unit
input units
hand-coded programs
learned weights
TheStandardPerceptronArchitecture
IPhone is very good .
good very/good very …
>5?
0.8 0.9 0.1 …
2017-11-27 IJCNLP 2017 Tutorial 17
TheLimitationsofPerceptrons
• Thehand-codedfeatures– Greatinfluenceontheperformance– Needlotsofcosttofindsuitablefeatures
• Alinearclassifierwithahyperplane– Cannotseparatenon-lineardata,suchasXOR functioncannotbelearnedbyasingle-layerperceptron
0,1
0,0 1,0
1,1
weight plane output =1output =0
The positive and negative casescannot be separated by a plane
2017-11-27 IJCNLP 2017 Tutorial 18
LearningwithNon-linearHiddenLayers
2017-11-27 IJCNLP 2017 Tutorial 19
FeedforwardNeuralNetworks
• Multi-layerPerceptron(MLP)• Theinformationispropagatedfrom
theinputstotheoutputs• NOcyclebetweenoutputsandinputs• Learningtheweightsofhiddenunits
isequivalenttolearningfeatures• Networkswithouthiddenlayersare
verylimitedintheinput-outputmappings– Morelayersoflinearunitsdonot
help.Itsstilllinear– Fixedoutputnon-linearities arenot
enoughx1 x2 xn…..
1st hidden layer
2nd hiddenlayer
Output layer
2017-11-27 IJCNLP 2017 Tutorial 20
MultipleLayer NeuralNetworks
• Whatarethosehiddenneuronsdoing?– Mayberepresentoutlines
2017-11-27 IJCNLP 2017 Tutorial 21
GeneralOptimizing(Learning)Algorithms
• GradientDescent
• StochasticGradientDescent(SGD)– Minibatch SGD(m>1),OnlineGD(m=1)
2017-11-27 IJCNLP 2017 Tutorial 22
Computational/Flow Graphs
• DescribingMathematicalExpressions• Forexample
– e=(a+b)*(b+1)• c=a+b,d=b+1,e=c*d
– Ifa=2,b=1
2017-11-27 IJCNLP 2017 Tutorial 23
DerivativesonComputationalGraphs
2017-11-27 IJCNLP 2017 Tutorial 24
ComputationalGraphBackwardPass(Backpropagation)
2017-11-27 IJCNLP 2017 Tutorial 25
Part 2.3: Recurrent and OtherNeural Networks
2017-11-27 IJCNLP 2017 Tutorial 26
LanguageModels
• Alanguagemodelcomputesaprobabilityforasequenceofword:!(#$,⋯#') orpredictsaprobabilityforthenextword:!(#')$|#$,⋯#')
• Usefulformachinetranslation,speechrecognition,andsoon– Wordordering
• P(thecatissmall)>P(smalltheiscat)
– Wordchoice• P(therearefour cats)>P(therearefor cats)
2017-11-27 IJCNLP 2017 Tutorial 27
TraditionalLanguageModels
• AnincorrectbutnecessaryMarkovassumption!– Probabilityisusuallyconditionedonn previouswords– ! "#,⋯"& =∏ !("*|"#,⋯ ,"*,#) ≈/*0# ∏ !("*|"*,(&,#),⋯ ,"*,#)/
*0#• Disadvantages
– ThereareALOTofn-grams!– Cannotseetoolonghistory
2017-11-27 IJCNLP 2017 Tutorial 28
RecurrentNeuralNetworks(RNNs)
• Conditiontheneuralnetworkonallpreviousinputs• RAMrequirementonlyscaleswithnumberofinputs
W1 W1
ht ht+1ht-1
yt-1 yt yt+1
xt-1 xt xt+1
W1
W2 W2W2
W3 W3W3
2017-11-27 IJCNLP 2017 Tutorial 29
RecurrentNeuralNetworks (RNNs)
• Atasingletimestept– ℎ" = tanh()*ℎ"+* +)-.")– 01" = 234567.()8ℎ")
ht …ht-1
01"
xt
W1
W2
W3
ht
01"
xt
W1
W2
W3
W1
2017-11-27 IJCNLP 2017 Tutorial 30
TrainingRNNsishard
• Ideallyinputsfrommanytimestepsagocanmodifyoutputy• Forexample,with2timesteps
ht ht+1ht-1
yt-1 yt yt+1
xt-1 xt xt+1
W1 W1
W2 W2
W3 W3
2017-11-27 IJCNLP 2017 Tutorial 31
BackPropagation ThroughTime(BPTT)• Totalerroristhesumofeacherrorattimestept
– !"!# = ∑ !"&
!#'()*
• !"&!#+ =
!"&!,&
!,&!#+ iseasytobecalculated
• Buttocalculate !"&!#. =!"&!,&
!,&!/&
!/&!#. ishard(alsofor01)
• Becauseℎ( = tanh(0*ℎ(8* +01:() dependsonℎ(8*,whichdependson0* andℎ(81,andsoon.
• So !"&!#. = ∑ !"&!,&
!,&!/&
!/&!/<
!/<!#.
(=)*
ht
>?(
xt
W1
W2
W3
2017-11-27 IJCNLP 2017 Tutorial 32
The vanishinggradientproblem• !"#
!$= ∑ !"#
!'#
!'#!(#
!(#!()
!()!$
*+,- ,ℎ* = tanh(4-ℎ*5- +4
78*)
• !(#!()
= ∏!(;!(;<=
*>,+?- = ∏ 4-diag[tanh′(⋯ )]*
>,+?-
• !(;!(;<=
≤ H 4- ≤ HI-– where H is bound diag[tanh′(⋯ )] , I-is the largest singular value of4-
• !(#!()
≤ HI-*5+à0
– ifHI- < 1, thiscanbecomeverysmall (vanishinggradient)– ifHI- > 1, thiscanbecomeverylarge(explodinggradient)
• Trickforexplodinggradient:clippingtrick(setathreshold)
2017-11-27 IJCNLP 2017 Tutorial 33
A“solution”
• Intuition– Ensure!"# ≥ 1à topreventvanishinggradients
• So…– ProperinitializationoftheW– TouseReLU insteadoftanh orsigmoidactivationfunctions
2017-11-27 IJCNLP 2017 Tutorial 34
Abetter“solution”
• Recalltheoriginaltransitionequation– ℎ" = tanh()*ℎ"+* +)-.")
• Wecaninsteadupdatethestateadditively– 0" = tanh()*ℎ"+* +)-.")– ℎ" = ℎ"+* + 0"– then, 123
12345= 1 + 173
12345≥ 1
– Ontheotherhand• ℎ" = ℎ"+* + 0" = ℎ"+- + 0"+* + 0" = ⋯
ht …ht-1
:;"
xt
…
…
2017-11-27 IJCNLP 2017 Tutorial 35
Abetter“solution” (cont.)
• Interpolatebetweenoldstateandnewstate(“choosingtoforget”)– !" = $ %&'" + )&ℎ"+,– ℎ" = !" ⊙ ℎ"+, + (1 − !") ⊙ 2"
• Introduceaseparateinputgate3"– 3" = $ %4'" + )4ℎ"+,– ℎ" = !" ⊙ ℎ"+, + 3" ⊙ 2"
• Selectivelyexposememorycell5" withanoutputgate6"– 6" = $ %7'" + )7ℎ"+,– 5" = !" ⊙ 5"+, + 3" ⊙ 2"– ℎ" = 6" ⊙ tanh(5")2017-11-27 IJCNLP 2017 Tutorial 36
LongShort-TermMemory(LSTM)
• Hochreiter & Schmidhuber,1997• LSTM=additiveupdates+gating
Xt
+ + + +0 1 2 3
ht-1
Ct-1 Ct
ht
�
�
+
�σ σ σtanh
tanh
ht
2017-11-27 IJCNLP 2017 Tutorial 37
GatedRecurrentUnites,GRU(Choetal.2014)
• Mainideas– Keeparoundmemoriestocapturelongdistancedependencies– Allowerrormessagestoflowatdifferentstrengthsdependingontheinputs
• Updategate– Basedoncurrentinputandhiddenstate– !" = $ %&'" + )&ℎ"+,
• Resetgate– Similarlybutwithdifferentweights– -" = $(%/'" + )/ℎ"+,)
2017-11-27 IJCNLP 2017 Tutorial 38
GRU• Memoryattimestepcombinescurrentandprevioustimesteps
– ℎ" = $" ⊙ ℎ"&' + (1 − $") ⊙ ℎ-– Updategatez controlshowmuchofpaststateshouldmatternow
• Ifz closedto1,thenwecancopyinformationinthatunitthroughmanytimestepsà lessvanishinggradient!
• Newmemorycontent– ℎ-" = tanh(23" + 4" ⊙ 5ℎ"&')– Ifresetgater unitiscloseto0,thenthisignorespreviousmemoryandonlystoresthenewinputinformationà allowsmodeltodropinformationthatisirrelevantinthefuture
2017-11-27 IJCNLP 2017 Tutorial 39
LSTMvs.GRU
• Noclearwinner!• Tuninghyperparameters likelayersizeisprobablymoreimportantthanpickingtheidealarchitecture
• GRUshave fewer parametersandthusmaytrainabitfasterorneedless datatogeneralize
• Ifyouhaveenoughdata,thegreaterexpressivepowerofLSTMsmay leadtobetterresults.
2017-11-27 IJCNLP 2017 Tutorial 40
MoreRNNs
• BidirectionalRNN • Stack BidirectionalRNN
2017-11-27 IJCNLP 2017 Tutorial 41
Tree-LSTMs
• TraditionalSequentialComposition
• Tree-StructuredComposition
2017-11-27 IJCNLP 2017 Tutorial 42
MoreApplicationsofRNN
• NeuralMachineTranslation• HandwritingGeneration• ImageCaptionGeneration• …...
2017-11-27 IJCNLP 2017 Tutorial 43
ConvolutionNeuralNetwork
CS231nConvolutionalNeuralNetworkforVisualRecognition.
Pooling
2017-11-27 IJCNLP 2017 Tutorial 44
CNN for NLP
Zhang,Y.,&Wallace,B.(2015).ASensitivityAnalysisof(andPractitioners’Guideto)ConvolutionalNeuralNetworksforSentenceClassification.
2017-11-27 IJCNLP 2017 Tutorial 45
RecursiveNeuralNetwork
Socher,R.,Manning,C.,&Ng,A.(2011).LearningContinuousPhraseRepresentationsandSyntacticParsingwithRecursiveNeuralNetwork.NIPS.
2017-11-27 IJCNLP 2017 Tutorial 46
Summary
• Deep Learning– Representation Learning– End-to-end Learning
• Popular Networks– Feedforward Neural Networks– Recurrent Neural Networks– Convolutional Neural Networks
2017-11-27 IJCNLP 2017 Tutorial 47