part2:deep learning - hitir.hit.edu.cn/~car/talks/ijcnlp17-tutorial/lec02-deep_learning.pdf ·...

Part 2: DeepLearning

2017-11-27 IJCNLP 2017 Tutorial 1

Part 2.1: DeepLearning Background


WhatisMachineLearning?

• FromDatatoKnowledge

InputAlgorithm

Output

Traditional Program

InputOutput

“Algorithm”

ML Program2017-11-27 IJCNLP 2017 Tutorial 3

AStandardExampleofML

• TheMNIST(ModifiedNIST)databaseofhand-writtendigitsrecognition– Publiclyavailable– AhugeamountabouthowwellvariousMLmethodsdoonit

– 60,000+ 10,000 hand-writtendigits(28x28pixelseach)


Veryhardtosaywhatmakesa2


Traditional Model(before2012)

• Fixed/engineeredfeatures+trainableclassifier– Designingafeatureextractorrequiresconsiderableeffortsbyexperts

SIFTGISTShapecontext


DeepLearning (after2012)

• LearningHierarchicalRepresentations• DEEPmeansmorethanonestageofnon-linearfeaturetransformation


DeepLearningArchitecture


DeepLearningisNotNew

• 1980stechnology(NeuralNetworks)


AboutNeuralNetworks

• Pros– Simpletolearnp(y|x)– Performance is OKforshallownets

• Cons– Troublewith>3layers– Overfitts– Slowtotrain


DeepLearningbeatsNN

• Pros– Simpletolearnp(y|x)

– Performance is OK forshallownets

• Cons– Troublewith>3layers

– Overfitts

– Slowtotrain

• Dropout• Maxout• StochasticPooling

GPU

• Newactivationfunctions:ReLU,…

• Gatedmechanism


ResultsonMNIST

• NaïveNeuralNetwork– 96.59%

• SVM(defaultsettingsforlibsvm)– 94.35%

• OptimalSVM[AndreasMueller]– 98.56%

• Thestateoftheart:ConvolutionalNN(2013)– 99.79%


DeepLearningforSpeechRecognition


DL forNLP: Representation Learning

A

B

A

B

A

B

- -

---

-


DL forNLP: End-to-End Learning


ROOThas_VBZ good_JJ Control_NN ._.

Stack Buffer

He_PRPnsubj

Configuration

0 0 1 0 1 … 0 1 0 0

Traditional Parser Stack-LSTM Parser

Part 2.2: Feedforward Neural Networks


feature units

decision unit

input units

hand-coded programs

learned weights

TheStandardPerceptronArchitecture

IPhone is very good .

good very/good very …

>5?

0.8 0.9 0.1 …


TheLimitationsofPerceptrons

• Thehand-codedfeatures– Greatinfluenceontheperformance– Needlotsofcosttofindsuitablefeatures

• Alinearclassifierwithahyperplane– Cannotseparatenon-lineardata,suchasXOR functioncannotbelearnedbyasingle-layerperceptron

0,1

0,0 1,0

1,1

weight plane output =1output =0

The positive and negative casescannot be separated by a plane


LearningwithNon-linearHiddenLayers


FeedforwardNeuralNetworks

• Multi-layerPerceptron(MLP)• Theinformationispropagatedfrom

theinputstotheoutputs• NOcyclebetweenoutputsandinputs• Learningtheweightsofhiddenunits

isequivalenttolearningfeatures• Networkswithouthiddenlayersare

verylimitedintheinput-outputmappings– Morelayersoflinearunitsdonot

help.Itsstilllinear– Fixedoutputnon-linearities arenot

enoughx1 x2 xn…..

1st hidden layer

2nd hiddenlayer

Output layer


MultipleLayer NeuralNetworks

• Whatarethosehiddenneuronsdoing?– Mayberepresentoutlines


GeneralOptimizing(Learning)Algorithms

• GradientDescent

• StochasticGradientDescent(SGD)– Minibatch SGD(m>1),OnlineGD(m=1)


Computational/Flow Graphs

• DescribingMathematicalExpressions• Forexample

– e=(a+b)*(b+1)• c=a+b,d=b+1,e=c*d

– Ifa=2,b=1


DerivativesonComputationalGraphs


ComputationalGraphBackwardPass(Backpropagation)


Part 2.3: Recurrent and OtherNeural Networks


LanguageModels

• Alanguagemodelcomputesaprobabilityforasequenceofword:!(#$,⋯#') orpredictsaprobabilityforthenextword:!(#')$|#$,⋯#')

• Usefulformachinetranslation,speechrecognition,andsoon– Wordordering

• P(thecatissmall)>P(smalltheiscat)

– Wordchoice• P(therearefour cats)>P(therearefor cats)


TraditionalLanguageModels

• AnincorrectbutnecessaryMarkovassumption!– Probabilityisusuallyconditionedonn previouswords– ! "#,⋯"& =∏ !("*|"#,⋯ ,"*,#) ≈/*0# ∏ !("*|"*,(&,#),⋯ ,"*,#)/

*0#• Disadvantages

– ThereareALOTofn-grams!– Cannotseetoolonghistory


RecurrentNeuralNetworks(RNNs)

• Conditiontheneuralnetworkonallpreviousinputs• RAMrequirementonlyscaleswithnumberofinputs

W1 W1

ht ht+1ht-1

yt-1 yt yt+1

xt-1 xt xt+1

W1

W2 W2W2

W3 W3W3


RecurrentNeuralNetworks (RNNs)

• Atasingletimestept– ℎ" = tanh()*ℎ"+* +)-.")– 01" = 234567.()8ℎ")

ht …ht-1

01"

xt

W1

W2

W3

ht

01"

xt

W1

W2

W3

W1


TrainingRNNsishard

• Ideallyinputsfrommanytimestepsagocanmodifyoutputy• Forexample,with2timesteps

ht ht+1ht-1

yt-1 yt yt+1

xt-1 xt xt+1

W1 W1

W2 W2

W3 W3


BackPropagation ThroughTime(BPTT)• Totalerroristhesumofeacherrorattimestept

– !"!# = ∑ !"&

!#'()*

• !"&!#+ =

!"&!,&

!,&!#+ iseasytobecalculated

• Buttocalculate !"&!#. =!"&!,&

!,&!/&

!/&!#. ishard(alsofor01)

• Becauseℎ( = tanh(0*ℎ(8* +01:() dependsonℎ(8*,whichdependson0* andℎ(81,andsoon.

• So !"&!#. = ∑ !"&!,&

!,&!/&

!/&!/<

!/<!#.

(=)*

ht

>?(

xt

W1

W2

W3


The vanishinggradientproblem• !"#

!$= ∑ !"#

!'#

!'#!(#

!(#!()

!()!$

*+,- ,ℎ* = tanh(4-ℎ*5- +4

78*)

• !(#!()

= ∏!(;!(;<=

*>,+?- = ∏ 4-diag[tanh′(⋯ )]*

>,+?-

• !(;!(;<=

≤ H 4- ≤ HI-– where H is bound diag[tanh′(⋯ )] , I-is the largest singular value of4-

• !(#!()

≤ HI-*5+à0

– ifHI- < 1, thiscanbecomeverysmall (vanishinggradient)– ifHI- > 1, thiscanbecomeverylarge(explodinggradient)

• Trickforexplodinggradient:clippingtrick(setathreshold)


A“solution”

• Intuition– Ensure!"# ≥ 1à topreventvanishinggradients

• So…– ProperinitializationoftheW– TouseReLU insteadoftanh orsigmoidactivationfunctions


Abetter“solution”

• Recalltheoriginaltransitionequation– ℎ" = tanh()*ℎ"+* +)-.")

• Wecaninsteadupdatethestateadditively– 0" = tanh()*ℎ"+* +)-.")– ℎ" = ℎ"+* + 0"– then, 123

12345= 1 + 173

12345≥ 1

– Ontheotherhand• ℎ" = ℎ"+* + 0" = ℎ"+- + 0"+* + 0" = ⋯

ht …ht-1

:;"

xt

…

…


Abetter“solution” (cont.)

• Interpolatebetweenoldstateandnewstate(“choosingtoforget”)– !" = $ %&'" + )&ℎ"+,– ℎ" = !" ⊙ ℎ"+, + (1 − !") ⊙ 2"

• Introduceaseparateinputgate3"– 3" = $ %4'" + )4ℎ"+,– ℎ" = !" ⊙ ℎ"+, + 3" ⊙ 2"

• Selectivelyexposememorycell5" withanoutputgate6"– 6" = $ %7'" + )7ℎ"+,– 5" = !" ⊙ 5"+, + 3" ⊙ 2"– ℎ" = 6" ⊙ tanh(5")2017-11-27 IJCNLP 2017 Tutorial 36

LongShort-TermMemory(LSTM)

• Hochreiter & Schmidhuber,1997• LSTM=additiveupdates+gating

Xt

+ + + +0 1 2 3

ht-1

Ct-1 Ct

ht

�

�

+

�σ σ σtanh

tanh

ht


GatedRecurrentUnites,GRU(Choetal.2014)

• Mainideas– Keeparoundmemoriestocapturelongdistancedependencies– Allowerrormessagestoflowatdifferentstrengthsdependingontheinputs

• Updategate– Basedoncurrentinputandhiddenstate– !" = $ %&'" + )&ℎ"+,

• Resetgate– Similarlybutwithdifferentweights– -" = $(%/'" + )/ℎ"+,)


GRU• Memoryattimestepcombinescurrentandprevioustimesteps

– ℎ" = $" ⊙ ℎ"&' + (1 − $") ⊙ ℎ-– Updategatez controlshowmuchofpaststateshouldmatternow

• Ifz closedto1,thenwecancopyinformationinthatunitthroughmanytimestepsà lessvanishinggradient!

• Newmemorycontent– ℎ-" = tanh(23" + 4" ⊙ 5ℎ"&')– Ifresetgater unitiscloseto0,thenthisignorespreviousmemoryandonlystoresthenewinputinformationà allowsmodeltodropinformationthatisirrelevantinthefuture


LSTMvs.GRU

• Noclearwinner!• Tuninghyperparameters likelayersizeisprobablymoreimportantthanpickingtheidealarchitecture

• GRUshave fewer parametersandthusmaytrainabitfasterorneedless datatogeneralize

• Ifyouhaveenoughdata,thegreaterexpressivepowerofLSTMsmay leadtobetterresults.


MoreRNNs

• BidirectionalRNN • Stack BidirectionalRNN


Tree-LSTMs

• TraditionalSequentialComposition

• Tree-StructuredComposition


MoreApplicationsofRNN

• NeuralMachineTranslation• HandwritingGeneration• ImageCaptionGeneration• …...


ConvolutionNeuralNetwork

CS231nConvolutionalNeuralNetworkforVisualRecognition.

Pooling


CNN for NLP

Zhang,Y.,&Wallace,B.(2015).ASensitivityAnalysisof(andPractitioners’Guideto)ConvolutionalNeuralNetworksforSentenceClassification.


RecursiveNeuralNetwork

Socher,R.,Manning,C.,&Ng,A.(2011).LearningContinuousPhraseRepresentationsandSyntacticParsingwithRecursiveNeuralNetwork.NIPS.


Summary

• Deep Learning– Representation Learning– End-to-end Learning

• Popular Networks– Feedforward Neural Networks– Recurrent Neural Networks– Convolutional Neural Networks


part2:deep learning - hitir.hit.edu.cn/~car/talks/ijcnlp17-tutorial/lec02-deep_learning.pdf ·...

Documents