deep learning for nlp part 3 - stanford university...deep learning for nlp part 3 cs224n christopher...

DeepLearningforNLPPart3

CS224NChristopherManning

(ManyslidesborrowedfromACL2012/NAACL2013Tutorialsbyme,RichardSocherandYoshuaBengio)

Backpropagation TrainingPart1.5:TheBasics

2

Backprop

• Computegradientofexample-wiselosswrtparameters

• Simplyapplyingthederivativechainrulewisely

• Ifcomputingtheloss(example,parameters)isO(n)computation,thensoiscomputingthegradient

3

Simple Chain Rule

4

Multiple Paths Chain Rule

5

Multiple Paths Chain Rule - General

…

6

Chain Rule in Flow Graph

…

…

…

Flowgraph:anydirectedacyclicgraphnode=computationresultarc=computationdependency

=successors of

7

Back-Prop in Multi-Layer Net

…

…

8

h = sigmoid(Vx)

Back-Prop in General Flow Graph

…

…

…

=successors of

1. Fprop:visitnodes intopo-sortorder- Computevalueofnodegivenpredecessors

2. Bprop:- initialize outputgradient=1- visitnodes inreverseorder:

Computegradientwrt eachnodeusinggradientwrt successors

Singlescalaroutput

9

Automatic Differentiation

• Thegradientcomputationcanbeautomaticallyinferredfromthesymbolicexpressionofthefprop.

• Eachnodetypeneedstoknowhowtocomputeitsoutputandhowtocomputethegradientwrt itsinputsgiventhegradientwrt itsoutput.

• Easyandfastprototyping.See:• Theano (Python),• TensorFlow (Python/C++),or• Autograd (Lua/C++forTorch)

10

Deep Learning General Strategy and Tricks

11

General Strategy1. Selectnetworkstructureappropriateforproblem

1. Structure:Singlewords,fixedwindowsvs.convolutionalvs.recurrent/recursivesentencebasedvs.bagofwords

2. Nonlinearities[coveredearlier]2. Checkforimplementationbugswithgradientchecks3. Parameterinitialization4. Optimization5. Checkifthemodelispowerfulenoughtooverfit

1. Ifnot,changemodelstructureormakemodel“larger”2. Ifyoucanoverfit:Regularize

12

Gradient Checks are Awesome!

• Allowsyoutoknowthattherearenobugsinyourneuralnetworkimplementation!(Butmakesitrunreallyslow.)

• Steps:1. Implementyourgradient2. Implementafinitedifferencecomputationbylooping

throughtheparametersofyournetwork,addingandsubtractingasmallepsilon(∼10-4)andestimatederivatives

3. Comparethetwoandmakesuretheyarealmostthesame13

,

Parameter Initialization

• ParameterInitializationcanbeveryimportantforsuccess!• Initializehiddenlayerbiasesto0andoutput(orreconstruction)

biasestooptimalvalueifweightswere0(e.g.,meantargetorinversesigmoidofmeantarget).

• Initializeweights∼ Uniform(−r,r),r inverselyproportionaltofan-in(previouslayersize)andfan-out(nextlayersize):

fortanh units,and4xbiggerforsigmoidunits [Glorot AISTATS2010]• Makeinitializationslighly positiveforReLU – toavoiddeadunits

14

• Gradientdescentusestotalgradientoverallexamplesperupdate• Shouldn’tbeused.Veryslow.

• SGDupdatesaftereachexample:

• L = lossfunction,zt=currentexample,θ = parametervector,andεt = learningrate.

• Youprocessanexampleandthenmoveeachparameterasmalldistancebysubtractingafractionofthegradient

• εt shouldbesmall…moreinfollowingslide• Important:applyallSGDupdatesatonceafterbackproppass

Stochastic Gradient Descent (SGD)

15

• RatherthandoSGDonasingleexample,peopleusuallydoitonaminibatch of32,64,or128examples.

• Yousumthegradientsinminibatch (andscaledownlearningrate)• Minoradvantage:gradientestimateismuchmorerobustwhenestimatedonabunchofexamplesratherthanjustone.

• Majoradvantage:codecanrunmuchfasteriff youcandoawholeminibatch atonceviamatrix-matrixmultiplies

• ThereisawholepanoplyoffancieronlinelearningalgorithmscommonlyusednowwithNNs.Goodonesinclude:• Adagrad• RMSprop• ADAM

Stochastic Gradient Descent (SGD)

16

Learning Rates

• Settingα correctlyistricky• Simplestrecipe:α fixedandsameforallparameters• Orstartwithlearningratejustsmallenoughtobestableinfirst

passthroughdata(epoch),thenhalveitonsubsequentepochs• Betterresultscanusuallybeobtainedbyusingacurriculumfor

decreasinglearningrates,typicallyinO(1/t)becauseoftheoreticalconvergenceguarantees,e.g.,withhyper-parametersε0 andτ

• Betteryet:Nohand-setlearningratesbyusingmethodslikeAdaGrad (Duchi,Hazan,&Singer2011)[butmayconvergetoosoon– tryresettingaccumulatedgradients]

17

Attempt to overfit training data

Assumingyoufoundtherightnetworkstructure,implementeditcorrectly,andoptimizeditproperly,youcanmakeyourmodeltotallyoverfit onyourtrainingdata(99%+accuracy)• Ifnot:

• Changearchitecture• Makemodelbigger(biggervectors,morelayers)• Fixoptimization

• Ifyes:• Now,it’stimetoregularizethenetwork

18

Prevent Overfitting: Model Size and Regularization

• Simplefirststep:Reducemodelsizebyloweringnumberofunitsandlayersandotherparameters

• StandardL1orL2regularizationonweights• EarlyStopping:Useparametersthatgavebestvalidationerror• Sparsity constraintsonhiddenactivations,e.g.,addtocost:

19

Prevent Feature Co-adaptation

Dropout (Hintonetal.2012)http://jmlr.org/papers/v15/srivastava14a.html

• Trainingtime:ateachinstanceofevaluation(inonlineSGD-training),randomlyset50%oftheinputstoeachneuronto0

• Testtime:halvethemodelweights(nowtwiceasmany)• Thispreventsfeatureco-adaptation:Afeaturecannotonlybeusefulinthepresenceofparticularotherfeatures

• Akindofmiddle-groundbetweenNaïveBayes(whereallfeatureweightsaresetindependently)andlogisticregressionmodels(whereweightsaresetinthecontextofallothers)

• Canbethoughtofasaformofmodelbagging• Itactsasastrongregularizer;see(Wageretal.2013)

http://arxiv.org/abs/1307.149320

Deep Learning Tricks of the Trade• Y.Bengio(2012),“PracticalRecommendationsforGradient-

BasedTrainingofDeepArchitectures”http://arxiv.org/abs/1206.5533• Unsupervisedpre-training• Stochasticgradientdescentandsettinglearningrates• Mainhyper-parameters• Learning rateschedule &earlystopping,Minibatches, Parameterinitialization,Numberofhiddenunits,L1orL2weightdecay,…

• Y.Bengio,I.Goodfellow,andA.Courville (inpress),“DeepLearning”.MITPress,ms. http://goodfeli.github.io/dlbook/• Manychaptersondeeplearning,includingoptimizationtricks• Somemorerecentstuffthan2012paper

21

Sharing statistical strengthPart1.7

22

Sharing Statistical Strength

• Besidesveryfastprediction,themainadvantageofdeeplearningisstatistical

• Potentialtolearnfromlesslabeledexamplesbecauseofsharingofstatisticalstrength:• Unsupervisedpre-training• Multi-tasklearning• Semi-supervisedlearning

23

Multi-Task Learning

• GeneralizingbettertonewtasksiscrucialtoapproachAI

• Deeparchitectureslearngoodintermediaterepresentationsthatcanbesharedacrosstasks

• Goodrepresentationsmakesenseformanytasks

raw input x

task 1 output y1

task 3 output y3

task 2output y2

shared intermediate representation h

24

Semi-Supervised Learning

• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)

purelysupervised

25

Semi-Supervised Learning

• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)

semi-supervised

26

Advantages of Deep LearningPart 2

Part1.1:TheBasics

27

#4 Unsupervised feature learning

Today,mostpractical,goodNLP&MLmethodsrequirelabeledtrainingdata(i.e.,supervisedlearning)

Butalmostalldataisunlabeled

MostinformationmustbeacquiredunsupervisedFortunately,agoodmodelofobserveddatacanreallyhelpyoulearnclassificationdecisions

Commentary:Thisismorethedreamthanthereality;mostoftherecentsuccessesofdeeplearninghavecomefromregularsupervisedlearningoververylargedatasets28

#5 Handling the recursivity of language

Humansentencesarecomposedfromwordsandphrases

Weneedcompositionality inourMLmodels

Recursion:thesameoperator(sameparameters)isappliedrepeatedlyondifferentcomponents

Recurrentmodels:Recursionalongatemporalsequence

Asmallcrowdquietlyentersthehistoricchurch

historicthe

quietlyenters

SVP

Det. Adj.

NPVP

Asmallcrowd

NP

NP

church

N.

SemanticRepresentations

xt−1 xt xt+1

zt−1 zt zt+1

29

#6 Why now?

Despitepriorinvestigationandunderstandingofmanyofthealgorithmictechniques…

Before2006trainingdeeparchitectureswasunsuccessfulL

Whathaschanged?• Newmethodsforunsupervisedpre-training(Restricted

BoltzmannMachines=RBMs,autoencoders,contrastiveestimation,etc.)anddeepmodeltrainingdeveloped

• Moreefficientparameterestimationmethods• Betterunderstandingofmodelregularization• More data andmorecomputationalpower

Deep Learning models have already achieved impressive results for HLT

NeuralLanguageModel[Mikolov etal.Interspeech 2011]

MSRMAVISSpeechSystem[Dahletal.2012;Seide etal. 2011;followingMohamedetal.2011]

“Thealgorithms representthefirsttimeacompanyhasreleased adeep-neural-networks(DNN)-basedspeech-recognitionalgorithminacommercialproduct.”

Model\ WSJASRtask Eval WER

KN5Baseline 17.2

Discriminative LM 16.9

Recurrent NNcombination 14.4

Acousticmodel&training

Recog\ WER

RT03SFSH

Hub5SWB

GMM40-mix,BMMI,SWB309h

1-pass−adapt

27.4 23.6

DBN-DNN7layerx2048, SWB309h

1-pass−adapt

18.5(−33%)

16.1(−32%)

GMM72-mix,BMMI,FSH2000h

k-pass+adapt

18.6 17.131

Deep Learn Models Have Interesting Performance Characteristics

Deeplearningmodelscannowbeveryfastinsomecircumstances

• SENNA[Collobert etal.2011]candoPOSorNERfasterthanotherSOTAtaggers(16xto122x),using25xlessmemory

• WSJPOS97.29%acc;CoNLL NER89.59%F1;CoNLL Chunking94.32%F1

Changesincomputingtechnologyfavordeeplearning• InNLP,speedhastraditionallycomefromexploitingsparsity• Butwithmodernmachines,branchesandwidelyspacedmemoryaccessesarecostly

• UniformparalleloperationsondensevectorsarefasterThesetrendsareevenstrongerwithmulti-coreCPUsandGPUs

32

TREE STRUCTURES WITH CONTINUOUS VECTORS

Compositionality

We need more than word vectors and bags! What of larger semantic units?

Howcanweknowwhenlargerunitsaresimilarinmeaning?

• Thesnowboarder isleapingoverthemogul

• Apersononasnowboardjumpsintotheair

Peopleinterpretthemeaningoflargertextunits–entities,descriptiveterms,facts,arguments,stories– bysemanticcomposition ofsmallerelements

Socher, Manning, NgSocher, Ng, Manning

RepresentingPhrasesasVectors

x2

x1012345678910

5

4

3

2

1Monday

92

Tuesday 9.51.5

Ifthevectorspacecapturessyntacticandsemanticinformation,thevectorscanbeusedasfeaturesforparsingandinterpretation

15

1.14

thecountryofmybirththeplacewhereIwasborn

Canweextendideasofwordvectorspacestophrases?

France 22.5

Germany 13

Vector for single words are useful as features but limited


Howshouldwemapphrasesintoavectorspace?

thecountryofmybirth

0.40.3

2.33.6

44.5

77

2.13.3

2.53.8

5.56.1

13.5

15

Use theprincipleofcompositionality!Themeaning (vector)ofasentenceisdetermined by(1) themeaningsofitswordsand(2) therules thatcombinethem.

Modeljointlylearnscompositionalvectorrepresentationsandtreestructure.

x2

x10 1 2 3 4 5 6 7 8 9 10

5

4

3

2

1

the country of my birththe place where I was born

Monday

Tuesday

FranceGermany


TreeRecursiveNeuralNetworks(TreeRNNs)

onthemat.

91

43

33

83

Computational unit:SimpleNeuralNetworklayer,appliedrecursively

85

33

NeuralNetwork

83

1.3

85

(Goller &Küchler 1996,Costaetal.2003,Socheretal.ICML,2011)


Version1:SimpleconcatenationTreeRNN

Onlyasingleweightmatrix=compositionfunction!

Noreallyinteractionbetweentheinputwords!

Notadequateforhumanlanguagecompositionfunction

W

c1 c2

pWscore s

p=tanh(W+b),

wheretanh:

score=VT p

c1c2


Version4:RecursiveNeuralTensorNetwork

Allowsthetwowordorphrasevectorstointeractmultiplicatively


Beyondthebagofwords:Sentimentdetection

Isthetoneofapieceoftextpositive,negative,orneutral?

• Sentimentisthatsentimentis“easy”• Detectionaccuracyforlongerdocuments~90%,BUT

……loved……………great………………impressed………………marvelous…………


StanfordSentimentTreebank

• 215,154phraseslabeledin11,855sentences• Canactuallytrainandtestcompositions

http://nlp.stanford.edu:8080/sentiment/


BetterDatasetHelpedAllModels

• Hardnegationcasesarestillmostlyincorrect• Wealsoneedamorepowerfulmodel!

75

76

77

78

79

80

81

82

83

84

TrainingwithSentenceLabels

TrainingwithTreebank

BiNB

RNN

MV-RNN


Version4:RecursiveNeuralTensorNetwork

Idea:Allowbothadditiveandmediatedmultiplicativeinteractionsofvectors


RecursiveNeuralTensorNetwork


RecursiveNeuralTensorNetwork

• Useresultingvectorsintreeasinputtoaclassifierlikelogisticregression

• Trainallweightsjointlywithgradientdescent


Positive/NegativeResultsonTreebank

74

76

78

80

82

84

86

TrainingwithSentence Labels TrainingwithTreebank

BiNBRNNMV-RNNRNTN

ClassifyingSentences:Accuracyimprovesto85.4

Note:formorerecentwork,seeLe&Mikolov (2014),Irsoy &Cardie (2014),Taietal.(2015)


ExperimentalResultsonTreebank

• RNTNcancaptureconstructionslikeXbutY• RNTNaccuracyof72%,comparedtoMV-RNN(65%),biword NB(58%)andRNN(54%)


NegationResults

Whennegatingnegatives,positiveactivationshouldincrease!

Demo:http://nlp.stanford.edu:8080/sentiment/

Conclusion

Developingintelligentmachinesinvolvesbeingabletorecognizeandexploitcompositionalstructure

Italsoinvolvesotherthingsliketop-downprediction,ofcourse

Workisnowunderwayonhowtodomorecomplextasksthanstraightclassificationinsidedeeplearningsystems

deep learning for nlp part 3 - stanford university...deep learning for nlp part 3 cs224n christopher...

Documents