deep learning for nlp part 3 - stanford university...deep learning for nlp part 3 cs224n christopher...

52
Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio)

Upload: others

Post on 30-Aug-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

DeepLearningforNLPPart3

CS224NChristopherManning

(ManyslidesborrowedfromACL2012/NAACL2013Tutorialsbyme,RichardSocherandYoshuaBengio)

Page 2: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Backpropagation TrainingPart1.5:TheBasics

2

Page 3: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Backprop

• Computegradientofexample-wiselosswrtparameters

• Simplyapplyingthederivativechainrulewisely

• Ifcomputingtheloss(example,parameters)isO(n)computation,thensoiscomputingthegradient

3

Page 4: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Simple Chain Rule

4

Page 5: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Multiple Paths Chain Rule

5

Page 6: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Multiple Paths Chain Rule - General

6

Page 7: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Chain Rule in Flow Graph

Flowgraph:anydirectedacyclicgraphnode=computationresultarc=computationdependency

=successors of

7

Page 8: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Back-Prop in Multi-Layer Net

8

h = sigmoid(Vx)

Page 9: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Back-Prop in General Flow Graph

=successors of

1. Fprop:visitnodes intopo-sortorder- Computevalueofnodegivenpredecessors

2. Bprop:- initialize outputgradient=1- visitnodes inreverseorder:

Computegradientwrt eachnodeusinggradientwrt successors

Singlescalaroutput

9

Page 10: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Automatic Differentiation

• Thegradientcomputationcanbeautomaticallyinferredfromthesymbolicexpressionofthefprop.

• Eachnodetypeneedstoknowhowtocomputeitsoutputandhowtocomputethegradientwrt itsinputsgiventhegradientwrt itsoutput.

• Easyandfastprototyping.See:• Theano (Python),• TensorFlow (Python/C++),or• Autograd (Lua/C++forTorch)

10

Page 11: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Deep Learning General Strategy and Tricks

11

Page 12: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

General Strategy1. Selectnetworkstructureappropriateforproblem

1. Structure:Singlewords,fixedwindowsvs.convolutionalvs.recurrent/recursivesentencebasedvs.bagofwords

2. Nonlinearities[coveredearlier]2. Checkforimplementationbugswithgradientchecks3. Parameterinitialization4. Optimization5. Checkifthemodelispowerfulenoughtooverfit

1. Ifnot,changemodelstructureormakemodel“larger”2. Ifyoucanoverfit:Regularize

12

Page 13: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Gradient Checks are Awesome!

• Allowsyoutoknowthattherearenobugsinyourneuralnetworkimplementation!(Butmakesitrunreallyslow.)

• Steps:1. Implementyourgradient2. Implementafinitedifferencecomputationbylooping

throughtheparametersofyournetwork,addingandsubtractingasmallepsilon(∼10-4)andestimatederivatives

3. Comparethetwoandmakesuretheyarealmostthesame13

,

Page 14: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Parameter Initialization

• ParameterInitializationcanbeveryimportantforsuccess!• Initializehiddenlayerbiasesto0andoutput(orreconstruction)

biasestooptimalvalueifweightswere0(e.g.,meantargetorinversesigmoidofmeantarget).

• Initializeweights∼ Uniform(−r,r),r inverselyproportionaltofan-in(previouslayersize)andfan-out(nextlayersize):

fortanh units,and4xbiggerforsigmoidunits [Glorot AISTATS2010]• Makeinitializationslighly positiveforReLU – toavoiddeadunits

14

Page 15: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

• Gradientdescentusestotalgradientoverallexamplesperupdate• Shouldn’tbeused.Veryslow.

• SGDupdatesaftereachexample:

• L = lossfunction,zt=currentexample,θ = parametervector,andεt = learningrate.

• Youprocessanexampleandthenmoveeachparameterasmalldistancebysubtractingafractionofthegradient

• εt shouldbesmall…moreinfollowingslide• Important:applyallSGDupdatesatonceafterbackproppass

Stochastic Gradient Descent (SGD)

15

Page 16: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

• RatherthandoSGDonasingleexample,peopleusuallydoitonaminibatch of32,64,or128examples.

• Yousumthegradientsinminibatch (andscaledownlearningrate)• Minoradvantage:gradientestimateismuchmorerobustwhenestimatedonabunchofexamplesratherthanjustone.

• Majoradvantage:codecanrunmuchfasteriff youcandoawholeminibatch atonceviamatrix-matrixmultiplies

• ThereisawholepanoplyoffancieronlinelearningalgorithmscommonlyusednowwithNNs.Goodonesinclude:• Adagrad• RMSprop• ADAM

Stochastic Gradient Descent (SGD)

16

Page 17: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Learning Rates

• Settingα correctlyistricky• Simplestrecipe:α fixedandsameforallparameters• Orstartwithlearningratejustsmallenoughtobestableinfirst

passthroughdata(epoch),thenhalveitonsubsequentepochs• Betterresultscanusuallybeobtainedbyusingacurriculumfor

decreasinglearningrates,typicallyinO(1/t)becauseoftheoreticalconvergenceguarantees,e.g.,withhyper-parametersε0 andτ

• Betteryet:Nohand-setlearningratesbyusingmethodslikeAdaGrad (Duchi,Hazan,&Singer2011)[butmayconvergetoosoon– tryresettingaccumulatedgradients]

17

Page 18: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Attempt to overfit training data

Assumingyoufoundtherightnetworkstructure,implementeditcorrectly,andoptimizeditproperly,youcanmakeyourmodeltotallyoverfit onyourtrainingdata(99%+accuracy)• Ifnot:

• Changearchitecture• Makemodelbigger(biggervectors,morelayers)• Fixoptimization

• Ifyes:• Now,it’stimetoregularizethenetwork

18

Page 19: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Prevent Overfitting: Model Size and Regularization

• Simplefirststep:Reducemodelsizebyloweringnumberofunitsandlayersandotherparameters

• StandardL1orL2regularizationonweights• EarlyStopping:Useparametersthatgavebestvalidationerror• Sparsity constraintsonhiddenactivations,e.g.,addtocost:

19

Page 20: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Prevent Feature Co-adaptation

Dropout (Hintonetal.2012)http://jmlr.org/papers/v15/srivastava14a.html

• Trainingtime:ateachinstanceofevaluation(inonlineSGD-training),randomlyset50%oftheinputstoeachneuronto0

• Testtime:halvethemodelweights(nowtwiceasmany)• Thispreventsfeatureco-adaptation:Afeaturecannotonlybeusefulinthepresenceofparticularotherfeatures

• Akindofmiddle-groundbetweenNaïveBayes(whereallfeatureweightsaresetindependently)andlogisticregressionmodels(whereweightsaresetinthecontextofallothers)

• Canbethoughtofasaformofmodelbagging• Itactsasastrongregularizer;see(Wageretal.2013)

http://arxiv.org/abs/1307.149320

Page 21: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Deep Learning Tricks of the Trade• Y.Bengio(2012),“PracticalRecommendationsforGradient-

BasedTrainingofDeepArchitectures”http://arxiv.org/abs/1206.5533• Unsupervisedpre-training• Stochasticgradientdescentandsettinglearningrates• Mainhyper-parameters• Learning rateschedule &earlystopping,Minibatches, Parameterinitialization,Numberofhiddenunits,L1orL2weightdecay,…

• Y.Bengio,I.Goodfellow,andA.Courville (inpress),“DeepLearning”.MITPress,ms. http://goodfeli.github.io/dlbook/• Manychaptersondeeplearning,includingoptimizationtricks• Somemorerecentstuffthan2012paper

21

Page 22: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Sharing statistical strengthPart1.7

22

Page 23: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Sharing Statistical Strength

• Besidesveryfastprediction,themainadvantageofdeeplearningisstatistical

• Potentialtolearnfromlesslabeledexamplesbecauseofsharingofstatisticalstrength:• Unsupervisedpre-training• Multi-tasklearning• Semi-supervisedlearning

23

Page 24: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Multi-Task Learning

• GeneralizingbettertonewtasksiscrucialtoapproachAI

• Deeparchitectureslearngoodintermediaterepresentationsthatcanbesharedacrosstasks

• Goodrepresentationsmakesenseformanytasks

raw input x

task 1 output y1

task 3 output y3

task 2output y2

shared intermediate representation h

24

Page 25: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Semi-Supervised Learning

• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)

purelysupervised

25

Page 26: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Semi-Supervised Learning

• Hypothesis:P(c|x)canbemoreaccuratelycomputedusingsharedstructurewithP(x)

semi-supervised

26

Page 27: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Advantages of Deep LearningPart 2

Part1.1:TheBasics

27

Page 28: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

#4 Unsupervised feature learning

Today,mostpractical,goodNLP&MLmethodsrequirelabeledtrainingdata(i.e.,supervisedlearning)

Butalmostalldataisunlabeled

MostinformationmustbeacquiredunsupervisedFortunately,agoodmodelofobserveddatacanreallyhelpyoulearnclassificationdecisions

Commentary:Thisismorethedreamthanthereality;mostoftherecentsuccessesofdeeplearninghavecomefromregularsupervisedlearningoververylargedatasets28

Page 29: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

#5 Handling the recursivity of language

Humansentencesarecomposedfromwordsandphrases

Weneedcompositionality inourMLmodels

Recursion:thesameoperator(sameparameters)isappliedrepeatedlyondifferentcomponents

Recurrentmodels:Recursionalongatemporalsequence

Asmallcrowdquietlyentersthehistoricchurch

historicthe

quietlyenters

SVP

Det. Adj.

NPVP

Asmallcrowd

NP

NP

church

N.

SemanticRepresentations

xt−1 xt xt+1

zt−1 zt zt+1

29

Page 30: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

#6 Why now?

Despitepriorinvestigationandunderstandingofmanyofthealgorithmictechniques…

Before2006trainingdeeparchitectureswasunsuccessfulL

Whathaschanged?• Newmethodsforunsupervisedpre-training(Restricted

BoltzmannMachines=RBMs,autoencoders,contrastiveestimation,etc.)anddeepmodeltrainingdeveloped

• Moreefficientparameterestimationmethods• Betterunderstandingofmodelregularization• More data andmorecomputationalpower

Page 31: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Deep Learning models have already achieved impressive results for HLT

NeuralLanguageModel[Mikolov etal.Interspeech 2011]

MSRMAVISSpeechSystem[Dahletal.2012;Seide etal. 2011;followingMohamedetal.2011]

“Thealgorithms representthefirsttimeacompanyhasreleased adeep-neural-networks(DNN)-basedspeech-recognitionalgorithminacommercialproduct.”

Model\ WSJASRtask Eval WER

KN5Baseline 17.2

Discriminative LM 16.9

Recurrent NNcombination 14.4

Acousticmodel&training

Recog\ WER

RT03SFSH

Hub5SWB

GMM40-mix,BMMI,SWB309h

1-pass−adapt

27.4 23.6

DBN-DNN7layerx2048, SWB309h

1-pass−adapt

18.5(−33%)

16.1(−32%)

GMM72-mix,BMMI,FSH2000h

k-pass+adapt

18.6 17.131

Page 32: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Deep Learn Models Have Interesting Performance Characteristics

Deeplearningmodelscannowbeveryfastinsomecircumstances

• SENNA[Collobert etal.2011]candoPOSorNERfasterthanotherSOTAtaggers(16xto122x),using25xlessmemory

• WSJPOS97.29%acc;CoNLL NER89.59%F1;CoNLL Chunking94.32%F1

Changesincomputingtechnologyfavordeeplearning• InNLP,speedhastraditionallycomefromexploitingsparsity• Butwithmodernmachines,branchesandwidelyspacedmemoryaccessesarecostly

• UniformparalleloperationsondensevectorsarefasterThesetrendsareevenstrongerwithmulti-coreCPUsandGPUs

32

Page 33: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

TREE STRUCTURES WITH CONTINUOUS VECTORS

Page 34: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

34

Page 35: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Compositionality

Page 36: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

We need more than word vectors and bags! What of larger semantic units?

Howcanweknowwhenlargerunitsaresimilarinmeaning?

• Thesnowboarder isleapingoverthemogul

• Apersononasnowboardjumpsintotheair

Peopleinterpretthemeaningoflargertextunits–entities,descriptiveterms,facts,arguments,stories– bysemanticcomposition ofsmallerelements

Page 37: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Socher, Manning, NgSocher, Ng, Manning

RepresentingPhrasesasVectors

x2

x1012345678910

5

4

3

2

1Monday

92

Tuesday 9.51.5

Ifthevectorspacecapturessyntacticandsemanticinformation,thevectorscanbeusedasfeaturesforparsingandinterpretation

15

1.14

thecountryofmybirththeplacewhereIwasborn

Canweextendideasofwordvectorspacestophrases?

France 22.5

Germany 13

Vector for single words are useful as features but limited

Page 38: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Socher, Manning, NgSocher, Ng, Manning

Howshouldwemapphrasesintoavectorspace?

thecountryofmybirth

0.40.3

2.33.6

44.5

77

2.13.3

2.53.8

5.56.1

13.5

15

Use theprincipleofcompositionality!Themeaning (vector)ofasentenceisdetermined by(1) themeaningsofitswordsand(2) therules thatcombinethem.

Modeljointlylearnscompositionalvectorrepresentationsandtreestructure.

x2

x10 1 2 3 4 5 6 7 8 9 10

5

4

3

2

1

the country of my birththe place where I was born

Monday

Tuesday

FranceGermany

Page 39: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Socher, Manning, NgSocher, Ng, Manning

TreeRecursiveNeuralNetworks(TreeRNNs)

onthemat.

91

43

33

83

Computational unit:SimpleNeuralNetworklayer,appliedrecursively

85

33

NeuralNetwork

83

1.3

85

(Goller &Küchler 1996,Costaetal.2003,Socheretal.ICML,2011)

Page 40: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Socher, Manning, NgSocher, Ng, Manning

Version1:SimpleconcatenationTreeRNN

Onlyasingleweightmatrix=compositionfunction!

Noreallyinteractionbetweentheinputwords!

Notadequateforhumanlanguagecompositionfunction

W

c1 c2

pWscore s

p=tanh(W+b),

wheretanh:

score=VT p

c1c2

Page 41: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Socher, Manning, NgSocher, Ng, Manning

Version4:RecursiveNeuralTensorNetwork

Allowsthetwowordorphrasevectorstointeractmultiplicatively

Page 42: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Socher, Manning, NgSocher, Ng, Manning

Beyondthebagofwords:Sentimentdetection

Isthetoneofapieceoftextpositive,negative,orneutral?

• Sentimentisthatsentimentis“easy”• Detectionaccuracyforlongerdocuments~90%,BUT

……loved……………great………………impressed………………marvelous…………

Page 43: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Socher, Manning, NgSocher, Ng, Manning

StanfordSentimentTreebank

• 215,154phraseslabeledin11,855sentences• Canactuallytrainandtestcompositions

http://nlp.stanford.edu:8080/sentiment/

Page 44: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Socher, Manning, NgSocher, Ng, Manning

BetterDatasetHelpedAllModels

• Hardnegationcasesarestillmostlyincorrect• Wealsoneedamorepowerfulmodel!

75

76

77

78

79

80

81

82

83

84

TrainingwithSentenceLabels

TrainingwithTreebank

BiNB

RNN

MV-RNN

Page 45: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Socher, Manning, NgSocher, Ng, Manning

Version4:RecursiveNeuralTensorNetwork

Idea:Allowbothadditiveandmediatedmultiplicativeinteractionsofvectors

Page 46: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Socher, Manning, NgSocher, Ng, Manning

RecursiveNeuralTensorNetwork

Page 47: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Socher, Manning, NgSocher, Ng, Manning

RecursiveNeuralTensorNetwork

Page 48: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Socher, Manning, NgSocher, Ng, Manning

RecursiveNeuralTensorNetwork

• Useresultingvectorsintreeasinputtoaclassifierlikelogisticregression

• Trainallweightsjointlywithgradientdescent

Page 49: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Socher, Manning, NgSocher, Ng, Manning

Positive/NegativeResultsonTreebank

74

76

78

80

82

84

86

TrainingwithSentence Labels TrainingwithTreebank

BiNBRNNMV-RNNRNTN

ClassifyingSentences:Accuracyimprovesto85.4

Note:formorerecentwork,seeLe&Mikolov (2014),Irsoy &Cardie (2014),Taietal.(2015)

Page 50: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Socher, Manning, NgSocher, Ng, Manning

ExperimentalResultsonTreebank

• RNTNcancaptureconstructionslikeXbutY• RNTNaccuracyof72%,comparedtoMV-RNN(65%),biword NB(58%)andRNN(54%)

Page 51: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Socher, Manning, NgSocher, Ng, Manning

NegationResults

Whennegatingnegatives,positiveactivationshouldincrease!

Demo:http://nlp.stanford.edu:8080/sentiment/

Page 52: Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Conclusion

Developingintelligentmachinesinvolvesbeingabletorecognizeandexploitcompositionalstructure

Italsoinvolvesotherthingsliketop-downprediction,ofcourse

Workisnowunderwayonhowtodomorecomplextasksthanstraightclassificationinsidedeeplearningsystems