cs224n/ling284 - stanford university
TRANSCRIPT
Natural Language Processing with Deep Learning
CS224N/Ling284
Lecture 5: Backpropagation
Kevin Clark
Natural Language Processingwith Deep Learning
CS224N/Ling284
Christopher Manning and Richard Socher
Lecture 2: Word Vectors
Announcements
• Assignment1dueThursday,11:59• Youcanuseupto3latedays(makingitdueSundayatmidnight)
• DefaultfinalprojectwillbereleasedFebruary1st
• TohelpyouchoosewhichprojectopGonyouwanttodo
• FinalprojectproposaldueFebruary8th• SeewebsitefordetailsandinspiraGon
2
OverviewToday:
• Fromone-layertomulGlayerneuralnetworks!
• FullyvectorizedgradientcomputaGon• ThebackpropagaGonalgorithm
• (TimepermiNng)ClassprojectGps
3
Remember:One-layerNeuralNet
x=[xmuseumsxinxParisxarexamazing]
4
Two-layerNeuralNet
x=[xmuseumsxinxParisxarexamazing]
5
RepeatasNeeded!
x=[xmuseumsxinxParisxarexamazing]
6
WhyHaveMul@pleLayers?
• HierarchicalrepresentaGons->neuralnetcanrepresentcomplicatedfeatures
• BeZerresults!
#Layers MachineTransla@onScore
2 23.7
4 25.3
8 25.5
FromTransformerNetwork(willcoverinalaterlecture)
7
Remember:Stochas@cGradientDescent
• UpdateequaGon:
𝛼=stepsizeorlearningrate
8
Remember:Stochas@cGradientDescent
• UpdateequaGon:
• ThisLecture:Howdowecompute?• Byhand• Algorithmically(thebackpropagaGonalgorithm)
𝛼=stepsizeorlearningrate
9
Whylearnallthesedetailsaboutgradients?
• Moderndeeplearningframeworkscomputegradientsforyou• Butwhytakeaclassoncompilersorsystemswhentheyare
implementedforyou?• Understandingwhatisgoingonunderthehoodisuseful!
• BackpropagaGondoesn’talwaysworkperfectly.• Understandingwhyiscrucialfordebuggingandimproving
models• Exampleinfuturelecture:explodingandvanishinggradients
10
QuicklyCompu@ngGradientsbyHand
• ReviewofmulGvariablederivaGves
• Fullyvectorizedgradients
• Muchfasterandmoreusefulthannon-vectorizedgradients
• Butdoinganon-vectorizedgradientcanbegoodpracGce,seeslidesinlastweek’slectureforanexample
• Lecturenotescoverthismaterialinmoredetail
11
Gradients
• GivenafuncGonwith1outputandninputs
• ItsgradientisavectorofparGalderivaGves
12
JacobianMatrix:Generaliza@onoftheGradient
• GivenafuncGonwithmoutputsandninputs
• ItsJacobianisanmxnmatrixofparGalderivaGves
13
ChainRuleForJacobians
• Forone-variablefuncGons:mulGplyderivaGves
• FormulGplevariables:mulGplyJacobians
14
ExampleJacobian:Ac@va@onFunc@on
15
ExampleJacobian:Ac@va@onFunc@on
FuncGonhasnoutputsandninputs->nbynJacobian
16
ExampleJacobian:Ac@va@onFunc@on
17
ExampleJacobian:Ac@va@onFunc@on
18
ExampleJacobian:Ac@va@onFunc@on
19
OtherJacobians
• ComputetheseathomeforpracGce!
• Checkyouranswerswiththelecturenotes
20
OtherJacobians
• ComputetheseathomeforpracGce!
• Checkyouranswerswiththelecturenotes
21
OtherJacobians
• ComputetheseathomeforpracGce!
• Checkyouranswerswiththelecturenotes
22
OtherJacobians
• ComputetheseathomeforpracGce!
• Checkyouranswerswiththelecturenotes
23
BacktoNeuralNets!
x=[xmuseumsxinxParisxarexamazing]24
BacktoNeuralNets!
x=[xmuseumsxinxParisxarexamazing]
• Let’sfind• InpracGcewecareaboutthegradientoftheloss,but
wewillcomputethegradientofthescoreforsimplicity
25
1.Breakupequa@onsintosimplepieces
26
2.Applythechainrule
27
2.Applythechainrule
28
2.Applythechainrule
29
2.Applythechainrule
30
3.WriteouttheJacobians
UsefulJacobiansfrompreviousslide
31
3.WriteouttheJacobians
UsefulJacobiansfrompreviousslide
32
3.WriteouttheJacobians
UsefulJacobiansfrompreviousslide
33
3.WriteouttheJacobians
UsefulJacobiansfrompreviousslide
34
3.WriteouttheJacobians
UsefulJacobiansfrompreviousslide
35
Re-usingComputa@on
• Supposewenowwanttocompute• Usingthechainruleagain:
36
Re-usingComputa@on
• Supposewenowwanttocompute• Usingthechainruleagain:
Thesame!Let’savoidduplicatedcomputaGon…
37
Re-usingComputa@on
• Supposewenowwanttocompute• Usingthechainruleagain:
38
Deriva@vewithrespecttoMatrix
• Whatdoeslooklike?
• 1output,nminputs:1bynmJacobian?
• Inconvenienttodo
39
Deriva@vewithrespecttoMatrix
• Whatdoeslooklike?
• 1output,nminputs:1bynmJacobian?
• Inconvenienttodo
• InsteadfollowconvenGon:shapeofthegradientisshapeofparameters
• Soisnbym:
40
Deriva@vewithrespecttoMatrix
• Remember• isgoingtobeinouranswer
• Theothertermshouldbebecause
• Itturnsout
41
WhytheTransposes?
• Hackyanswer:thismakesthedimensionsworkout
• Usefultrickforcheckingyourwork!
• FullexplanaGoninthelecturenotes42
WhytheTransposes?
• Hackyanswer:thismakesthedimensionsworkout
• Usefultrickforcheckingyourwork!
• FullexplanaGoninthelecturenotes43
Whatshapeshouldderiva@vesbe?
• isarowvector• ButconvenGonsaysourgradientshouldbeacolumnvector
becauseisacolumnvector…
• DisagreementbetweenJacobianform(whichmakesthechainruleeasy)andtheshapeconvenGon(whichmakesimplemenGngSGDeasy)
• WeexpectanswerstofollowtheshapeconvenGon
• ButJacobianformisusefulforcompuGngtheanswers
44
Whatshapeshouldderiva@vesbe?• TwoopGons:
• 1.UseJacobianformasmuchaspossible,reshapetofollowtheconvenGonattheend:• Whatwejustdid.Butattheendtransposetomakethe
derivaGveacolumnvector,resulGngin
• 2.AlwaysfollowtheconvenGon
• Lookatdimensionstofigureoutwhentotransposeand/orreorderterms.
45
NotesonPA1
• Don’tworryifyouusedsomeothermethodforgradientcomputaGon(aslongasyouranswerisrightandyouareconsistent!)
• Thislecturewecomputedthegradientofthescore,butinPA1itsoftheloss
• Don’tforgettoreplacef’withtheactualderivaGve
• PA1usesforthelineartransformaGon:gradientsaredifferent!
46
Backpropaga@on
• Computegradientsalgorithmically
• ConverGngwhatwejustdidbyhandintoanalgorithm
• Usedbydeeplearningframeworks(TensorFlow,PyTorch,etc.)
47
Computa@onalGraphs
�
+ �
• RepresenGngourneuralnetequaGonsasagraph
• Sourcenodes:inputs
• Interiornodes:operaGons
48
Computa@onalGraphs
�
+ �
• RepresenGngourneuralnetequaGonsasagraph
• Sourcenodes:inputs
• Interiornodes:operaGons
• EdgespassalongresultoftheoperaGon
49
Computa@onalGraphs
�
+ �
• RepresenGngourneuralnetequaGonsasagraph
• Sourcenodes:inputs
• Interiornodes:operaGons
• EdgespassalongresultoftheoperaGon
“ForwardPropagaGon”
50
Backpropaga@on
�
+ �
• Gobackwardsalongedges• Passalonggradients
51
Backpropaga@on:SingleNode
• Nodereceivesan“upstreamgradient”
• Goalistopassonthecorrect“downstreamgradient”
Upstreamgradient
52 Downstreamgradient
Backpropaga@on:SingleNode
Downstreamgradient
Upstreamgradient
• Eachnodehasalocalgradient
• Thegradientofitsoutputwithrespecttoitsinput
Localgradient
53
Backpropaga@on:SingleNode
Downstreamgradient
Upstreamgradient
• Eachnodehasalocalgradient
• Thegradientofitsoutputwithrespecttoitsinput
Localgradient
54
Chainrule!
Backpropaga@on:SingleNode
Downstreamgradient
Upstreamgradient
• Eachnodehasalocalgradient
• Thegradientofitsoutputwithrespecttoitsinput
Localgradient
• [downstreamgradient]=[upstreamgradient]x[localgradient]
55
Backpropaga@on:SingleNode
*
• WhataboutnodeswithmulGpleinputs?
56
Backpropaga@on:SingleNode
Downstreamgradients
Upstreamgradient
Localgradients
*
• MulGpleinputs->mulGplelocalgradients
57
AnExample
58
AnExample
+
*max
59
Forwardpropsteps
AnExample
+
*max
60
Forwardpropsteps
6
3
2
1
2
2
0
AnExample
+
*max
61
Forwardpropsteps
6
3
2
1
2
2
0
Localgradients
AnExample
+
*max
62
Forwardpropsteps
6
3
2
1
2
2
0
Localgradients
AnExample
+
*max
63
Forwardpropsteps
6
3
2
1
2
2
0
Localgradients
AnExample
+
*max
64
Forwardpropsteps
6
3
2
1
2
2
0
Localgradients
AnExample
+
*max
65
Forwardpropsteps
6
3
2
1
2
2
0
Localgradients
upstream*local=downstream
1
1*3=3
1*2=2
AnExample
+
*max
66
Forwardpropsteps
6
3
2
1
2
2
0
Localgradients
upstream*local=downstream
1
3
2
3*1=3
3*0=0
AnExample
+
*max
67
Forwardpropsteps
6
3
2
1
2
2
0
Localgradients
upstream*local=downstream
1
3
2
3
0
2*1=2
2*1=2
AnExample
+
*max
68
Forwardpropsteps
6
3
2
1
2
2
0
Localgradients
1
3
2
3
0
2
2
Gradientsaddatbranches
69
+
Gradientsaddatbranches
70
+
NodeIntui@ons
+
*max
71
6
3
2
1
2
2
0
1
22
2
• +“distributes”theupstreamgradient
NodeIntui@ons
+
*max
72
6
3
2
1
2
2
0
1
33
0
• +“distributes”theupstreamgradient
• max“routes”theupstreamgradient
NodeIntui@ons
+
*max
73
6
3
2
1
2
2
0
1
3
2
• +“distributes”theupstreamgradient
• max“routes”theupstreamgradient
• *“switches”theupstreamgradient
Efficiency:computeallgradientsatonce
* + �
• Incorrectwayofdoingbackprop:• Firstcompute
74
Efficiency:computeallgradientsatonce
* + �
• Incorrectwayofdoingbackprop:• Firstcompute
• Thenindependentlycompute
• DuplicatedcomputaGon!
75
Efficiency:computeallgradientsatonce
* + �
• Correctway:• Computeallthegradientsatonce
• Analogoustousingwhenwecomputedgradientsbyhand
76
BackpropImplementa@ons
77
Implementa@on:forward/backwardAPI
78
Implementa@on:forward/backwardAPI
79
Alterna@vetobackprop:NumericGradient
• Forsmallh,
• Easytoimplement
• Butapproximateandveryslow:• Havetorecomputefforeveryparameterofourmodel
• UsefulforcheckingyourimplementaGon
80
Summary
• BackpropagaGon:recursivelyapplythechainrulealongcomputaGonalgraph• [downstreamgradient]=[upstreamgradient]x[localgradient]
• Forwardpass:computeresultsofoperaGonandsaveintermediatevalues
• Backward:applychainruletocomputegradient
81
82
ProjectTypes
1.ApplyexisGngneuralnetworkmodeltoanewtask
2.Implementacomplexneuralarchitecture(s)
• ThisiswhatPA4willhaveyoudo!
3.Comeupwithanewmodel/trainingalgorithm/etc.• Get1or2workingfirst
• SeeprojectpageforsomeinspiraGon
83
Must-haves(choose-your-ownfinalproject)
• 10,000+labeledexamplesbymilestone
• Feasibletask
• AutomaGcevaluaGonmetric
• NLPiscentral
84
Detailsma_er!
• Splityourdataintotrain/dev/test:onlylookattestforfinalexperiments
• Lookatyourdata,collectsummarystaGsGcs
• Lookatyourmodel’soutputs,doerroranalysis
• Tuninghyperparametersisimportant
• Writeupqualityisimportant• Lookatlast-year’sprizewinnersforexamples
85
ProjectAdvice
• Implementsimplestpossiblemodelfirst(e.g.,averagewordvectorsandapplylogisGcregression)andimproveit• Havingabaselinesystemiscrucial
• Firstoverfityourmodeltotrainset(getreallygoodtrainingsetresults)• Thenregularizeitsoitdoeswellonthedevset
• Startearly!
86