cs224n/ling284 - stanford university

Natural Language Processing with Deep Learning

CS224N/Ling284

Lecture 5: Backpropagation

Kevin Clark

Natural Language Processingwith Deep Learning

CS224N/Ling284

Christopher Manning and Richard Socher

Lecture 2: Word Vectors

Announcements

•  Assignment1dueThursday,11:59•  Youcanuseupto3latedays(makingitdueSundayatmidnight)

•  DefaultfinalprojectwillbereleasedFebruary1st

•  TohelpyouchoosewhichprojectopGonyouwanttodo

•  FinalprojectproposaldueFebruary8th•  SeewebsitefordetailsandinspiraGon

2

OverviewToday:

•  Fromone-layertomulGlayerneuralnetworks!

•  FullyvectorizedgradientcomputaGon•  ThebackpropagaGonalgorithm

•  (TimepermiNng)ClassprojectGps

3

Remember:One-layerNeuralNet

x=[xmuseumsxinxParisxarexamazing]

4

Two-layerNeuralNet


5

RepeatasNeeded!


6

WhyHaveMul@pleLayers?

•  HierarchicalrepresentaGons->neuralnetcanrepresentcomplicatedfeatures

•  BeZerresults!

#Layers MachineTransla@onScore

2 23.7

4 25.3

8 25.5

FromTransformerNetwork(willcoverinalaterlecture)

7

Remember:Stochas@cGradientDescent

•  UpdateequaGon:

𝛼=stepsizeorlearningrate

8

Remember:Stochas@cGradientDescent

•  UpdateequaGon:

•  ThisLecture:Howdowecompute?•  Byhand•  Algorithmically(thebackpropagaGonalgorithm)

𝛼=stepsizeorlearningrate

9

Whylearnallthesedetailsaboutgradients?

•  Moderndeeplearningframeworkscomputegradientsforyou•  Butwhytakeaclassoncompilersorsystemswhentheyare

implementedforyou?•  Understandingwhatisgoingonunderthehoodisuseful!

•  BackpropagaGondoesn’talwaysworkperfectly.•  Understandingwhyiscrucialfordebuggingandimproving

models•  Exampleinfuturelecture:explodingandvanishinggradients

10

QuicklyCompu@ngGradientsbyHand

•  ReviewofmulGvariablederivaGves

•  Fullyvectorizedgradients

•  Muchfasterandmoreusefulthannon-vectorizedgradients

•  Butdoinganon-vectorizedgradientcanbegoodpracGce,seeslidesinlastweek’slectureforanexample

•  Lecturenotescoverthismaterialinmoredetail

11

Gradients

•  GivenafuncGonwith1outputandninputs

•  ItsgradientisavectorofparGalderivaGves

12

JacobianMatrix:Generaliza@onoftheGradient

•  GivenafuncGonwithmoutputsandninputs

•  ItsJacobianisanmxnmatrixofparGalderivaGves

13

ChainRuleForJacobians

•  Forone-variablefuncGons:mulGplyderivaGves

•  FormulGplevariables:mulGplyJacobians

14

ExampleJacobian:Ac@va@onFunc@on

15


FuncGonhasnoutputsandninputs->nbynJacobian

16


17


18


19

OtherJacobians

•  ComputetheseathomeforpracGce!

•  Checkyouranswerswiththelecturenotes

20

OtherJacobians



21

OtherJacobians



22

OtherJacobians



23

BacktoNeuralNets!

x=[xmuseumsxinxParisxarexamazing]24

BacktoNeuralNets!


•  Let’sfind•  InpracGcewecareaboutthegradientoftheloss,but

wewillcomputethegradientofthescoreforsimplicity

25

1.Breakupequa@onsintosimplepieces

26

2.Applythechainrule

27

2.Applythechainrule

28

2.Applythechainrule

29

2.Applythechainrule

30

3.WriteouttheJacobians

UsefulJacobiansfrompreviousslide

31



32



33



34



35

Re-usingComputa@on

•  Supposewenowwanttocompute•  Usingthechainruleagain:

36

Re-usingComputa@on


Thesame!Let’savoidduplicatedcomputaGon…

37

Re-usingComputa@on


38

Deriva@vewithrespecttoMatrix

•  Whatdoeslooklike?

•  1output,nminputs:1bynmJacobian?

•  Inconvenienttodo

39


•  Whatdoeslooklike?

•  1output,nminputs:1bynmJacobian?

•  Inconvenienttodo

•  InsteadfollowconvenGon:shapeofthegradientisshapeofparameters

•  Soisnbym:

40


•  Remember•  isgoingtobeinouranswer

•  Theothertermshouldbebecause

•  Itturnsout

41

WhytheTransposes?

•  Hackyanswer:thismakesthedimensionsworkout

•  Usefultrickforcheckingyourwork!

•  FullexplanaGoninthelecturenotes42

WhytheTransposes?

•  Hackyanswer:thismakesthedimensionsworkout

•  Usefultrickforcheckingyourwork!

•  FullexplanaGoninthelecturenotes43

Whatshapeshouldderiva@vesbe?

•  isarowvector•  ButconvenGonsaysourgradientshouldbeacolumnvector

becauseisacolumnvector…

•  DisagreementbetweenJacobianform(whichmakesthechainruleeasy)andtheshapeconvenGon(whichmakesimplemenGngSGDeasy)

•  WeexpectanswerstofollowtheshapeconvenGon

•  ButJacobianformisusefulforcompuGngtheanswers

44

Whatshapeshouldderiva@vesbe?•  TwoopGons:

•  1.UseJacobianformasmuchaspossible,reshapetofollowtheconvenGonattheend:•  Whatwejustdid.Butattheendtransposetomakethe

derivaGveacolumnvector,resulGngin

•  2.AlwaysfollowtheconvenGon

•  Lookatdimensionstofigureoutwhentotransposeand/orreorderterms.

45

NotesonPA1

•  Don’tworryifyouusedsomeothermethodforgradientcomputaGon(aslongasyouranswerisrightandyouareconsistent!)

•  Thislecturewecomputedthegradientofthescore,butinPA1itsoftheloss

•  Don’tforgettoreplacef’withtheactualderivaGve

•  PA1usesforthelineartransformaGon:gradientsaredifferent!

46

Backpropaga@on

•  Computegradientsalgorithmically

•  ConverGngwhatwejustdidbyhandintoanalgorithm

•  Usedbydeeplearningframeworks(TensorFlow,PyTorch,etc.)

47

Computa@onalGraphs

�

+ �

•  RepresenGngourneuralnetequaGonsasagraph

•  Sourcenodes:inputs

•  Interiornodes:operaGons

48

Computa@onalGraphs

�

+ �




•  EdgespassalongresultoftheoperaGon

49

Computa@onalGraphs

�

+ �




•  EdgespassalongresultoftheoperaGon

“ForwardPropagaGon”

50

Backpropaga@on

�

+ �

•  Gobackwardsalongedges•  Passalonggradients

51

Backpropaga@on:SingleNode

•  Nodereceivesan“upstreamgradient”

•  Goalistopassonthecorrect“downstreamgradient”

Upstreamgradient

52 Downstreamgradient


Downstreamgradient

Upstreamgradient

•  Eachnodehasalocalgradient

•  Thegradientofitsoutputwithrespecttoitsinput

Localgradient

53


Downstreamgradient

Upstreamgradient



Localgradient

54

Chainrule!


Downstreamgradient

Upstreamgradient



Localgradient

•  [downstreamgradient]=[upstreamgradient]x[localgradient]

55


*

•  WhataboutnodeswithmulGpleinputs?

56


Downstreamgradients

Upstreamgradient

Localgradients

*

•  MulGpleinputs->mulGplelocalgradients

57

AnExample

58

AnExample

+

*max

59

Forwardpropsteps

AnExample

+

*max

60

Forwardpropsteps

6

3

2

1

2

2

0

AnExample

+

*max

61

Forwardpropsteps

6

3

2

1

2

2

0

Localgradients

AnExample

+

*max

62

Forwardpropsteps

6

3

2

1

2

2

0

Localgradients

AnExample

+

*max

63

Forwardpropsteps

6

3

2

1

2

2

0

Localgradients

AnExample

+

*max

64

Forwardpropsteps

6

3

2

1

2

2

0

Localgradients

AnExample

+

*max

65

Forwardpropsteps

6

3

2

1

2

2

0

Localgradients

upstream*local=downstream

1

1*3=3

1*2=2

AnExample

+

*max

66

Forwardpropsteps

6

3

2

1

2

2

0

Localgradients


1

3

2

3*1=3

3*0=0

AnExample

+

*max

67

Forwardpropsteps

6

3

2

1

2

2

0

Localgradients


1

3

2

3

0

2*1=2

2*1=2

AnExample

+

*max

68

Forwardpropsteps

6

3

2

1

2

2

0

Localgradients

1

3

2

3

0

2

2

Gradientsaddatbranches

69

+

Gradientsaddatbranches

70

+

NodeIntui@ons

+

*max

71

6

3

2

1

2

2

0

1

22

2

•  +“distributes”theupstreamgradient

NodeIntui@ons

+

*max

72

6

3

2

1

2

2

0

1

33

0


•  max“routes”theupstreamgradient

NodeIntui@ons

+

*max

73

6

3

2

1

2

2

0

1

3

2


•  max“routes”theupstreamgradient

•  *“switches”theupstreamgradient

Efficiency:computeallgradientsatonce

* + �

•  Incorrectwayofdoingbackprop:•  Firstcompute

74


* + �

•  Incorrectwayofdoingbackprop:•  Firstcompute

•  Thenindependentlycompute

•  DuplicatedcomputaGon!

75


* + �

•  Correctway:•  Computeallthegradientsatonce

•  Analogoustousingwhenwecomputedgradientsbyhand

76

BackpropImplementa@ons

77

Implementa@on:forward/backwardAPI

78

Implementa@on:forward/backwardAPI

79

Alterna@vetobackprop:NumericGradient

•  Forsmallh,

•  Easytoimplement

•  Butapproximateandveryslow:•  Havetorecomputefforeveryparameterofourmodel

•  UsefulforcheckingyourimplementaGon

80

Summary

•  BackpropagaGon:recursivelyapplythechainrulealongcomputaGonalgraph•  [downstreamgradient]=[upstreamgradient]x[localgradient]

•  Forwardpass:computeresultsofoperaGonandsaveintermediatevalues

•  Backward:applychainruletocomputegradient

81

ProjectTypes

1.ApplyexisGngneuralnetworkmodeltoanewtask

2.Implementacomplexneuralarchitecture(s)

•  ThisiswhatPA4willhaveyoudo!

3.Comeupwithanewmodel/trainingalgorithm/etc.•  Get1or2workingfirst

•  SeeprojectpageforsomeinspiraGon

83

Must-haves(choose-your-ownfinalproject)

•  10,000+labeledexamplesbymilestone

•  Feasibletask

•  AutomaGcevaluaGonmetric

•  NLPiscentral

84

Detailsma_er!

•  Splityourdataintotrain/dev/test:onlylookattestforfinalexperiments

•  Lookatyourdata,collectsummarystaGsGcs

•  Lookatyourmodel’soutputs,doerroranalysis

•  Tuninghyperparametersisimportant

•  Writeupqualityisimportant•  Lookatlast-year’sprizewinnersforexamples

85

ProjectAdvice

•  Implementsimplestpossiblemodelfirst(e.g.,averagewordvectorsandapplylogisGcregression)andimproveit•  Havingabaselinesystemiscrucial

•  Firstoverfityourmodeltotrainset(getreallygoodtrainingsetresults)•  Thenregularizeitsoitdoeswellonthedevset

•  Startearly!

86

cs224n/ling284 - stanford university

Documents