cs224n/ling284 - stanford university

86
Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 5: Backpropagation Kevin Clark

Upload: others

Post on 09-Dec-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS224N/Ling284 - Stanford University

Natural Language Processing with Deep Learning

CS224N/Ling284

Lecture 5: Backpropagation

Kevin Clark

Natural Language Processingwith Deep Learning

CS224N/Ling284

Christopher Manning and Richard Socher

Lecture 2: Word Vectors

Page 2: CS224N/Ling284 - Stanford University

Announcements

•  Assignment1dueThursday,11:59•  Youcanuseupto3latedays(makingitdueSundayatmidnight)

•  DefaultfinalprojectwillbereleasedFebruary1st

•  TohelpyouchoosewhichprojectopGonyouwanttodo

•  FinalprojectproposaldueFebruary8th•  SeewebsitefordetailsandinspiraGon

2

Page 3: CS224N/Ling284 - Stanford University

OverviewToday:

•  Fromone-layertomulGlayerneuralnetworks!

•  FullyvectorizedgradientcomputaGon•  ThebackpropagaGonalgorithm

•  (TimepermiNng)ClassprojectGps

3

Page 4: CS224N/Ling284 - Stanford University

Remember:One-layerNeuralNet

x=[xmuseumsxinxParisxarexamazing]

4

Page 5: CS224N/Ling284 - Stanford University

Two-layerNeuralNet

x=[xmuseumsxinxParisxarexamazing]

5

Page 6: CS224N/Ling284 - Stanford University

RepeatasNeeded!

x=[xmuseumsxinxParisxarexamazing]

6

Page 7: CS224N/Ling284 - Stanford University

WhyHaveMul@pleLayers?

•  HierarchicalrepresentaGons->neuralnetcanrepresentcomplicatedfeatures

•  BeZerresults!

#Layers MachineTransla@onScore

2 23.7

4 25.3

8 25.5

FromTransformerNetwork(willcoverinalaterlecture)

7

Page 8: CS224N/Ling284 - Stanford University

Remember:Stochas@cGradientDescent

•  UpdateequaGon:

𝛼=stepsizeorlearningrate

8

Page 9: CS224N/Ling284 - Stanford University

Remember:Stochas@cGradientDescent

•  UpdateequaGon:

•  ThisLecture:Howdowecompute?•  Byhand•  Algorithmically(thebackpropagaGonalgorithm)

𝛼=stepsizeorlearningrate

9

Page 10: CS224N/Ling284 - Stanford University

Whylearnallthesedetailsaboutgradients?

•  Moderndeeplearningframeworkscomputegradientsforyou•  Butwhytakeaclassoncompilersorsystemswhentheyare

implementedforyou?•  Understandingwhatisgoingonunderthehoodisuseful!

•  BackpropagaGondoesn’talwaysworkperfectly.•  Understandingwhyiscrucialfordebuggingandimproving

models•  Exampleinfuturelecture:explodingandvanishinggradients

10

Page 11: CS224N/Ling284 - Stanford University

QuicklyCompu@ngGradientsbyHand

•  ReviewofmulGvariablederivaGves

•  Fullyvectorizedgradients

•  Muchfasterandmoreusefulthannon-vectorizedgradients

•  Butdoinganon-vectorizedgradientcanbegoodpracGce,seeslidesinlastweek’slectureforanexample

•  Lecturenotescoverthismaterialinmoredetail

11

Page 12: CS224N/Ling284 - Stanford University

Gradients

•  GivenafuncGonwith1outputandninputs

•  ItsgradientisavectorofparGalderivaGves

12

Page 13: CS224N/Ling284 - Stanford University

JacobianMatrix:Generaliza@onoftheGradient

•  GivenafuncGonwithmoutputsandninputs

•  ItsJacobianisanmxnmatrixofparGalderivaGves

13

Page 14: CS224N/Ling284 - Stanford University

ChainRuleForJacobians

•  Forone-variablefuncGons:mulGplyderivaGves

•  FormulGplevariables:mulGplyJacobians

14

Page 15: CS224N/Ling284 - Stanford University

ExampleJacobian:Ac@va@onFunc@on

15

Page 16: CS224N/Ling284 - Stanford University

ExampleJacobian:Ac@va@onFunc@on

FuncGonhasnoutputsandninputs->nbynJacobian

16

Page 17: CS224N/Ling284 - Stanford University

ExampleJacobian:Ac@va@onFunc@on

17

Page 18: CS224N/Ling284 - Stanford University

ExampleJacobian:Ac@va@onFunc@on

18

Page 19: CS224N/Ling284 - Stanford University

ExampleJacobian:Ac@va@onFunc@on

19

Page 20: CS224N/Ling284 - Stanford University

OtherJacobians

•  ComputetheseathomeforpracGce!

•  Checkyouranswerswiththelecturenotes

20

Page 21: CS224N/Ling284 - Stanford University

OtherJacobians

•  ComputetheseathomeforpracGce!

•  Checkyouranswerswiththelecturenotes

21

Page 22: CS224N/Ling284 - Stanford University

OtherJacobians

•  ComputetheseathomeforpracGce!

•  Checkyouranswerswiththelecturenotes

22

Page 23: CS224N/Ling284 - Stanford University

OtherJacobians

•  ComputetheseathomeforpracGce!

•  Checkyouranswerswiththelecturenotes

23

Page 24: CS224N/Ling284 - Stanford University

BacktoNeuralNets!

x=[xmuseumsxinxParisxarexamazing]24

Page 25: CS224N/Ling284 - Stanford University

BacktoNeuralNets!

x=[xmuseumsxinxParisxarexamazing]

•  Let’sfind•  InpracGcewecareaboutthegradientoftheloss,but

wewillcomputethegradientofthescoreforsimplicity

25

Page 26: CS224N/Ling284 - Stanford University

1.Breakupequa@onsintosimplepieces

26

Page 27: CS224N/Ling284 - Stanford University

2.Applythechainrule

27

Page 28: CS224N/Ling284 - Stanford University

2.Applythechainrule

28

Page 29: CS224N/Ling284 - Stanford University

2.Applythechainrule

29

Page 30: CS224N/Ling284 - Stanford University

2.Applythechainrule

30

Page 31: CS224N/Ling284 - Stanford University

3.WriteouttheJacobians

UsefulJacobiansfrompreviousslide

31

Page 32: CS224N/Ling284 - Stanford University

3.WriteouttheJacobians

UsefulJacobiansfrompreviousslide

32

Page 33: CS224N/Ling284 - Stanford University

3.WriteouttheJacobians

UsefulJacobiansfrompreviousslide

33

Page 34: CS224N/Ling284 - Stanford University

3.WriteouttheJacobians

UsefulJacobiansfrompreviousslide

34

Page 35: CS224N/Ling284 - Stanford University

3.WriteouttheJacobians

UsefulJacobiansfrompreviousslide

35

Page 36: CS224N/Ling284 - Stanford University

Re-usingComputa@on

•  Supposewenowwanttocompute•  Usingthechainruleagain:

36

Page 37: CS224N/Ling284 - Stanford University

Re-usingComputa@on

•  Supposewenowwanttocompute•  Usingthechainruleagain:

Thesame!Let’savoidduplicatedcomputaGon…

37

Page 38: CS224N/Ling284 - Stanford University

Re-usingComputa@on

•  Supposewenowwanttocompute•  Usingthechainruleagain:

38

Page 39: CS224N/Ling284 - Stanford University

Deriva@vewithrespecttoMatrix

•  Whatdoeslooklike?

•  1output,nminputs:1bynmJacobian?

•  Inconvenienttodo

39

Page 40: CS224N/Ling284 - Stanford University

Deriva@vewithrespecttoMatrix

•  Whatdoeslooklike?

•  1output,nminputs:1bynmJacobian?

•  Inconvenienttodo

•  InsteadfollowconvenGon:shapeofthegradientisshapeofparameters

•  Soisnbym:

40

Page 41: CS224N/Ling284 - Stanford University

Deriva@vewithrespecttoMatrix

•  Remember•  isgoingtobeinouranswer

•  Theothertermshouldbebecause

•  Itturnsout

41

Page 42: CS224N/Ling284 - Stanford University

WhytheTransposes?

•  Hackyanswer:thismakesthedimensionsworkout

•  Usefultrickforcheckingyourwork!

•  FullexplanaGoninthelecturenotes42

Page 43: CS224N/Ling284 - Stanford University

WhytheTransposes?

•  Hackyanswer:thismakesthedimensionsworkout

•  Usefultrickforcheckingyourwork!

•  FullexplanaGoninthelecturenotes43

Page 44: CS224N/Ling284 - Stanford University

Whatshapeshouldderiva@vesbe?

•  isarowvector•  ButconvenGonsaysourgradientshouldbeacolumnvector

becauseisacolumnvector…

•  DisagreementbetweenJacobianform(whichmakesthechainruleeasy)andtheshapeconvenGon(whichmakesimplemenGngSGDeasy)

•  WeexpectanswerstofollowtheshapeconvenGon

•  ButJacobianformisusefulforcompuGngtheanswers

44

Page 45: CS224N/Ling284 - Stanford University

Whatshapeshouldderiva@vesbe?•  TwoopGons:

•  1.UseJacobianformasmuchaspossible,reshapetofollowtheconvenGonattheend:•  Whatwejustdid.Butattheendtransposetomakethe

derivaGveacolumnvector,resulGngin

•  2.AlwaysfollowtheconvenGon

•  Lookatdimensionstofigureoutwhentotransposeand/orreorderterms.

45

Page 46: CS224N/Ling284 - Stanford University

NotesonPA1

•  Don’tworryifyouusedsomeothermethodforgradientcomputaGon(aslongasyouranswerisrightandyouareconsistent!)

•  Thislecturewecomputedthegradientofthescore,butinPA1itsoftheloss

•  Don’tforgettoreplacef’withtheactualderivaGve

•  PA1usesforthelineartransformaGon:gradientsaredifferent!

46

Page 47: CS224N/Ling284 - Stanford University

Backpropaga@on

•  Computegradientsalgorithmically

•  ConverGngwhatwejustdidbyhandintoanalgorithm

•  Usedbydeeplearningframeworks(TensorFlow,PyTorch,etc.)

47

Page 48: CS224N/Ling284 - Stanford University

Computa@onalGraphs

+ �

•  RepresenGngourneuralnetequaGonsasagraph

•  Sourcenodes:inputs

•  Interiornodes:operaGons

48

Page 49: CS224N/Ling284 - Stanford University

Computa@onalGraphs

+ �

•  RepresenGngourneuralnetequaGonsasagraph

•  Sourcenodes:inputs

•  Interiornodes:operaGons

•  EdgespassalongresultoftheoperaGon

49

Page 50: CS224N/Ling284 - Stanford University

Computa@onalGraphs

+ �

•  RepresenGngourneuralnetequaGonsasagraph

•  Sourcenodes:inputs

•  Interiornodes:operaGons

•  EdgespassalongresultoftheoperaGon

“ForwardPropagaGon”

50

Page 51: CS224N/Ling284 - Stanford University

Backpropaga@on

+ �

•  Gobackwardsalongedges•  Passalonggradients

51

Page 52: CS224N/Ling284 - Stanford University

Backpropaga@on:SingleNode

•  Nodereceivesan“upstreamgradient”

•  Goalistopassonthecorrect“downstreamgradient”

Upstreamgradient

52 Downstreamgradient

Page 53: CS224N/Ling284 - Stanford University

Backpropaga@on:SingleNode

Downstreamgradient

Upstreamgradient

•  Eachnodehasalocalgradient

•  Thegradientofitsoutputwithrespecttoitsinput

Localgradient

53

Page 54: CS224N/Ling284 - Stanford University

Backpropaga@on:SingleNode

Downstreamgradient

Upstreamgradient

•  Eachnodehasalocalgradient

•  Thegradientofitsoutputwithrespecttoitsinput

Localgradient

54

Chainrule!

Page 55: CS224N/Ling284 - Stanford University

Backpropaga@on:SingleNode

Downstreamgradient

Upstreamgradient

•  Eachnodehasalocalgradient

•  Thegradientofitsoutputwithrespecttoitsinput

Localgradient

•  [downstreamgradient]=[upstreamgradient]x[localgradient]

55

Page 56: CS224N/Ling284 - Stanford University

Backpropaga@on:SingleNode

*

•  WhataboutnodeswithmulGpleinputs?

56

Page 57: CS224N/Ling284 - Stanford University

Backpropaga@on:SingleNode

Downstreamgradients

Upstreamgradient

Localgradients

*

•  MulGpleinputs->mulGplelocalgradients

57

Page 58: CS224N/Ling284 - Stanford University

AnExample

58

Page 59: CS224N/Ling284 - Stanford University

AnExample

+

*max

59

Forwardpropsteps

Page 60: CS224N/Ling284 - Stanford University

AnExample

+

*max

60

Forwardpropsteps

6

3

2

1

2

2

0

Page 61: CS224N/Ling284 - Stanford University

AnExample

+

*max

61

Forwardpropsteps

6

3

2

1

2

2

0

Localgradients

Page 62: CS224N/Ling284 - Stanford University

AnExample

+

*max

62

Forwardpropsteps

6

3

2

1

2

2

0

Localgradients

Page 63: CS224N/Ling284 - Stanford University

AnExample

+

*max

63

Forwardpropsteps

6

3

2

1

2

2

0

Localgradients

Page 64: CS224N/Ling284 - Stanford University

AnExample

+

*max

64

Forwardpropsteps

6

3

2

1

2

2

0

Localgradients

Page 65: CS224N/Ling284 - Stanford University

AnExample

+

*max

65

Forwardpropsteps

6

3

2

1

2

2

0

Localgradients

upstream*local=downstream

1

1*3=3

1*2=2

Page 66: CS224N/Ling284 - Stanford University

AnExample

+

*max

66

Forwardpropsteps

6

3

2

1

2

2

0

Localgradients

upstream*local=downstream

1

3

2

3*1=3

3*0=0

Page 67: CS224N/Ling284 - Stanford University

AnExample

+

*max

67

Forwardpropsteps

6

3

2

1

2

2

0

Localgradients

upstream*local=downstream

1

3

2

3

0

2*1=2

2*1=2

Page 68: CS224N/Ling284 - Stanford University

AnExample

+

*max

68

Forwardpropsteps

6

3

2

1

2

2

0

Localgradients

1

3

2

3

0

2

2

Page 69: CS224N/Ling284 - Stanford University

Gradientsaddatbranches

69

+

Page 70: CS224N/Ling284 - Stanford University

Gradientsaddatbranches

70

+

Page 71: CS224N/Ling284 - Stanford University

NodeIntui@ons

+

*max

71

6

3

2

1

2

2

0

1

22

2

•  +“distributes”theupstreamgradient

Page 72: CS224N/Ling284 - Stanford University

NodeIntui@ons

+

*max

72

6

3

2

1

2

2

0

1

33

0

•  +“distributes”theupstreamgradient

•  max“routes”theupstreamgradient

Page 73: CS224N/Ling284 - Stanford University

NodeIntui@ons

+

*max

73

6

3

2

1

2

2

0

1

3

2

•  +“distributes”theupstreamgradient

•  max“routes”theupstreamgradient

•  *“switches”theupstreamgradient

Page 74: CS224N/Ling284 - Stanford University

Efficiency:computeallgradientsatonce

* + �

•  Incorrectwayofdoingbackprop:•  Firstcompute

74

Page 75: CS224N/Ling284 - Stanford University

Efficiency:computeallgradientsatonce

* + �

•  Incorrectwayofdoingbackprop:•  Firstcompute

•  Thenindependentlycompute

•  DuplicatedcomputaGon!

75

Page 76: CS224N/Ling284 - Stanford University

Efficiency:computeallgradientsatonce

* + �

•  Correctway:•  Computeallthegradientsatonce

•  Analogoustousingwhenwecomputedgradientsbyhand

76

Page 77: CS224N/Ling284 - Stanford University

BackpropImplementa@ons

77

Page 78: CS224N/Ling284 - Stanford University

Implementa@on:forward/backwardAPI

78

Page 79: CS224N/Ling284 - Stanford University

Implementa@on:forward/backwardAPI

79

Page 80: CS224N/Ling284 - Stanford University

Alterna@vetobackprop:NumericGradient

•  Forsmallh,

•  Easytoimplement

•  Butapproximateandveryslow:•  Havetorecomputefforeveryparameterofourmodel

•  UsefulforcheckingyourimplementaGon

80

Page 81: CS224N/Ling284 - Stanford University

Summary

•  BackpropagaGon:recursivelyapplythechainrulealongcomputaGonalgraph•  [downstreamgradient]=[upstreamgradient]x[localgradient]

•  Forwardpass:computeresultsofoperaGonandsaveintermediatevalues

•  Backward:applychainruletocomputegradient

81

Page 82: CS224N/Ling284 - Stanford University

82

Page 83: CS224N/Ling284 - Stanford University

ProjectTypes

1.ApplyexisGngneuralnetworkmodeltoanewtask

2.Implementacomplexneuralarchitecture(s)

•  ThisiswhatPA4willhaveyoudo!

3.Comeupwithanewmodel/trainingalgorithm/etc.•  Get1or2workingfirst

•  SeeprojectpageforsomeinspiraGon

83

Page 84: CS224N/Ling284 - Stanford University

Must-haves(choose-your-ownfinalproject)

•  10,000+labeledexamplesbymilestone

•  Feasibletask

•  AutomaGcevaluaGonmetric

•  NLPiscentral

84

Page 85: CS224N/Ling284 - Stanford University

Detailsma_er!

•  Splityourdataintotrain/dev/test:onlylookattestforfinalexperiments

•  Lookatyourdata,collectsummarystaGsGcs

•  Lookatyourmodel’soutputs,doerroranalysis

•  Tuninghyperparametersisimportant

•  Writeupqualityisimportant•  Lookatlast-year’sprizewinnersforexamples

85

Page 86: CS224N/Ling284 - Stanford University

ProjectAdvice

•  Implementsimplestpossiblemodelfirst(e.g.,averagewordvectorsandapplylogisGcregression)andimproveit•  Havingabaselinesystemiscrucial

•  Firstoverfityourmodeltotrainset(getreallygoodtrainingsetresults)•  Thenregularizeitsoitdoeswellonthedevset

•  Startearly!

86