Transcript
Page 1: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Deep Learning Basics Lecture 2: Backpropagation

Princeton University COS 495

Instructor: Yingyu Liang

Page 2: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

How to train the dragon?

… …

…… …

…

π‘₯

β„Ž1 β„Ž2 β„ŽπΏ

𝑦

Page 3: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

How to get the expected output

Loss of the system𝑙(π‘₯; πœƒ) = 𝑙(π‘“πœƒ , π‘₯, 𝑦)

π‘₯ 𝑙 π‘₯; πœƒ β‰  0π‘“πœƒ(π‘₯)

𝑦

Page 4: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

How to get the expected output

π‘₯ 𝑙 π‘₯; πœƒ + 𝑑 β‰ˆ 0

Find direction 𝑑 so that:

Loss 𝑙(π‘₯; πœƒ + 𝑑)

Page 5: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

How to get the expected output

π‘₯

How to find 𝑑: 𝑙 π‘₯; πœƒ + πœ–π‘£ β‰ˆ 𝑙 π‘₯; πœƒ + 𝛻𝑙 π‘₯; πœƒ βˆ— πœ–π‘£ for small scalar πœ–

𝑙 π‘₯; πœƒ + 𝑑 β‰ˆ 0

Loss 𝑙(π‘₯; πœƒ + 𝑑)

Page 6: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

How to get the expected output

π‘₯

Conclusion: Move πœƒ along βˆ’π›»π‘™ π‘₯; πœƒ for a small amount

𝑙 π‘₯; πœƒ + 𝑑

Loss 𝑙(π‘₯; πœƒ + 𝑑)

Page 7: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Neural Networks as real circuits Pictorial illustration of gradient descent

Page 8: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Gradient

β€’ Gradient of the loss is simpleβ€’ E.g., 𝑙 π‘“πœƒ , π‘₯, 𝑦 = π‘“πœƒ π‘₯ βˆ’ 𝑦 2/2

β€’πœ•π‘™

πœ•πœƒ= (π‘“πœƒ π‘₯ βˆ’ 𝑦)

πœ•π‘“

πœ•πœƒ

β€’ Key part: gradient of the hypothesis

Page 9: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Open the box: real circuit

Page 10: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Single neuron

π‘₯2

π‘₯1

βˆ’ 𝑓

Function: 𝑓 = π‘₯1 βˆ’ π‘₯2

Page 11: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Single neuron

π‘₯2

βˆ’1

π‘₯1

1βˆ’ 𝑓

Function: 𝑓 = π‘₯1 βˆ’ π‘₯2

Gradient: πœ•π‘“

πœ•π‘₯1= 1,

πœ•π‘“

πœ•π‘₯2= βˆ’1

Page 12: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Two neurons

+ π‘₯2

π‘₯1

βˆ’ 𝑓

Function: 𝑓 = π‘₯1 βˆ’ π‘₯2 = π‘₯1 βˆ’ π‘₯3 + π‘₯4

π‘₯3

π‘₯4

Page 13: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Two neurons

+ π‘₯2

βˆ’1

π‘₯1

1βˆ’ 𝑓

Function: 𝑓 = π‘₯1 βˆ’ π‘₯2 = π‘₯1 βˆ’ π‘₯3 + π‘₯4

Gradient: πœ•π‘₯2

πœ•π‘₯3= 1,

πœ•π‘₯2

πœ•π‘₯4= 1. What about

πœ•π‘“

πœ•π‘₯3?

π‘₯3

π‘₯4

πœ•π‘₯2

πœ•π‘₯3= 1

πœ•π‘₯2

πœ•π‘₯4= 1

Page 14: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Two neurons

+ π‘₯2

βˆ’1

π‘₯1

1βˆ’ 𝑓

Function: 𝑓 = π‘₯1 βˆ’ π‘₯2 = π‘₯1 βˆ’ π‘₯3 + π‘₯4

Gradient:πœ•π‘“

πœ•π‘₯3=

πœ•π‘“

πœ•π‘₯2

πœ•π‘₯2

πœ•π‘₯3= βˆ’1

π‘₯3

π‘₯4

βˆ’1

βˆ’1

Page 15: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Multiple input

+ π‘₯2

βˆ’1

π‘₯1

1βˆ’ 𝑓

Function: 𝑓 = π‘₯1 βˆ’ π‘₯2 = π‘₯1 βˆ’ π‘₯3 + π‘₯5 + π‘₯4

Gradient:πœ•π‘₯2

πœ•π‘₯5= 1

π‘₯3

π‘₯4

βˆ’1

βˆ’1

π‘₯5

πœ•π‘₯2

πœ•π‘₯5= 1

Page 16: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Multiple input

+ π‘₯2

βˆ’1

π‘₯1

1βˆ’ 𝑓

Function: 𝑓 = π‘₯1 βˆ’ π‘₯2 = π‘₯1 βˆ’ π‘₯3 + π‘₯5 + π‘₯4

Gradient:πœ•π‘“

πœ•π‘₯5=

πœ•π‘“

πœ•π‘₯5

πœ•π‘₯5

πœ•π‘₯3= βˆ’1

π‘₯3

π‘₯4

βˆ’1

βˆ’1

π‘₯5

βˆ’1

Page 17: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Weights on the edges

+ π‘₯2

βˆ’1

π‘₯1

1βˆ’ 𝑓

Function: 𝑓 = π‘₯1 βˆ’ π‘₯2 = π‘₯1 βˆ’ 𝑀3π‘₯3 + 𝑀4π‘₯4

π‘₯3

π‘₯4

𝑀3

𝑀4

Page 18: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Weights on the edges

+ π‘₯2

βˆ’1

π‘₯1

1βˆ’ 𝑓

Function: 𝑓 = π‘₯1 βˆ’ π‘₯2 = π‘₯1 βˆ’ 𝑀3π‘₯3 + 𝑀4π‘₯4

π‘₯3

𝑀3

π‘₯4

𝑀4

Page 19: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Weights on the edges

+ π‘₯2

βˆ’1

π‘₯1

1βˆ’ 𝑓

Function: 𝑓 = π‘₯1 βˆ’ π‘₯2 = π‘₯1 βˆ’ 𝑀3π‘₯3 + 𝑀4π‘₯4

Gradient:πœ•π‘“

πœ•π‘€3=

πœ•π‘“

πœ•π‘₯2

πœ•π‘₯2

πœ•π‘€3= βˆ’1 Γ— π‘₯3 = βˆ’π‘₯3

π‘₯3

π‘₯4

𝑀3

𝑀4

βˆ’π‘₯3

βˆ’π‘₯4

Page 20: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Activation

𝜎 π‘₯2

βˆ’1

π‘₯1

1βˆ’ 𝑓

Function: 𝑓 = π‘₯1 βˆ’ π‘₯2 = π‘₯1 βˆ’ 𝜎 𝑀3π‘₯3 + 𝑀4π‘₯4

π‘₯3

π‘₯4

𝑀3

𝑀4

Page 21: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Activation

𝜎 π‘₯2

βˆ’1

π‘₯1

1βˆ’ 𝑓

Function: 𝑓 = π‘₯1 βˆ’ π‘₯2 = π‘₯1 βˆ’ 𝜎 𝑀3π‘₯3 + 𝑀4π‘₯4

Let 𝑛𝑒𝑑2 = 𝑀3π‘₯3 + 𝑀4π‘₯4

π‘₯3

π‘₯4

𝑀3

𝑀4

𝑛𝑒𝑑2

Page 22: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Activation

𝜎 π‘₯2

βˆ’1

π‘₯1

1βˆ’ 𝑓

Function: 𝑓 = π‘₯1 βˆ’ π‘₯2 = π‘₯1 βˆ’ 𝜎 𝑀3π‘₯3 + 𝑀4π‘₯4

Gradient:πœ•π‘“

πœ•π‘€3=

πœ•π‘“

πœ•π‘₯2

πœ•π‘₯2

πœ•π‘›π‘’π‘‘2

πœ•π‘›π‘’π‘‘2

πœ•π‘€3= βˆ’1 Γ— πœŽβ€² Γ— π‘₯3 = βˆ’πœŽβ€²π‘₯3

π‘₯3

π‘₯4

𝑀3

𝑀4

𝑛𝑒𝑑2

πœ•π‘₯2

πœ•π‘›π‘’π‘‘2= πœŽβ€²

πœ•π‘›π‘’π‘‘2

πœ•π‘€3= π‘₯3

Page 23: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Activation

𝜎 π‘₯2

βˆ’1

π‘₯1

1βˆ’ 𝑓

Function: 𝑓 = π‘₯1 βˆ’ π‘₯2 = π‘₯1 βˆ’ 𝜎 𝑀3π‘₯3 + 𝑀4π‘₯4

Gradient:πœ•π‘“

πœ•π‘€3=

πœ•π‘“

πœ•π‘₯2

πœ•π‘₯2

πœ•π‘›π‘’π‘‘2

πœ•π‘›π‘’π‘‘2

πœ•π‘€3= βˆ’1 Γ— πœŽβ€² Γ— π‘₯3 = βˆ’πœŽβ€²π‘₯3

π‘₯3

π‘₯4

𝑀3

𝑀4

𝑛𝑒𝑑2βˆ’πœŽβ€²

βˆ’πœŽβ€²π‘₯3

Page 24: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Multiple paths

𝜎 π‘₯2

βˆ’1

+ π‘₯1

1βˆ’ 𝑓

Function: 𝑓 = π‘₯1 βˆ’ π‘₯2 = (π‘₯1+π‘₯5) βˆ’ 𝜎 𝑀3π‘₯3 + 𝑀4π‘₯4

π‘₯3

π‘₯4

𝑀3

𝑀4

π‘₯5

𝑛𝑒𝑑2

Page 25: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Multiple paths

𝜎 π‘₯2

βˆ’1

+ π‘₯1

1βˆ’ 𝑓

Function: 𝑓 = π‘₯1 βˆ’ π‘₯2 = (π‘₯1+π‘₯5) βˆ’ 𝜎 𝑀3π‘₯3 + 𝑀4π‘₯4

π‘₯3 𝑀3

𝑀4

𝑛𝑒𝑑2

Page 26: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Multiple paths

𝜎 π‘₯2

βˆ’1

+ π‘₯1

1βˆ’ 𝑓

Function: 𝑓 = π‘₯1 βˆ’ π‘₯2 = (π‘₯3+π‘₯5) βˆ’ 𝜎 𝑀3π‘₯3 + 𝑀4π‘₯4

Gradient:πœ•π‘“

πœ•π‘₯3=

πœ•π‘“

πœ•π‘₯2

πœ•π‘₯2

πœ•π‘›π‘’π‘‘2

πœ•π‘›π‘’π‘‘2

πœ•π‘₯3+

πœ•π‘“

πœ•π‘₯1

πœ•π‘₯1

πœ•π‘₯3= βˆ’1 Γ— πœŽβ€² Γ— 𝑀3 + 1 Γ— 1 = βˆ’πœŽβ€²π‘€3 + 1

π‘₯3 𝑀3

𝑀4

𝑛𝑒𝑑2

Page 27: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Summary

β€’ Forward to compute 𝑓

β€’ Backward to compute the gradients

𝜎 β„Ž21

𝜎 β„Ž11

+ 𝑓

π‘₯1

π‘₯2

𝑛𝑒𝑑11

𝑛𝑒𝑑21

Page 28: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Math form

Page 29: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Gradient descent

β€’ Minimize loss 𝐿 πœƒ , where the hypothesis is parametrized by πœƒ

β€’ Gradient descentβ€’ Initialize πœƒ0

β€’ πœƒπ‘‘+1 = πœƒπ‘‘ βˆ’ πœ‚π‘‘π›» 𝐿 πœƒπ‘‘

Page 30: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Stochastic gradient descent (SGD)

β€’ Suppose data points arrive one by one

β€’ 𝐿 πœƒ =1

𝑛σ𝑑=1

𝑛 𝑙(πœƒ, π‘₯𝑑 , 𝑦𝑑), but we only know 𝑙(πœƒ, π‘₯𝑑 , 𝑦𝑑) at time 𝑑

β€’ Idea: simply do what you can based on local informationβ€’ Initialize πœƒ0

β€’ πœƒπ‘‘+1 = πœƒπ‘‘ βˆ’ πœ‚π‘‘π›»π‘™(πœƒπ‘‘, π‘₯𝑑 , 𝑦𝑑)

Page 31: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β€’Suppose data points arrive one by one ‒𝐿 =1 𝑛 σ𝑑=1 𝑛𝑙( , 𝑑, 𝑑), but we only

Mini-batch

β€’ Instead of one data point, work with a small batch of 𝑏 points

(π‘₯𝑑𝑏+1,𝑦𝑑𝑏+1),…, (π‘₯𝑑𝑏+𝑏,𝑦𝑑𝑏+𝑏)

β€’ Update rule

πœƒπ‘‘+1 = πœƒπ‘‘ βˆ’ πœ‚π‘‘π›»1

𝑏

1≀𝑖≀𝑏

𝑙 πœƒπ‘‘ , π‘₯𝑑𝑏+𝑖 , 𝑦𝑑𝑏+𝑖

β€’ Typical batch size: 𝑏 = 128


Top Related