![Page 1: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/1.jpg)
Deep Learning Basics Lecture 2: Backpropagation
Princeton University COS 495
Instructor: Yingyu Liang
![Page 2: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/2.jpg)
How to train the dragon?
β¦ β¦
β¦β¦ β¦
β¦
π₯
β1 β2 βπΏ
π¦
![Page 3: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/3.jpg)
How to get the expected output
Loss of the systemπ(π₯; π) = π(ππ , π₯, π¦)
π₯ π π₯; π β 0ππ(π₯)
π¦
![Page 4: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/4.jpg)
How to get the expected output
π₯ π π₯; π + π β 0
Find direction π so that:
Loss π(π₯; π + π)
![Page 5: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/5.jpg)
How to get the expected output
π₯
How to find π: π π₯; π + ππ£ β π π₯; π + π»π π₯; π β ππ£ for small scalar π
π π₯; π + π β 0
Loss π(π₯; π + π)
![Page 6: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/6.jpg)
How to get the expected output
π₯
Conclusion: Move π along βπ»π π₯; π for a small amount
π π₯; π + π
Loss π(π₯; π + π)
![Page 7: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/7.jpg)
Neural Networks as real circuits Pictorial illustration of gradient descent
![Page 8: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/8.jpg)
Gradient
β’ Gradient of the loss is simpleβ’ E.g., π ππ , π₯, π¦ = ππ π₯ β π¦ 2/2
β’ππ
ππ= (ππ π₯ β π¦)
ππ
ππ
β’ Key part: gradient of the hypothesis
![Page 9: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/9.jpg)
Open the box: real circuit
![Page 10: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/10.jpg)
Single neuron
π₯2
π₯1
β π
Function: π = π₯1 β π₯2
![Page 11: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/11.jpg)
Single neuron
π₯2
β1
π₯1
1β π
Function: π = π₯1 β π₯2
Gradient: ππ
ππ₯1= 1,
ππ
ππ₯2= β1
![Page 12: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/12.jpg)
Two neurons
+ π₯2
π₯1
β π
Function: π = π₯1 β π₯2 = π₯1 β π₯3 + π₯4
π₯3
π₯4
![Page 13: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/13.jpg)
Two neurons
+ π₯2
β1
π₯1
1β π
Function: π = π₯1 β π₯2 = π₯1 β π₯3 + π₯4
Gradient: ππ₯2
ππ₯3= 1,
ππ₯2
ππ₯4= 1. What about
ππ
ππ₯3?
π₯3
π₯4
ππ₯2
ππ₯3= 1
ππ₯2
ππ₯4= 1
![Page 14: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/14.jpg)
Two neurons
+ π₯2
β1
π₯1
1β π
Function: π = π₯1 β π₯2 = π₯1 β π₯3 + π₯4
Gradient:ππ
ππ₯3=
ππ
ππ₯2
ππ₯2
ππ₯3= β1
π₯3
π₯4
β1
β1
![Page 15: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/15.jpg)
Multiple input
+ π₯2
β1
π₯1
1β π
Function: π = π₯1 β π₯2 = π₯1 β π₯3 + π₯5 + π₯4
Gradient:ππ₯2
ππ₯5= 1
π₯3
π₯4
β1
β1
π₯5
ππ₯2
ππ₯5= 1
![Page 16: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/16.jpg)
Multiple input
+ π₯2
β1
π₯1
1β π
Function: π = π₯1 β π₯2 = π₯1 β π₯3 + π₯5 + π₯4
Gradient:ππ
ππ₯5=
ππ
ππ₯5
ππ₯5
ππ₯3= β1
π₯3
π₯4
β1
β1
π₯5
β1
![Page 17: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/17.jpg)
Weights on the edges
+ π₯2
β1
π₯1
1β π
Function: π = π₯1 β π₯2 = π₯1 β π€3π₯3 + π€4π₯4
π₯3
π₯4
π€3
π€4
![Page 18: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/18.jpg)
Weights on the edges
+ π₯2
β1
π₯1
1β π
Function: π = π₯1 β π₯2 = π₯1 β π€3π₯3 + π€4π₯4
π₯3
π€3
π₯4
π€4
![Page 19: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/19.jpg)
Weights on the edges
+ π₯2
β1
π₯1
1β π
Function: π = π₯1 β π₯2 = π₯1 β π€3π₯3 + π€4π₯4
Gradient:ππ
ππ€3=
ππ
ππ₯2
ππ₯2
ππ€3= β1 Γ π₯3 = βπ₯3
π₯3
π₯4
π€3
π€4
βπ₯3
βπ₯4
![Page 20: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/20.jpg)
Activation
π π₯2
β1
π₯1
1β π
Function: π = π₯1 β π₯2 = π₯1 β π π€3π₯3 + π€4π₯4
π₯3
π₯4
π€3
π€4
![Page 21: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/21.jpg)
Activation
π π₯2
β1
π₯1
1β π
Function: π = π₯1 β π₯2 = π₯1 β π π€3π₯3 + π€4π₯4
Let πππ‘2 = π€3π₯3 + π€4π₯4
π₯3
π₯4
π€3
π€4
πππ‘2
![Page 22: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/22.jpg)
Activation
π π₯2
β1
π₯1
1β π
Function: π = π₯1 β π₯2 = π₯1 β π π€3π₯3 + π€4π₯4
Gradient:ππ
ππ€3=
ππ
ππ₯2
ππ₯2
ππππ‘2
ππππ‘2
ππ€3= β1 Γ πβ² Γ π₯3 = βπβ²π₯3
π₯3
π₯4
π€3
π€4
πππ‘2
ππ₯2
ππππ‘2= πβ²
ππππ‘2
ππ€3= π₯3
![Page 23: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/23.jpg)
Activation
π π₯2
β1
π₯1
1β π
Function: π = π₯1 β π₯2 = π₯1 β π π€3π₯3 + π€4π₯4
Gradient:ππ
ππ€3=
ππ
ππ₯2
ππ₯2
ππππ‘2
ππππ‘2
ππ€3= β1 Γ πβ² Γ π₯3 = βπβ²π₯3
π₯3
π₯4
π€3
π€4
πππ‘2βπβ²
βπβ²π₯3
![Page 24: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/24.jpg)
Multiple paths
π π₯2
β1
+ π₯1
1β π
Function: π = π₯1 β π₯2 = (π₯1+π₯5) β π π€3π₯3 + π€4π₯4
π₯3
π₯4
π€3
π€4
π₯5
πππ‘2
![Page 25: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/25.jpg)
Multiple paths
π π₯2
β1
+ π₯1
1β π
Function: π = π₯1 β π₯2 = (π₯1+π₯5) β π π€3π₯3 + π€4π₯4
π₯3 π€3
π€4
πππ‘2
![Page 26: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/26.jpg)
Multiple paths
π π₯2
β1
+ π₯1
1β π
Function: π = π₯1 β π₯2 = (π₯3+π₯5) β π π€3π₯3 + π€4π₯4
Gradient:ππ
ππ₯3=
ππ
ππ₯2
ππ₯2
ππππ‘2
ππππ‘2
ππ₯3+
ππ
ππ₯1
ππ₯1
ππ₯3= β1 Γ πβ² Γ π€3 + 1 Γ 1 = βπβ²π€3 + 1
π₯3 π€3
π€4
πππ‘2
![Page 27: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/27.jpg)
Summary
β’ Forward to compute π
β’ Backward to compute the gradients
π β21
π β11
+ π
π₯1
π₯2
πππ‘11
πππ‘21
![Page 28: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/28.jpg)
Math form
![Page 29: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/29.jpg)
Gradient descent
β’ Minimize loss πΏ π , where the hypothesis is parametrized by π
β’ Gradient descentβ’ Initialize π0
β’ ππ‘+1 = ππ‘ β ππ‘π» πΏ ππ‘
![Page 30: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/30.jpg)
Stochastic gradient descent (SGD)
β’ Suppose data points arrive one by one
β’ πΏ π =1
πΟπ‘=1
π π(π, π₯π‘ , π¦π‘), but we only know π(π, π₯π‘ , π¦π‘) at time π‘
β’ Idea: simply do what you can based on local informationβ’ Initialize π0
β’ ππ‘+1 = ππ‘ β ππ‘π»π(ππ‘, π₯π‘ , π¦π‘)
![Page 31: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) β’Suppose data points arrive one by one β’πΏ =1 π Οπ‘=1 ππ( , π‘, π‘), but we only](https://reader033.vdocuments.mx/reader033/viewer/2022053022/604d04af4310880707292f04/html5/thumbnails/31.jpg)
Mini-batch
β’ Instead of one data point, work with a small batch of π points
(π₯π‘π+1,π¦π‘π+1),β¦, (π₯π‘π+π,π¦π‘π+π)
β’ Update rule
ππ‘+1 = ππ‘ β ππ‘π»1
π
1β€πβ€π
π ππ‘ , π₯π‘π+π , π¦π‘π+π
β’ Typical batch size: π = 128