back-propagation - cs.cmu.edu16385/s17/slides/9.5_backpropagation.pdf · back-propagation 16-385...
TRANSCRIPT
![Page 1: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/1.jpg)
Back-Propagation16-385 Computer Vision (Kris Kitani)
Carnegie Mellon University
![Page 2: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/2.jpg)
World’s Smallest Perceptron!
y = wx
back to the…
function of ONE parameter!
(a.k.a. line equation, linear regression)
yx
wf
![Page 3: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/3.jpg)
Training the world’s smallest perceptron
this should be the gradient of the loss
function
Now where does this come from?
This is just gradient descent, that means…
![Page 4: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/4.jpg)
dLdw
L =1
2(y � y)2
y = wx
…is the rate at which this will change…
… per unit change of this
the loss function
the weight parameter
Let’s compute the derivative…
![Page 5: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/5.jpg)
Compute the derivative
That means the weight update for gradient descent is:
dLdw
=d
dw
⇢1
2(y � y)2
�
= �(y � y)dwx
dw
= �(y � y)x = rw
just shorthand
w = w �rw
= w + (y � y)x
move in direction of negative gradient
![Page 6: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/6.jpg)
Gradient Descent (world’s smallest perceptron)
For each sample
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
b. Gradient update
{xi, yi}
y = wxi
dLi
dw
= �(yi � y)xi = rw
Li =1
2(yi � y)2
w = w �rw
![Page 7: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/7.jpg)
Training the world’s smallest perceptron
![Page 8: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/8.jpg)
y
world’s (second) smallest perceptron!
function of two parameters!
w1
w2
x1
x2
![Page 9: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/9.jpg)
Gradient Descent
For each sample
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
b. Gradient update
{xi, yi}
we just need to compute partial derivatives for this network
![Page 10: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/10.jpg)
Back-Propagation
@L@w1
=@
@w1
⇢1
2(y � y)2
�
= �(y � y)@y
@w1
= �(y � y)@
Pi wixi
@w1
= �(y � y)@w1x1
@w1
= �(y � y)x1 = rw1
@L@w2
=@
@w2
⇢1
2(y � y)2
�
= �(y � y)@y
@w2
= �(y � y)@
Pi wixi
@w1
= �(y � y)@w2x2
@w2
= �(y � y)x2 = rw2
Why do we have partial derivatives now?
![Page 11: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/11.jpg)
Back-Propagation
@L@w1
=@
@w1
⇢1
2(y � y)2
�
= �(y � y)@y
@w1
= �(y � y)@
Pi wixi
@w1
= �(y � y)@w1x1
@w1
= �(y � y)x1 = rw1
@L@w2
=@
@w2
⇢1
2(y � y)2
�
= �(y � y)@y
@w2
= �(y � y)@
Pi wixi
@w1
= �(y � y)@w2x2
@w2
= �(y � y)x2 = rw2
w1 = w1 � ⌘rw1
= w1 + ⌘(y � y)x1
w2 = w2 � ⌘rw2
= w2 + ⌘(y � y)x2
Gradient Update
![Page 12: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/12.jpg)
Gradient Descent
For each sample
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
b. Gradient update
{xi, yi}
Li =1
2(yi � y)
y = fMLP(xi; ✓)
w1i = w1i + ⌘(y � y)x1i
w2i = w2i + ⌘(y � y)x2i
rw1i = �(yi � y)x1i
rw2i = �(yi � y)x2i
(since gradients approximated from stochastic sample)
(adjustable step size)
two BP lines now
![Page 13: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/13.jpg)
We haven’t seen a lot of ‘propagation’ yet because our perceptrons only had one layer…
![Page 14: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/14.jpg)
multi-layer perceptron
x
function of FOUR parameters and FOUR layers!
w1 w2
b1
h1 h2
w3
yo
![Page 15: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/15.jpg)
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1 b1
![Page 16: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/16.jpg)
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1 b1
![Page 17: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/17.jpg)
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1 b1
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
![Page 18: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/18.jpg)
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1 b1
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
![Page 19: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/19.jpg)
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1 b1
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
![Page 20: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/20.jpg)
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1 b1
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
![Page 21: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/21.jpg)
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1 b1
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
![Page 22: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/22.jpg)
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1 b1
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
![Page 23: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/23.jpg)
x
w1 a1 f1 w2 yf2
hiddenlayer 2
hiddenlayer 3
outputlayer 4
weightinputactivationsum
weight weight
a2 a3w3 f3
activation activation
inputlayer 1
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
b1
![Page 24: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/24.jpg)
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
Entire network can be written out as one long equation
What is known? What is unknown?We need to train the network:
![Page 25: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/25.jpg)
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
Entire network can be written out as a long equation
What is known? What is unknown?
known
We need to train the network:
![Page 26: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/26.jpg)
a1 = w1 · x+ b1
a2 = w2 · f1(w1 · x+ b1)
a3 = w3 · f2(w2 · f1(w1 · x+ b1))
y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))
Entire network can be written out as a long equation
What is known? What is unknown?
unknown
We need to train the network:
activation function sometimes has
parameters
![Page 27: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/27.jpg)
Given a set of samples and a MLP
Estimate the parameters of the MLP
Learning an MLP
{xi, yi}
✓ = {f, w, b}
y = fMLP(x; ✓)
![Page 28: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/28.jpg)
Stochastic Gradient Descent
For each random sample
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
b. Gradient update
{xi, yi}
y = fMLP(xi; ✓)
@L@✓
vector of parameter update equations
vector of parameter partial derivatives
✓ ✓ � ⌘r✓
![Page 29: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/29.jpg)
So we need to compute the partial derivatives
@L@✓
=
@L@w3
@L@w2
@L@w1
@L@b
�
![Page 30: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/30.jpg)
x
w1 a1 f1 w2 f2a2 a3w3 f3
b1
y
how
…thisthis
doesaffect…
@L
@w1Partial derivative describes…
So, how do you compute it?
(loss layer)
Remember,
![Page 31: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/31.jpg)
The Chain Rule
![Page 32: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/32.jpg)
@L
@w3=
@L
@f3
@f3@a3
@a3@w3
f2 a3w3 f3· · ·
@L
@w3
@L
@f3
@f3@a3
@a3@w3
rest of the network L(y, y)y
Intuitively, the effect of weight on loss function :
depends on
depends ondepends on
According to the chain rule…
![Page 33: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/33.jpg)
f2 a3w3 f3rest of the network L(y, y)y
@L
@w3=
@L
@f3
@f3@a3
@a3@w3
= �⌘(y � y)@f3@a3
@a3@w3
= �⌘(y � y)f3(1� f3)@a3@w3
= �⌘(y � y)f3(1� f3)f2
Chain Rule!
![Page 34: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/34.jpg)
f2 a3w3 f3rest of the network L(y, y)y
@L
@w3=
@L
@f3
@f3@a3
@a3@w3
= �⌘(y � y)@f3@a3
@a3@w3
= �⌘(y � y)f3(1� f3)@a3@w3
= �⌘(y � y)f3(1� f3)f2
Just the partial derivative of L2 loss
![Page 35: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/35.jpg)
f2 a3w3 f3rest of the network L(y, y)y
@L
@w3=
@L
@f3
@f3@a3
@a3@w3
= �⌘(y � y)@f3@a3
@a3@w3
= �⌘(y � y)f3(1� f3)@a3@w3
= �⌘(y � y)f3(1� f3)f2
Let’s use a Sigmoid functionds(x)
dx
= s(x)(1� s(x))
![Page 36: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/36.jpg)
f2 a3w3 f3rest of the network L(y, y)y
@L
@w3=
@L
@f3
@f3@a3
@a3@w3
= �⌘(y � y)@f3@a3
@a3@w3
= �⌘(y � y)f3(1� f3)@a3@w3
= �⌘(y � y)f3(1� f3)f2
Let’s use a Sigmoid functionds(x)
dx
= s(x)(1� s(x))
![Page 37: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/37.jpg)
f2 a3w3 f3rest of the network L(y, y)y
@L
@w3=
@L
@f3
@f3@a3
@a3@w3
= �⌘(y � y)@f3@a3
@a3@w3
= �⌘(y � y)f3(1� f3)@a3@w3
= �⌘(y � y)f3(1� f3)f2
![Page 38: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/38.jpg)
x
w1 a1 f1 w2 yf2a2 a3w3 f3
b1
@L
@w2=
@L
@f3
@f3@a3
@a3@f2
@f2@a2
@a2@w2
![Page 39: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/39.jpg)
x
w1 a1 f1 w2 yf2a2 a3w3 f3
b1
already computed. re-use (propagate)!
@L
@w2=
@L
@f3
@f3@a3
@a3@f2
@f2@a2
@a2@w2
![Page 40: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/40.jpg)
x
w1 a1 f1 w2 f2a2 a3w3 f3
b1
y
The Chain rule says…
depends ondepends ondepends on
depends ondepends on
depends on
depends on
@L
@w1=
@L
@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@w1
![Page 41: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/41.jpg)
x
w1 a1 f1 w2 f2a2 a3w3 f3
b1
y
The Chain rule says…
depends ondepends ondepends on
depends ondepends on
depends on
depends on
already computed. re-use (propagate)!
@L
@w1=
@L
@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@w1
![Page 42: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/42.jpg)
x
w1 a1 f1 w2 f2a2 a3w3 f3
b1
y
depends ondepends ondepends on
depends ondepends on
depends on
depends on
@L@w3
=@L@f3
@f3@a3
@a3@w3
@L@w2
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@w2
@L@w1
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@w1
@L@b
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@b
![Page 43: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/43.jpg)
x
w1 a1 f1 w2 f2a2 a3w3 f3
b1
y
depends ondepends ondepends on
depends ondepends on
depends on
depends on
@L@w3
=@L@f3
@f3@a3
@a3@w3
@L@w2
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@w2
@L@w1
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@w1
@L@b
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@b
![Page 44: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/44.jpg)
x
w1 a1 f1 w2 f2a2 a3w3 f3
b1
y
depends ondepends ondepends on
depends ondepends on
depends on
depends on
@L@w3
=@L@f3
@f3@a3
@a3@w3
@L@w2
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@w2
@L@w1
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@w1
@L@b
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@b
![Page 45: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/45.jpg)
Stochastic Gradient Descent
For each example sample
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
b. Gradient update
{xi, yi}
y = fMLP(xi; ✓)
w3 = w3 � ⌘rw3
w2 = w2 � ⌘rw2
w1 = w1 � ⌘rw1
b = b� ⌘rb
Li
@L@w3
=@L@f3
@f3@a3
@a3@w3
@L@w2
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@w2
@L@w1
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@w1
@L@b
=@L@f3
@f3@a3
@a3@f2
@f2@a2
@a2@f1
@f1@a1
@a1@b
![Page 46: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University](https://reader031.vdocuments.mx/reader031/viewer/2022020413/5b8342ed7f8b9a64618cb87b/html5/thumbnails/46.jpg)
@L@✓
✓ ✓ + ⌘@L@✓
vector of parameter update equations
vector of parameter partial derivatives
Stochastic Gradient Descent
For each example sample
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
b. Gradient update
{xi, yi}
y = fMLP(xi; ✓)
Li