deep learning basics lecture 1: feedforward...stochastic gradient descent (sgd) •suppose data...

Deep Learning Basics Lecture 2: Backpropagation

Princeton University COS 495

Instructor: Yingyu Liang

How to train the dragon?

… …

…… …

ℎ1 ℎ2 ℎ𝐿

How to get the expected output

Loss of the system𝑙(𝑥; 𝜃) = 𝑙(𝑓𝜃 , 𝑥, 𝑦)

𝑥 𝑙 𝑥; 𝜃 ≠ 0𝑓𝜃(𝑥)

𝑥 𝑙 𝑥; 𝜃 + 𝑑 ≈ 0

Find direction 𝑑 so that:

Loss 𝑙(𝑥; 𝜃 + 𝑑)

How to find 𝑑: 𝑙 𝑥; 𝜃 + 𝜖𝑣 ≈ 𝑙 𝑥; 𝜃 + 𝛻𝑙 𝑥; 𝜃 ∗ 𝜖𝑣 for small scalar 𝜖

𝑙 𝑥; 𝜃 + 𝑑 ≈ 0

Conclusion: Move 𝜃 along −𝛻𝑙 𝑥; 𝜃 for a small amount

𝑙 𝑥; 𝜃 + 𝑑

Neural Networks as real circuits Pictorial illustration of gradient descent

Gradient

• Gradient of the loss is simple• E.g., 𝑙 𝑓𝜃 , 𝑥, 𝑦 = 𝑓𝜃 𝑥 − 𝑦 2/2

•𝜕𝑙

𝜕𝜃= (𝑓𝜃 𝑥 − 𝑦)

𝜕𝑓

𝜕𝜃

• Key part: gradient of the hypothesis

Open the box: real circuit

Single neuron

− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2

Single neuron

1− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2

Gradient: 𝜕𝑓

𝜕𝑥1= 1,

𝜕𝑓

𝜕𝑥2= −1

Two neurons

+ 𝑥2

− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥4

Two neurons

+ 𝑥2

1− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥4

Gradient: 𝜕𝑥2

𝜕𝑥3= 1,

𝜕𝑥2

𝜕𝑥4= 1. What about

𝜕𝑓

𝜕𝑥3?

𝜕𝑥2

𝜕𝑥3= 1

𝜕𝑥2

𝜕𝑥4= 1

Two neurons

+ 𝑥2

1− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥4

Gradient:𝜕𝑓

𝜕𝑥3=

𝜕𝑓

𝜕𝑥2

𝜕𝑥3= −1

Multiple input

+ 𝑥2

1− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥5 + 𝑥4

Gradient:𝜕𝑥2

𝜕𝑥5= 1

𝜕𝑥2

𝜕𝑥5= 1

Multiple input

+ 𝑥2

1− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥5 + 𝑥4

Gradient:𝜕𝑓

𝜕𝑥5=

𝜕𝑓

𝜕𝑥5

𝜕𝑥3= −1

Weights on the edges

+ 𝑥2

1− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑤3𝑥3 + 𝑤4𝑥4

+ 𝑥2

1− 𝑓

+ 𝑥2

1− 𝑓

Gradient:𝜕𝑓

𝜕𝑤3=

𝜕𝑓

𝜕𝑥2

𝜕𝑤3= −1 × 𝑥3 = −𝑥3

−𝑥3

−𝑥4

Activation

𝜎 𝑥2

1− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝜎 𝑤3𝑥3 + 𝑤4𝑥4

Activation

𝜎 𝑥2

1− 𝑓

Let 𝑛𝑒𝑡2 = 𝑤3𝑥3 + 𝑤4𝑥4

𝑛𝑒𝑡2

Activation

𝜎 𝑥2

1− 𝑓

Gradient:𝜕𝑓

𝜕𝑤3=

𝜕𝑓

𝜕𝑥2

𝜕𝑛𝑒𝑡2

𝜕𝑤3= −1 × 𝜎′ × 𝑥3 = −𝜎′𝑥3

𝑛𝑒𝑡2

𝜕𝑥2

𝜕𝑛𝑒𝑡2= 𝜎′

𝜕𝑛𝑒𝑡2

𝜕𝑤3= 𝑥3

Activation

𝜎 𝑥2

1− 𝑓

Gradient:𝜕𝑓

𝜕𝑤3=

𝜕𝑓

𝜕𝑥2

𝜕𝑛𝑒𝑡2

𝜕𝑤3= −1 × 𝜎′ × 𝑥3 = −𝜎′𝑥3

𝑛𝑒𝑡2−𝜎′

−𝜎′𝑥3

Multiple paths

𝜎 𝑥2

+ 𝑥1

1− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2 = (𝑥1+𝑥5) − 𝜎 𝑤3𝑥3 + 𝑤4𝑥4

𝑛𝑒𝑡2

Multiple paths

𝜎 𝑥2

+ 𝑥1

1− 𝑓

𝑥3 𝑤3

𝑛𝑒𝑡2

Multiple paths

𝜎 𝑥2

+ 𝑥1

1− 𝑓

Gradient:𝜕𝑓

𝜕𝑥3=

𝜕𝑓

𝜕𝑥2

𝜕𝑛𝑒𝑡2

𝜕𝑥3+

𝜕𝑓

𝜕𝑥1

𝜕𝑥3= −1 × 𝜎′ × 𝑤3 + 1 × 1 = −𝜎′𝑤3 + 1

𝑥3 𝑤3

𝑛𝑒𝑡2

Summary

• Forward to compute 𝑓

• Backward to compute the gradients

𝜎 ℎ21

𝜎 ℎ11

+ 𝑓

𝑛𝑒𝑡11

𝑛𝑒𝑡21

Math form

Gradient descent

• Minimize loss 𝐿 𝜃 , where the hypothesis is parametrized by 𝜃

• Gradient descent• Initialize 𝜃0

• 𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡𝛻 𝐿 𝜃𝑡

Stochastic gradient descent (SGD)

• Suppose data points arrive one by one

• 𝐿 𝜃 =1

𝑛σ𝑡=1

𝑛 𝑙(𝜃, 𝑥𝑡 , 𝑦𝑡), but we only know 𝑙(𝜃, 𝑥𝑡 , 𝑦𝑡) at time 𝑡

• Idea: simply do what you can based on local information• Initialize 𝜃0

• 𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡𝛻𝑙(𝜃𝑡, 𝑥𝑡 , 𝑦𝑡)

Mini-batch

• Instead of one data point, work with a small batch of 𝑏 points

(𝑥𝑡𝑏+1,𝑦𝑡𝑏+1),…, (𝑥𝑡𝑏+𝑏,𝑦𝑡𝑏+𝑏)

• Update rule

𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡𝛻1

1≤𝑖≤𝑏

𝑙 𝜃𝑡 , 𝑥𝑡𝑏+𝑖 , 𝑦𝑡𝑏+𝑖

• Typical batch size: 𝑏 = 128

deep learning basics lecture 1: feedforward...stochastic gradient descent (sgd) •suppose data...

Documents

impact characteristics of liquid nitrogen droplets...ሶ /...

new york state common core 6 mathematics curriculum ›...

eigen-values, eigen-vectors qr factorization (1) · pdf...

die schrödingergleichung i5. normierung der wellenfunktion...

超伝導量子コンピュータ - 東京大学...2018/8/3...

数列の極限...⎧収束⋯⋯⋯lim 𝑛→∞ 𝑎𝑛=...

question 1 · 2020-05-28 · 1 question 1 a) rts: 1) '[...

𝑉𝑡=0+𝑎𝑡 ∗ 𝑡+𝑎𝑡 ∗ 𝑡…𝑎𝑡...

極限math-aquarium【練習問題＋解答】極限 4 3 (1)...

politecnico di torino · 2020. 4. 9. · 𝐿 𝐿=𝐴...

relasi ekuivalensi dan automata minimal · teori bahasa dan...

3.1 solutions to exercises...last edited 3/16/15 for 𝑡...

algebra -...

carbon nanotubes based electrode architectures for ... ·...

the key to the future lies in the...

abstract koyck transformation for...

zarządzanie finansami w małych i średnich...

reaktionsdiffusionsgleichung - uni ulm aktuelles · 𝑛...

bc calculus series convergence/divergence b notesheet name:...

cobertura global y de orden superior · black-scholes...