lecture 7: training neural networks, part...

110
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018 1 Lecture 7: Training Neural Networks, Part 2

Upload: others

Post on 26-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 20181

Lecture 7:Training Neural Networks,

Part 2

Page 2: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 20182

Administrative

- Assignment 1 is being graded, stay tuned- Project proposals due tomorrow by 11:59pm on

Gradescope- Assignment 2 is out, due Wednesday 5/2 11:59pm

Page 3: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 20183

Last time: Activation FunctionsSigmoid

tanh

ReLU

Leaky ReLU

Maxout

ELU

Page 4: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 20184

Last time: Activation FunctionsSigmoid

tanh

ReLU

Leaky ReLU

Maxout

ELU

Good default choice

Page 5: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 20185

Last time: Weight InitializationInitialization too small:Activations go to zero, gradients also zero,No learning

Initialization too big:Activations saturate (for tanh), Gradients zero, no learning

Initialization just right:Nice distribution of activations at all layers,Learning proceeds nicely

Page 6: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 20186

Last time: Data Preprocessing

Page 7: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 20187

Last time: Data PreprocessingBefore normalization: classification loss very sensitive to changes in weight matrix; hard to optimize

After normalization: less sensitive to small changes in weights; easier to optimize

Page 8: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 20188

Last time: Babysitting Learning

Page 9: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 20189

Last time: Hyperparameter Search

Important Parameter

Important Parameter

Uni

mpo

rtant

Pa

ram

eter

Uni

mpo

rtant

Pa

ram

eter

Grid Layout Random Layout

Coarse to fine search

Page 10: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201810

Today

- More normalization- Fancier optimization- Regularization- Transfer Learning

Page 11: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201811

Last time: Batch NormalizationInput:

Learnable params:

Output:

Intermediates:

Page 12: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201812

Last time: Batch NormalizationInput:

Learnable params:

Output:

Intermediates:

Estimate mean and variance from minibatch;Can’t do this at test-time

Page 13: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201813

Batch Normalization: Test TimeInput:

Learnable params:

Output:

Intermediates:

(Running) average of values seen during training

(Running) average of values seen during training

Page 14: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201814

Batch Normalization for ConvNets

x: N × D

, : 1 × Dɣ,β: 1 × Dy = ɣ(x- )/ +β

x: N×C×H×W

, : 1×C×1×1ɣ,β: 1×C×1×1y = ɣ(x- )/ +β

Normalize Normalize

Batch Normalization for fully-connected networks

Batch Normalization for convolutional networks(Spatial Batchnorm, BatchNorm2D)

Page 15: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201815

Layer Normalization

x: N × D

, : 1 × Dɣ,β: 1 × Dy = ɣ(x- )/ +β

x: N × D

, : N × 1ɣ,β: 1 × Dy = ɣ(x- )/ +β

Normalize Normalize

Layer Normalization for fully-connected networksSame behavior at train and test!Can be used in recurrent networks

Batch Normalization for fully-connected networks

Ba, Kiros, and Hinton, “Layer Normalization”, arXiv 2016

Page 16: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201816

Instance Normalization

x: N×C×H×W

, : 1×C×1×1ɣ,β: 1×C×1×1y = ɣ(x- )/ +β

x: N×C×H×W

, : N×C×1×1ɣ,β: 1×C×1×1y = ɣ(x- )/ +β

Normalize Normalize

Instance Normalization for convolutional networksSame behavior at train / test!

Batch Normalization for convolutional networks

Ulyanov et al, Improved Texture Networks: Maximizing Quality and Diversity in Feed-forward Stylization and Texture Synthesis, CVPR 2017

Page 17: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201817

Comparison of Normalization Layers

Wu and He, “Group Normalization”, arXiv 2018

Page 18: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201818

Group Normalization

Wu and He, “Group Normalization”, arXiv 2018 (Appeared 3/22/2018)

Page 19: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018

19

Decorrelated Batch Normalization

Huang et al, “Decorrelated Batch Normalization”, arXiv 2018 (Appeared 4/23/2018)

Batch Normalization Decorrelated Batch Normalization

BatchNorm normalizes the data, but cannot correct for correlations among the input features

DBN whitens the data using the full covariance matrix of the minibatch; this corrects for correlations

Page 20: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201820

Optimization

W_1

W_2

Page 21: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201821

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

Page 22: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201822

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?Very slow progress along shallow dimension, jitter along steep direction

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

Page 23: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201823

Optimization: Problems with SGD

What if the loss function has a local minima or saddle point?

Page 24: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201824

Optimization: Problems with SGD

What if the loss function has a local minima or saddle point?

Zero gradient, gradient descent gets stuck

Page 25: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201825

Optimization: Problems with SGD

What if the loss function has a local minima or saddle point?

Saddle points much more common in high dimension

Dauphin et al, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”, NIPS 2014

Page 26: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201826

Optimization: Problems with SGD

Our gradients come from minibatches so they can be noisy!

Page 27: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201827

SGD + MomentumSGD

- Build up “velocity” as a running mean of gradients- Rho gives “friction”; typically rho=0.9 or 0.99

Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

SGD+Momentum

Page 28: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201828

SGD + MomentumSGD+Momentum SGD+Momentum

You may see SGD+Momentum formulated different ways, but they are equivalent - give same sequence of x

Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

Page 29: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201829

SGD + Momentum

Local Minima Saddle points

Poor Conditioning

Gradient Noise

SGD SGD+Momentum

Page 30: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201830

Gradient

Velocity

actual step

Momentum update:

SGD+Momentum

Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

Combine gradient at current point with velocity to get step used to update weights

Page 31: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201831

Gradient

Velocity

actual step

Momentum update:

Nesterov Momentum

Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

GradientVelocity

actual step

Nesterov Momentum

Combine gradient at current point with velocity to get step used to update weights

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

Page 32: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201832

Nesterov Momentum

GradientVelocity

actual step

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

Page 33: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201833

Nesterov MomentumAnnoying, usually we want update in terms of

GradientVelocity

actual step

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

Page 34: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201834

Nesterov MomentumAnnoying, usually we want update in terms of

GradientVelocity

actual step

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

Change of variables and rearrange:

Page 35: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018

Change of variables and rearrange:

35

Nesterov MomentumAnnoying, usually we want update in terms of

Page 36: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201836

Nesterov MomentumSGD

SGD+Momentum

Nesterov

Page 37: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201837

AdaGrad

Added element-wise scaling of the gradient based on the historical sum of squares in each dimension

“Per-parameter learning rates” or “adaptive learning rates”

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

Page 38: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201838

AdaGrad

Q: What happens with AdaGrad?

Page 39: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201839

AdaGrad

Q: What happens with AdaGrad? Progress along “steep” directions is damped; progress along “flat” directions is accelerated

Page 40: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201840

AdaGrad

Q2: What happens to the step size over long time?

Page 41: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201841

AdaGrad

Q2: What happens to the step size over long time? Decays to zero

Page 42: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201842

RMSProp

AdaGrad

RMSProp

Tieleman and Hinton, 2012

Page 43: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201843

RMSPropSGD

SGD+Momentum

RMSProp

Page 44: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201844

Adam (almost)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Page 45: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201845

Adam (almost)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Momentum

AdaGrad / RMSProp

Sort of like RMSProp with momentum

Q: What happens at first timestep?

Page 46: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201846

Adam (full form)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Momentum

AdaGrad / RMSProp

Bias correction

Bias correction for the fact that first and second moment estimates start at zero

Page 47: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201847

Adam (full form)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Momentum

AdaGrad / RMSProp

Bias correction

Bias correction for the fact that first and second moment estimates start at zero

Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4is a great starting point for many models!

Page 48: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201848

Adam

SGD

SGD+Momentum

RMSProp

Adam

Page 49: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201849

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.

Q: Which one of these learning rates is best to use?

Page 50: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201850

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.

=> Learning rate decay over time!

step decay: e.g. decay learning rate by half every few epochs.

exponential decay:

1/t decay:

Page 51: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201851

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.

Loss

Epoch

Learning rate decay!

Page 52: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201852

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.

Loss

Epoch

Learning rate decay!

More critical with SGD+Momentum, less common with Adam

Page 53: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201853

First-Order Optimization

Loss

w1

Page 54: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201854

First-Order Optimization

Loss

w1

(1) Use gradient form linear approximation(2) Step to minimize the approximation

Page 55: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201855

Second-Order Optimization

Loss

w1

(1) Use gradient and Hessian to form quadratic approximation(2) Step to the minima of the approximation

Page 56: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201856

second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

Second-Order Optimization

Q: What is nice about this update?

Page 57: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201857

second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

Second-Order Optimization

Q: What is nice about this update?

No hyperparameters!No learning rate!(Though you might use one in practice)

Page 58: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201858

second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

Second-Order Optimization

Q2: Why is this bad for deep learning?

Page 59: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201859

second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

Second-Order Optimization

Q2: Why is this bad for deep learning?

Hessian has O(N^2) elementsInverting takes O(N^3)N = (Tens or Hundreds of) Millions

Page 60: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201860

Second-Order Optimization

- Quasi-Newton methods (BGFS most popular):instead of inverting the Hessian (O(n^3)), approximate inverse Hessian with rank 1 updates over time (O(n^2) each).

- L-BFGS (Limited memory BFGS): Does not form/store the full inverse Hessian.

Page 61: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201861

L-BFGS

- Usually works very well in full batch, deterministic mode i.e. if you have a single, deterministic f(x) then L-BFGS will probably work very nicely

- Does not transfer very well to mini-batch setting. Gives bad results. Adapting second-order methods to large-scale, stochastic setting is an active area of research.

Le et al, “On optimization methods for deep learning, ICML 2011”Ba et al, “Distributed second-order optimization using Kronecker-factored approximations”, ICLR 2017

Page 62: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201862

- Adam is a good default choice in many cases- SGD+Momentum with learning rate decay often

outperforms Adam by a bit, but requires more tuning

- If you can afford to do full batch updates then try out L-BFGS (and don’t forget to disable all sources of noise)

In practice:

Page 63: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201863

Beyond Training Error

Better optimization algorithms help reduce training loss

But we really care about error on new data - how to reduce the gap?

Page 64: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201864

Early Stopping

Iteration

Loss

Iteration

AccuracyTrainVal

Stop training here

Stop training the model when accuracy on the validation set decreasesOr train for a long time, but always keep track of the model snapshot that worked best on val

Page 65: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201865

1. Train multiple independent models2. At test time average their results

(Take average of predicted probability distributions, then choose argmax)

Enjoy 2% extra performance

Model Ensembles

Page 66: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201866

Model Ensembles: Tips and TricksInstead of training independent models, use multiple snapshots of a single model during training!

Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.

Page 67: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201867

Model Ensembles: Tips and TricksInstead of training independent models, use multiple snapshots of a single model during training!

Cyclic learning rate schedules can make this work even better!

Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.

Page 68: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201868

Model Ensembles: Tips and TricksInstead of using actual parameter vector, keep a moving average of the parameter vector and use that at test time (Polyak averaging)

Polyak and Juditsky, “Acceleration of stochastic approximation by averaging”, SIAM Journal on Control and Optimization, 1992.

Page 69: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201869

How to improve single-model performance?

Regularization

Page 70: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018

Regularization: Add term to loss

70

In common use: L2 regularizationL1 regularizationElastic net (L1 + L2)

(Weight decay)

Page 71: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201871

Regularization: DropoutIn each forward pass, randomly set some neurons to zeroProbability of dropping is a hyperparameter; 0.5 is common

Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014

Page 72: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201872

Regularization: Dropout Example forward pass with a 3-layer network using dropout

Page 73: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201873

Regularization: DropoutHow can this possibly be a good idea?

Forces the network to have a redundant representation;Prevents co-adaptation of features

has an ear

has a tail

is furry

has claws

mischievous look

cat score

X

X

X

Page 74: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201874

Regularization: DropoutHow can this possibly be a good idea?

Another interpretation:

Dropout is training a large ensemble of models (that share parameters).

Each binary mask is one model

An FC layer with 4096 units has24096 ~ 101233 possible masks!Only ~ 1082 atoms in the universe...

Page 75: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201875

Dropout: Test time

Dropout makes our output random!

Output(label)

Input(image)

Random mask

Want to “average out” the randomness at test-time

But this integral seems hard …

Page 76: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201876

Dropout: Test timeWant to approximate the integral

Consider a single neuron.a

x y

w1 w2

Page 77: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201877

Dropout: Test timeWant to approximate the integral

Consider a single neuron.

At test time we have:a

x y

w1 w2

Page 78: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201878

Dropout: Test timeWant to approximate the integral

Consider a single neuron.

At test time we have:During training we have:

a

x y

w1 w2

Page 79: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201879

Dropout: Test timeWant to approximate the integral

Consider a single neuron.

At test time we have:During training we have:

a

x y

w1 w2

At test time, multiply by dropout probability

Page 80: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201880

Dropout: Test time

At test time all neurons are active always=> We must scale the activations so that for each neuron:output at test time = expected output at training time

Page 81: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201881

Dropout Summary

drop in forward pass

scale at test time

Page 82: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201882

More common: “Inverted dropout”

test time is unchanged!

Page 83: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201883

Regularization: A common patternTraining: Add some kind of randomness

Testing: Average out randomness (sometimes approximate)

Page 84: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201884

Regularization: A common patternTraining: Add some kind of randomness

Testing: Average out randomness (sometimes approximate)

Example: Batch Normalization

Training: Normalize using stats from random minibatches

Testing: Use fixed stats to normalize

Page 85: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201885

Load image and label

“cat”

CNN

Computeloss

Regularization: Data Augmentation

This image by Nikita is licensed under CC-BY 2.0

Page 86: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201886

Regularization: Data Augmentation

Load image and label

“cat”

CNN

Computeloss

Transform image

Page 87: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201887

Data AugmentationHorizontal Flips

Page 88: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201888

Data AugmentationRandom crops and scales

Training: sample random crops / scalesResNet:1. Pick random L in range [256, 480]2. Resize training image, short side = L3. Sample random 224 x 224 patch

Page 89: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201889

Data AugmentationRandom crops and scales

Training: sample random crops / scalesResNet:1. Pick random L in range [256, 480]2. Resize training image, short side = L3. Sample random 224 x 224 patch

Testing: average a fixed set of cropsResNet:1. Resize image at 5 scales: {224, 256, 384, 480, 640}2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips

Page 90: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201890

Data AugmentationColor Jitter

Simple: Randomize contrast and brightness

Page 91: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201891

Data AugmentationColor Jitter

Simple: Randomize contrast and brightness

More Complex:

1. Apply PCA to all [R, G, B] pixels in training set

2. Sample a “color offset” along principal component directions

3. Add offset to all pixels of a training image

(As seen in [Krizhevsky et al. 2012], ResNet, etc)

Page 92: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201892

Data AugmentationGet creative for your problem!

Random mix/combinations of :- translation- rotation- stretching- shearing, - lens distortions, … (go crazy)

Page 93: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201893

Regularization: A common patternTraining: Add random noiseTesting: Marginalize over the noise

Examples:DropoutBatch NormalizationData Augmentation

Page 94: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201894

Regularization: A common patternTraining: Add random noiseTesting: Marginalize over the noise

Examples:DropoutBatch NormalizationData AugmentationDropConnect

Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013

Page 95: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201895

Regularization: A common patternTraining: Add random noiseTesting: Marginalize over the noise

Examples:DropoutBatch NormalizationData AugmentationDropConnectFractional Max Pooling

Graham, “Fractional Max Pooling”, arXiv 2014

Page 96: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201896

Regularization: A common patternTraining: Add random noiseTesting: Marginalize over the noise

Examples:DropoutBatch NormalizationData AugmentationDropConnectFractional Max PoolingStochastic Depth

Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016

Page 97: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201897

Transfer Learning

“You need a lot of a data if you want to train/use CNNs”

Page 98: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201898

Transfer Learning

“You need a lot of a data if you want to train/use CNNs”

BUSTED

Page 99: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201899

Transfer Learning with CNNs

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096FC-1000

1. Train on Imagenet

Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014

Page 100: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018100

Transfer Learning with CNNs

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096FC-1000

1. Train on Imagenet

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096

FC-C

2. Small Dataset (C classes)

Freeze these

Reinitialize this and train

Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014

Page 101: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018101

Transfer Learning with CNNs

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096FC-1000

1. Train on Imagenet

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096

FC-C

2. Small Dataset (C classes)

Freeze these

Reinitialize this and train

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096

FC-C

3. Bigger dataset

Freeze these

Train these

With bigger dataset, train more layers

Lower learning rate when finetuning; 1/10 of original LR is good starting point

Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014

Page 102: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018102

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096FC-1000

More generic

More specific

very similar dataset

very different dataset

very little data ? ?

quite a lot of data

? ?

Page 103: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018103

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096FC-1000

More generic

More specific

very similar dataset

very different dataset

very little data Use Linear Classifier ontop layer

?

quite a lot of data

Finetune a few layers

?

Page 104: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018104

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096FC-1000

More generic

More specific

very similar dataset

very different dataset

very little data Use Linear Classifier on top layer

You’re in trouble… Try linear classifier from different stages

quite a lot of data

Finetune a few layers

Finetune a larger number of layers

Page 105: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018105

Transfer learning with CNNs is pervasive…(it’s the norm, not an exception)

Image Captioning: CNN + RNN

Girshick, “Fast R-CNN”, ICCV 2015Figure copyright Ross Girshick, 2015. Reproduced with permission.

Object Detection (Fast R-CNN)

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015Figure copyright IEEE, 2015. Reproduced for educational purposes.

Page 106: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018106

Transfer learning with CNNs is pervasive…(it’s the norm, not an exception)

Image Captioning: CNN + RNN

Girshick, “Fast R-CNN”, ICCV 2015Figure copyright Ross Girshick, 2015. Reproduced with permission.

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015Figure copyright IEEE, 2015. Reproduced for educational purposes.

Object Detection (Fast R-CNN) CNN pretrained

on ImageNet

Page 107: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018107

Transfer learning with CNNs is pervasive…(it’s the norm, not an exception)

Image Captioning: CNN + RNN

Girshick, “Fast R-CNN”, ICCV 2015Figure copyright Ross Girshick, 2015. Reproduced with permission.

Object Detection (Fast R-CNN) CNN pretrained

on ImageNet

Word vectors pretrained with word2vec Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for

Generating Image Descriptions”, CVPR 2015Figure copyright IEEE, 2015. Reproduced for educational purposes.

Page 108: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018108

Takeaway for your projects and beyond:Have some dataset of interest but it has < ~1M images?

1. Find a very large dataset that has similar data, train a big ConvNet there

2. Transfer learn to your dataset

Deep learning frameworks provide a “Model Zoo” of pretrained models so you don’t need to train your ownCaffe: https://github.com/BVLC/caffe/wiki/Model-ZooTensorFlow: https://github.com/tensorflow/modelsPyTorch: https://github.com/pytorch/vision

Page 109: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018109

Summary- Lots of Batch Normalization variants- Optimization

- Momentum, RMSProp, Adam, etc- Regularization

- Dropout, etc- Transfer learning

- Use this for your projects!

Page 110: Lecture 7: Training Neural Networks, Part 2cs231n.stanford.edu/slides/2018/cs231n_2018_lecture07.pdf · 2018-04-24 · Batch Normalization Decorrelated Batch Normalization BatchNorm

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018110

Next time: Deep Learning Software!