training neural networks optimization (update...

76
Training Neural Networks Optimization (update rule) M. Soleymani Sharif University of Technology Fall 2017 Some slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017, some from Andrew Ng lectures, “Deep Learning Specialization”, coursera, 2017, and few from Hinton lectures, “Neural Networks for Machine Learning”, coursera, 2015.

Upload: others

Post on 06-Jun-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Training Neural NetworksOptimization (update rule)

M. Soleymani

Sharif University of Technology

Fall 2017

Some slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017,

some from Andrew Ng lectures, “Deep Learning Specialization”, coursera, 2017,

and few from Hinton lectures, “Neural Networks for Machine Learning”, coursera, 2015.

Page 2: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Outline

• Mini-batch gradient descent

• Gradient descent with momentum

• RMSprop

• Adam

• Learning rate and learning rate decay

• Batch normalization

Page 3: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Reminder: The error surface

• For a linear neuron with a squared error,it is a quadratic bowl.

– Vertical cross-sections are parabolas.

– Horizontal cross-sections are ellipses.

• For multi-layer nets the error surface ismuch more complicated.

– But locally, a piece of a quadratic bowl isusually a very good approximation.

Page 4: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Mini-batch gradient descent

• Large datasets– Divide dataset into smaller batches containing one subset of the main training

set

– Weights are updated after seeing training data in each of these batches

• Vectorization provides efficiency

Page 5: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Gradient descent methods

Batch 1

𝑋 1 , 𝑌{1}

Batch 2

𝑋 2 , 𝑌{2}

Batch m

𝑋 𝑚 , 𝑌{𝑚}

Batch size=1 Batch size=n(the size of training set)

Batch gradient Stochastic gradient Stochastic mini-batch gradient

e.g., Batch size= 32, 64, 128, 256

n: whole no of training databs: the size of batches

𝑚 =𝑛

𝑏𝑠: the number of batches

Page 6: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Mini-batch gradient descent

For epoch=1,…,k

1. For t=1,…,m

1. Forward propagation on 𝑋{𝑡}

2. 𝐽{𝑡} =1

𝑚 𝑛∈𝐵𝑎𝑡𝑐ℎ𝑡

𝐿 𝑌𝑛𝑡, 𝑌𝑛

𝑡+ 𝜆𝑅(𝑊)

3. Backpropagation on 𝐽{𝑡} to compute gradients 𝑑𝑊

4. For 𝑙 = 1,… , 𝐿

1. 𝑊[𝑙] = 𝑊[𝑙] − 𝛼𝑑𝑊[𝑙]

𝐴[0] = 𝑋{𝑡}

1. For 𝑙 = 1,… , 𝐿

1. 𝑍 𝑙 = 𝑊[𝑙]𝐴 𝑙−1

2. 𝐴[𝑙] = 𝑓 𝑙 𝑍 𝑙

𝑌𝑛𝑡

= 𝐴𝑛𝐿

1 epoch:Single pass over all trainingsamples

Batch 1

𝑋 1 , 𝑌{1}

Batch 2

𝑋 2 , 𝑌{2}

Batch m

𝑋 𝑚 , 𝑌{𝑚}

Vectorized computaion

Page 7: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Gradient descent methods

Batch size=1 Batch size=n(the size of training set)

Batch gradient descent Stochastic gradient descent Stochastic mini-batch gradient

e.g., Batch size= 32, 64, 128, 256

• Does not use vectorizedform and thus not computationally efficient

• Need to process whole training set for weight update

• Vectorization• Fastest learning

(for proper batch size)

Page 8: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Batch size

• Full batch (batch size = N)

• SGB (batch size = 1)

• SGD (batch size = 128)

[Andrew Ng, Deep Learning Specialization, 2017]

© 2017 Coursera Inc.

Page 9: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Choosing mini-batch size

• For small training sets (e.g., n<2000) use full-batch gradient descent

• Typical mini-batch sizes for larger training sets:– 64, 128, 256, 512

• Make sure one batch of training data and the corresponding forward,backward required to be cached can fit in CPU memory

Page 10: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Mini-batch gradient descent: loss-#epoch curve

Page 11: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Mini-batch gradient descent vs. full-batch

[Andrew Ng, Deep Learning Specialization, 2017]

© 2017 Coursera Inc.

Page 12: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Problems with gradient descent

• Poor conditioning

• Saddle points and local minima

• Noisy gradients in the SGD

Page 13: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Problems with gradient descent

• What if loss changes quickly in one direction and slowly in another?What does gradient descent do?

Page 14: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Problems with gradient descent

• What if loss changes quickly in one direction and slowly in another?What does gradient descent do?

• Very slow progress along shallow dimension, jitter along steepdirection

Poor conditioning

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

Page 15: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Convergence speed of gradient descent

• Learning rate– If the learning rate is big, the weights slosh to and fro across the ravine.

– If the learning rate is too big, this oscillation diverges.

– What we would like to achieve:• Move quickly in directions with small but consistent gradients.

• Move slowly in directions with big but inconsistent gradients.

• How to train your neural networks much more quickly?

Page 16: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Optimization: Problems with SGD

• What if the loss function has a local minima or saddle point?

Page 17: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Optimization: Problems with SGD

• What if the loss function has a local minima or saddle point?

Zero gradient, gradient descent gets stuck

Saddle points much more common in high dimension

Page 18: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

The problem of local optima

For high dimensional parameter space, most points of

zero gradients are not local optima and are saddle points.

What about very high-dimensional spaces?

[Andrew Ng, Deep Learning Specialization, 2017]

© 2017 Coursera Inc.

Page 19: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

The problem of local optima

[Andrew Ng, Deep Learning Specialization, 2017]

© 2017 Coursera Inc.

Page 20: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

The problem of local optima

• A function of very high dimensional space:– if the gradient is zero, then in each direction it can either be a convex light

function or a concave light function.

– for it to be a local optima, all directions need to look like this.• And so the chance of that happening is maybe very small

Page 21: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Problem of plateaus

• The problem is near the saddle points

• Plateau is a region where the derivative is close to zero for a long time– Very very slow progress

– Finally, algorithm can then find its way off the plateau.

[Andrew Ng, Deep Learning Specialization, 2017]

© 2017 Coursera Inc.

Page 22: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Problems with stochastic gradient descent

• Noisy gradients by SGD

• Might need long time

Page 23: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

SGD + Momentum

• 𝑣𝑑𝑊 = 0

• On each iteration:– Compute 𝛻𝑊𝐽 on the current mini-batch– 𝑣𝑑𝑊 = 𝛽𝑣𝑑𝑊 + 1 − 𝛽 𝛻𝑊𝐽

– 𝑊 = 𝑊 − 𝛼𝑣𝑑𝑊

• We will show that 𝑣𝑑𝑊 is moving average of recent gradientswith exponentially decaying weights

Build up “velocity” as a running mean of gradientse.g., 𝛽=0.9 or 0.99

Page 24: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Momentum helps to reduce problems with SGD

Page 25: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Exponentially weighted averages

𝑣0 = 0𝑣1 = 0.9𝑣0 + 0.1𝜃1

…𝑣𝑡 = 0.9𝑣𝑡−1 + 0.1𝜃𝑡

[Andrew Ng, Deep Learning Specialization, 2017]

© 2017 Coursera Inc.

Page 26: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Exponentially weighted averages

• 𝑣0 = 0

• 𝑣𝑡 = 𝛽𝑣𝑡−1 + 1 − 𝛽 𝜃𝑡 𝛽 = 𝟎. 𝟗

𝛽 = 𝟎. 𝟗𝟖

[Andrew Ng, Deep Learning Specialization, 2017]

© 2017 Coursera Inc.

Page 27: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Exponentially weighted averages

• 𝑣0 = 0

• 𝑣𝑡 = 𝛽𝑣𝑡−1 + 1 − 𝛽 𝜃𝑡

𝑣𝑡 = 1 − 𝛽 𝜃𝑡 + 𝛽𝜃𝑡−1 + 𝛽2𝜃𝑡−2 + ⋯+ 𝛽𝑡−1𝜃1

1

-1

1

𝑒

1 − 𝜖 1/𝜖 =1

𝑒

0.910 =1

𝑒

0.9850 =1

𝑒

[Andrew Ng, Deep Learning Specialization, 2017]

© 2017 Coursera Inc.

Page 28: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Bias correction

• Computation of these averages more accurately:

𝑣𝑡𝑐 =

𝑣𝑡

1 − 𝛽𝑡

• Example:𝑣0 = 0

𝑣1 = 0.98𝑣0 + 0.02𝜃1 = 0.02𝜃1

𝑣2 = 0.98𝑣1 + 0.02𝜃2 = 0.0196𝜃1 + 0.02𝜃2

𝑣2𝑐 =

0.0196𝜃1 + 0.02𝜃2

1 − 0.982=

0.0196𝜃1 + 0.02𝜃2

0.0396

[Andrew Ng, Deep Learning Specialization, 2017]

© 2017 Coursera Inc.

Page 29: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Bias correction

𝑣𝑡𝑐 =

𝑣𝑡

1 − 𝛽𝑡

• when you're still warming up your estimates, the bias correction canhelp

• when t is large enough, the bias correction makes almost nodifference

• people don't often bother to implement bias corrections.– rather just wait that initial period and have a slightly more biased estimate

and go from there.

[Andrew Ng, Deep Learning Specialization, 2017]

© 2017 Coursera Inc.

Page 30: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Gradient descent with momentum

• 𝑣𝑑𝑊 = 0

• On each iteration:– Compute dW on the current mini-batch

– 𝑣𝑑𝑊 = 𝛽𝑣𝑑𝑊 + 1 − 𝛽 𝑑𝑊

– 𝑊 = 𝑊 − 𝛼𝑣𝑑𝑊

• Hyper-parameters 𝛽 and 𝛼 (𝛽 = 0.9)

[Andrew Ng, Deep Learning Specialization, 2017]

© 2017 Coursera Inc.

Page 31: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Gradient descent with momentum

• 𝑣𝑑𝑊 = 0

• On each iteration:– Compute dW on the current mini-batch

– 𝑣𝑑𝑊 = 𝛽𝑣𝑑𝑊 + 1 − 𝛽 𝑑𝑊

– 𝑊 = 𝑊 − 𝛼𝑣𝑑𝑊

• Hyper-parameters 𝛽 and 𝛼 (𝛽 = 0.9)

learning rate would need to be tuned

differently for these two different versions

𝑣𝑑𝑊 = 𝛽𝑣𝑑𝑊 + 𝑑𝑊

[Andrew Ng, Deep Learning Specialization, 2017]

© 2017 Coursera Inc.

Page 32: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Gradient descent with momentum

• 𝑣𝑑𝑊 = 0

• On each iteration:– Compute dW on the current mini-batch

– 𝑣𝑑𝑊 = 𝛽𝑣𝑑𝑊 + 𝑑𝑊

– 𝑊 = 𝑊 − 𝛼𝑣𝑑𝑊

Page 33: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Nesterov Momentum

• Nesterov Momentum is a slightly different version of the momentumupdate

– It enjoys stronger theoretical converge guarantees for convex functions and inpractice it also consistently works slightly better than standard momentum

Page 34: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Nesterov Momentum

Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983 Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004 Sutskever et al, “On the importance of initialization and momentum in deel learning”, ICML 2013

Page 35: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Nesterov Momentum

Simple update rule

Momentum

Nesterov

Page 36: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Nesterov Accelerate Gradient (NAG)

• Nestrov accelerated gradient– a little is annoying (requires gradient at 𝑥𝑡 + 𝜌𝑣𝑡 instead of at 𝑥𝑡)

• You like to evaluate gradient and loss in the same point

– With change of variables, it is resolved:

Page 37: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Nesterov Momentum

Page 38: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

RMSprop (Root Mean Square prop)

• How to adjust the scale of each dimension of 𝑑𝑊?

[Andrew Ng, Deep Learning Specialization, 2017]

© 2017 Coursera Inc.

Page 39: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

RMSprop

• 𝑆𝑑𝑊 = 0

• On iteration t:– Compute dW on the current mini-batch

– 𝑆𝑑𝑊 = 𝛽𝑆𝑑𝑊 + (1 − 𝛽)𝑑𝑊2

– 𝑊 = 𝑊 − 𝛼𝑑𝑊

𝑆𝑑𝑊

𝑆𝑑𝑊: an exponentially weighted average of the squares of the derivatives (second order moment)

Element-wise: 𝑑𝑊𝑖2 = 𝑑𝑊𝑖 ∗ 𝑑𝑊𝑖

Adagrad𝑆𝑑𝑊 = 𝑆𝑑𝑊 + 𝑑𝑊2

(𝑆𝑑𝑊: Sum of the squares of derivatives)

Step size will be very small during this algorithm

Page 40: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

RMSprop

• Updates in the vertical direction is divided by a much larger numberwhile in the horizontal direction by a smaller number.

• Thus, you can therefore use a larger learning rate alpha

Page 41: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

RMSprop

• 𝑆𝑑𝑊 = 0

• On iteration t:– Compute dW on the current mini-batch

– 𝑆𝑑𝑊 = 𝛽𝑆𝑑𝑊 + (1 − 𝛽)𝑑𝑊2

– 𝑊 = 𝑊 − 𝛼𝑑𝑊

𝑆𝑑𝑊+𝜖𝜖 = 10−8

Page 42: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

RMSprop

Page 43: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Adam (Adaptive Moment Estimation)

• Takes RMSprop and momentum idea together

• 𝑣𝑑𝑊 = 0, 𝑆𝑑𝑊 = 0

• On each iteration:– Compute dW on the current mini-batch

– 𝑉𝑑𝑊 = 𝛽1𝑉𝑑𝑊 + 1 − 𝛽1 𝑑𝑊

– 𝑆𝑑𝑊 = 𝛽2𝑆𝑑𝑊 + 1 − 𝛽2 𝑑𝑊2

– 𝑊 = 𝑊 − 𝛼𝑉𝑑𝑊

𝑆𝑑𝑊+𝜖

Problem: may cause very large steps at the beginning

Solution: Bias correction to prevent first and second moment estimates start at zero

Page 44: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Adam (full form)

• Takes RMSprop and momentum idea together

• 𝑣𝑑𝑊 = 0, 𝑆𝑑𝑊 = 0

• On each iteration:– Compute dW on the current mini-batch

– 𝑉𝑑𝑊 = 𝛽1𝑉𝑑𝑊 + 1 − 𝛽1 𝑑𝑊

– 𝑆𝑑𝑊 = 𝛽2𝑆𝑑𝑊 + 1 − 𝛽2 𝑑𝑊2

– 𝑉𝑑𝑊𝑐 =

𝑉𝑑𝑊

1−𝛽1𝑡

– 𝑆𝑑𝑊𝑐 =

𝑆𝑑𝑊

1−𝛽2𝑡

– 𝑊 = 𝑊 − 𝛼𝑉𝑑𝑊

𝑐

𝑆𝑑𝑊𝑐 +𝜖

Page 45: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Adam

• Takes RMSprop and momentum idea together

• 𝑣𝑑𝑊 = 0, 𝑆𝑑𝑊 = 0

• On each iteration:– Compute dW on the current mini-batch

– 𝑉𝑑𝑊 = 𝛽1𝑉𝑑𝑊 + 1 − 𝛽1 𝑑𝑊

– 𝑆𝑑𝑊 = 𝛽2𝑆𝑑𝑊 + 1 − 𝛽2 𝑑𝑊2

– 𝑉𝑑𝑊𝑐 =

𝑉𝑑𝑊

1−𝛽1𝑡

– 𝑆𝑑𝑊𝑐 =

𝑆𝑑𝑊

1−𝛽2𝑡

– 𝑊 = 𝑊 − 𝛼𝑉𝑑𝑊

𝑐

𝑆𝑑𝑊𝑐 +𝜖

Hyperparameter Choice:𝛼: need to bee tuned𝛽1 = 0.9𝛽2 = 0.999𝜖 = 10−8

Momentum

RMSprop

Bias correction

Page 46: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Adam

Page 47: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Example

RMSprop, AdaGrad, and Adam can adaptively tune the update vector size (per parameter)Known also as methods with adaptive learning rates

Source: http://cs231n.github.io/neural-networks-3/

Page 48: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Example

Source: http://cs231n.github.io/neural-networks-3/

Page 49: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Learning rate

• SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learningrate as a hyperparameter.

• Learning rate is an important hyperparameter that usually adjust first

Page 50: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Which one of these learning rates is best to use?

Page 51: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Choosing learning rate parameter

• loss not going down: learning rate too low

• loss exploding: learning rate too high

• Rough range for learning rate we should be cross-validating issomewhere [1e-3 … 1e-5]

cost: NaN almost always means high learning rate...

Page 52: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Learning rate decay

• Maybe during the initial steps of learning, you could afford to takemuch bigger steps.

• But then as learning approaches converges, then having a slowerlearning rate allows you to take smaller steps.

Page 53: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Learning rate decay

• Mini-batch gradient descent

• Slowly reduce learning rate over the time

it won't exactly converge

oscillating in a tighter region

around this minimum

[Andrew Ng, Deep Learning Specialization, 2017]

© 2017 Coursera Inc.

Page 54: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Learning rate decay over time

• Step decay: decay learning rate by half every few epochs.

• Exponential decay:– 𝛼 = 𝛼0𝑒

−decay_rate×epoch_num 𝛼 = 𝛼00.95epoch_num

• 1/t decay:

– 𝛼 =𝛼0

1+decay_rate×epoch_num

Page 55: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Learning rate

Page 56: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Learning rate

Page 57: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Monitoring loss function during iterations

Page 58: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Track the ratio of weight updates / weight magnitudes:

ratio between the updates and values: e.g., 0.0002 / 0.02 = 0.01 (about okay) want this to be somewhere around 0.001 or so

Page 59: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

First-order optimization

Page 60: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

First-order optimization

Page 61: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Second-order optimization

Page 62: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Second-order optimization

What is nice about this update?No hyperparameters! No learning rate!

Page 63: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Second-order optimization

Why is this bad for deep learning?Hessian has O(N^2) elementsInverting takes O(N^3) N = (Tens or Hundreds of) Millions

Page 64: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Second-order optimization: L-BFGS

• Quasi-Newton methods (BGFS most popular):– instead of inverting the Hessian (O(n^3)), approximate inverse Hessian with

rank 1 updates over time (O(n^2) each).

• L-BFGS (Limited memory BFGS):– Does not form/store the full inverse Hessian.

• L-BFGS usually works very well in full batch, deterministic mode– i.e. work very well when you have a single, deterministic cost function

• But does not transfer very well to mini-batch setting.– Gives bad results.

– Adapting L-BFGS to large-scale, stochastic setting is an active area of research.

Page 65: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

In practice

• Adam is a good default choice in most cases

• If you can afford to do full batch updates then try out L-BFGS (anddon’t forget to disable all sources of noise)

Page 66: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Before going to the generalization in the next lecture

We now see an important technique that helps to speed up training while also acts as a form of regularization

Page 67: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Recall: Normalizing inputs

• On the training set compute mean of each input (feature)

– 𝜇𝑖 = 𝑛=1

𝑁 𝑥𝑖𝑛

𝑁

– 𝜎𝑖2 =

𝑛=1𝑁 𝑥𝑖

𝑛−𝜇𝑖

2

𝑁

• Remove mean: from each of the input mean of the corresponding input– 𝑥𝑖 ← 𝑥𝑖 − 𝜇𝑖

• Normalize variance:

– 𝑥𝑖 ←𝑥𝑖

𝜎𝑖

Page 68: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Batch normalization

• “you want unit gaussian activations? justmake them so.”

• Can we normalize 𝑎[𝑙] too?

• Normalizing 𝑧[𝑙] is much more common

• Makes neural networks more robust to therange of hyperparameters

– Much more easily train a very deep network𝑊[2] 𝑥

×

𝑓[1]

𝑊[𝐿]

×

𝑓[𝐿]

𝑧[1]

𝑎[1]

𝑎[𝐿−1]

𝑧[𝐿]

𝑎[𝐿]

𝑎[𝐿] = 𝑜𝑢𝑡𝑝𝑢𝑡

Page 69: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Batch normalization

• Consider a batch of activations at some layer.

• To make each dimension unit gaussian, apply:

𝜇[𝑙] =

𝑖

𝑧(𝑖)[𝑙]

𝑏

𝜎2[𝑙]=

𝑖

𝑧 𝑖 𝑙 − 𝜇 𝑙 2

𝑏

𝑧𝑛𝑜𝑟𝑚𝑖 𝑙

=𝑧(𝑖)[𝑙] − 𝜇[𝑙]

𝜎2[𝑙]+ 𝜖

[Ioffe and Szegedy, 2015]

Page 70: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Batch normalization

• To make each dimension unit gaussian, apply:

𝜇[𝑙] =

𝑖

𝑧(𝑖)[𝑙]

𝑏

𝜎2[𝑙]=

𝑖=1

𝑏𝑧 𝑖 𝑙 − 𝜇 𝑙 2

𝑏

𝑧𝑛𝑜𝑟𝑚𝑖 𝑙

=𝑧(𝑖)[𝑙] − 𝜇[𝑙]

𝜎2[𝑙]+ 𝜖

• Problem: do we necessarily want a unit gaussian input to an activation layer?

𝑧(𝑛)[𝑙] = 𝛾 𝑙 𝑧𝑛𝑜𝑟𝑚𝑖 𝑙

+ 𝛽 𝑙

𝛾 𝑙 and 𝛽 𝑙 are learnable parameters of the model

Page 71: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Batch normalization

𝑧𝑛𝑜𝑟𝑚𝑖 𝑙

=𝑧(𝑖)[𝑙] − 𝜇[𝑙]

𝜎2[𝑙]+ 𝜖

• Allow the network to squash the range if it wants to:

𝑧(𝑖)[𝑙] = 𝛾 𝑙 𝑧𝑛𝑜𝑟𝑚𝑖 𝑙

+ 𝛽 𝑙

The network can learn to recover the identity:

[Ioffe and Szegedy, 2015]

𝛾 𝑙 = 𝜎2[𝑙]+ 𝜖

𝛽 𝑙 = 𝜇[𝑙]

In the linear area of tanh

⇒ 𝑧(𝑖)[𝑙] = 𝑧(𝑖)[𝑙]

Page 72: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Batch normalization at test time

• At test time BatchNorm layer functions differently:– The mean/std are not computed based on the batch.

– Instead, a single fixed empirical mean of activations during training is used.(e.g. can be estimated during training with running averages)

• 𝜇[𝑙] = 0

• for i=1…m

– 𝜇[𝑙] = 𝛽𝜇[𝑙] + 1 − 𝛽 𝑗=1

𝑏 𝑋𝑗𝑖

𝑏

Page 73: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

How does Batch Norm work?

• Learning on shifting input distributions

[Andrew Ng, Deep Learning Specialization, 2017]

© 2017 Coursera Inc.

Page 74: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Why this is a problem with neural networks?

Limits the amount to which updating weights in earlier layers can affectthe distribution of values that the subsequent layers see

The values become more stable to stand later layers on

𝑊[1] 𝑊[2]𝑊[3] 𝑊[4]

𝑊[5]

[Andrew Ng, Deep Learning Specialization, 2017]

© 2017 Coursera Inc.

Page 75: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Batch normalization: summary

• Speedup the learning process– Weaken the coupling between the earlier layers and later layers parameters

• Improves gradient flow through the network

• Allows higher learning rates

• Reduces the strong dependence on initialization

• Acts as a form of regularization in a funny way (as we will see)

Page 76: Training Neural Networks Optimization (update rule)ce.sharif.edu/.../96-97/1/ce959-1/resources/root/slides/Optimization.p… · Training Neural Networks Optimization (update rule)

Resources

• Deep Learning Book, Chapter 8.

• Please see the following note:– http://cs231n.github.io/neural-networks-3/