overview of gradient descent optimization algorithms

34
Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/

Upload: others

Post on 22-Mar-2022

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overview of gradient descent optimization algorithms

Overview of gradient descent

optimization algorithmsHYUNG IL KOO

Based on

http://sebastianruder.com/optimizing-gradient-descent/

Page 2: Overview of gradient descent optimization algorithms

Problem Statement

โ€ข Machine Learning Optimization Problem

โ€ข Training samples: ๐‘ฅ ๐‘– , ๐‘ฆ ๐‘–

โ€ข Cost function: ๐ฝ(๐œƒ; ๐‘‹; ๐‘Œ) = ๐‘– ๐‘‘(๐‘“(๐‘ฅ(๐‘–); ๐œƒ), ๐‘ฆ(๐‘–))

๐œƒ = argmin๐œƒ

๐ฝ(๐œƒ; ๐‘‹; ๐‘Œ)

Page 3: Overview of gradient descent optimization algorithms

Optimization method

โ€ข Gradient Descent

โ€ข The most common way to optimize neural networks

โ€ข Deep learning library contains implementations of various gradient descent

algorithms

โ€ข To minimize an objective function ๐ฝ(๐œƒ) parameterized by a model's

parameters ๐œƒ โˆˆ โ„œ๐‘‘ by updating the parameters in the opposite direction of

the gradient of the objective function ๐›ป๐œƒ๐ฝ(๐œƒ) with respect to the

parameters.

Page 4: Overview of gradient descent optimization algorithms

CONTENTS

Page 5: Overview of gradient descent optimization algorithms

โ€ข Gradient descent variants

โ€ข Batch gradient descent

โ€ข Stochastic gradient descent

โ€ข Mini-batch gradient descent

โ€ข Challenges

Page 6: Overview of gradient descent optimization algorithms

โ€ข Gradient descent optimization algorithms

โ€ข Momentum

โ€ข Adaptive Gradient

โ€ข Visualization

โ€ข Which optimizer to choose?

โ€ข Additional strategies for optimizing SGD

โ€ข Shuffling and Curriculum Learning

โ€ข Batch normalization

Page 7: Overview of gradient descent optimization algorithms

GRADIENT DESCENT

VARIANTS

Page 8: Overview of gradient descent optimization algorithms

Batch gradient descent

โ€ข Computes the gradient of the cost function w.r.t. ๐œƒ for the

entire training dataset:

โ€ข Properties

โ€ข Very slow

โ€ข Intractable for datasets that don't fit in memory

โ€ข No online learning

๐œƒ๐‘›๐‘’๐‘ค = ๐œƒ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚๐›ป๐œƒ๐ฝ(๐œƒ; ๐‘‹; ๐‘Œ)

Page 9: Overview of gradient descent optimization algorithms

Stochastic Gradient descent (SGD)

โ€ข To perform a parameter update for each training example ๐‘ฅ(๐‘–)

and label ๐‘ฆ(๐‘–)

โ€ข Properties:

โ€ข Faster

โ€ข Online learning

โ€ข Heavy fluctuation

โ€ข Capability to jump to new (potentially better local minima)

โ€ข Complicated convergence (overshooting)

๐œƒ๐‘›๐‘’๐‘ค = ๐œƒ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚๐›ป๐œƒ๐ฝ(๐œƒ; ๐‘ฅ ๐‘– ; ๐‘ฆ(๐‘–))

Page 10: Overview of gradient descent optimization algorithms

SGD fluctuation

Page 11: Overview of gradient descent optimization algorithms

Batch Gradient vs SGD

Batch gradient

โ€ข It converges to the minimum

of the basin the parameters

are placed in.

Stochastic gradient descent

โ€ข It is able to jump to new and

potentially better local

minima.

โ€ข This complicates

convergence to the exact

minimum, as SGD will keep

overshooting

Page 12: Overview of gradient descent optimization algorithms

Mini-batch (stochastic) gradient descent

โ€ข To perform an update for every mini-batch of ๐‘› training

examples:

๐œƒ๐‘›๐‘’๐‘ค = ๐œƒ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚๐›ป๐œƒ๐ฝ(๐œƒ; ๐‘ฅ ๐‘–:๐‘–+๐‘› ; ๐‘ฆ(๐‘–:๐‘–+๐‘›))

Page 13: Overview of gradient descent optimization algorithms

Properties of mini-batch gradient descent

โ€ข Compared with SGD

โ€ข It reduces the variance of the parameter updates, which can lead to more

stable convergence;

โ€ข It can make use of highly optimized matrix optimizations common to

state-of-the-art deep learning libraries that make computing the gradient

w.r.t. a mini-batch very efficient

โ€ข Mini-batch gradient descent is typically the algorithm of

choice when training a neural network and the term SGD

usually is employed also when mini-batches are used.

Page 14: Overview of gradient descent optimization algorithms

CHALLENGES

Page 15: Overview of gradient descent optimization algorithms

Challenges

โ€ข Choosing a proper learning rate can be difficult.

โ€ข Small learning rate leads to painfully slow convergence

โ€ข Large learning rate can hinder convergence and cause the loss function to

fluctuate around the minimum or even to diverge

โ€ข Learning rate schedules and thresholds

โ€ข It has to be defined in advance and unable to adapt to a dataset's

characteristics.

โ€ข Same learning rate applies to all parameter updates.

โ€ข If our data is sparse and our features have very different frequencies, we

might not want to update all of them to the same extent, but perform a

larger update for rarely occurring features.

Page 16: Overview of gradient descent optimization algorithms

Challenges

โ€ข Avoiding getting trapped in their numerous suboptimal local

minima and saddle points

โ€ข Dauphin et al. argue that the difficulty arises in fact not from local

minima but from saddle points.

โ€ข These saddle points are usually surrounded by a plateau of the same error,

which makes it notoriously hard for SGD to escape, as the gradient is

close to zero in all dimensions.

Page 17: Overview of gradient descent optimization algorithms

MOMENTUM

Page 18: Overview of gradient descent optimization algorithms

Momentum

โ€ข One of the main limitations of gradient descent) is local minima

โ€ข When the gradient descent algorithm reaches a local minimum, the gradient becomes zero and the weights converge to a sub-optimal solution

โ€ข A very popular method to avoid local minima is to compute a temporal average direction in which the weights have been moving recently

โ€ข An easy way to implement this is by using an exponential average

๐‘ฃ๐‘ก = ๐›พ๐‘ฃ๐‘กโˆ’1 + ๐œ‚๐›ป๐œƒ๐ฝ(๐œƒ)๐œƒ๐‘›๐‘’๐‘ค = ๐œƒ๐‘œ๐‘™๐‘‘ โˆ’ ๐‘ฃ๐‘ก

โ€ข The term ๐›พ is called the momentum

โ€ข The momentum has a value between 0 and 1 (typically 0.9)

โ€ข Properties

โ€ข Fast convergence

โ€ข Less oscillation

Page 19: Overview of gradient descent optimization algorithms

Momentum

โ€ข Essentially, when using momentum, we push a ball down a hill. The ball

accumulates momentum as it rolls downhill, becoming faster and faster on

the way (until it reaches its terminal velocity if there is air resistance).

โ€ข The same thing happens to our parameter updates: The momentum term

increases for dimensions whose gradients point in the same directions and

reduces updates for dimensions whose gradients change directions. As a

result, we gain faster convergence and reduced oscillation.

Page 20: Overview of gradient descent optimization algorithms

Momentum

โ€ข The momentum term is also useful in spaces with long

ravines characterized by sharp curvature across the ravine

and a gently sloping floor

โ€ข Sharp curvature tends to cause divergent oscillations across

the ravine

โ€ข To avoid this problem, we could decrease the learning rate, but this is too slow

โ€ข The momentum term filters out the high curvature and allows the effective weight

steps to be bigger

โ€ข It turns out that ravines are not uncommon in optimization problems, so the use of

momentum can be helpful in many situations

โ€ข However, a momentum term can hurt when the search is close to the minima

(think of the error surface as a bowl)

โ€ข As the network approaches the bottom of the error surface, it builds enough

momentum to propel the weights in the opposite direction, creating an

undesirable oscillation that results in slower convergence

Page 21: Overview of gradient descent optimization algorithms

Smarter Ball?

Page 22: Overview of gradient descent optimization algorithms

NGD (Nesterov accelerated gradient)

โ€ข Nesterov accelerated gradient improved on the basis of Momentum algorithm

โ€ข Approximation of the next position of the parameters.

๐‘ฃ๐‘ก = ๐›พ๐‘ฃ๐‘กโˆ’1 + ๐œ‚๐›ป๐œƒ๐ฝ(๐œƒ โˆ’ ๐›พ๐‘ฃ๐‘กโˆ’1).

๐œƒ๐‘›๐‘’๐‘ค = ๐œƒ๐‘œ๐‘™๐‘‘ โˆ’ ๐‘ฃ๐‘ก

Page 23: Overview of gradient descent optimization algorithms

ADAPTIVE GRADIENTS

Page 24: Overview of gradient descent optimization algorithms

Adaptive Gradient Methods

โ€ข Let us adapt the learning rate of each parameter, performing

larger updates for infrequent and smaller updates for frequent

parameters.

โ€ข Methods

โ€ข AdaGrad (Adaptive Gradient Method)

โ€ข AdaDelta

โ€ข RMSProp (Root Mean Square Propagation)

โ€ข Adam (Adaptive Moment Estimation)

Page 25: Overview of gradient descent optimization algorithms

Adaptive Gradient Methods

โ€ข These methods use a different learning rate for each parameter

๐œƒ๐‘– โˆˆ โ„œ at every time step ๐‘ก.

โ€ข For brevity, we set ๐‘”๐‘–(๐‘ก)

to be the gradient of the objective function w.r.t.

๐œƒ๐‘– โˆˆ โ„œ at time step ๐‘ก:

๐œƒ๐‘–(๐‘ก+1)

= ๐œƒ๐‘–(๐‘ก)

โˆ’ ๐œผ โˆ™ ๐‘”๐‘–(๐‘ก)

โ€ข These methods modify the learning rate ๐œผ at each time step (๐‘ก) for every

parameter ๐œƒ๐‘– based on the past gradients that have been computed for ๐œƒ๐‘–.

Page 26: Overview of gradient descent optimization algorithms

Adagrad

โ€ข Adagrad modifies the general learning rate ๐œ‚ at each time step ๐‘กfor every parameter ๐œƒ๐‘– based on the past gradients that have

been computed for ๐œƒ๐‘–:

๐œƒ๐‘–(๐‘ก+1)

= ๐œƒ๐‘–(๐‘ก)

โˆ’๐œผ

๐‘ฎ๐’•,๐’Š + ๐๐‘”๐‘ก,๐‘–

โ€ข ๐บ๐‘ก,๐‘– = ๐‘˜โ‰ค๐‘ก ๐‘”๐‘–๐‘˜

2

๐‘”๐‘–(๐‘˜)

= ๐œ•๐ฝ(๐œƒ)

๐œ•๐œƒ๐‘– ๐œƒ(๐‘˜)

Page 27: Overview of gradient descent optimization algorithms

Adagrad

โ€ข Pros

โ€ข It eliminates the need to manually tune the learning rate. Most

implementations use a default value of 0.01.

โ€ข Cons

โ€ข Its accumulation of the squared gradients in the denominator: the

accumulated sum keeps growing during training. This causes the learning

rate to shrink and eventually become infinitesimally small. The following

algorithms aim to resolve this flaw.

Page 28: Overview of gradient descent optimization algorithms

RMSprop

โ€ข RMSprop has been developed to resolve Adagrad's diminishing

learning rates.

๐œ†๐‘ก,๐‘– = ๐›พ ๐œ†๐‘กโˆ’1,๐‘– + 1 โˆ’ ๐›พ ๐‘”๐‘–๐‘ก

2

๐œƒ๐‘–(๐‘ก+1)

= ๐œƒ๐‘–(๐‘ก)

โˆ’๐œผ

๐€๐’•,๐’Š + ๐๐‘”๐‘ก,๐‘–

ฮท : Learning rate is suggest to set to 0.001

๐›พ : Fixed momentum term

Page 29: Overview of gradient descent optimization algorithms

โ€ข Adam keeps an exponentially decaying average of past gradients ๐‘š๐‘ก, similar to

momentum.

๐‘š๐‘ก,๐‘– = ๐›ฝ1๐‘š๐‘กโˆ’1,๐‘– + (1 โˆ’ ๐›ฝ1)๐‘”๐‘–๐‘ก

๐‘ฃ๐‘ก,๐‘– = ๐›ฝ2๐‘ฃ๐‘กโˆ’1,๐‘– + 1 โˆ’ ๐›ฝ2 ๐‘”๐‘–๐‘ก

2

โ€ข They counteract these biases by computing bias-corrected first and second

moment estimates.

โ€ข which yields the Adam update rule.

๐œƒ๐‘–(๐‘ก+1)

= ๐œƒ๐‘–(๐‘ก)

โˆ’๐œผ

๐‘ฃ๐‘ก + ๐œ– ๐‘š๐‘ก,๐‘–

Adam (Adaptive moment Estimation)

๐‘š๐‘ก : Estimate of the first moment(the mean)

๐‘ฃ๐‘ก : Estimate of the second moment

(the un-centered variance)

๐›ฝ1 : suggest to set to 0.9

๐›ฝ2 : suggest to set to 0.999

๐‘š๐‘ก,๐‘– =๐‘š๐‘ก,๐‘–

1 โˆ’ ๐›ฝ1

๐‘ฃ๐‘ก,๐‘– =๐‘ฃ๐‘ก,๐‘–

1 โˆ’ ๐›ฝ2

๐œ– : suggest to set to 10โˆ’8

Page 30: Overview of gradient descent optimization algorithms

COMPARISON

Page 31: Overview of gradient descent optimization algorithms

Visualization of algorithms

Page 32: Overview of gradient descent optimization algorithms

Which optimizer to choose?

โ€ข RMSprop is an extension of Adagrad that deals with its

radically diminishing learning rates.

โ€ข Adam slightly outperform RMSprop towards the end of

optimization as gradients become sparser.

Page 33: Overview of gradient descent optimization algorithms

CONCLUSION

Page 34: Overview of gradient descent optimization algorithms

Conclusion

โ€ข Three variants of gradient descent, among which mini-batch

gradient descent is the most popular.