COMPUTATIONAL STATISTICS
OPTIMISATION
Luca Bortolussi
Department of Mathematics and GeosciencesUniversity of Trieste
Office 238, third floor, [email protected]
Trieste, Winter Semester 2016/2017
OUTLINE
1 STOCHASTIC GRADIENT DESCENT
2 CONJUGATE GRADIENTS
3 NEWTON’S METHODS
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 3 / 27
BASICS
Consider a function f (x), from Rn to R, twice differentiable.Their minima are points such that ∇f (x) = 0.At a minimum x∗ of f , the Hessian matrix Hf (x∗) is positivesemidefinite, i.e. vT Hf v ≥ 0.If a point x∗ is such that (a) ∇f (x) = 0 and (b) Hf (x∗) ispositive definite, then x∗ is a minimum of f .For a quadratic function f (x) = 1
2xT Ax − bT x + c thecondition ∇f (x) = 0 reads Ax − b = 0.If A is invertible and positive definite, then the pointx∗ = A−1b is the unique minimum of f , as f is a convexfunction.
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 4 / 27
GRADIENT DESCENT
Notation. xk denotes the sequence of points of thedescent. gk = ∇f (xk ). The update is in the direction pk :
xk+1 = xk + ηkpk
In gradient descent, at a point x, take a step towards−∇f (x), hence in the update rule becomes we setpk = −gk .In the simplest case, ηk = η. If η is not small enough, wecan step over the minimum. If η is very small this usuallynot happens, but convergence is very slow.For a quadratic function, we have that pk = −Axk + b.
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 5 / 27
STOCHASTIC GRADIENT DESCENT
If the function to minimise is of the form f (x) =∑N
i=1 fi(x),as is the case for ML problems, then we can use stochasticgradient descent, which instead of taking a step along gk ,it steps along the direction −∇fi(xk).The algorithm iterates over the dataset one or more times,typically shuffling it each time.Alternatively to one single observations, small batches(mini-batches) of observations can be used to improve themethod.
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 6 / 27
SGD, CROSS ENTROPY, AND MINI-BATCHES
The cross-entropy between distributions p and q is:
H(p,q) = H(p) + DKL(p ‖ q) = −∑
xp(x) log q(x)
The empirical distribution pemp of the dataset (xi , yi) givesto each observed point probability 1/N, for N total points.Maximizing the log likelihood is the same as minimizing thecross entropy between the empirical probability and theprobability predicted by the model. Calling the loss functionL(f (xi , θ), yi) = log p(yi | xi , θ), this is
H(pemp,p) =1N
∑i
L(f (xi , θ), yi)
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 7 / 27
SGD, CROSS ENTROPY, AND MINI-BATCHES
The cross entropy between the model and the true datadistribution is
H(pdata,p) = E((x,y)∼pdata)[L(f (x, θ), y)]
If we sample N points from pdata, H(pdata,p) isapproximated by H(pemp,p) in a statistical sense.Similarly, the gradient ∇θH(pdata,p) can be approximatedby ∇θH(pemp,p).We can see the use of a mini-batch of size m (with a singlepass on the data) in the SGD as a statistical approximationof the gradient ∇θH(pdata,p): hence we minimize thegeneralization error.For very big data, we may not even use all data points intraining.
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 8 / 27
SGD AND LEARNING RATE
The learning rate η of the SGD algorithm cannot be kept fixed ateach iteration. In fact, the algorithm would not converge in thiscase, due to the noisy evaluations fo the gradient. Hence, ηkmust depend on the iteration
A sufficient condition for convergence of SGD is :∞∑
k=1
ηk = ∞ and∞∑
k=1
η2k < ∞
Typically, one sets ηk = (1 − kτ
)η0 + kτητ,
where τ is equal to the number of iterations for few epochs of thealgorithm (epoch = one iteration over the dataset). For deepmodels (i.e. very complex), τ ≈ 100. Furthermore, ητ ≈ 0.01η0.
The choice of η0 is delicate. Too large and the algorithm maydiverge, too small and it may take forever. Strategy: monitor thefirst 50-100 iterations (plot the estimated cost function, using thesame minibatch used for gradient), and find an “optimal” η0.Then choose a larger one, avoiding instabilities.
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 9 / 27
MOMENTUM AND NESTEROV-MOMENTUM
Introduces memory in the gradient,by averaging the current value withprevious ones:
25/10/16 14:42
Page 23 of 56http://www.deeplearningbook.org/contents/optimization.html
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
and Bousquet 2008( ) argue that it therefore may not be worthwhile to pursue
an optimization algorithm that converges faster than O( 1k
) for machine learning
tasks—faster convergence presumably corresponds to overfitting. Moreover, the
asymptotic analysis obscures many advantages that stochastic gradient descent
has after a small number of steps. With large datasets, the ability of SGD to make
rapid initial progress while evaluating the gradient for only very few examples
outweighs its slow asymptotic convergence. Most of the algorithms described in
the remainder of this chapter achieve benefits that matter in practice but are lost
in the constant factors obscured by the O( 1k
) asymptotic analysis. One can also
trade off the benefits of both batch and stochastic gradient descent by gradually
increasing the minibatch size during the course of learning.
For more information on SGD, see ( ).Bottou 1998
8.3.2
8.3.2
8.3.2
8.3.2
8.3.2
8.3.2
8.3.28.3.2 Momen
Momen
Momen
Momen
Momen
Momen
MomenMomentum
tum
tum
tum
tum
tum
tumtum
While stochastic gradient descent remains a very popular optimization strategy,
learning with it can sometimes be slow. The method of momentum (Polyak 1964, )
is designed to accelerate learning, especially in the face of high curvature, small but
consistent gradients, or noisy gradients. The momentum algorithm accumulates
an exponentially decaying moving average of past gradients and continues to move
in their direction. The effect of momentum is illustrated in figure .8.5
Formally, the momentum algorithm introduces a variable v
that plays the role
of velocity—it is the direction and speed at which the parameters move through
parameter space. The velocity is set to an exponentially decaying average of the
negative gradient. The name momentum
derives from a physical analogy, in
which the negative gradient is a force moving a particle through parameter space,
according to Newton’s laws of motion. Momentum in physics is mass times velocity.
In the momentum learning algorithm, we assume unit mass, so the velocity vector v
may also be regarded as the momentum of the particle. A hyperparameter
α ∈ [0,1)
determines how quickly the contributions of previous gradients exponentially decay.
The update rule is given by:
v v← α − ∇ θ
1
m
m
i=1
L( (f x( )i
; )θ , y( )i )
, (8.15)
θ θ v← + . (8.16)
The velocity v
accumulates the gradient elements ∇θ
1m
mi=1 L( (f x( )i
; )θ , y( )i )
.
The larger α
is relative to
, the more previous gradients affect the current direction.
The SGD algorithm with momentum is given in algorithm .8.2
296
25/10/16 14:42
Page 24 of 56http://www.deeplearningbook.org/contents/optimization.html
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
− − −30 20 10 0 10 20−30
−20
−10
0
10
20
Figure 8.5: Momentum aims primarily to solve two problems: poor conditioning of the
Hessian matrix and variance in the stochastic gradient. Here, we illustrate how momentum
overcomes the first of these two problems. The contour lines depict a quadratic loss
function with a poorly conditioned Hessian matrix. The red path cutting across the
contours indicates the path followed by the momentum learning rule as it minimizes this
function. At each step along the way, we draw an arrow indicating the step that gradient
descent would take at that point. We can see that a poorly conditioned quadratic objective
looks like a long, narrow valley or canyon with steep sides. Momentum correctly traverses
the canyon lengthwise, while gradient steps waste time moving back and forth across the
narrow axis of the canyon. Compare also figure , which shows the behavior of gradient4.6
descent without momentum.
297
If all gradients are aligned, momentum accelerates bymultiplying by 1/(1 − α). Generally, α = 0.5 or 0.9 or 0.99.We can see the algorithm as a physical system subject to aNewtown forces and evolving in continuous time. The costfunction is taken as a potential and modulated by η, andthe momentum term corresponds to viscous friction(proportional to velocity). Initial velocity is equal to theinitial gradient.
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 10 / 27
MOMENTUM AND NESTEROV-MOMENTUM
Introduces memory in the gradient,by averaging the current value withprevious ones:
25/10/16 14:42
Page 23 of 56http://www.deeplearningbook.org/contents/optimization.html
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
and Bousquet 2008( ) argue that it therefore may not be worthwhile to pursue
an optimization algorithm that converges faster than O( 1k
) for machine learning
tasks—faster convergence presumably corresponds to overfitting. Moreover, the
asymptotic analysis obscures many advantages that stochastic gradient descent
has after a small number of steps. With large datasets, the ability of SGD to make
rapid initial progress while evaluating the gradient for only very few examples
outweighs its slow asymptotic convergence. Most of the algorithms described in
the remainder of this chapter achieve benefits that matter in practice but are lost
in the constant factors obscured by the O( 1k
) asymptotic analysis. One can also
trade off the benefits of both batch and stochastic gradient descent by gradually
increasing the minibatch size during the course of learning.
For more information on SGD, see ( ).Bottou 1998
8.3.2
8.3.2
8.3.2
8.3.2
8.3.2
8.3.2
8.3.28.3.2 Momen
Momen
Momen
Momen
Momen
Momen
MomenMomentum
tum
tum
tum
tum
tum
tumtum
While stochastic gradient descent remains a very popular optimization strategy,
learning with it can sometimes be slow. The method of momentum (Polyak 1964, )
is designed to accelerate learning, especially in the face of high curvature, small but
consistent gradients, or noisy gradients. The momentum algorithm accumulates
an exponentially decaying moving average of past gradients and continues to move
in their direction. The effect of momentum is illustrated in figure .8.5
Formally, the momentum algorithm introduces a variable v
that plays the role
of velocity—it is the direction and speed at which the parameters move through
parameter space. The velocity is set to an exponentially decaying average of the
negative gradient. The name momentum
derives from a physical analogy, in
which the negative gradient is a force moving a particle through parameter space,
according to Newton’s laws of motion. Momentum in physics is mass times velocity.
In the momentum learning algorithm, we assume unit mass, so the velocity vector v
may also be regarded as the momentum of the particle. A hyperparameter
α ∈ [0,1)
determines how quickly the contributions of previous gradients exponentially decay.
The update rule is given by:
v v← α − ∇ θ
1
m
m
i=1
L( (f x( )i
; )θ , y( )i )
, (8.15)
θ θ v← + . (8.16)
The velocity v
accumulates the gradient elements ∇θ
1m
mi=1 L( (f x( )i
; )θ , y( )i )
.
The larger α
is relative to
, the more previous gradients affect the current direction.
The SGD algorithm with momentum is given in algorithm .8.2
296
25/10/16 14:42
Page 24 of 56http://www.deeplearningbook.org/contents/optimization.html
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
− − −30 20 10 0 10 20−30
−20
−10
0
10
20
Figure 8.5: Momentum aims primarily to solve two problems: poor conditioning of the
Hessian matrix and variance in the stochastic gradient. Here, we illustrate how momentum
overcomes the first of these two problems. The contour lines depict a quadratic loss
function with a poorly conditioned Hessian matrix. The red path cutting across the
contours indicates the path followed by the momentum learning rule as it minimizes this
function. At each step along the way, we draw an arrow indicating the step that gradient
descent would take at that point. We can see that a poorly conditioned quadratic objective
looks like a long, narrow valley or canyon with steep sides. Momentum correctly traverses
the canyon lengthwise, while gradient steps waste time moving back and forth across the
narrow axis of the canyon. Compare also figure , which shows the behavior of gradient4.6
descent without momentum.
297
25/10/16 14:42
Page 25 of 56http://www.deeplearningbook.org/contents/optimization.html
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Previously, the size of the step was simply the norm of the gradient multiplied
by the learning rate. Now, the size of the step depends on how large and how
aligned a sequence of gradients are. The step size is largest when many successive
gradients point in exactly the same direction. If the momentum algorithm always
observes gradient g
, then it will accelerate in the direction of −g
, until reaching a
terminal velocity where the size of each step is
|| ||g
1− α
. (8.17)
It is thus helpful to think of the momentum hyperparameter in terms of 11−α
. Forexample, α = .
9 corresponds to multiplying the maximum speed by relative to10
the gradient descent algorithm.
Common values of α
used in practice include .5, .
9, and .
99. Like the learningrate, α
may also be adapted over time. Typically it begins with a small value and
is later raised. It is less important to adapt α
over time than to shrink
over time.
Algorithm 8.2 Stochastic gradient descent (SGD) with momentum
Require: Learning rate , momentum parameter . α
Require: Initial parameter , initial velocity .θ v
while dostopping criterion not met
Sample a minibatch of m
examples from the training set {x(1)
, . . . ,x( )m } with
corresponding targets y( )i .
Compute gradient estimate: g ← 1m∇θ
i L f( (x( )i
; )θ , y( )i )
Compute velocity update: v v g← α −
Apply update: θ θ v← +
end while
We can view the momentum algorithm as simulating a particle subject to
continuous-time Newtonian dynamics. The physical analogy can help to build
intuition for how the momentum and gradient descent algorithms behave.
The position of the particle at any point in time is given by θ(t
). The particle
experiences net force . This force causes the particle to accelerate:f ( )t
f( ) =t∂2
∂t2
θ( )t . (8.18)
Rather than viewing this as a second-order differential equation of the position,
we can introduce the variable v(t
) representing the velocity of the particle at time
t and rewrite the Newtonian dynamics as a first-order differential equation:
v( ) =t∂
∂t
θ( )t , (8.19)
298
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 11 / 27
MOMENTUM AND NESTEROV-MOMENTUM
Nesterov momentum evaluates the gradient in anintermediate point. It can be shown that it modifiesstandard GD convergence rate to O(1/k2)
25/10/16 14:42
Page 27 of 56http://www.deeplearningbook.org/contents/optimization.html
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
that the gradient can continue to cause motion until a minimum is reached, but
strong enough to prevent motion if the gradient does not justify moving.
8.3.3
8.3.3
8.3.3
8.3.3
8.3.3
8.3.3
8.3.38.3.3 Nestero
Nestero
Nestero
Nestero
Nestero
Nestero
NesteroNesterov
v
v
v
v
v
vv Momen
Momen
Momen
Momen
Momen
Momen
MomenMomentum
tum
tum
tum
tum
tum
tumtum
Sutskever 2013et al. ( ) introduced a variant of the momentum algorithm that was
inspired by Nesterov’s accelerated gradient method ( , , ). TheNesterov 1983 2004
update rules in this case are given by:
v v← α − ∇ θ
1
m
m
i=1
L
f x( ( )i
; + )θ αv ,y( )i
, (8.21)
θ θ v← + , (8.22)
where the parameters α and
play a similar role as in the standard momentum
method. The difference between Nesterov momentum and standard momentum is
where the gradient is evaluated. With Nesterov momentum the gradient is evaluated
after the current velocity is applied. Thus one can interpret Nesterov momentum
as attempting to add a correction factor to the standard method of momentum.
The complete Nesterov momentum algorithm is presented in algorithm .8.3
In the convex batch gradient case, Nesterov momentum brings the rate of
convergence of the excess error from O(1/k
) (after k
steps) to O(1/k2
) as shown
by Nesterov 1983( ). Unfortunately, in the stochastic gradient case, Nesterov
momentum does not improve the rate of convergence.
Algorithm 8.3 Stochastic gradient descent (SGD) with Nesterov momentum
Require: Learning rate , momentum parameter . α
Require: Initial parameter , initial velocity .θ v
while dostopping criterion not met
Sample a minibatch of m
examples from the training set {x(1)
, . . . ,x( )m } with
corresponding labels y( )i .
Apply interim update: ˜
θ θ v← + α
Compute gradient (at interim point): g ← 1m∇θ
i L f( (x( )i ; ˜
θ y), ( )i )
Compute velocity update: v v g← α −
Apply update: θ θ v← +
end while
300
25/10/16 14:42
Page 27 of 56http://www.deeplearningbook.org/contents/optimization.html
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
that the gradient can continue to cause motion until a minimum is reached, but
strong enough to prevent motion if the gradient does not justify moving.
8.3.3
8.3.3
8.3.3
8.3.3
8.3.3
8.3.3
8.3.38.3.3 Nestero
Nestero
Nestero
Nestero
Nestero
Nestero
NesteroNesterov
v
v
v
v
v
vv Momen
Momen
Momen
Momen
Momen
Momen
MomenMomentum
tum
tum
tum
tum
tum
tumtum
Sutskever 2013et al. ( ) introduced a variant of the momentum algorithm that was
inspired by Nesterov’s accelerated gradient method ( , , ). TheNesterov 1983 2004
update rules in this case are given by:
v v← α − ∇ θ
1
m
m
i=1
L
f x( ( )i
; + )θ αv ,y( )i
, (8.21)
θ θ v← + , (8.22)
where the parameters α and
play a similar role as in the standard momentum
method. The difference between Nesterov momentum and standard momentum is
where the gradient is evaluated. With Nesterov momentum the gradient is evaluated
after the current velocity is applied. Thus one can interpret Nesterov momentum
as attempting to add a correction factor to the standard method of momentum.
The complete Nesterov momentum algorithm is presented in algorithm .8.3
In the convex batch gradient case, Nesterov momentum brings the rate of
convergence of the excess error from O(1/k
) (after k
steps) to O(1/k2
) as shown
by Nesterov 1983( ). Unfortunately, in the stochastic gradient case, Nesterov
momentum does not improve the rate of convergence.
Algorithm 8.3 Stochastic gradient descent (SGD) with Nesterov momentum
Require: Learning rate , momentum parameter . α
Require: Initial parameter , initial velocity .θ v
while dostopping criterion not met
Sample a minibatch of m
examples from the training set {x(1)
, . . . ,x( )m } with
corresponding labels y( )i .
Apply interim update: ˜
θ θ v← + α
Compute gradient (at interim point): g ← 1m∇θ
i L f( (x( )i ; ˜
θ y), ( )i )
Compute velocity update: v v g← α −
Apply update: θ θ v← +
end while
300
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 12 / 27
INITIALIZATION OF THE OPTIMISATION ALGORITHM
The initial point of the optimization algorithm is crucial forconvergence, especially in high dimensions. If we are inthe basin of attraction of a good minimum/ area with goodcost function, then the SGD will work fine. Otherwise not.We can randomise initial conditions and try theoptimisation several times.If we have some extra information about the solution,better incorporate it: As a general rule, always useasymmetric initial conditions (especially if the model hassymmetries: see neural networks).Sample from a (zero mean) Gaussian or an uniform.Range is important: if too large may result in instabilities. Iftoo small, may introduce too little variation.Heuristics depend on the model to learn.
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 13 / 27
ADAPTIVE LEARNING RATE: ADAGRAD
Introduce a different rate for each parameter. Modify themto take the curvature of the search space into account.AdaGrad: scales the learning rate inversely proportional tothe square root of the sum of all their historical squaredvalues. Good for convex problems.
25/10/16 14:42
Page 35 of 56http://www.deeplearningbook.org/contents/optimization.html
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.4 The AdaGrad algorithm
Require: Global learning rate
Require: Initial parameter θ
Require: Small constant , perhapsδ 10−7
, for numerical stability
Initialize gradient accumulation variable r = 0
while dostopping criterion not met
Sample a minibatch of m
examples from the training set {x(1)
, . . . ,x( )m } with
corresponding targets y( )i .
Compute gradient: g ← 1m∇θ
iL f( (x( )i
; )θ , y( )i )
Accumulate squared gradient: r r g g← +
Compute update: ∆
θ ← − δ+
√r
g
. (Division and square root applied
element-wise)
Apply update: θ θ θ← + ∆
end while
have made the learning rate too small before arriving at such a convex structure.
RMSProp uses an exponentially decaying average to discard history from the
extreme past so that it can converge rapidly after finding a convex bowl, as if it
were an instance of the AdaGrad algorithm initialized within that bowl.
RMSProp is shown in its standard form in algorithm and combined with8.5
Nesterov momentum in algorithm . Compared to AdaGrad, the use of the8.6
moving average introduces a new hyperparameter, ρ
, that controls the length scale
of the moving average.
Empirically, RMSProp has been shown to be an effective and practical op-
timization algorithm for deep neural networks. It is currently one of the go-to
optimization methods being employed routinely by deep learning practitioners.
8.5.3
8.5.3
8.5.3
8.5.3
8.5.3
8.5.3
8.5.38.5.3 A
A
A
A
A
A
AAdam
dam
dam
dam
dam
dam
damdam
Adam
( , ) is yet another adaptive learning rate optimizationKingma and Ba 2014
algorithm and is presented in algorithm . The name “Adam” derives from8.7
the phrase “adaptive moments.” In the context of the earlier algorithms, it is
perhaps best seen as a variant on the combination of RMSProp and momentum
with a few important distinctions. First, in Adam, momentum is incorporated
directly as an estimate of the first order moment (with exponential weighting) of
the gradient. The most straightforward way to add momentum to RMSProp is to
apply momentum to the rescaled gradients. The use of momentum in combination
with rescaling does not have a clear theoretical motivation. Second, Adam includes
308
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 14 / 27
ADAPTIVE LEARNING RATE: RMSPROP
RMSProp: performs better in non-convex setting thanAdaGrad, by changing gradient accumulation into anexponentially weighted moving average.
25/10/16 14:42
Page 36 of 56http://www.deeplearningbook.org/contents/optimization.html
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.5 The RMSProp algorithm
Require: Global learning rate , decay rate . ρ
Require: Initial parameter θRequire:
Small constant δ
, usually 10−6
, used to stabilize division by small
numbers.
Initialize accumulation variables r = 0
while dostopping criterion not met
Sample a minibatch of m
examples from the training set {x(1)
, . . . ,x( )m } with
corresponding targets y( )i .
Compute gradient: g ← 1m∇θ
iL f( (x( )i
; )θ , y( )i )
Accumulate squared gradient: r r g g← ρ + (1 )− ρ
Compute parameter update: ∆θ = − √δ+rg
. ( 1√δ+r
applied element-wise)
Apply update: θ θ θ← + ∆
end while
bias corrections to the estimates of both the first-order moments (the momentum
term) and the (uncentered) second-order moments to account for their initialization
at the origin (see algorithm ). RMSProp also incorporates an estimate of the8.7
(uncentered) second-order moment, however it lacks the correction factor. Thus,
unlike in Adam, the RMSProp second-order moment estimate may have high bias
early in training. Adam is generally regarded as being fairly robust to the choice
of hyperparameters, though the learning rate sometimes needs to be changed from
the suggested default.
8.5.4
8.5.4
8.5.4
8.5.4
8.5.4
8.5.4
8.5.48.5.4 Cho
Cho
Cho
Cho
Cho
Cho
ChoChoosing
osing
osing
osing
osing
osing
osingosing the
the
the
the
the
the
thethe Righ
Righ
Righ
Righ
Righ
Righ
RighRight
t
t
t
t
t
tt Optimization
Optimization
Optimization
Optimization
Optimization
Optimization
OptimizationOptimization Algorithm
Algorithm
Algorithm
Algorithm
Algorithm
Algorithm
AlgorithmAlgorithm
In this section, we discussed a series of related algorithms that each seek to address
the challenge of optimizing deep models by adapting the learning rate for each
model parameter. At this point, a natural question is: which algorithm should one
choose?
Unfortunately, there is currently no consensus on this point. ( )Schaul et al. 2014
presented a valuable comparison of a large number of optimization algorithms
across a wide range of learning tasks. While the results suggest that the family of
algorithms with adaptive learning rates (represented by RMSProp and AdaDelta)
performed fairly robustly, no single best algorithm has emerged.
Currently, the most popular optimization algorithms actively in use include
SGD, SGD with momentum, RMSProp, RMSProp with momentum, AdaDelta
and Adam. The choice of which algorithm to use, at this point, seems to depend
309
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 15 / 27
ADAPTIVE LEARNING RATE: RMSPROP
RMSProp can be also combined with NesterovMomentum. There is an extra hyperparameter controllingthe length scale of moving average.
25/10/16 14:42
Page 37 of 56http://www.deeplearningbook.org/contents/optimization.html
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.6 RMSProp algorithm with Nesterov momentum
Require: Global learning rate , decay rate , momentum coefficient . ρ α
Require: Initial parameter , initial velocity .θ v
Initialize accumulation variable r = 0
while dostopping criterion not met
Sample a minibatch of m
examples from the training set {x(1)
, . . . ,x( )m } with
corresponding targets y( )i.
Compute interim update: ˜
θ θ v← + α
Compute gradient: g ← 1m∇θ
iL f( (x( )i ; ˜
θ y), ( )i )
Accumulate gradient: r r g g← ρ + (1 )− ρ
Compute velocity update: v v← α − √r
g. (1√r
applied element-wise)
Apply update: θ θ v← +
end while
largely on the user’s familiarity with the algorithm (for ease of hyperparameter
tuning).
8.6
8.6
8.6
8.6
8.6
8.6
8.68.6 Appro
Appro
Appro
Appro
Appro
Appro
ApproApproximate
ximate
ximate
ximate
ximate
ximate
ximateximate Second-Order
Second-Order
Second-Order
Second-Order
Second-Order
Second-Order
Second-OrderSecond-Order Metho
Metho
Metho
Metho
Metho
Metho
MethoMethods
ds
ds
ds
ds
ds
dsds
In this section we discuss the application of second-order methods to the training
of deep networks. See ( ) for an earlier treatment of this subject.LeCun et al. 1998a
For simplicity of exposition, the only objective function we examine is the empirical
risk:
J( ) = θ Ex,y∼pdata ( )x,y
[ ( ( ; ) )] =L f x θ , y1
m
m
i=1
L f( (x( )i
; )θ , y( )i
). (8.25)
However the methods we discuss here extend readily to more general objective
functions that, for instance, include parameter regularization terms such as those
discussed in chapter .7
8.6.1
8.6.1
8.6.1
8.6.1
8.6.1
8.6.1
8.6.18.6.1 Newton’s
Newton’s
Newton’s
Newton’s
Newton’s
Newton’s
Newton’sNewton’s Metho
Metho
Metho
Metho
Metho
Metho
MethoMethod
d
d
d
d
d
dd
In section , we introduced second-order gradient methods. In contrast to first-4.3
order methods, second-order methods make use of second derivatives to improve
optimization. The most widely used second-order method is Newton’s method. We
now describe Newton’s method in more detail, with emphasis on its application to
neural network training.
310
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 16 / 27
ADAPTIVE LEARNING RATE: ADAM
Adam integrates RMSProp with momentum. Introduces asecond order correction. Quite stable w.r.t. hyperparameters.
25/10/16 14:42
Page 38 of 56http://www.deeplearningbook.org/contents/optimization.html
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.7 The Adam algorithm
Require: Step size (Suggested default: ) 0 001.Require:
Exponential decay rates for moment estimates, ρ1 and ρ2
in [0, 1).
(Suggested defaults: and respectively)0 9. 0 999.Require:
Small constant δ
used for numerical stabilization. (Suggested default:
10−8)
Require: Initial parameters θ
Initialize 1st and 2nd moment variables ,s = 0 r = 0
Initialize time step t = 0
while dostopping criterion not met
Sample a minibatch of m
examples from the training set {x(1)
, . . . ,x( )m } with
corresponding targets y( )i.
Compute gradient: g ← 1m∇θ
iL f( (x( )i
; )θ , y( )i )
t t← + 1
Update biased first moment estimate: s← ρ1
s+ (1− ρ1 )g
Update biased second moment estimate: r ← ρ2
r + (1− ρ2
)g g
Correct bias in first moment: ˆ
s← s1−ρt
1
Correct bias in second moment: ˆ
r← r1−ρt2
Compute update: ∆ = θ − s√r+δ
(operations applied element-wise)
Apply update: θ θ θ← + ∆
end while
Newton’s method is an optimization scheme based on using a second-order Tay-
lor series expansion to approximate J (θ
) near some point θ0
, ignoring derivatives
of higher order:
J J( ) θ ≈ (θ0
) + (θ θ− 0)∇θJ(θ0
) +1
2
(θ θ− 0)
H θ θ( − 0
), (8.26)
where H
is the Hessian of J
with respect to θ
evaluated at θ0
. If we then solve for
the critical point of this function, we obtain the Newton parameter update rule:
θ∗ = θ0
−H−1∇θJ(θ0
) (8.27)
Thus for a locally quadratic function (with positive definite H
), by rescaling
the gradient by H−1
, Newton’s method jumps directly to the minimum. If the
objective function is convex but not quadratic (there are higher-order terms), this
update can be iterated, yielding the training algorithm associated with Newton’s
method, given in algorithm .8.8
311
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 17 / 27
DIGRESSION: REGULARIZATION BY EARLY STOPPING
One of the most used regularization strategies, particularlyfor deep models, is early stopping. Idea is that, for acomplex model (prone to overfitting), the bestgeneralization is not found at an optimum. A better solutioncan be found along the trajectory going to it.One uses a validation dataset to check during optimizationhow validation error decreases, and stops at a minimum ofthe validation curve. Time is thus treated as ahyperparameter.
24/10/16 14:49
Page 19 of 46http://www.deeplearningbook.org/contents/regularization.html
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
0 50 100 150 200 250
Time (epochs)
0 00.
0 05.
0 10.
0 15.
0 20.
Loss
(neg
ative log-likelihood)
Training set loss
Validation set loss
Figure 7.3: Learning curves showing how the negative log-likelihood loss changes over
time (indicated as number of training iterations over the dataset, or epochs
). In this
example, we train a maxout network on MNIST. Observe that the training objective
decreases consistently over time, but the validation set average loss eventually begins to
increase again, forming an asymmetric U-shaped curve.
greatly improved (in proportion with the increased number of examples for the
shared parameters, compared to the scenario of single-task models). Of course this
will happen only if some assumptions about the statistical relationship between
the different tasks are valid, meaning that there is something shared across some
of the tasks.
From the point of view of deep learning, the underlying prior belief is the
following: among the factors that explain the variations observed in the data
associated with the different tasks, some are shared across two or more tasks.
7.8
7.8
7.8
7.8
7.8
7.8
7.87.8 Early
Early
Early
Early
Early
Early
EarlyEarly Stopping
Stopping
Stopping
Stopping
Stopping
Stopping
StoppingStopping
When training large models with sufficient representational capacity to overfit
the task, we often observe that training error decreases steadily over time, but
validation set error begins to rise again. See figure for an example of this7.3
behavior. This behavior occurs very reliably.
This means we can obtain a model with better validation set error (and thus,
hopefully better test set error) by returning to the parameter setting at the point in
time with the lowest validation set error. Every time the error on the validation set
improves, we store a copy of the model parameters. When the training algorithm
terminates, we return these parameters, rather than the latest parameters. The
246
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 18 / 27
DIGRESSION: REGULARIZATION BY EARLY STOPPING
One can show that early stopping, for linear models, has asimilar effect as L2 regularization.Early stopping is a very cheap form of regularization.
24/10/16 14:49
Page 24 of 46http://www.deeplearningbook.org/contents/regularization.html
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
w1
w2
w∗
w
w1
w2
w∗
w
Figure 7.4: An illustration of the effect of early stopping. (Left)The solid contour lines
indicate the contours of the negative log-likelihood. The dashed line indicates the trajectory
taken by SGD beginning from the origin. Rather than stopping at the pointw∗ that
minimizes the cost, early stopping results in the trajectory stopping at an earlier pointw.
(Right)An illustration of the effect ofL2
regularization for comparison. The dashed circles
indicate the contours of the L2
penalty, which causes the minimum of the total cost to lie
nearer the origin than the minimum of the unregularized cost.
We are going to study the trajectory followed by the parameter vector during
training. For simplicity, let us set the initial parameter vector to the origin,3 thatis w (0) = 0
. Let us study the approximate behavior of gradient descent on J by
analyzing gradient descent on J :
w( )τ = w ( 1)τ−
− ∇ wJ(w( 1)τ−
) (7.35)
= w ( 1)τ−
− H w( ( 1)τ−
−w∗
) (7.36)
w( )τ
−w∗
= ( )(I H− w( 1)τ−
−w∗
). (7.37)
Let us now rewrite this expression in the space of the eigenvectors of H
, exploiting
the eigendecomposition of H: H = Q QΛ
, whereΛ
is a diagonal matrix and Q
is an orthonormal basis of eigenvectors.
w( )τ
−w∗
= (I Q Q− Λ )(w( 1)τ−
−w∗
) (7.38)
Q(w( )τ
−w∗
) = ( )I − Λ Q (w( 1)τ−
−w∗
) (7.39)
3
For neural networks, to obtain symmetry breaking between hidden units, we cannot initialize
all the parameters to 0
, as discussed in section . However, the argument holds for any other6.2
initial value w(0)
.
251
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 19 / 27
DIGRESSION: REGULARIZATION BY EARLY STOPPING
24/10/16 14:49
Page 20 of 46http://www.deeplearningbook.org/contents/regularization.html
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
algorithm terminates when no parameters have improved over the best recorded
validation error for some pre-specified number of iterations. This procedure is
specified more formally in algorithm .7.1
Algorithm 7.1
The early stopping meta-algorithm for determining the best
amount of time to train. This meta-algorithm is a general strategy that works
well with a variety of training algorithms and ways of quantifying error on the
validation set.
Let be the number of steps between evaluations.nLet p
be the “patience,” the number of times to observe worsening validation set
error before giving up.
Let θo
be the initial parameters.
θ θ← o
i← 0
j ← 0
v←∞θ∗
← θi∗
← i
while doj < p
Update by running the training algorithm for steps.θ n
i i n← +
v
← ValidationSetError( )θ
if v
< v then
j ← 0θ∗
← θi∗
← i
v v←
else
j j← + 1
end if
end while
Best parameters are θ∗
, best number of training steps is i∗
This strategy is known as
early stopping
. It is probably the most commonly
used form of regularization in deep learning. Its popularity is due both to its
effectiveness and its simplicity.
One way to think of early stopping is as a very efficient hyperparameter selection
algorithm. In this view, the number of training steps is just another hyperparameter.
We can see in figure that this hyperparameter has a U-shaped validation set7.3
247
OUTLINE
1 STOCHASTIC GRADIENT DESCENT
2 CONJUGATE GRADIENTS
3 NEWTON’S METHODS
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 21 / 27
GRADIENT DESCENT
Notation. xk denotes the sequence of points of thedescent. gk = ∇f (xk ). The update is in the direction pk :
xk+1 = xk + ηkpk
In gradient descent, at a point x, take a step towards−∇f (x), hence in the update rule becomes we setpk = −gk .In the simplest case, ηk = η. If η is not small enough, wecan step over the minimum. If η is very small this usuallynot happens, but convergence is very slow.For a quadratic function, we have that pk = −Axk + b.
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 22 / 27
GRADIENT DESCENT WITH LINE SEARCH
One possibility to improve gradient descent is to take thebest step possible, i.e. set ηk to a value minimising thefunction f (xk + λpk) along the line with direction pk .The minimum is obtained by solving for λ the equation
∇f (xk + λpk)T pk = gTk+1pk = 0
and setting ηk to this solution.for a quadratic function, we have that the best learning rateis given by
ηk =(b − Axk )T pk
pT Ap
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 23 / 27
CONJUGATE GRADIENTS
Consider a quadratic minimisation problem. If the matrix Awould be diagonal, we could solve separately n different1-dimensional optimisation problems.We can change coordinates by an orthogonal matrix P thatdiagonalises the matrix A. By letting x = Py, we canrewrite the function f (x) as
f (y) =12
yT PT APy − BT Py + c
The columns of P are called conjugate vectors and satisfypT
i Apj = 0 and pTi Api > 0. They are linearly independent
and are very good directions to follows in a descentmethod.
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 24 / 27
CONJUGATE GRADIENTS
To construct conjugate vectors, we can use theGram-Schmidt orthogonalisation procedure: if v is linearlyindependent of p1,. . . , pk , then
pk+1 = v −k∑
j=1
pTj Av
pTj Apj
pj
We can start from a basis and construct the conjugatevectors p1,. . . , pn.In the conjugate vectors algorithm, we take step k + 1along pk+1. The best ηk , according to line search, is
ηk =−pT
k gk
pTk Apk
It holds that ∇f (xk+1)T pi = 0 for all i = 1, . . . , k (Lunenbergexpanding subspace theorem).
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 25 / 27
CONJUGATE GRADIENTS
The conjugate gradients method constructs pk ’s on the fly.Works well also for non-quadratic problems. For quadraticproblems converges in at most n steps.A good choice for a linearly independent vector v at stepk + 1 to construct pk+1 is thus ∇f (xk+1).In this case, after some algebra, we can compute:
ηk+1 =gT
k+1gk+1
pTk+1Apk+1
pk+1 = −gk+1 + βkpk
with
βk =gT
k+1gk+1
gTk gk
or βk =gT
k+1(gk+1 − gk )
gTk gk
known as the Fletcher-Reeves or Polak-Ribière(preferrable for non-quadratic problems) formulae .
OUTLINE
1 STOCHASTIC GRADIENT DESCENT
2 CONJUGATE GRADIENTS
3 NEWTON’S METHODS
STOCHASTIC GRADIENT DESCENT CONJUGATE GRADIENTS NEWTON’S METHODS 27 / 27
NEWTON-RAPSON METHOD
As an alternative optimisation for small n, we can use theNewton-Rapson method, which has better convergenceproperties than gradient descent.By Taylor expansion
f (x + ∆) ≈ f (x) + ∆T∇f (x) +12
∆T Hf (x)∆
where Hf is the Hessian of f (x).Differentiating w.r.t. ∆, the minimum of the r.h.s. is when∇f (x) = −Hf (x)∆, hence for ∆ = −H−1
f (x)∇f (x)
Thus we obtain the update rule:
xk+1 = xk − ηH−1f (xk )∇f (xk )
with 0 < η < 1 to improve convergence.