optimization for training deep models - deep …...what is optimization? the process to find maxima...

81
Optimization for Training Deep Models Deep learning reading group Henrique Morimitsu December 13, 2016 INRIA Presentation based on Chapter 8 of the Deep Learning book by [Goodfellow et al., 2016]

Upload: others

Post on 23-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Optimization for Training Deep ModelsDeep learning reading group

Henrique MorimitsuDecember 13, 2016

INRIA

Presentation based on Chapter 8 of the Deep Learning book by [Goodfellow et al., 2016]

Page 2: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Table of contents

1. Difference between learning and pure optimization2. Challenges in network optimization3. Basic algorithms4. Parameter initialization strategies5. Algorithms with adaptive learning rate6. Approximate second-order models7. Optimization strategies and meta-algorithms

1

Page 3: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

What is in here?

If you want to know:

• Best batch size?• How to avoid local minima?• Best training algorithm?• Optimal learning rate?• Best weight initialization policy?

2

Page 4: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

What is in here?

If you want to know:

• Best batch size?• How to avoid local minima?• Best training algorithm?• Optimal learning rate?• Best weight initialization policy?

No luck for you 2

Page 5: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

What is in here?

However, if you want to understand:

• How batch size affects training• The problems of local minimaand saddle points

• Different training algorithm• Importance of learning rate• Existing initialization policies andtheir implications

This is for you 3

Page 6: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Difference between learning andpure optimization

Page 7: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

What is optimization?

The process to find maxima or minima basedon constraints• Involved in many contexts of deeplearning, but the hardest one isneural network training

• Expend days to months tosolve a single instance

• Special techniques have beendeveloped specifically for thiscase

• Presentation focus on oneparticular case of optimization:

Finding the parameters θ of anetwork that reduce a costfunction J(θ)

4

Page 8: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Difference between learning and pure optimization

• Several differences• Machine learning acts indirectly, unlike optimization

• Usually we want to optimize a performance measureP, based on the test set

• Problem may be intractable• Optimize a different cost function J instead

• Hoping that it optimizes P as well

5

Page 9: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Example: recognizing cats

Application to recognize cats

• You don’t know which images you may receive

6

Page 10: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Example: recognizing cats

Application to recognize cats

• You don’t know which images you may receive

7

Page 11: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Example: recognizing cats

Application to recognize cats

• You don’t know which images you may receive

8

Page 12: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Example: recognizing cats

• The performance measure P is the number of correctclassifications

• But you don’t have access to these images• So instead you build a training set• And measure the performance J on this set• Minimizing J does not necessarily minimize P

9

Page 13: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Example: recognizing cats

Training Test

10

Page 14: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Example: recognizing cats

Training Test

11

Page 15: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Generalization error (loss function)

• Typically the loss function is a simple average overthe training set:

J(θ) = E(x,y)∼pdataL(f(x,θ), y) =1m

m∑i=1

L(f(x(i),θ), y(i))

(1)where• E is the expectation operator• L is the per-example loss function• f(x,θ) is the predicted output for input x• pdata is the empirical distribution (notice the hat)• y is the target output• m is the number of training examples

12

Page 16: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Surrogate loss

• Often, minimizing the real loss isintractable (e.g. 0-1 loss)

• Minimize a surrogate loss instead• E.g. the negative log-likelihood is asurrogate for the 0-1 loss

• Sometimes, the surrogate loss maylearn more

• Test error 0-1 loss keepsdecreasing even after training0-1 loss is zero

• Even if 0-1 loss is zero, it canbe improved by pushing theclasses even farther from eachother

13

Page 17: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Batch and minibatch algorithms

• In machine learning, the objective functiondecomposes as a sum

• Compute each update based on the cost function ofa subset

• For example, the often used gradient is based on theexpectation of the training set

▽θJ(θ) = E(x,y)∼pdata▽θL(f(x;θ), y) (2)

• Computing the expectation exactly at each update isvery expensive

14

Page 18: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Batch and minibatch algorithms

• The gain of using more samples isless than linear

• Estimator for the true mean µof a distribution

• How far the estimated mean µis?

• Compute the standarddeviation:

SE(µm) =

√√√√Var[1m

m∑i=1

x(i)]=

σ√m

(??)• Error drop proportionally onlyto the square root of m

15

Page 19: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Batch and minibatch algorithms

• Generate batches by random sampling• Random sampling may also have some benefits:

• Redundant training set→ less samples for correctgradient

• Does not happen in practice, but many samplescontribute similarly

16

Page 20: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Batch and minibatch algorithms

• Algorithms that use the whole training set are calledbatch or deterministic

• If only subsets are used, it is called minibatch• However, many times minibatch algorithm are simplyreferred to as batch

• If the algorithm uses only one example, it is calledstochastic or online

• Most algorithm use between 1 and a few examples• They are traditionally called minibatch, minibatchstochastic or simply stochastic• The canonical example is stochastic gradientdescent, which will be covered later

17

Page 21: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Batch and minibatch algorithms

• Considerations about minibatch size:• Larger batches are more accurate, but the return isless than linear

• Very small batches do not make good use ofmulticore capabilities

• If processed in parallel, the amount of memoryscales with the batch size

• Some hardware achieve better performance withspecific sizes (typically a power of 2)

• Small batches may offer a regularizing effect,probably due to the noise they add

• But may require small learning rates to keep stability• Number of steps for convergence may increase

18

Page 22: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Batch and minibatch algorithms

• Minibatches must be selected randomly, forunbiased estimation

• Subsequent estimates must also be independent ofeach other

• In practice, we do not enforce true randomness, butjust shuffle the data

19

Page 23: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Batch and minibatch algorithms

• Minibatch stochastic gradient descent follows thegradient of the true generalization error• Most algorithms shuffle the data once and then passthrough it many times

• In the first pass, each minibatch gives an unbiasedestimate

• In the next passes it becomes biased

20

Page 24: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Challenges in networkoptimization

Page 25: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Ill conditioning

• Condition number: how a change in input affectsoutput

• High value greatly propagate errors• May cause the gradient to become ”stuck”

• Even small steps increase the cost

21

Page 26: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

When ill conditioning is a problem

• Approximate cost function by a quadratic Taylorexpansion

f(x) ≈ f(x(0))+ (x− x(0))Tg+ 12(x− x

(0))TH(x− x(0)) (3)

where:• g is the gradient• H is the Hessian matrix

22

Page 27: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

When ill conditioning is a problem

f(x) ≈ f(x(0)) + (x− x(0))Tg+12(x− x

(0))TH(x− x(0)) (3)

• Updating using a learning rate of ϵ:

f(x− ϵg) ≈ f(x(0))− ϵgTg+12ϵ

2gTHg (4)

• The first term is the squared gradient norm (positive)• If second term grows too much, the cost increases• Monitor both terms to see if ill conditioning is aproblem

23

Page 28: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Local minima

• In convex problems, any local minimum is a globalminimum

• Non-convex functions may have several localminima• Neural networks are non-convex!• But actually, not a major problem

24

Page 29: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Model identifiability

• A model is identifiable if a large training set yields aunique set of parameters

• Models with latent variable are often not identifiable• Neural nets are not identifiable• The input and output weights of any ReLU or maxoutunits may be scaled to produce the same result

25

Page 30: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Weight space symmetry

• Also, the same result can be obtained by swappingthe weights between units

• This is known as weight space symmetry• There are n!m permutations of the hidden units

26

Page 31: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Local minima

• Therefore, the cost function may have anduncountable number of local minima• But they are not a big problem

• Local minima are a problem when their cost is muchhigher than the global minimum

27

Page 32: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Local minima

• Local minima is still an open problem• But nowadays many believe most local minima havea low cost[Choromanska et al., 2015, Dauphin et al., 2014,Goodfellow et al., 2015, Saxe et al., 2013]

• Test to rule out local minima problem:• Plot gradient norm along time

28

Page 33: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Local minima

Source: [Goodfellow et al., 2016]

29

Page 34: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Saddle points and flat regions

• Many attribute themajority of problem tolocal minima

• But there is anothertype of point thatappears even moreoften and causeproblems• Saddle points• Gradient is zero

Source: [Goodfellow et al., 2016]

30

Page 35: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Are there many saddle points?

• The eigenvalues of a Hessian are all positive at aminimum

• Saddle points contain a mix of positive and negativeeigenvalues

• Suppose the cost function is random• The sign of the Hessian can be determined by a coinflip

• A minimum only happens if all coins are heads• At high dimension, extremely more likely to find asaddle point

• Studies showed it may happen in practice[Saxe et al., 2013]

31

Page 36: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Are saddle points a problem?

• For second-ordermethods (Newton’smethod), yes!• But research in thisdirection[Dauphin et al., 2014]

• For gradient, it isunclear• Intuitively, it may be• But studies show it isnot[Goodfellow et al., 2015]

Source: [Goodfellow et al., 2016]

32

Page 37: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Are saddle points a problem?

• Flat regions also havezero gradient

• Takes a long time totraverse

• Gradient wastes timecircumnavigating tallmountains

Source: [Goodfellow et al., 2015]

33

Page 38: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Cliffs and exploding gradients

• Very steep cliffs causegradients to explode

• More often in recurrentnetworks

Source: [Goodfellow et al., 2016]

34

Page 39: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Local and global structure

• Gradient only look at localstructure• Actually, most usefulapproaches also have thesame problem!

• If initialized at a bad position,cannot find the lowestminima• However, less likely in highdimension

Source:[Goodfellow et al., 2016]

35

Page 40: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Basic algorithms

Page 41: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Stochastic gradient descent (SGD)

• Most used algorithm for deeplearning

• Do not confuse with(deterministic) gradientdescent• Stochastic usesminibatches

• Algorithm is similar, but thereare some importantmodifications

36

Page 42: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Gradient descent algorithm

• Full training samples {x(1), ..., x(m)} with targets y(i)

• Compute gradient

g← 1m▽θ

( m∑i=1

L(f(x(i);θ), y(i)))

(5)

• Apply updateθ ← θ − ϵg (6)

where• ϵ is the learning rate• θ are the network parameters• L(·) is the loss function

37

Page 43: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Stochastic gradient descent algorithm

• Minibatch of training samples {x(1), ..., x(m)} withtargets y(i)

• Compute gradient

g← 1m▽θ

( m∑i=1

L(f(x(i);θ), y(i)))

(7)

• Apply updateθ ← θ − ϵkg (8)

38

Page 44: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Learning rate for SGD

• Learning rate ϵk must be adaptive• Minibatches introduce noise that do not disappearalong even at the minimum

• Sufficient condition for convergence:∞∑k=1

ϵk =∞ and∞∑k=1

ϵ2k <∞ (9)

• Implying limk→∞ ϵk = 0

39

Page 45: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Learning rate for SGD

• It is common to decay the learning rate linearly untiliteration τ

• Can also be decayed at intervalsϵk = (1− α)ϵ0 + αϵτ (10)

• Three parameters to choose• Usually:

• τ should allow a few hundred passes through thetraining set

• ϵτ should be roughly 1% of ϵ0

40

Page 46: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Learning rate for SGD

• The main problem is how to choose ϵ0

• Tipically:• Higher than the best value for the first 100 iterations• Monitor the initial results and use a higher value• Too high will cause instability

41

Page 47: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Momentum

Adapted from [Goodfellow et al., 2016] 42

Page 48: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Momentum

Source: [Goodfellow et al., 2016] 42

Page 49: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Momentum

Adapted from [Goodfellow et al., 2016] 42

Page 50: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Momentum

43

Page 51: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Momentum

43

Page 52: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Momentum

• In these cases, momentum can help• Derived from the physics term (= mass × velocity)• Assume unit mass, so just consider velocity

44

Page 53: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Momentum

• Accumulates previous gradients

v← αv− ϵ▽θ

(1m

m∑i=1

L(f(x(i);θ), y(i)))

(11)

θ ← θ + v (12)

where• α ∈ [0, 1) is a hyperparameter• ▽θ is the gradient• θ are the network parameters

45

Page 54: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Momentum

Source: [Goodfellow et al., 2016]

46

Page 55: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Momentum

47

Page 56: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

SGD with momentum

• Compute gradient

g← 1m▽θ

( m∑i=1

L(f(x(i);θ), y(i)))

(13)

• Compute velocity update

v← αv− ϵg (14)

• Apply updateθ ← θ + v (15)

where• α is the momentum coefficient

48

Page 57: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Parameter initializationstrategies

Page 58: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Parameter initialization

• Deep learning methods are strongly affected byinitialization

• It can determine if the algorithm converges at all• Network optimization is still not well understood• Modern techniques are simple and heuristic

49

Page 59: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Breaking symmetry

• It is known that initialization must break symmetrybetween units• Units with same inputs and activations will beupdated the same way

• Even if using dropouts, it is better to avoid symmetry

• Common initialization choices are random andorthogonal matrices• The former is cheaper and works well

• Biases, however, are usually constants chosenheuristically

50

Page 60: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Gaussian initialization

• Choice of Gaussian function does not seem to affectmuch

• However, the scale of the distribution does matter• Larger weights break symmetry more, but mayexplode

51

Page 61: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

How to choose the mean?

• The weight can be interpreted as how much unitsinteract with each other

• If initial weight is high, we put a prior about whichunits should interact

• Therefore, it is a good idea to initialize the weightsaround zero

52

Page 62: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Normalized initialization

• Some heuristics are adapted to input/output sizes(Glorot and Bengio, 2010)

Wi,j ∼ U(−√

6m+ n ,

√6

m+ n

)(16)

• Enforce same forward / backward variance betweenlayers

• Derived from a network with no non-linearities• But in practice works well in other networks as well• Xavier initialization in Caffe

53

Page 63: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Sparse initialization

• If the layer is huge, normalized initialization yieldsvery low weights

• Martens (2010) propose to have exactly k non-zeroweights at each layer

• This helps to keep higher values and increasediversity

• But put a strong prior on some connections• It may take a long time to fix wrong priors

54

Page 64: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Orthogonal matrices

• [Saxe et al., 2013] recommend using orthogonalweight matrices for initialization

• If using a non-linear function ϕ that saturates asx→∞, there exists γ > γ0 in

xl+1i =∑j

γW(l+1,l)i,j ϕ(xlj) (17)

that avoid vanishing gradients despite the numberof network layers, where:• xli is the activity of neuron i in layer l• W(l+1,l)

i,j is a random orthogonal weight matrix

55

Page 65: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Bias initialization

• Bias initialization is typically easier• It is common to initialize them as zero• Sometimes, we might want to use other constants:

• For output units, it may be beneficial to initializethem according to the marginal statistics of theoutput

• Avoid saturation, e.g., 0.1 for ReLU• Initialize as 1 for gate units (LSTM)

56

Page 66: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Considerations about initialization

• Often, even optimal initialization does not lead tooptimal results

• This may happen because:• Wrong criteria. Maybe it is not good to preserve thenorm through the network

• Initial properties may be lost during training• Speed increases, but we may be sacrificinggeneralization capability

• If resources allows it, initialization should be ahyperparameter of the network

57

Page 67: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Algorithms with adaptivelearning rate

Page 68: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Adaptive learning rate

• Learning rate is one of thehyperparameter that impactsthe most

• The gradient is highlysensitive to some directions

• If we assume that thesensitivity is axis-aligned, itmakes sense to use separaterates for each parameter

Source:[Goodfellow et al., 2016]

58

Page 69: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Delta-bar-delta [Jacobs, 1988]

• Early heuristic approach• Simple idea: if the partial derivative in respect toone parameter remains the same, increase thelearning rate, otherwise, decrease

• Must be used in batch methods

59

Page 70: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

AdaGrad [Duchi et al., 2011]

• Scale the gradient according to the historical norms• Learning rates of parameters with high partialderivatives decrease fast

• Enforces progress in more gently sloped directions• Nice properties for convex optimization• But for deep learning decrease the rate in excess

60

Page 71: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

AdaGrad algorithm

• Accumulate squared gradients

r← r+ g⊙ g (18)

• Element-wise update

∆θ ← − ϵ

δ +√r⊙ g (19)

• Update parameters

θ ← θ +∆θ (20)

where• g is the gradient• δ is a small constant for stabilization

61

Page 72: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

RMSProp [Hinton, 2012]

• Modification of AdaGrad to perform better onnon-convex problems

• AdaGrad accumulates since beginning, gradient maybe too small before reaching a convex structure

• RMSProp uses an exponentially weighted movingaverage

62

Page 73: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

RMSProp algorithm

• Accumulate squared gradients

r← ρr+ (1− ρ)g⊙ g (21)

• Element-wise update

∆θ ← − ϵ

δ +√r⊙ g (22)

• Update parameters

θ ← θ +∆θ (23)

where• ρ is the decay rate

63

Page 74: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Adam [Kingma and Ba, 2014]

• Adaptive Moments, variation of RMSProp +Momentum

• Momentum is incorporated directly as an estimate ofthe first order moment• In RMSProp momentum is included after rescalingthe gradients

• Adam also add bias correction to the moments

64

Page 75: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Adam algorithm

• Update time step: t← t+ 1• Update biased moment estimates

s← ρ1s+ (1− ρ1)g (24)

r← ρ2r+ (1− ρ2)g⊙ g (25)• Correct biases

s← s1− ρt1

(26)

r← r1− ρt2

(27)

• Update parameters

∆θ ← −ϵ sδ +√r

(28)

θ ← θ +∆θ (29) 65

Page 76: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

Thank you!

Questions?

65

Page 77: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

References I

Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B.,and LeCun, Y. (2015).The loss surfaces of multilayer networks.In AISTATS.

Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K.,Ganguli, S., and Bengio, Y. (2014).Identifying and attacking the saddle point problemin high-dimensional non-convex optimization.In Advances in Neural Information ProcessingSystems, pages 2933–2941.

66

Page 78: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

References II

Duchi, J., Hazan, E., and Singer, Y. (2011).Adaptive subgradient methods for online learningand stochastic optimization.Journal of Machine Learning Research,12(Jul):2121–2159.Goodfellow, I., Bengio, Y., and A., C. (2016).Deep Learning.MIT Press.

67

Page 79: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

References III

Goodfellow, I. J., Vinyals, O., and Saxe, A. M. (2015).Qualitatively characterizing neural networkoptimization problems.In International Conference on LearningRepresentations.

Hinton, G. (2012).Neural networks for machine learning.Coursera, video lectures.

68

Page 80: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

References IV

Jacobs, R. A. (1988).Increased rates of convergence through learningrate adaptation.Neural networks, 1(4):295–307.

Kingma, D. and Ba, J. (2014).Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980.

69

Page 81: Optimization for Training Deep Models - Deep …...What is optimization? The process to find maxima or minima based on constraints • Involvedinmanycontextsofdeep learning,butthehardestoneis

References V

Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013).Exact solutions to the nonlinear dynamics oflearning in deep linear neural networks.In International Conference on LearningRepresentations.

70