![Page 1: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/1.jpg)
Optimization for Deep NetworksIshan Misra
![Page 2: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/2.jpg)
Overview
• Vanilla SGD
• SGD + Momentum
• NAG
• Rprop
• AdaGrad
• RMSProp
• AdaDelta
• Adam
![Page 3: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/3.jpg)
More tricks
• Batch Normalization
• Natural Networks
![Page 4: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/4.jpg)
Gradient (Steepest) Descent
• Move in the opposite direction of the gradient
![Page 5: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/5.jpg)
Conjugate Gradient Methods
• See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning], Martens et al., 2010 [Deep Learning via Hessian Free optimization]
![Page 6: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/6.jpg)
Notation
Parameters of Network
Function of network parameters
![Page 7: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/7.jpg)
Properties of Loss function for SGD
Loss function over all samples must decompose into a loss function per sample
![Page 8: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/8.jpg)
Vanilla SGD
Parameters of
Network
Function of network
parameters
Iteration number
Step size/Learning rate
![Page 9: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/9.jpg)
SGD + Momentum
• Plain SGD can make erratic updates on non-smooth loss functions• Consider an outlier example which “throws off” the learning process
• Maintain some history of updates
• Physics example• A moving ball acquires “momentum”, at which point it becomes less
sensitive to the direct force (gradient)
![Page 10: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/10.jpg)
SGD + Momentum
Parameters of
Network
Function of network
parameters
Iteration number
Step size/Learning rate
Momentum
![Page 11: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/11.jpg)
SGD + Momentum
• At iteration 𝑡 you add updates from previous iteration 𝑡−𝑛 by weight 𝜇𝑛
• You effectively multiply your updates by 1
1−𝜇
![Page 12: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/12.jpg)
Nesterov Accelerate Gradient (NAG)
• Ilya Sutskever, 2012
• First make a jump as directed by momentum
• Then depending on where you land, correct the parameters
![Page 13: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/13.jpg)
NAG
Parameters of
Network
Function of network
parameters
Iteration number
Step size/Learning rate
Momentum
![Page 14: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/14.jpg)
NAG vs Standard Momentum
![Page 15: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/15.jpg)
Why anything new (beyond Momentum/NAG)?
• How to set learning rate and decay of learning rates?
• Ideally want adaptive learning rates
![Page 16: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/16.jpg)
Why anything new (beyond Momentum/NAG)?
• Neurons in each layer learn differently• Gradient magnitudes vary across layers
• Early layers get “vanishing gradients”
• Should ideally use separate adaptive learning rates• One of the reasons for having “gain” or lr multipliers in caffe
• Adaptive learning rate algorithms• Jacobs 1989 – agreement in sign between current gradient for a weight
and velocity for that weight
• Use larger mini-batches
![Page 17: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/17.jpg)
Adaptive Learning Rates
![Page 18: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/18.jpg)
Resilient Propagation (Rprop)
• Riedmiller and Braun 1993
• Address the problem of adaptive learning rate
• Increase the learning rate for a weight multiplicatively if signs of last two gradients agree
• Else decrease learning rate multiplicatively
![Page 19: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/19.jpg)
Rprop Update
![Page 20: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/20.jpg)
Rprop Initialization
• Initialize all updates at iteration 0 to constant value• If you set both learning rates to 1, you get “Manhattan update rule”
• Rprop effectively divides the gradient by its magnitude• You never update using the gradient itself, but by its sign
![Page 21: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/21.jpg)
Problems with Rprop
• Consider a weight that gets updates of 0.1 in nine mini batches, and -0.9 in tenth mini batch
• SGD would keep this weight roughly where it started
• Rprop would increment weight nine times by 𝛿, and then for the tenth update decrease the weight 𝛿• Effective update 9𝛿 − 𝛿 = 8𝛿
• Across mini-batches we scale updates very differently
![Page 22: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/22.jpg)
Adaptive Gradient (AdaGrad)
• Duchi et al., 2010
• We need to scale updates across mini-batches similarly
• Use gradient updates as an indicator to scaling
![Page 23: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/23.jpg)
Problems with AdaGrad
• Lowers the update size very aggressively
![Page 24: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/24.jpg)
RMSProp = Rprop + SGD
• Tieleman & Hinton et al., 2012 (Coursera slide 29, Lecture 6)
• Scale updates similarly across mini-batches
• Scale by decaying average of squared gradient• Rather than the sum of squared gradients in AdaGrad
![Page 25: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/25.jpg)
RMSProp
• Has shown success for training Recurrent Models
• Using Momentum generally does not show much improvement
![Page 26: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/26.jpg)
Fancy RMSProp
• “No more pesky learning rates” – Schaul et al.• Computes a diagonal Hessian and uses something similar to RMSProp
• Diagonal Hessian computation requires an additional Forward-Backward pass• Double the time of SGD
![Page 27: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/27.jpg)
Units of update
• SGD update is in terms of gradient
![Page 28: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/28.jpg)
Unitless updates
• Updates are not in units of parameters
![Page 29: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/29.jpg)
Hessian updates
![Page 30: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/30.jpg)
Hessian gives correct units
![Page 31: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/31.jpg)
AdaDelta
• Zeiler et al., 2012
• Get updates that match units
• Keep properties from RMSProp
• Updates should be of the form
![Page 32: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/32.jpg)
AdaDelta
• Approximate denominator by sqrt(decaying average of gradient squares)
• Approximate numerator by decaying average of squared updates
![Page 33: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/33.jpg)
AdaDelta
• Approximate denominator by sqrt(decaying average of gradient squares)
• Approximate numerator by decaying average of squared updates
![Page 34: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/34.jpg)
AdaDelta Update Rule
![Page 35: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/35.jpg)
Problems with AdaDelta
• The moving averages are biased by initialization of decay parameters
![Page 36: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/36.jpg)
Problems with AdaDelta
• Not the first intuitive shot at fixing “unit updates” problem
• This just maybe me nitpicking
• Why do updates have a time delay?• There is some explanation in the paper, not convinced …
![Page 37: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/37.jpg)
Adam
• Kingma & Ba, 2015
• Averages of gradient, or squared gradients
• Bias correction
![Page 38: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/38.jpg)
Adam update rule
Updates are not in the correct unit
Simplification, does not have decay over gamma_1
![Page 39: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/39.jpg)
AdaMax Adam
![Page 40: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/40.jpg)
Adam Results – Logistic Regression
![Page 41: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/41.jpg)
Adam Results - MLP
![Page 42: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/42.jpg)
Adam Results – Conv Nets
![Page 43: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/43.jpg)
Visualization
Alec Radford
![Page 44: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/44.jpg)
Batch NormalizationIoffe and Szegedy, 2015
![Page 45: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/45.jpg)
Distribution of input
• Having a fixed input distribution is known to help training of linear classifiers• Normalize inputs for SVM
• Normalize inputs for Deep Networks
![Page 46: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/46.jpg)
Distribution of input at each layer
• Each layer would benefit if its input had constant distribution
![Page 47: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/47.jpg)
Normalize the input to each layer!
![Page 48: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/48.jpg)
Is this normalization a good idea?
• Consider inputs to a sigmoid layer
• If normalized, the sigmoid may not ever “saturate”
![Page 49: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/49.jpg)
Modify normalization …
• Accommodate identity transform
Learned parameters
![Page 50: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/50.jpg)
Batch Normalization Results
![Page 51: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/51.jpg)
Batch Normalization for ensembles
![Page 52: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/52.jpg)
Natural GradientsRelated work: PRONG or Natural Neural Networks
![Page 53: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/53.jpg)
Gradients and orthonormal coordinates
• In Euclidean space, we define length using orthonormal coordinates
![Page 54: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/54.jpg)
What happens on a manifold?
• Use metric tensors (generalizes dot product to manifolds)
G is generally a symmetric PSD matrix.
It is a tensor because G’ = J^T G J, where J = jacobian(G)
![Page 55: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/55.jpg)
Gradient descent revisited
• What exactly does a gradient descent update step do?
![Page 56: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/56.jpg)
Gradient descent revisited
• What exactly does a gradient descent update step do?
Distance is measured in
orthonormal coordinates
![Page 57: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/57.jpg)
Relationship: Natural Gradient vs. Gradient
F is the Fisher information matrix
![Page 58: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/58.jpg)
PRONG: Projected Natural Gradient Descent
• Main idea is to avoid computing inverse of Fisher matrix
• Reparametrize the network so that Fisher matrix is identity• Hint: Whitening …
![Page 59: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/59.jpg)
Reparametrization
• Each layer has two components
The old 𝜃 contains all the
weights; Learned
Whitening matrix; Estimated
![Page 60: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/60.jpg)
Reparametrization
![Page 61: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/61.jpg)
Exact forms of weights
• 𝑈𝑖 is the normalized eigen-decomposition of Σ𝑖
![Page 62: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/62.jpg)
PRONG Update rule
![Page 63: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/63.jpg)
Similarity to Batch Normalization
![Page 64: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/64.jpg)
PRONG: Results
![Page 65: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/65.jpg)
PRONG: Results
![Page 66: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/66.jpg)
Thanks!
![Page 67: Optimization for Deep Networksimisra/data/Optimization_2015_11_11.pdfConjugate Gradient Methods •See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning],](https://reader033.vdocuments.mx/reader033/viewer/2022050323/5f7cc23bb30b6140da117e55/html5/thumbnails/67.jpg)
References
• http://climin.readthedocs.org/en/latest/rmsprop.html
• http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
• Training a 3-Node Neural Network is NP-Complete (Blum and Rivest, 1993)
• Equilibrated adaptive learning rates for non-convex optimization
• Practical Recommendations for Gradient-Based Training of Deep Architectures
• Efficient BackPropagation - Yann Le Cun 1989
• Stochastic Gradient Descent Tricks – Leon Bottou