![Page 1: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/1.jpg)
Review Series of Recent Deep Learning Papers:
Parameter Prediction Paper: Learning toLearn by gradient descent by gradient
descentMarcin Andrychowicz, Misha Denil, Sergio Gmez Colmenarejo, MatthewW. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, Nando de
FreitasNIPS 2016
Reviewed by : Arshdeep Sekhon
1Department of Computer Science, University of Virginiahttps://qdata.github.io/deep2Read/
August 25, 2018Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 1 / 10
![Page 2: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/2.jpg)
Optimizer
A standard machine learning algorithm
θ∗ = arg minθ
f (θ) (1)
A standard optimizer
θt + 1 = θt − αt∇f (θt) (2)
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 2 / 10
![Page 3: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/3.jpg)
Optimizer
A standard machine learning algorithm
θ∗ = arg minθ
f (θ) (1)
A standard optimizer
θt + 1 = θt − αt∇f (θt) (2)
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 2 / 10
![Page 4: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/4.jpg)
Motivation
Optimization strategies tailored to different classes of tasks:
Deep Learning: High Dimensional, non convex optimization problemsAdagrad,RMSprop, Rprop, etc.Combinatorial Optimization: Relaxations
Generally, hand designed update rules.
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 3 / 10
![Page 5: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/5.jpg)
Motivation
Optimization strategies tailored to different classes of tasks:
Deep Learning: High Dimensional, non convex optimization problemsAdagrad,RMSprop, Rprop, etc.Combinatorial Optimization: Relaxations
Generally, hand designed update rules.
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 3 / 10
![Page 6: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/6.jpg)
Learning to Learn
Meta-learning / Learning to Learn
Replace hand designed update rules with learned update rule
θt+1 = θt + gt(∇(f (θt)), φ) (3)
gt is the optimizer with its own parameters
gt is a recurrent neural network that predicts update at each timestep,parameterised by φ
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 4 / 10
![Page 7: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/7.jpg)
Learning to Learn
Meta-learning / Learning to Learn
Replace hand designed update rules with learned update rule
θt+1 = θt + gt(∇(f (θt)), φ) (3)
gt is the optimizer with its own parameters
gt is a recurrent neural network that predicts update at each timestep,parameterised by φ
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 4 / 10
![Page 8: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/8.jpg)
Generalization for optimizers
Goal
Find an optimizer with learned updates (instead of hand designed updates)that performs well on a class of optimization problems.
Generalization in machine learning: Capacity to make predictions aboutthe target at novel unseen points
Generalization in ’Learning to Learn’ context: Optimizer should performwell in unseen problems of the same ’type’:
transfer learning
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 5 / 10
![Page 9: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/9.jpg)
Generalization for optimizers
Goal
Find an optimizer with learned updates (instead of hand designed updates)that performs well on a class of optimization problems.
Generalization in machine learning: Capacity to make predictions aboutthe target at novel unseen points
Generalization in ’Learning to Learn’ context: Optimizer should performwell in unseen problems of the same ’type’:
transfer learning
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 5 / 10
![Page 10: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/10.jpg)
Generalization for optimizers
Goal
Find an optimizer with learned updates (instead of hand designed updates)that performs well on a class of optimization problems.
Generalization in machine learning: Capacity to make predictions aboutthe target at novel unseen points
Generalization in ’Learning to Learn’ context: Optimizer should performwell in unseen problems of the same ’type’:
transfer learning
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 5 / 10
![Page 11: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/11.jpg)
Generalization for optimizers
Goal
Find an optimizer with learned updates (instead of hand designed updates)that performs well on a class of optimization problems.
Generalization in machine learning: Capacity to make predictions aboutthe target at novel unseen points
Generalization in ’Learning to Learn’ context: Optimizer should performwell in unseen problems of the same ’type’:
transfer learning
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 5 / 10
![Page 12: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/12.jpg)
Learning to Learn with Recurrent Neural Networks
1 When can you say an optimizer is good?2 θ∗ (optimal parameters) is a function of the class of functions f being
optimized and the optimizer parameters
θ∗(f , φ) (4)
3 Expected Loss
L(φ) = Ef
[f (θ∗(f , φ))
](5)
4 Expected Loss with RNN optimizer
L(φ) = Ef
[ T∑t=1
wt f (θt)]
(6)
θt+1 = θt + gt (7)(gt
ht+1
)= m(∇t , ht , φ)
5 Generate updates at each timestep using a recurrent neural network.
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 6 / 10
![Page 13: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/13.jpg)
Learning to Learn with Recurrent Neural Networks
1 When can you say an optimizer is good?2 θ∗ (optimal parameters) is a function of the class of functions f being
optimized and the optimizer parameters
θ∗(f , φ) (4)
3 Expected Loss
L(φ) = Ef
[f (θ∗(f , φ))
](5)
4 Expected Loss with RNN optimizer
L(φ) = Ef
[ T∑t=1
wt f (θt)]
(6)
θt+1 = θt + gt (7)(gt
ht+1
)= m(∇t , ht , φ)
5 Generate updates at each timestep using a recurrent neural network.
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 6 / 10
![Page 14: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/14.jpg)
Learning to Learn with Recurrent Neural Networks
1 When can you say an optimizer is good?2 θ∗ (optimal parameters) is a function of the class of functions f being
optimized and the optimizer parameters
θ∗(f , φ) (4)
3 Expected Loss
L(φ) = Ef
[f (θ∗(f , φ))
](5)
4 Expected Loss with RNN optimizer
L(φ) = Ef
[ T∑t=1
wt f (θt)]
(6)
θt+1 = θt + gt (7)(gt
ht+1
)= m(∇t , ht , φ)
5 Generate updates at each timestep using a recurrent neural network.
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 6 / 10
![Page 15: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/15.jpg)
Learning to Learn with Recurrent Neural Networks
1 When can you say an optimizer is good?2 θ∗ (optimal parameters) is a function of the class of functions f being
optimized and the optimizer parameters
θ∗(f , φ) (4)
3 Expected Loss
L(φ) = Ef
[f (θ∗(f , φ))
](5)
4 Expected Loss with RNN optimizer
L(φ) = Ef
[ T∑t=1
wt f (θt)]
(6)
θt+1 = θt + gt (7)(gt
ht+1
)= m(∇t , ht , φ)
5 Generate updates at each timestep using a recurrent neural network.Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 6 / 10
![Page 16: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/16.jpg)
Optimizing the optimizer
Computational Graph used for computing the gradient
1 Minimize loss L(φ)w.r.t the parameters of the optimizer (φ)
2 calculateδLδφ
and backpropagate through time
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 7 / 10
![Page 17: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/17.jpg)
Optimizing the optimizer
1 backpropagation through the computational graph
2 Assumption: Optimizee gradients do not depend on optimizerparameters
3 ∇t = ∇θf (θt) ;∇t
δφ= 0: Drop gradients along dotted edges
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 8 / 10
![Page 18: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/18.jpg)
Optimizing the optimizer
1 backpropagation through the computational graph2 Assumption: Optimizee gradients do not depend on optimizer
parameters
3 ∇t = ∇θf (θt) ;∇t
δφ= 0: Drop gradients along dotted edges
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 8 / 10
![Page 19: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/19.jpg)
Optimizing the optimizer
1 backpropagation through the computational graph2 Assumption: Optimizee gradients do not depend on optimizer
parameters
3 ∇t = ∇θf (θt) ;∇t
δφ= 0: Drop gradients along dotted edges
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 8 / 10
![Page 20: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/20.jpg)
Coordinate-wise LSTM optimizer
1 Issue: Huge hidden state RNN if we want an update for eachparameter
θt+1 = θt + gt (8)(gt
ht+1
)= m(∇t , ht , φ)
2 Solution: use an RNN to get an update coordinate wise3 Coordinate wise LSTM shares the weights across all parameters
Separate hidden states, but shared parameters
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 9 / 10
![Page 21: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9b279a6f9364092a54f992/html5/thumbnails/21.jpg)
Experiments and Results: Quadratic Functions
1 Quadratic Functions:
f (θ) = ||W θ − y ||22 (9)
2 Optimzer trained on random functions from this family3 Test on a random function sampled from this family distribution
4
Learning Curves
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 10 / 10