csc413 tutorial: optimization for machine learning

CSC413 Tutorial: Optimization

for Machine Learning

Summer Tao

January 28, 2021

1Based on tutorials/slides by Harris Chan, Ladislav Rampasek, Jake Snell, Kevin Swersky, Shenlong Wang & others

“How to train your neural network”Neural Network

Overview

● Review: Overall Training Loop● Initialization● Optimization

○ Gradient Descent○ Momentum, Nesterov Accelerated Momentum○ Learning Rate Schedulers: Adagrad, RMSProp, Adam

● Hyperparameter tuning: learning rate, batch size, regularization● Jupyter/Colab Demo in PyTorch

Neural Network Training Loop

Initialization of ParametersInitial parameters of the neural network can affect the gradients and learning

Idea 1: Constant initialization

● Result: For fully connected layers: identical gradients, identical neurons. Bad!

Idea 2: Random weights, to break symmetry

● Too large of initialization: exploding gradients● Too small of initialization: vanishing gradients

2. Kaiming He Init: ReLU activation

Initialization: Calibrate the variance

1: Glorot & Bengio: Understanding the difficulty of training deep feedforward neural networks 2: He et al.: Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Two popular initialization schemes:

1: Xavier Init: For Tanh activation

# of neurons in layer l −1

Initialization: Xavier Initialization IntuitionFor networks with Tanh activation functions, Xavier Init aim for the following behaviour of the activations:

1. Variance of activation ~ constant across every layer

2. Mean of the activation and weights = zero

Interactive Demo: Initialization Schemes

Source: Initializing neural networks. https://www.deeplearning.ai/ai-notes/initialization/

Batch Normalization Layer

Ioffe & Szegedy: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Can we compensate for bad initializations in some other way?

BatchNorm’s Idea:

● Explicitly normalize the activations of each layer to be unit Gaussian.

● Apply immediately after fully connected/conv layers and before non-linearities

● Learn an additional scale and shift and running statistics for test time

Batch Normalization Layer

Ioffe & Szegedy: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

● BatchNorm significantly speeds up training in practice (Fig 1a)

● Distribution after connected layer is much more stable with BN, reducing the “internal covariate shift”, i.e. the change in the distribution of network activations due to the change in network parameters during training

Optimization: Formal definition● Given a training set:

● Prediction function:

● Define a loss function:

● Find the parameters:

which minimizes the empirical risk :

Optimization: Formal definition● Empirical risk :

● The optimum satisfies:

● Where

● Sometimes the equation has closed-form solution (e.g. linear regression)

Optimization: Batch Gradient DescentBatch Gradient Descent:

● Initialize the parameters randomly

● For each iteration, do until convergence:

Learning rate (a small step)

Gradient DescentGeometric interpretation:● Gradient is perpendicular to the tangent of the level

set curve ● Given the current point, negative gradient direction

decreases the function fastestAlternative interpretation:● Minimizing the first-order taylor approx of keep

the new point close to the current point

Source: Wikipedia

Stochastic Gradient Descent● Initialize the parameters randomly● For each iteration, do until convergence:

○ Randomly select a training sample (or a small subset of the training samples)

○ Conduct gradient descent:

● Intuition: A noisy approximation of the gradient of the whole dataset

● Pro: each update requires a small amount of training data, good for training algorithms for a large-scale dataset

● Tips○ Subsample without replacement so that you visit each point on each pass through the

dataset ("epoch")○ Divide the log-likelihood estimate by the size of mini-batches, making learning rate

invariant to the mini-batch size.

Gradient Descent with Momentum● Initialize the parameters randomly● For each iteration, do until convergence:

○ Update the momentum

● Pro: “accelerate” learning by accumulating some “velocity/momentum” using the past gradients

Nesterov Accelerated Gradient● Initialize the parameters randomly● For each iteration, do until convergence:

○ Update the momentum

● Pro: Look into the future to see how much momentum is required

Nesterov Accelerated Gradient

● First make a big jump in the direction of the previous accumulated gradient● Then measure the gradient where you end up and make a correction

StandardMomentum

Correction

Learning Rate SchedulersWhat if we want to be able to have a per-parameter learning rate?

● Certain parameter may be more sensitive (i.e. have higher curvature)

Learning Rate Schedulers: Adagrad● Initialize the parameters randomly● For each iteration, do until convergence:

○ Conduct gradient descent on i-th parameter:

Intuition: It increases the learning rate for more sparse features and decreases the learning rate for less sparse ones, according to the history of the gradient

Learning Rate Schedulers: RMSprop/Adadelta● Initialize the parameters randomly● For each iteration, do until convergence:

Intuition: Unlike Adagrad, the denominator places a significant weight on the most recent gradient. This also helps avoid decreasing learning rate too much.

Learning Rate Schedulers: Adam● Initialize the parameters randomly● For each iteration, do until convergence:

Bias-corrected forms of ,

Paper Link

Optimizers Comparison (excluding Adam)

Source: Sebastian Ruder, https://ruder.io/optimizing-gradient-descent/, Image: Alec Radford

SGD optimization on loss surface contoursSGD optimization on loss surface contours

Interactive Demo: Optimizers

Source: Parameter optimization in neural networks: https://www.deeplearning.ai/ai-notes/optimization/

Learning RateIdeal Learning Rate should be:

● Should not be too big (objective will blow up )● Should not be too small (takes longer to

converge)

Convergence criteria:

● Change in objective function is close to zero● Gradient norm is close to zero● Validation error starts to increase

(early-stopping)Idealized cartoon depiction of different learning rates.

Image Credit: Andrej Karpathy

Learning Rate: Decay ScheduleAnneal (decay) learning rate over time so the parameters can settle into a local minimum. Typical decay strategies:

1. Step Decay: reduce by factor every few epochs (e.g. a half every 5 epochs, or by 0.1 every 20 epochs), or when validation error stops improving

2. Exponential Decay: Set learning rate according to the equation

3. 1/t decay:

Iteration number

Hyperparam

Batch SizeBatch Size: the number of training data points for computing the empirical risk at each iteration.

● Typical small batches are powers of 2: 32, 64, 128, 256, 512,● Large batches are in the thousands

Large Batch Size has:

● Fewer frequency of updates● More accurate gradient● More parallelization efficiency / accelerates wallclock training ● May hurt generalization, perhaps by causing the algorithm to find poorer

local optima/plateau.

Batch SizeRelated papers on batch size:

● Goyal et al., Accurate, large minibatch SGD○ Proposes to increase the learning rate by of the minibatch size

● Hoffer et al., Train longer generalize better○ Proposes to increase the learning rate by square root of the minibatch size

● Smith et al., Don't decay the learning rate, increase the batch size ○ Increasing batch size reduce noise, while maintaining same step size

Hyperparameter TuningSeveral approaches for tuning multiple hyperparameters together:

Image source: Random Search for Hyper-Parameter Optimization

Prefer random search over grid search,higher chance of finding better performing hyper param

Hyperparameter TuningSearch hyperparameter on log scale:

●○ Learning rate and regularization strength have multiplicative effects on the training dynamics

● Start from coarse ranges then narrow down, or expand range if near the boundary of range

One validation fold vs cross-validation:

● Simplifies code base to just use one (sizeable) validation set vs doing cross validation

Jupyter/Colab Demo in PyTorch

See Colab notebook

References● Notes and tutorials from other courses:

○ ECE521 (Winter 2017) tutorial on Training neural network○ Stanford's CS231n notes on Stochastic Gradient Descent, Setting up data and loss, and

Training neural networks○ Deeplearning.ai's interactive notes on Initialization and Parameter optimization in neural

networks ○ Jimmy Ba's Talk for Optimization in Deep Learning at Deep Learning Summer School 2019

● Academic/white papers:○ SGD tips and tricks from Leon Bottou○ Efficient BackProp from Yann LeCun○ Practical Recommendations for Gradient-Based Training of Deep Architectures from Yoshua

Bengio

csc413 tutorial: optimization for machine learning

Documents

smart grid optimization - iaria · outline of the tutorial...

search engine optimization tutorial

csc413/2516 lectures 4: optimization · if the cost...

a tutorial on evolutionary multiobjective optimization

csc413/2516 lecture 8: attention and transformers ·...

a tutorial on bayesian optimization of expensive cost

tutorial: topology optimization of a cantilever beam

a short tutorial on evolutionary multiobjective optimization

tutorial composite optimization mg1

a tutorial on bayesian optimization - arxiv.org · a...

inverse design in photonics by topology optimization:...

search engine optimization tools - tutorial: google for...

design optimization tutorial

optimization tutorial

tutorial on neural network optimization problems

csc413/2516 lecture 11: q-learning & the game of go ›...

vivado power analysis optimization tutorial

bayesian optimization a tutorial on

ansys tutorial design optimization

tutorial on optimization with submodular functions