stochastic variance reduced gradient methods · convex optimization: algorithms and complexity how...

56
Optimization for Machine Learning Stochastic Variance Reduced Gradient Methods Lecturers: Francis Bach & Robert M. Gower Tutorials: Hadrien Hendrikx, Rui Yuan, Nidham Gazagnadou African Master's in Machine Intelligence (AMMI), Kigali

Upload: others

Post on 13-Jul-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Optimization for Machine Learning

Stochastic Variance Reduced Gradient Methods

Lecturers: Francis Bach & Robert M. Gower

Tutorials: Hadrien Hendrikx, Rui Yuan, Nidham Gazagnadou

African Master's in Machine Intelligence (AMMI), Kigali

Page 2: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

References for this class

Section 6.3:

M. Schmidt, N. Le Roux, F. Bach (2016), Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Sébastien Bubeck (2015)Foundations and TrendsConvex Optimization: Algorithms and Complexity

How to transform convergence results into

iteration complexity

Section 1.3.5, R.M. Gower, Ph.d thesis: Sketch and Project: Randomized Iterative Methods for Linear Systems and Inverting Matrices University of Edinburgh, 2016

RMG, P. Richtárik and Francis Bach (2018)Stochastic quasi-gradient methods: variance reduction via Jacobian sketching

Page 3: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Solving the Finite Sum Training Problem

Page 4: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

A Datum Function

Finite Sum Training Problem

Optimization Sum of Terms

Page 5: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

SGD shrinking stepsize

Convergence for Strongly Convex● f(w) is - strongly convex● Subgradients bounded

Page 6: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

SGD Theory

SGD recap

Page 7: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

SGD initially fast, slow later

SGD suffers from high variance

Page 8: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Can we get best of both?

Today we learn about methods like this one

Page 9: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Variance reduced methods

Page 10: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Build an Estimate of the Gradient

Instead of using directlyUse to update estimate

Page 11: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Build an Estimate of the Gradient

Instead of using directlyUse to update estimate

Page 12: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Build an Estimate of the Gradient

Instead of using directlyUse to update estimate

Similar

Converges in L2

We would like gradient estimate such that:

Page 13: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Build an Estimate of the Gradient

Instead of using directlyUse to update estimate

Similar

Converges in L2

We would like gradient estimate such that:

Page 14: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Build an Estimate of the Gradient

Instead of using directlyUse to update estimate

Similar

Converges in L2

We would like gradient estimate such that:

Page 15: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Covariates

Let x and z be random variables. We say that x and z are covariates if:

Variance Reduced Estimate:

Page 16: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

EXE:

Covariates

Let x and z be random variables. We say that x and z are covariates if:

Variance Reduced Estimate:

Page 17: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

EXE:

Covariates

Let x and z be random variables. We say that x and z are covariates if:

Variance Reduced Estimate:

Page 18: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

SVRG: Stochastic Variance Reduced Gradients

grad estimate

Reference point

Sample

Page 19: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

SVRG: Stochastic Variance Reduced Gradients

Freeze reference point for m iterations

Page 20: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

SAGA: Stochastic Average Gradient unbiased version

grad estimate

Store gradient

Sample

Page 21: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

SAGA: Stochastic Average Gradient

Page 22: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

SAGA: Stochastic Average Gradient

Stores a matrixNo inner loop, rolling update

Page 23: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

SAG: Stochastic Average Gradient (Biased version)

grad estimate

Store gradient

Sample

Page 24: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

SAG: Stochastic Average Gradient

EXE: Introduce a variable . Re-write the SAG algorithm so G is updated efficiently at each iteration.

Page 25: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

SAG: Stochastic Average Gradient

Stores a matrixVery easy to implement

EXE: Introduce a variable . Re-write the SAG algorithm so G is updated efficiently at each iteration.

Page 26: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

The Stochastic Average Gradient

Page 27: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

The Stochastic Average Gradient

How to prove this converges? Is this the only option?

Page 28: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Stochastic Gradient Descent α =0.5

Page 29: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Convergence Theorems

Page 30: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Smoothness + convexity

Strong Convexity

Assumptions for Convergence

EXE:

Page 31: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Convergence SVRG

Theorem

In practice use

Johnson, R. & Zhang, T. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction, NIPS 2013

Page 32: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Convergence SAG

Theorem SAG

A practical convergence result!

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Because of biased gradients, difficult proof that relies on computer assisted steps

Page 33: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Convergence SAGA

Theorem SAGA

An even more practical convergence result!

A. Defazio, F. Bach and J. Lacoste-Julien (2014)NIPS, SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives.

Much easier proof due to unbiased gradients

Page 34: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

SGD

Comparisons in complexity for strongly convex

Gradient descent

Approximate solution

SVRG/SAGA/SAG

Variance reduction faster than GD when

How did I get these complexity results from the convergence results?

Section 1.3.5, R.M. Gower, Ph.d thesis: Sketch and Project: Randomized Iterative Methods for Linear Systems and Inverting Matrices University of Edinburgh, 2016

Page 35: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Finite Sum Training Problem

Practicals implementation of SAG for Linear Classifiers

L2 regularizor + linear hypothesis

Page 36: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Finite Sum Training Problem

Practicals implementation of SAG for Linear Classifiers

L2 regularizor + linear hypothesis

Page 37: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Finite Sum Training Problem

Practicals implementation of SAG for Linear Classifiers

L2 regularizor + linear hypothesis

Nonlinear in w

Linear in w

Page 38: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Finite Sum Training Problem

Practicals implementation of SAG for Linear Classifiers

L2 regularizor + linear hypothesis

Only store real number

Nonlinear in w

Linear in w

Stoch. gradient estimate

Full gradient estimate

Reduce Storage to O(n)

Page 39: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Take for home Variance Reduction

● Variance reduced methods use only one stochastic gradient per iteration and converge linearly on strongly convex functions

● Choice of fixed stepsize possible

● SAGA only needs to know the smoothness parameter to work, but requires storing n past stochastic gradients

● SVRG only has O(d) storage, but requires full gradient computations every so often. Has an extra “number of inner iterations” parameter to tune

Page 40: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Proving Convergence of SVRG

Page 41: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Proof:

Unbiased estimatorTaking expectation with respect to j

Must control this!

Page 42: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Smoothness Consequences ISmoothness

EXE: Lemma 1

Proof:

Page 43: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Smoothness

Smoothness Consequences II

EXE: Lemma 2

Proof:

Page 44: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Smoothness

Smoothness Consequences II

EXE: Lemma 2

Proof:

Page 45: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Smoothness

Smoothness Consequences II

EXE: Lemma 2

Proof:

Lemma 1

Page 46: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Bounding gradient estimate

EXE: Lemma 3

Proof:

Page 47: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Bounding gradient estimate

EXE: Lemma 3

Proof:

Page 48: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Bounding gradient estimate

EXE: Lemma 3

Proof:

Lemma 2

Page 49: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Proof:

Unbiased estimatorTaking expectation with respect to j

Must control this!

Page 50: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Proof (continued I):

Unbiased estimatorTaking expectation with respect to j

Page 51: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Proof (continued I):

Unbiased estimatorTaking expectation with respect to j

Page 52: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Proof (continued II):

Page 53: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Proof (continued II):

Page 54: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Proof (continued II):

Page 55: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Proof (continued II):

Jensen’s inequality

Re-arranging again

Page 56: Stochastic Variance Reduced Gradient Methods · Convex Optimization: Algorithms and Complexity How to transform convergence results into iteration complexity Section 1.3.5, R.M. Gower,

Proof (continued II):

Jensen’s inequality

Re-arranging again

= 7/8