introduction to machine learning and stochastic optimization · introduction to machine learning...

Introduction to Machine Learning and Stochastic

OptimizationRobert M. Gower

Spring School on Optimization and Data Science, Novi Saad, March 2017

Solving the Finite Sum Training Problem

A Datum Function

Finite Sum Training Problem

Optimization Sum of Terms

The Training Problem

Gradient Descent Example

A Logistic Regression problem using the fourclass labelled data from LIBSVM

(n, d)= (862,2)

Optimal point

Gradient Descent Example

A Logistic Regression problem using the fourclass labelled data from LIBSVM

(n, d)= (862,2)

The Training Problem

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?


Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then


Optimal point

EXE: Using that

Show that

Strong Convexity

Assumptions for Convergence

EXE: Using that

Show that

Strong Convexity

Assumptions for Convergence Often the same as

the regularization parameter

EXE: Using that

Show that

Strong Convexity



Strong convexity parameter!

EXE: Using that

Show that

Expected Bounded Stochastic Gradients

Strong Convexity



Strong convexity parameter!

Example of Strong Convexity

Hinge loss + L2

Quadratic lower bound

Complexity / Convergence

Theorem

Proof:

Unbiased estimatorTaking expectation with respect to j

Taking total expectationStrong conv.

Bounded Stoch grad

Stochastic Gradient Descent α =0.01

Complexity / ConvergenceTheorem (Shrinking stepsize)


Sublinear convergence


Sublinear convergence

Shrinking Stepsize

Comparison SGD vs GD

time

SGD

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.


time

SGD

GD



time

SGD

GD

Stoch. Average Gradient (SAG)



time

SGD

GD

Stoch. Average Gradient (SAG)


Maybe just an unbiased estimate is not enough.

Variance reduced methods through

Sketching

Build an Estimate of the Gradient

Instead of using directlyUse to update estimate

Build an Estimate of the Gradient

Instead of using directlyUse to update estimate

Unbiased

Converges in L2

We would like gradient estimate such that:

Example: The Stochastic Average Gradient


The Stochastic Average Gradient


How to prove this converges? Is this the only option?

Introducing the Jacobian


Is this the only option? How to prove this converges?


Is this the only option? How to prove this converges?

Stoch. Linear Measurement

Stochastic Sketch

Sparse Stochastic Matrix

Stochastic Sparse Sketches

Stochastic Sketch

Sparse Stochastic Matrix


Eg: SGD Sketch


Many examples: Sparse Rademacher matrices, sampling with replacement, nonuniform...etc

Eg: Mini-batch SGD Sketch

Sample Stochastic Sketch

A Jacobian Based Method

Maintain Jacobian Estimate

Improved Guess

Sample Stochastic Sketch


Maintain Jacobian Estimate

? ?


Jacobian Sketching Algorithm


Jacobian Sketching Algorithm

? ?

Updating the Jacobian Estimate: Sketch and project

Sketch and Project the Jacobian


Sketch and Project the Jacobian


RMG and Peter Richtarik (2015) Randomized iterative methods for linear systemsSIAM Journal on Matrix Analysis and Applications 36(4)

Solution:

Exercise

Show that the solution Jt is given by

Proof: The Lagrangian is given by

Solution:

Exercise



Differentiating in J and setting to zero: (1)

Solution:

Exercise



Differentiating in J and setting to zero: (1)(2)

Solution:

Exercise



Differentiating in J and setting to zero: (1)(2)

Substituting (1) into (2) gives the solution.

Solution:

Sketch and project the Jacobian

Unbiased ConditionLemma.

Proof:

Exercise

Proof:


Archetype Jacobian Sketching Algorithm


Archetype Jacobian Sketching Algorithm

Looks expensive and complicated. Investigate

Homework:

Example: minibatch-SAGA

Gradiant estimate

Jacobain update

Homework:

Example: minibatch-SAGA

Proving Convergence of Variance reduced

methods

introduction to machine learning and stochastic optimization · introduction to machine learning...

Documents