introduction to machine learning and stochastic optimization · introduction to machine learning...

79
Introduction to Machine Learning and Stochastic Optimization Robert M. Gower Spring School on Optimization and Data Science, Novi Saad, March 2017

Upload: others

Post on 07-Jun-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Introduction to Machine Learning and Stochastic

OptimizationRobert M. Gower

Spring School on Optimization and Data Science, Novi Saad, March 2017

Page 2: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Solving the Finite Sum Training Problem

Page 3: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

A Datum Function

Finite Sum Training Problem

Optimization Sum of Terms

Page 4: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

The Training Problem

Page 5: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Gradient Descent Example

A Logistic Regression problem using the fourclass labelled data from LIBSVM

(n, d)= (862,2)

Page 6: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Optimal point

Gradient Descent Example

A Logistic Regression problem using the fourclass labelled data from LIBSVM

(n, d)= (862,2)

Page 7: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

The Training Problem

Page 8: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Page 9: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

Page 10: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

Page 11: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Stochastic Gradient Descent

Page 12: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Stochastic Gradient Descent

Optimal point

Page 13: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

EXE: Using that

Show that

Strong Convexity

Assumptions for Convergence

Page 14: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

EXE: Using that

Show that

Strong Convexity

Assumptions for Convergence Often the same as

the regularization parameter

Page 15: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

EXE: Using that

Show that

Strong Convexity

Assumptions for Convergence Often the same as

the regularization parameter

Strong convexity parameter!

Page 16: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

EXE: Using that

Show that

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence Often the same as

the regularization parameter

Strong convexity parameter!

Page 17: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Example of Strong Convexity

Hinge loss + L2

Quadratic lower bound

Page 18: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Complexity / Convergence

Theorem

Page 19: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Proof:

Unbiased estimatorTaking expectation with respect to j

Taking total expectationStrong conv.

Bounded Stoch grad

Page 20: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Stochastic Gradient Descent α =0.01

Page 21: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Stochastic Gradient Descent α =0.1

Page 22: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Stochastic Gradient Descent α =0.2

Page 23: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Stochastic Gradient Descent α =0.5

Page 24: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Complexity / ConvergenceTheorem (Shrinking stepsize)

Page 25: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Complexity / ConvergenceTheorem (Shrinking stepsize)

Sublinear convergence

Page 26: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Complexity / ConvergenceTheorem (Shrinking stepsize)

Sublinear convergence

Shrinking Stepsize

Page 27: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Comparison SGD vs GD

time

SGD

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Page 28: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Comparison SGD vs GD

time

SGD

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Page 29: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Comparison SGD vs GD

time

SGD

GD

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Page 30: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Comparison SGD vs GD

time

SGD

GD

Stoch. Average Gradient (SAG)

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Page 31: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Comparison SGD vs GD

time

SGD

GD

Stoch. Average Gradient (SAG)

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Maybe just an unbiased estimate is not enough.

Page 32: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Variance reduced methods through

Sketching

Page 33: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Build an Estimate of the Gradient

Instead of using directlyUse to update estimate

Page 34: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Build an Estimate of the Gradient

Instead of using directlyUse to update estimate

Page 35: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Build an Estimate of the Gradient

Instead of using directlyUse to update estimate

Unbiased

Converges in L2

We would like gradient estimate such that:

Page 36: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Build an Estimate of the Gradient

Instead of using directlyUse to update estimate

Unbiased

Converges in L2

We would like gradient estimate such that:

Page 37: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Example: The Stochastic Average Gradient

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Page 38: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

The Stochastic Average Gradient

Page 39: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

The Stochastic Average Gradient

How to prove this converges? Is this the only option?

Page 40: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Introducing the Jacobian

Page 41: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Introducing the Jacobian

Page 42: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Introducing the Jacobian

Page 43: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

The Stochastic Average Gradient

Is this the only option? How to prove this converges?

Page 44: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

The Stochastic Average Gradient

Is this the only option? How to prove this converges?

Page 45: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

The Stochastic Average Gradient

Is this the only option? How to prove this converges?

Stoch. Linear Measurement

Page 46: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Stochastic Sketch

Sparse Stochastic Matrix

Stochastic Sparse Sketches

Page 47: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Stochastic Sketch

Sparse Stochastic Matrix

Stochastic Sparse Sketches

Eg: SGD Sketch

Page 48: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Stochastic Sparse Sketches

Many examples: Sparse Rademacher matrices, sampling with replacement, nonuniform...etc

Eg: Mini-batch SGD Sketch

Page 49: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Sample Stochastic Sketch

A Jacobian Based Method

Maintain Jacobian Estimate

Page 50: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Improved Guess

Sample Stochastic Sketch

A Jacobian Based Method

Maintain Jacobian Estimate

? ?

Page 51: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

A Jacobian Based Method

Jacobian Sketching Algorithm

Page 52: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

A Jacobian Based Method

Jacobian Sketching Algorithm

Page 53: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

A Jacobian Based Method

Jacobian Sketching Algorithm

? ?

Page 54: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Updating the Jacobian Estimate: Sketch and project

Page 55: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Updating the Jacobian Estimate: Sketch and project

Page 56: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Updating the Jacobian Estimate: Sketch and project

Page 57: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Sketch and Project the Jacobian

Updating the Jacobian Estimate: Sketch and project

Page 58: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Sketch and Project the Jacobian

Updating the Jacobian Estimate: Sketch and project

RMG and Peter Richtarik (2015) Randomized iterative methods for linear systemsSIAM Journal on Matrix Analysis and Applications 36(4)

Page 59: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Solution:

Exercise

Show that the solution Jt is given by

Proof: The Lagrangian is given by

Page 60: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Solution:

Exercise

Show that the solution Jt is given by

Proof: The Lagrangian is given by

Page 61: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Solution:

Exercise

Show that the solution Jt is given by

Proof: The Lagrangian is given by

Differentiating in J and setting to zero: (1)

Page 62: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Solution:

Exercise

Show that the solution Jt is given by

Proof: The Lagrangian is given by

Differentiating in J and setting to zero: (1)(2)

Page 63: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Solution:

Exercise

Show that the solution Jt is given by

Proof: The Lagrangian is given by

Differentiating in J and setting to zero: (1)(2)

Substituting (1) into (2) gives the solution.

Page 64: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Solution:

Sketch and project the Jacobian

Page 65: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Solution:

Sketch and project the Jacobian

Page 66: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Solution:

Sketch and project the Jacobian

Page 67: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Solution:

Sketch and project the Jacobian

Page 68: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Solution:

Sketch and project the Jacobian

Page 69: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Unbiased ConditionLemma.

Proof:

Page 70: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Unbiased ConditionLemma.

Proof:

Page 71: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Unbiased ConditionLemma.

Proof:

Page 72: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Exercise

Proof:

Page 73: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Exercise

Proof:

Page 74: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Exercise

Proof:

Page 75: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

A Jacobian Based Method

Archetype Jacobian Sketching Algorithm

Page 76: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

A Jacobian Based Method

Archetype Jacobian Sketching Algorithm

Looks expensive and complicated. Investigate

Page 77: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Homework:

Example: minibatch-SAGA

Page 78: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Gradiant estimate

Jacobain update

Homework:

Example: minibatch-SAGA

Page 79: Introduction to Machine Learning and Stochastic Optimization · Introduction to Machine Learning and Stochastic ... Spring School on Optimization and Data Science, Novi Saad, March

Proving Convergence of Variance reduced

methods