introduction to machine learning and stochastic optimization · introduction to machine learning...
TRANSCRIPT
Introduction to Machine Learning and Stochastic
OptimizationRobert M. Gower
Spring School on Optimization and Data Science, Novi Saad, March 2017
Solving the Finite Sum Training Problem
A Datum Function
Finite Sum Training Problem
Optimization Sum of Terms
The Training Problem
Gradient Descent Example
A Logistic Regression problem using the fourclass labelled data from LIBSVM
(n, d)= (862,2)
Optimal point
Gradient Descent Example
A Logistic Regression problem using the fourclass labelled data from LIBSVM
(n, d)= (862,2)
The Training Problem
Stochastic Gradient Descent
Is it possible to design a method that uses only the gradient of a single data function at each iteration?
Stochastic Gradient Descent
Is it possible to design a method that uses only the gradient of a single data function at each iteration?
Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then
Stochastic Gradient Descent
Is it possible to design a method that uses only the gradient of a single data function at each iteration?
Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then
Stochastic Gradient Descent
Stochastic Gradient Descent
Optimal point
EXE: Using that
Show that
Strong Convexity
Assumptions for Convergence
EXE: Using that
Show that
Strong Convexity
Assumptions for Convergence Often the same as
the regularization parameter
EXE: Using that
Show that
Strong Convexity
Assumptions for Convergence Often the same as
the regularization parameter
Strong convexity parameter!
EXE: Using that
Show that
Expected Bounded Stochastic Gradients
Strong Convexity
Assumptions for Convergence Often the same as
the regularization parameter
Strong convexity parameter!
Example of Strong Convexity
Hinge loss + L2
Quadratic lower bound
Complexity / Convergence
Theorem
Proof:
Unbiased estimatorTaking expectation with respect to j
Taking total expectationStrong conv.
Bounded Stoch grad
Stochastic Gradient Descent α =0.01
Stochastic Gradient Descent α =0.1
Stochastic Gradient Descent α =0.2
Stochastic Gradient Descent α =0.5
Complexity / ConvergenceTheorem (Shrinking stepsize)
Complexity / ConvergenceTheorem (Shrinking stepsize)
Sublinear convergence
Complexity / ConvergenceTheorem (Shrinking stepsize)
Sublinear convergence
Shrinking Stepsize
Comparison SGD vs GD
time
SGD
M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.
Comparison SGD vs GD
time
SGD
M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.
Comparison SGD vs GD
time
SGD
GD
M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.
Comparison SGD vs GD
time
SGD
GD
Stoch. Average Gradient (SAG)
M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.
Comparison SGD vs GD
time
SGD
GD
Stoch. Average Gradient (SAG)
M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.
Maybe just an unbiased estimate is not enough.
Variance reduced methods through
Sketching
Build an Estimate of the Gradient
Instead of using directlyUse to update estimate
Build an Estimate of the Gradient
Instead of using directlyUse to update estimate
Build an Estimate of the Gradient
Instead of using directlyUse to update estimate
Unbiased
Converges in L2
We would like gradient estimate such that:
Build an Estimate of the Gradient
Instead of using directlyUse to update estimate
Unbiased
Converges in L2
We would like gradient estimate such that:
Example: The Stochastic Average Gradient
M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.
The Stochastic Average Gradient
The Stochastic Average Gradient
How to prove this converges? Is this the only option?
Introducing the Jacobian
Introducing the Jacobian
Introducing the Jacobian
The Stochastic Average Gradient
Is this the only option? How to prove this converges?
The Stochastic Average Gradient
Is this the only option? How to prove this converges?
The Stochastic Average Gradient
Is this the only option? How to prove this converges?
Stoch. Linear Measurement
Stochastic Sketch
Sparse Stochastic Matrix
Stochastic Sparse Sketches
Stochastic Sketch
Sparse Stochastic Matrix
Stochastic Sparse Sketches
Eg: SGD Sketch
Stochastic Sparse Sketches
Many examples: Sparse Rademacher matrices, sampling with replacement, nonuniform...etc
Eg: Mini-batch SGD Sketch
Sample Stochastic Sketch
A Jacobian Based Method
Maintain Jacobian Estimate
Improved Guess
Sample Stochastic Sketch
A Jacobian Based Method
Maintain Jacobian Estimate
? ?
A Jacobian Based Method
Jacobian Sketching Algorithm
A Jacobian Based Method
Jacobian Sketching Algorithm
A Jacobian Based Method
Jacobian Sketching Algorithm
? ?
Updating the Jacobian Estimate: Sketch and project
Updating the Jacobian Estimate: Sketch and project
Updating the Jacobian Estimate: Sketch and project
Sketch and Project the Jacobian
Updating the Jacobian Estimate: Sketch and project
Sketch and Project the Jacobian
Updating the Jacobian Estimate: Sketch and project
RMG and Peter Richtarik (2015) Randomized iterative methods for linear systemsSIAM Journal on Matrix Analysis and Applications 36(4)
Solution:
Exercise
Show that the solution Jt is given by
Proof: The Lagrangian is given by
Solution:
Exercise
Show that the solution Jt is given by
Proof: The Lagrangian is given by
Solution:
Exercise
Show that the solution Jt is given by
Proof: The Lagrangian is given by
Differentiating in J and setting to zero: (1)
Solution:
Exercise
Show that the solution Jt is given by
Proof: The Lagrangian is given by
Differentiating in J and setting to zero: (1)(2)
Solution:
Exercise
Show that the solution Jt is given by
Proof: The Lagrangian is given by
Differentiating in J and setting to zero: (1)(2)
Substituting (1) into (2) gives the solution.
Solution:
Sketch and project the Jacobian
Solution:
Sketch and project the Jacobian
Solution:
Sketch and project the Jacobian
Solution:
Sketch and project the Jacobian
Solution:
Sketch and project the Jacobian
Unbiased ConditionLemma.
Proof:
Unbiased ConditionLemma.
Proof:
Unbiased ConditionLemma.
Proof:
Exercise
Proof:
Exercise
Proof:
Exercise
Proof:
A Jacobian Based Method
Archetype Jacobian Sketching Algorithm
A Jacobian Based Method
Archetype Jacobian Sketching Algorithm
Looks expensive and complicated. Investigate
Homework:
Example: minibatch-SAGA
Gradiant estimate
Jacobain update
Homework:
Example: minibatch-SAGA
Proving Convergence of Variance reduced
methods