stochastic gradient descent methods

91
Stochastic Optimization STUDY GROUP Stochastic Gradient Descent Methods Xuetong Wu & Viktoria Schram Department of EEE University of Melbourne October 22, 2020 1 / 91

Upload: others

Post on 16-Oct-2021

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent Methods

Xuetong Wu & Viktoria Schram

Department of EEEUniversity of Melbourne

October 22, 2020

1 / 91

Page 2: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Overview

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient MethodsNon-Smooth Optimization ProblemsOptimization Considering Additional InformationOptimization in Case of Non-I.i.d. Data

5 Conclusion

2 / 91

Page 3: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Introduction

Outline

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient Methods

5 Conclusion

3 / 91

Page 4: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Introduction

Parameter Estimation Problems

Communications

Tracking

Control theory

System identification

Machine learning

...

Kushner, 1997, Stochastic ApproximationHan-Fu Chen, 2003, Stochastic Approximation and Its Applications

4 / 91

Page 5: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Introduction

Classification Problems

Consider the typical image classification problem,

Training Examples Models

Dogs

Cats

Labels

We wish to learn a good model h to minimise the predicted error:

minh

1

n

n∑i=1

(1h(Xi)6=Yi)

5 / 91

Page 6: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Introduction

Classification Problems

Consider the typical image classification problem,

Training Examples Models

Dogs

Cats

Labels

We wish to learn a good model h to minimise the predicted error:

minh

1

n

n∑i=1

(1h(Xi)6=Yi)

6 / 91

Page 7: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Introduction

Regression ProblemsConsider the simple regression problem, we wish to learn a goodmodel to minimise the mean squared error.

𝑌𝑌 = 𝑎𝑎𝑎𝑎 + 𝑏𝑏

Training Examples Models Predicted Labels

Mathematically,

mina,b

1

n

n∑i=1

(Yi − aXi − b)2

7 / 91

Page 8: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Introduction

Optimization in Learning Problems

Many of machine learning problems can be formulated as thefollowing problem,

minw∈W

F (w) =1

n

n∑i=1

f(w,Zi)

Zi: training sample/ data pair (Xi, Yi)

w: model parameters (e.g., a, b in least square problem)

f : loss function

8 / 91

Page 9: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Gradient Descent

Outline

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient Methods

5 Conclusion

9 / 91

Page 10: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Gradient Descent

Gradient Descentminw∈W

F (w) =1

n

n∑i=1

f(w,Zi)

If f is convex and differentiable w.r.t. w, with the first-order Taylorapproximation with η > 0,

F (w + η∆w) ≈ F (w) + η∆wT∇wF (w)

Best ∆w that minimises the R.H.S.

∆w = −∇wF (w)

we choose initial point w0 and certain step size ηt at each time t.

(Batch) Gradient descent:

wt+1 = wt − ηt∇wF (w) = wt −ηtn

n∑i=1

∇wf(wt, Zi)

Stops at a certain point such that,

F (wt)− F (w∗) ≤ ε10 / 91

Page 11: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Gradient Descent

Gradient Descent

Figure: Visualization of Gradient Descent

11 / 91

Page 12: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Gradient Descent

Convergence Rate For GDAssume f is convex and differentiable w.r.t. w, further assume thegradient ∇wF (w) = 1

n

∑ni=1∇wf(w,Zi) is L-Lipschitz continuous

(∇2F � LI) . Then,

Theorem

Gradient descent with fixed step size η ≤ 1/L satisfies

F (wt)− F (w∗) ≤ ‖w0 − w?‖2

2ηt

Convergence rate ∼ O(1t ), iteration complexity ∼ O(1

ε ).

R. Tibshirani, Convex Optimization 10-725

12 / 91

Page 13: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Gradient Descent

Convergence Rate For GD with Strong ConvexityFurthermore, if F (w) is µ-strongly-convex (∇2F � µI).

Theorem

Gradient descent with fixed step size η ≤ 2/(µ+ L) or withbacktracking line search satisfies

F (wt)− F (w?) ≤ ctL2‖w0 − w?‖2

where 0 < c < 1.

Convergence rate ∼ O(ct), iteration needed for error ε ∼ O(log 1ε ).

R. Tibshirani, Convex Optimization 10-725

13 / 91

Page 14: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Gradient Descent

ProblemsTwo main drawbacks of gradient descent,

If n is relatively large, computing the gradient is memory andtime consuming.

If the loss function is nonconvex, the solution can be stuck ina local stationary point (e.g., a saddle point).

14 / 91

Page 15: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Outline

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient Methods

5 Conclusion

15 / 91

Page 16: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Stochastic Gradient Descent

A possible practical way is to simulate the stream by randomly pickup Zt uniformly at time t from the training examples.

Namely, the stochastic gradient descent:

wt+1 = wt − ηt∇wf(wt, Zt)

Why does this work? By uniformly picking,

EZt [∇wf(wt, Zt)] =1

n

n∑i=1

∇wf(wt, Zi)

Unbiased estimate but high variance, usually works well in largescale problems.

16 / 91

Page 17: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Stochastic GD v.s. GD

Figure: Stochastic GD v.s. GD

17 / 91

Page 18: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Remarks on SGD

Computational cost for n samples and p iterations.

GD ∼ O(np)

SGD ∼ O(p)

SGD does not always produce descending directions andgradient is very noisy!

The solution of SGD bounces around optimal value withconstant step size.

Convergence properties?

18 / 91

Page 19: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Convergence Rate Analysis

minimizew F (w) :=1

n

n∑i=1

f(w,Zi)

We wish to get to achieve ε-optimality,

E[F (wt)]− F (w∗) ≤ ε

after t iterations.

19 / 91

Page 20: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Assumptions

F (w) is µ-strongly-convex and the gradeint is L Lipschitzcontinuous, µ/L ≤ 1.

∇f(wt, Zt) is an unbiased estimate of ∇F (wt).

For all w, the variance of the gradient

EZ [‖∇f(w,Z)‖22]− ‖EZ [∇f(w,Z)]‖22 ≤ σ2

20 / 91

Page 21: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Constant Step Size

Theorem (Convergence with Fixed Stepsizes)

Under the assumptions, if ηt = η ≤ 1L , then SGD achieves

E [F (wt)]− F (w∗) ≤ ηLσ2

2µ+ (1− ηµ)t (F (w0)− F (w∗))

Linear convergence at the beginning.

When t→∞,

E [F (wt)− F (w∗)] ≤ ηLσ2

converges to some neighborhood of w∗− variation in gradientcomputation prevents further progress.

Theorem 4.6 in Bottou, 2018, Optimization Methods for Large-Scale Machine Learning

21 / 91

Page 22: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Diminishing Step Size

Theorem (Convergence with Diminishing Stepsizes)

Under the assumptions, if ηt = θt+1 for some θ > 1

µ , then SGDachieves

E [F (wt)− F (w∗)] ≤ 2ν

t+ 1

where

ν := max

{Lσ2

4(µ− 1), F (w0)− F (w∗)

}

Convergence rate is O(1t ), iterations needed O(1

ε ) with diminishingstepsize ηt � 1

t .

Theorem 4.7 in Bottou, 2018, Optimization Methods for Large-Scale Machine Learning

22 / 91

Page 23: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Convergence Rate and Time Comparison

Risk Minimization Under Strong Convexity and L-smooth:

iterationcomplexity

per-iterationcost

totalcomput. cost

GD log 1ε n n log 1

ε

SGD 1ε 1 1

ε

23 / 91

Page 24: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

AdvantagesCompared to the gradient descent methods, SGD has the followingadvantages.

Less computational cost per iteration.

For larger datasets it can converge faster with large n andmoderate ε.

For non-convex cases, SGD can get rid of saddle pointssometimes due to the variance.

Rong Ge et.al,, COLT’2015, Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition

24 / 91

Page 25: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Variance Reduction and AccelerationTo reduce the variance of the estimate of the gradient, we can usemini-batch SGD for k � n:

wt+1 = wt − ηtk∑i=1

∇f(wt, Zi)

SGD for logistic regression problems with n = 10000:

R. Tibshirani, Convex Optimization 10-725

25 / 91

Page 26: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Variance Reduction and AccelerationCan we have 1 gradient per iteration and only O(log(1/ε))iterations?Yes! Stochastic Average Gradient (SAG):

wt = wt−1 −ηtn

n∑i=1

∇f ti with ∇f ti =

{∇f (wt−1, Zi) if i = i(t)∇f t−1i otherwise

SAG gradient estimates are no longer unbiased, but they havegreatly reduced variance. With the fixed stepsize ηt = 1

16L ,

E [F (wt)]− F (w∗) 6 O

((1−min

16L,

1

8n

})t)

Iteration complexity ∼ O(log 1ε )!

Other variants with similar convergence: SDCA, SVRG, SAGA

Roux et.al, NIPS’12, A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training SetsShai & Zhang, JMLR’13, Stochastic dual coordinate ascent methods for regularized loss minimization.Johnson & Zhang, NIPS’13, Accelerating stochastic gradient descent using predictive variance reductionDefazio & Bach, NIPS’14, SAGA

26 / 91

Page 27: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

More on SGD

If n is not very large,

1

n

n∑i=1

(f (wt, Zi)− f(w∗, Zi)) ≤ ε?

Sometimes simply minimising the loss will cause overfitting.

27 / 91

Page 28: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Caveats: Data Overfitting for y = sin 2πx+ n

(a) t = 2 (b) t = 7

(c) t = 100 (d) t = 5000

Often, Small Training Error 6⇒ Small Testing Error !

28 / 91

Page 29: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

Early StoppingActually, if samples are regarded as some random variable drawnfrom some distributions P , we may consider minimising the truerisk,

Ftrue(wt) = EZ∼P [f(wt, Z)]

Figure: Early Stopping with SGD

29 / 91

Page 30: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent

SGD Applications in supervised learningSGD Methods are widely applied in machine learning problems.

For different differentiable loss functions,

Adaline:∑

i12

(yi − w>Φ(xi)

)2Tikhonov:

∑i

(yi − wTxi

)2+ λ‖w‖22

Logistic Regression:∑i yi log

(1

1+e−wT xi

)+ (1− yi) log

(1− 1

1+e−wT xi

)What if the loss is not differentiable?

SVM: 12‖w‖

22 + C

∑i max

(0, 1− yi

(w>xi + b

))Lasso:

∑i

(yi − wTxi

)2+ λ‖w‖1

Perceptron:∑

i max{

0,−yiw>Φ(xi)}

Neural Network:∑

i f(w,Zi)

30 / 91

Page 31: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Outline

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient MethodsNon-Smooth Optimization ProblemsOptimization Considering Additional InformationOptimization in Case of Non-I.i.d. Data

5 Conclusion

31 / 91

Page 32: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Short note: to be consistent with the notations used in the firstpart of this presentation, the following slides were adapted and not

the same nomenclature is used as was used in the recordedpresentation.

Sorry for any inconvenience this might cause.

32 / 91

Page 33: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Problem - Optimal Szenario

Finding zeros (roots) of a real-valued smooth function∇f(w) : Rd → Rd, for w ∈ R.

⇒Newton’s method

wt+1 = wt − [∇f ′w(wt)]−1∇f(wt)

n: tth iterationw: Parameters of function f(.)∇f(w): Derivative of function f(.)f ′w(.): Derivative of ∇f(.) wrt. w

Kushner, 1997, Stochastic ApproximationHan-Fu Chen, 2003, Stochastic Approximation and Its Applications

33 / 91

Page 34: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Problem - Reality

Finding zeros (roots) of an unknown, real-valued function∇f(w) : Rd → Rd, for w ∈ R, which can be observed but theobservation may be corrupted by (i.i.d.) errors (εt)t≥1

⇒Stochastic Approximation (Robbins & Monro ’51):

wt+1 = wt − ηt[∇f(wt) + εt]

ηt: Stepsize, learning rate for iteration tεt: Zero mean noise for iteration t∇f(w): Derivative of function f(.)

Kushner, 1997, Stochastic ApproximationHan-Fu Chen, 2003, Stochastic Approximation and Its Applications

34 / 91

Page 35: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Gradient Descent

wt+1 = wt − ηt1

n

n∑i=1

[∇f(wt, Zi)]

Stochastic Gradient Descent

wt+1 = wt − ηt[∇f(wt, Zit)]

f(wt, Zit ): Loss function for tth parameters w and tth sample Zitn: Set (/batch) sizeηt: Learning rate at iteration tt: Iteration t = 1, 2, ...it ∈ {1, ..., n}: Uniformly at random chosen index at iteration t

R. Tibshirani, Convex Optimization 10-725

35 / 91

Page 36: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Stochastic Gradient Descent

wt+1 = wt − ηt[∇f(wt, Zit)]

Random selection of f(.):

E[∇f(wt, Zit)] = ∇F (w)

f(wt, Zit ): Loss function for tth parameters w and tth sample Zitn: Set (/batch) sizeηt: Learning rate at iteration tt: Iteration t = 1, 2, ...it ∈ {1, ..., n}: Uniformly at random chosen index at iteration t

R. Tibshirani, Convex Optimization 10-725

36 / 91

Page 37: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Question

Optimization function:

Non-smooth

Additional information

Non-i.i.d input samples

Goal:

minw

1

n

n∑i=1

f(w,Zi)

f(w,Zi): Loss function with parameters w for ith sample Zi

H. Li et al., 2018, Visualizing the Loss Landscape of Neural Nets.

37 / 91

Page 38: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Question

Optimization function:

Non-smooth

Additional information

Non-i.i.d input samples

Goal:

minw

1

n

n∑i=1

f(w,Zi)

f(w,Zi): Loss function with parameters w for ith sample Zi

H. Li et al., 2018, Visualizing the Loss Landscape of Neural Nets.

38 / 91

Page 39: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Question

Optimization function:

Non-smooth

Additional information

Non-i.i.d input samples

Goal:

minw

1

n

n∑i=1

f(w,Zi)

f(w,Zi): Loss function with parameters w for ith sample Zi

H. Li et al., 2018, Visualizing the Loss Landscape of Neural Nets.

39 / 91

Page 40: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Goal

Non-differentiable (non-smooth)

⇒ ?

Additional Information

⇒ ?

Non-i.i.d. data

⇒ ?

40 / 91

Page 41: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Outline

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient MethodsNon-Smooth Optimization ProblemsOptimization Considering Additional InformationOptimization in Case of Non-I.i.d. Data

5 Conclusion

41 / 91

Page 42: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

g is a subgradient of f at x2 if

f(x) ≥ f(x2) + gT (x− x2) ∀ x

*∂f(x): Subdifferential, set of all subgradients, i.e. gi ∈ ∂f(x)(**x =̂ w: Slight abuse in notation compared to before)

S. Boyd, Stanford University, EE364b

42 / 91

Page 43: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home 43 / 91

Page 44: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home 44 / 91

Page 45: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home

45 / 91

Page 46: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home

46 / 91

Page 47: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home

47 / 91

Page 48: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

⇒Negative subgradient doesn’t necessarily give a descent direction

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home 48 / 91

Page 49: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

⇒Negative subgradient doesn’t necessarily give a descent direction

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home 49 / 91

Page 50: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

⇒Negative subgradient doesn’t necessarily give a descent direction

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home50 / 91

Page 51: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

⇒Negative subgradient doesn’t necessarily give a descent direction

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home51 / 91

Page 52: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

⇒Negative subgradient doesn’t necessarily give a descent direction

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home52 / 91

Page 53: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient of a Function

Goal: minx

f(x); Use: Gradient descent method

⇒Negative subgradient doesn’t necessarily give a descent directionD. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home

53 / 91

Page 54: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Subgradient Method

wt+1 = wt − ηtgt

⇒ keep track of best iterate wbestt+1 among w1, ..., wt+1, i.e.,

f(wbestt+1 ) = minj=1,...t+1

f(wj)

wt: tth parameter estimateηt: Learning rategt: Subgradientt: Iteration t = 1, 2, ...(*w =̂ x: Switch back to notation used in the beginning)

S. Boyd, Stanford University, EE364bR. Tibshirani, Convex Optimization 10-725/36-725

54 / 91

Page 55: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Stochastic Subgradient Method

Noisy subgradients: g̃ = g + v, where g ∈ ∂f(w), E[v] = 0

wt+1 = wt − ηtg̃t

⇒ Random choice of (sample) index i at iteration t(out of a set (/batch) of samples with size n)

wt+1 = wt − ηtgit

S. Boyd, Stanford University, EE364bR. Tibshirani, Convex Optimization 10-725/36-725J. Zhu, University of Melbourne, 2020, Discussion after IT lecture

55 / 91

Page 56: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Stochastic Subgradient Method

wt+1 = wt − ηtgit

⇒ keep track of best iterate wbestt+1 among w1, ..., wt+1, i.e.,

f(wbestt+1 ) = minj=1,...t+1

f(wj)

wt: tth parameter estimateηt: Learning rategit : Subgradient for uniformly at random chosen sample (Zi)t: Iteration t = 1, 2, ...(*w =̂ x: Switch back to notation used in the beginning)

S. Boyd, Stanford University, EE364bR. Tibshirani, Convex Optimization 10-725/36-725

56 / 91

Page 57: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Convergence Results

Fixed step size η

SGM:

limt→∞

F (w(best)t ) ≤ F ∗ +

L2η

2

Stochastic SGM:

limt→∞

F (w(best)t ) ≤ F ∗ +

5n2L2η

2

S. Boyd et al., 2003, Subgradient MethodsR. Tibshirani, Convex Optimization 10-725/36-725 Subgradient Method*Convergence rates for f(w) Lipschitz continuous and constant L > 0

57 / 91

Page 58: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Convergence Results

Diminishing stepsize (Robins-Monro condition)

(Stochastic) SGM:

limt→∞

F (w(best)t ) ≤ F ∗

ηt > 0, limt→∞

ηt = 0,

∞∑t=1

ηt =∞

e.g.:

ηt > 0,

∞∑t=1

η2t <∞,

∞∑t=1

ηt =∞

*More about how to choose η: S. Boyd et al., 2003, Subgradient MethodsR. Tibshirani, Convex Optimization 10-725/36-725 Subgradient Method**Convergence rates for f(w) Lipschitz continuous and constant L > 0

58 / 91

Page 59: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Non-Smooth Optimization Problems

Applications

Algorithms for non-differentiable convex optimization

Convex analysis

ML/DL

⇒ Methods based on stochastic subgradients are used to(approximately) optimize (nonconvex, nonsmooth) deep neuralnetworks (DNNs)

⇒ E.g.: Adagrad, ADAM, NADAM, RMSProp, ...

Udell, Operations Research and Information Engineering Cornell, 2017, PresentationJ. Duchi et al., 2011, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

59 / 91

Page 60: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Outline

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient MethodsNon-Smooth Optimization ProblemsOptimization Considering Additional InformationOptimization in Case of Non-I.i.d. Data

5 Conclusion

60 / 91

Page 61: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Adagrad

wt+1 = wt − ηt(Gt)−12 git

⇒ Incorporates knowledge of the geometry of past iterations

wt: tth parameter estimateηt: Learning rategit : Subgradient for uniformly at random chosen sample (Zi)t: Iteration t = 1, 2, ...

Gt: Outer product matrix of past gradients up to time step t:∑tτ=1 gτg

Udell, Cornell, 2017, Operations Research and Information EngineeringJ. Duchi et al., 2011, Adaptive Subgradient Methods for Online Learning and Stochastic OptimizationAdaGrad - Adaptive Subgradient Methods, https://ppasupat.github.io/a9online/1107.html 61 / 91

Page 62: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Adagrad

wt+1,j = wt,j − ηt,j(t∑

τ=1

g2τ,j)− 1

2 git,j

Example: minw

f(w) = 100w21 + w2

2

j: jth feature/parameter wηt: Learning rategit : Subgradient for uniformly at random chosen sample (Zi)t: Iteration t = 1, 2, ...∑tτ=1 g

2τ,j : Sum of past gradients up to time step t

Udell, Cornell, 2017, Operations Research and Information EngineeringJ. Duchi et al., 2011, Adaptive Subgradient Methods for Online Learning and Stochastic OptimizationAdaGrad - Adaptive Subgradient Methods, https://ppasupat.github.io/a9online/1107.html

62 / 91

Page 63: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Adagrad

⇒Variable metric projected subgradient method

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric Methods

63 / 91

Page 64: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Adagrad

⇒Variable metric projected subgradient method

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric Methods

64 / 91

Page 65: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Variable metric projected subgradient method

Projection carried out in Ht = 1ηt

(Gt)12 metric

wt+1 = wt − ηt(Gt)−12 git

wt+1 = PHtW (wt −H−1t git) = PHtW (y)

where

PHtW (y) = argminw∈W

||w − y||2Ht

PHtW (y): Projection of a vector y onto W according to Ht metric

||w − y||Ht =√

(w − y)THt−1(w − y): Mahalanobis norm, weighted l2-distance

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsY. Chen, Princton University, 2019, ELE 522: Large-Scale Optimization for Data Science

65 / 91

Page 66: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Variable metric projected subgradient method

Projection carried out in Ht = 1ηt

(Gt)12 metric

wt+1 = wt −H−1t git

wt+1 = PHtW (wt −H−1t git) = PHtW (y)

where

PHtW (y) = argminw∈W

||w − y||2Ht

PHtW (y): Projection of a vector y onto W according to Ht metric

||w − y||Ht =√

(w − y)THt−1(w − y): Mahalanobis norm, weighted l2-distance

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsY. Chen, Princton University, 2019, ELE 522: Large-Scale Optimization for Data Science

66 / 91

Page 67: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Variable metric projected subgradient method

Projection carried out in Ht = 1ηt

(Gt)12 metric

wt+1 = wt −H−1t git

wt+1 = PHtW (wt −H−1t git) = PHtW (y)

where

PHtW (y) = argminw∈W

||w − y||2Ht

PHtW (y): Projection of a vector y onto W according to Ht metric

||w − y||Ht =√

(w − y)THt−1(w − y): Mahalanobis norm, weighted l2-distance

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsY. Chen, Princton University, 2019, ELE 522: Large-Scale Optimization for Data Science

67 / 91

Page 68: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Variable metric projected subgradient method

Projection carried out in Ht = 1ηt

(Gt)12 metric

wt+1 = wt −H−1t git

wt+1 = PHtW (wt −H−1t git) = PHtW (y)

where

PHtW (y) = argminw∈W

||w − y||2Ht

PHtW (y): Projection of a vector y onto W according to Ht metric

||w − y||Ht =√

(w − y)THt−1(w − y): Mahalanobis norm, weighted l2-distance

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsY. Chen, Princton University, 2019, ELE 522: Large-Scale Optimization for Data Science

68 / 91

Page 69: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

But now, what if...

Parameter of interest lies on a non-Euclidean manifold

Probability vectors

G. Raskutti, The information geometry of mirror descentY. Chen, Princton University, 2019, ELE 522: Large-Scale Optimization for Data Science

69 / 91

Page 70: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Convergence Analysis

Basic inequality: (Projected) subgradient method

F (w(best)t )− F ∗ ≤

R2 + L2∑t

i=1 η2i

2∑t

i=1 ηi

ηi = (R/L)/√t

F (w(best)t )− F ∗ ≤ RL√

t

for: L = maxw∈W||git ||2 and R = max

w,w∗∈W||w − w∗||2

⇒Analysis and convergence results depend on l2 normS. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric Methods

70 / 91

Page 71: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Subgradient Method Update Rule(s)

minw∈W

F (w)

wt+1 = wt − ηtgit

wt+1 = PW(wt − ηtgit)wt+1 = argmin

w∈W||w − (wt − ηtgit)||22

Using some math:

wt+1 = argminw∈W

{gTitw +1

2ηt||w − wt||22}

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsJ. Duchi et al., 2003, Proximal and First-Order Methods for Convex Optimization

71 / 91

Page 72: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Subgradient Method Update Rule(s)

minw∈W

F (w)

wt+1 = wt − ηtgitwt+1 = PW(wt − ηtgit)

wt+1 = argminw∈W

||w − (wt − ηtgit)||22

Using some math:

wt+1 = argminw∈W

{gTitw +1

2ηt||w − wt||22}

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsJ. Duchi et al., 2003, Proximal and First-Order Methods for Convex Optimization

72 / 91

Page 73: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Subgradient Method Update Rule(s)

minw∈W

F (w)

wt+1 = wt − ηtgitwt+1 = PW(wt − ηtgit)wt+1 = argmin

w∈W||w − (wt − ηtgit)||22

Using some math:

wt+1 = argminw∈W

{gTitw +1

2ηt||w − wt||22}

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsJ. Duchi et al., 2003, Proximal and First-Order Methods for Convex Optimization

73 / 91

Page 74: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Subgradient Method Update Rule(s)

minw∈W

F (w)

wt+1 = wt − ηtgitwt+1 = PW(wt − ηtgit)wt+1 = argmin

w∈W||w − (wt − ηtgit)||22

Using some math:

wt+1 = argminw∈W

{gTitw +1

2ηt||w − wt||22}

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsJ. Duchi et al., 2003, Proximal and First-Order Methods for Convex Optimization

74 / 91

Page 75: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Stochastic Mirror Descent

wt+1 = argminw{gTtiw +

1

2ηtBφ(w,wt)}

Bregman divergence:

Bφ(w,wt) = φ(w)− φ(wt)−∇φ(wt)T (w − wt)

∇φ(.): Mirror map, invertible mapφ(.): Potential function, strictly convex, differentiable

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and ComplexityN. Azizan et al., Stochastic Interpretation of SMD: Risk-Sensitive Optimality

75 / 91

Page 76: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Generalization/Examples

Gradient descent for φ(w) = 12 ||w||

22

Bφ = 12 ||w − wt||

22

Mirror descent = Projected subgradient method

Negative entropy for φ(w) =∑n

i=1wi logwi(For w ∈ unit simplex)

p-norm Algorithm φ(w) = 12 ||w||

2p

Exponential Gradient descent

Sparse Mirror Descent

...

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsN. Azizan al., 2019, SMD on Overparameterized Nonlinear Models: Conv., Implicit Regul., and General.

76 / 91

Page 77: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

wt+1 = argminw{gTtiw +

1

2ηtBφ(w,wt)}

Using some math:

wt+1 = argminw∈W∩D

Bφ(w, yt+1)

wt+1 = P φW(yt+1)

wt+1 = P φW(∇φ(wt+1)−1)

Alternative update rule stochastic mirror descent:

∇φ(wt+1) = ∇φ(wt)− ηtgTti

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and Complexity

77 / 91

Page 78: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

wt+1 = argminw{gTtiw +

1

2ηtBφ(w,wt)}

Using some math:

wt+1 = argminw∈W∩D

Bφ(w, yt+1)

wt+1 = P φW(yt+1)

wt+1 = P φW(∇φ(wt+1)−1)

Alternative update rule stochastic mirror descent:

∇φ(wt+1) = ∇φ(wt)− ηtgTti

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and Complexity

78 / 91

Page 79: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

wt+1 = argminw{gTtiw +

1

2ηtBφ(w,wt)}

Using some math:

wt+1 = argminw∈W∩D

Bφ(w, yt+1)

wt+1 = P φW(yt+1)

wt+1 = P φW(∇φ(wt+1)−1)

Alternative update rule stochastic mirror descent:

∇φ(wt+1) = ∇φ(wt)− ηtgTti

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and Complexity

79 / 91

Page 80: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

wt+1 = argminw{gTtiw +

1

2ηtBφ(w,wt)}

Using some math:

wt+1 = argminw∈W∩D

Bφ(w, yt+1)

wt+1 = P φW(yt+1)

wt+1 = P φW(∇φ(wt+1)−1)

Alternative update rule stochastic mirror descent:

∇φ(wt+1) = ∇φ(wt)− ηtgTti

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and Complexity

80 / 91

Page 81: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

wt+1 = argminw{gTtiw +

1

2ηtBφ(w,wt)}

Using some math:

wt+1 = argminw∈W∩D

Bφ(w, yt+1)

wt+1 = P φW(yt+1)

wt+1 = P φW(∇φ(wt+1)−1)

Alternative update rule stochastic mirror descent:

∇φ(wt+1) = ∇φ(wt)− ηtgTtiS. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and Complexity

81 / 91

Page 82: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Stochastic Mirror Descent

∇φ(wt+1) = ∇φ(wt)− ηtgTti

∇φ(.): Mirror map, invertible mapφ(.): Potential function, strictly convex, differentiable

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsM.S. Alkousa al., 2019, On some SMD methods for constrained online optimization problems.Z. Zhou et al., 2020, On the convergence of MD beyond stochastic convex programming 82 / 91

Page 83: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization Considering Additional Information

Applications

Non-smooth and/or non-convex stochastic optimizationproblems

Highly overparameterized nonlinear learning problems

Large scale optimization problems

Online learning

Reinforcement learning

Z. Zhou et al., 2020, On the convergence of MD beyond stochastic convex programmingN. Aziza et al., 2019, SMD on Overparametrized Nonlinear Models: Conv., Implicit Regul., and General.M. Raginsky et al., Sparse Q-learning with Mirror DescentS. Mahadevan et al., Continuous-Time SMD on a Network: Variance Reduction, Consensus, Convergence

83 / 91

Page 84: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization in Case of Non-I.i.d. Data

Outline

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient MethodsNon-Smooth Optimization ProblemsOptimization Considering Additional InformationOptimization in Case of Non-I.i.d. Data

5 Conclusion

84 / 91

Page 85: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization in Case of Non-I.i.d. Data

Ergodic Mirror Descent

Update rule:

wt+1 = argminw{gTtiw +

1

2ηtBφt(w,wt)}

⇒ Based on stochastic mirror descent

J. Duchi et al., 2012, Ergodic Mirror Descent

85 / 91

Page 86: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization in Case of Non-I.i.d. Data

Ergodic Mirror Descent

Update rule:

wt+1 = argminw{gTtiw +

1

2ηtBφt(w,wt)}

⇒ Based on stochastic mirror descent

J. Duchi et al., 2012, Ergodic Mirror Descent

86 / 91

Page 87: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization in Case of Non-I.i.d. Data

Ergodic Mirror Descent

F (w) := EΠ[f(w;Zi)]w∈W

Stochastic process Pi

Stationary distribution Π such that Pi → Π

Training samples (Z1, ..., Zn) ∼ PLoss function for w on sample Zi is f(w,Zi)

J. Duchi et al., 2012, Ergodic Mirror Descent

87 / 91

Page 88: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Stochastic Subgradient Methods

Optimization in Case of Non-I.i.d. Data

Ergodic Mirror Descent

Convergence in expectation and with high probabilityshown for:

Distributed convex optimization

(Potentially nonlinear) ARMA processes

Learning ranking facts

Pseudo-random sanity

J. Duchi et al., 2012, Ergodic Mirror DescentMicrosoft Research, 2016, Learning and stochastic optimization with non-iid data,https://www.youtube.com/watch?v=_yRnHRQVMgw

88 / 91

Page 89: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Conclusion

Outline

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient Methods

5 Conclusion

89 / 91

Page 90: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Conclusion

Usually

⇒ Stochastic gradient descent

Non-differentiable (non-smooth)

⇒ Stochastic subgradient methods

Additional Information

⇒ Stochastic mirror descent

Non-i.i.d. input data

⇒ Ergodic mirror descent

90 / 91

Page 91: Stochastic Gradient Descent Methods

Stochastic Optimization STUDY GROUP

Conclusion

Thank you

91 / 91