stochastic gradient descent methods

Stochastic Optimization STUDY GROUP

Stochastic Gradient Descent Methods

Xuetong Wu & Viktoria Schram

Department of EEEUniversity of Melbourne

October 22, 2020

1 / 91


Overview

1 Introduction

2 Gradient Descent

3 Stochastic Gradient Descent

4 Stochastic Subgradient MethodsNon-Smooth Optimization ProblemsOptimization Considering Additional InformationOptimization in Case of Non-I.i.d. Data

5 Conclusion

2 / 91


Introduction

Outline

1 Introduction

2 Gradient Descent


4 Stochastic Subgradient Methods

5 Conclusion

3 / 91


Introduction

Parameter Estimation Problems

Communications

Tracking

Control theory

System identification

Machine learning

...

Kushner, 1997, Stochastic ApproximationHan-Fu Chen, 2003, Stochastic Approximation and Its Applications

4 / 91


Introduction

Classification Problems

Consider the typical image classification problem,

Training Examples Models

Dogs

Cats

Labels

We wish to learn a good model h to minimise the predicted error:

minh

1

n

n∑i=1

(1h(Xi)6=Yi)

5 / 91


Introduction

Classification Problems

Consider the typical image classification problem,

Training Examples Models

Dogs

Cats

Labels

We wish to learn a good model h to minimise the predicted error:

minh

1

n

n∑i=1

(1h(Xi)6=Yi)

6 / 91


Introduction

Regression ProblemsConsider the simple regression problem, we wish to learn a goodmodel to minimise the mean squared error.

𝑌𝑌 = 𝑎𝑎𝑎𝑎 + 𝑏𝑏

Training Examples Models Predicted Labels

Mathematically,

mina,b

1

n

n∑i=1

(Yi − aXi − b)2

7 / 91


Introduction

Optimization in Learning Problems

Many of machine learning problems can be formulated as thefollowing problem,

minw∈W

F (w) =1

n

n∑i=1

f(w,Zi)

Zi: training sample/ data pair (Xi, Yi)

w: model parameters (e.g., a, b in least square problem)

f : loss function

8 / 91


Gradient Descent

Outline

1 Introduction

2 Gradient Descent



5 Conclusion

9 / 91


Gradient Descent

Gradient Descentminw∈W

F (w) =1

n

n∑i=1

f(w,Zi)

If f is convex and differentiable w.r.t. w, with the first-order Taylorapproximation with η > 0,

F (w + η∆w) ≈ F (w) + η∆wT∇wF (w)

Best ∆w that minimises the R.H.S.

∆w = −∇wF (w)

we choose initial point w0 and certain step size ηt at each time t.

(Batch) Gradient descent:

wt+1 = wt − ηt∇wF (w) = wt −ηtn

n∑i=1

∇wf(wt, Zi)

Stops at a certain point such that,

F (wt)− F (w∗) ≤ ε10 / 91


Gradient Descent

Gradient Descent

Figure: Visualization of Gradient Descent

11 / 91


Gradient Descent

Convergence Rate For GDAssume f is convex and differentiable w.r.t. w, further assume thegradient ∇wF (w) = 1

n

∑ni=1∇wf(w,Zi) is L-Lipschitz continuous

(∇2F � LI) . Then,

Theorem

Gradient descent with fixed step size η ≤ 1/L satisfies

F (wt)− F (w∗) ≤ ‖w0 − w?‖2

2ηt

Convergence rate ∼ O(1t ), iteration complexity ∼ O(1

ε ).

R. Tibshirani, Convex Optimization 10-725

12 / 91


Gradient Descent

Convergence Rate For GD with Strong ConvexityFurthermore, if F (w) is µ-strongly-convex (∇2F � µI).

Theorem

Gradient descent with fixed step size η ≤ 2/(µ+ L) or withbacktracking line search satisfies

F (wt)− F (w?) ≤ ctL2‖w0 − w?‖2

where 0 < c < 1.

Convergence rate ∼ O(ct), iteration needed for error ε ∼ O(log 1ε ).


13 / 91


Gradient Descent

ProblemsTwo main drawbacks of gradient descent,

If n is relatively large, computing the gradient is memory andtime consuming.

If the loss function is nonconvex, the solution can be stuck ina local stationary point (e.g., a saddle point).

14 / 91


Stochastic Gradient Descent

Outline

1 Introduction

2 Gradient Descent



5 Conclusion

15 / 91




A possible practical way is to simulate the stream by randomly pickup Zt uniformly at time t from the training examples.

Namely, the stochastic gradient descent:

wt+1 = wt − ηt∇wf(wt, Zt)

Why does this work? By uniformly picking,

EZt [∇wf(wt, Zt)] =1

n

n∑i=1

∇wf(wt, Zi)

Unbiased estimate but high variance, usually works well in largescale problems.

16 / 91



Stochastic GD v.s. GD

Figure: Stochastic GD v.s. GD

17 / 91



Remarks on SGD

Computational cost for n samples and p iterations.

GD ∼ O(np)

SGD ∼ O(p)

SGD does not always produce descending directions andgradient is very noisy!

The solution of SGD bounces around optimal value withconstant step size.

Convergence properties?

18 / 91



Convergence Rate Analysis

minimizew F (w) :=1

n

n∑i=1

f(w,Zi)

We wish to get to achieve ε-optimality,

E[F (wt)]− F (w∗) ≤ ε

after t iterations.

19 / 91



Assumptions

F (w) is µ-strongly-convex and the gradeint is L Lipschitzcontinuous, µ/L ≤ 1.

∇f(wt, Zt) is an unbiased estimate of ∇F (wt).

For all w, the variance of the gradient

EZ [‖∇f(w,Z)‖22]− ‖EZ [∇f(w,Z)]‖22 ≤ σ2

20 / 91



Constant Step Size

Theorem (Convergence with Fixed Stepsizes)

Under the assumptions, if ηt = η ≤ 1L , then SGD achieves

E [F (wt)]− F (w∗) ≤ ηLσ2

2µ+ (1− ηµ)t (F (w0)− F (w∗))

Linear convergence at the beginning.

When t→∞,

E [F (wt)− F (w∗)] ≤ ηLσ2

2µ

converges to some neighborhood of w∗− variation in gradientcomputation prevents further progress.

Theorem 4.6 in Bottou, 2018, Optimization Methods for Large-Scale Machine Learning

21 / 91



Diminishing Step Size

Theorem (Convergence with Diminishing Stepsizes)

Under the assumptions, if ηt = θt+1 for some θ > 1

µ , then SGDachieves

E [F (wt)− F (w∗)] ≤ 2ν

t+ 1

where

ν := max

{Lσ2

4(µ− 1), F (w0)− F (w∗)

}

Convergence rate is O(1t ), iterations needed O(1

ε ) with diminishingstepsize ηt � 1

t .

Theorem 4.7 in Bottou, 2018, Optimization Methods for Large-Scale Machine Learning

22 / 91



Convergence Rate and Time Comparison

Risk Minimization Under Strong Convexity and L-smooth:

iterationcomplexity

per-iterationcost

totalcomput. cost

GD log 1ε n n log 1

ε

SGD 1ε 1 1

ε

23 / 91



AdvantagesCompared to the gradient descent methods, SGD has the followingadvantages.

Less computational cost per iteration.

For larger datasets it can converge faster with large n andmoderate ε.

For non-convex cases, SGD can get rid of saddle pointssometimes due to the variance.

Rong Ge et.al,, COLT’2015, Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition

24 / 91



Variance Reduction and AccelerationTo reduce the variance of the estimate of the gradient, we can usemini-batch SGD for k � n:

wt+1 = wt − ηtk∑i=1

∇f(wt, Zi)

SGD for logistic regression problems with n = 10000:


25 / 91



Variance Reduction and AccelerationCan we have 1 gradient per iteration and only O(log(1/ε))iterations?Yes! Stochastic Average Gradient (SAG):

wt = wt−1 −ηtn

n∑i=1

∇f ti with ∇f ti =

{∇f (wt−1, Zi) if i = i(t)∇f t−1i otherwise

SAG gradient estimates are no longer unbiased, but they havegreatly reduced variance. With the fixed stepsize ηt = 1

16L ,

E [F (wt)]− F (w∗) 6 O

((1−min

{µ

16L,

1

8n

})t)

Iteration complexity ∼ O(log 1ε )!

Other variants with similar convergence: SDCA, SVRG, SAGA

Roux et.al, NIPS’12, A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training SetsShai & Zhang, JMLR’13, Stochastic dual coordinate ascent methods for regularized loss minimization.Johnson & Zhang, NIPS’13, Accelerating stochastic gradient descent using predictive variance reductionDefazio & Bach, NIPS’14, SAGA

26 / 91



More on SGD

If n is not very large,

1

n

n∑i=1

(f (wt, Zi)− f(w∗, Zi)) ≤ ε?

Sometimes simply minimising the loss will cause overfitting.

27 / 91



Caveats: Data Overfitting for y = sin 2πx+ n

(a) t = 2 (b) t = 7

(c) t = 100 (d) t = 5000

Often, Small Training Error 6⇒ Small Testing Error !

28 / 91



Early StoppingActually, if samples are regarded as some random variable drawnfrom some distributions P , we may consider minimising the truerisk,

Ftrue(wt) = EZ∼P [f(wt, Z)]

Figure: Early Stopping with SGD

29 / 91



SGD Applications in supervised learningSGD Methods are widely applied in machine learning problems.

For different differentiable loss functions,

Adaline:∑

i12

(yi − w>Φ(xi)

)2Tikhonov:

∑i

(yi − wTxi

)2+ λ‖w‖22

Logistic Regression:∑i yi log

(1

1+e−wT xi

)+ (1− yi) log

(1− 1

1+e−wT xi

)What if the loss is not differentiable?

SVM: 12‖w‖

22 + C

∑i max

(0, 1− yi

(w>xi + b

))Lasso:

∑i

(yi − wTxi

)2+ λ‖w‖1

Perceptron:∑

i max{

0,−yiw>Φ(xi)}

Neural Network:∑

i f(w,Zi)

30 / 91


Stochastic Subgradient Methods

Outline

1 Introduction

2 Gradient Descent



5 Conclusion

31 / 91



Short note: to be consistent with the notations used in the firstpart of this presentation, the following slides were adapted and not

the same nomenclature is used as was used in the recordedpresentation.

Sorry for any inconvenience this might cause.

32 / 91



Optimization Problem - Optimal Szenario

Finding zeros (roots) of a real-valued smooth function∇f(w) : Rd → Rd, for w ∈ R.

⇒Newton’s method

wt+1 = wt − [∇f ′w(wt)]−1∇f(wt)

n: tth iterationw: Parameters of function f(.)∇f(w): Derivative of function f(.)f ′w(.): Derivative of ∇f(.) wrt. w


33 / 91



Optimization Problem - Reality

Finding zeros (roots) of an unknown, real-valued function∇f(w) : Rd → Rd, for w ∈ R, which can be observed but theobservation may be corrupted by (i.i.d.) errors (εt)t≥1

⇒Stochastic Approximation (Robbins & Monro ’51):

wt+1 = wt − ηt[∇f(wt) + εt]

ηt: Stepsize, learning rate for iteration tεt: Zero mean noise for iteration t∇f(w): Derivative of function f(.)


34 / 91



Gradient Descent

wt+1 = wt − ηt1

n

n∑i=1

[∇f(wt, Zi)]


wt+1 = wt − ηt[∇f(wt, Zit)]

f(wt, Zit ): Loss function for tth parameters w and tth sample Zitn: Set (/batch) sizeηt: Learning rate at iteration tt: Iteration t = 1, 2, ...it ∈ {1, ..., n}: Uniformly at random chosen index at iteration t


35 / 91




wt+1 = wt − ηt[∇f(wt, Zit)]

Random selection of f(.):

E[∇f(wt, Zit)] = ∇F (w)

f(wt, Zit ): Loss function for tth parameters w and tth sample Zitn: Set (/batch) sizeηt: Learning rate at iteration tt: Iteration t = 1, 2, ...it ∈ {1, ..., n}: Uniformly at random chosen index at iteration t


36 / 91



Question

Optimization function:

Non-smooth

Additional information

Non-i.i.d input samples

Goal:

minw

1

n

n∑i=1

f(w,Zi)

f(w,Zi): Loss function with parameters w for ith sample Zi

H. Li et al., 2018, Visualizing the Loss Landscape of Neural Nets.

37 / 91



Question


Non-smooth



Goal:

minw

1

n

n∑i=1

f(w,Zi)



38 / 91



Question


Non-smooth



Goal:

minw

1

n

n∑i=1

f(w,Zi)



39 / 91



Goal

Non-differentiable (non-smooth)

⇒ ?

Additional Information

⇒ ?

Non-i.i.d. data

⇒ ?

40 / 91



Non-Smooth Optimization Problems

Outline

1 Introduction

2 Gradient Descent



5 Conclusion

41 / 91




Subgradient of a Function

g is a subgradient of f at x2 if

f(x) ≥ f(x2) + gT (x− x2) ∀ x

*∂f(x): Subdifferential, set of all subgradients, i.e. gi ∈ ∂f(x)(**x =̂ w: Slight abuse in notation compared to before)

S. Boyd, Stanford University, EE364b

42 / 91





Goal: minx

f(x); Use: Gradient descent method

D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home 43 / 91

https://bloomberg.github.io/foml/#home





Goal: minx








Goal: minx


D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home

45 / 91






Goal: minx



46 / 91






Goal: minx



47 / 91






Goal: minx


⇒Negative subgradient doesn’t necessarily give a descent direction







Goal: minx









Goal: minx



D. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home50 / 91






Goal: minx









Goal: minx


⇒Negative subgradient doesn’t necessarily give a descent directionD. S. Rosenberg, 2018, Foundations of Machine Learning, https://bloomberg.github.io/foml/#home

53 / 91





Subgradient Method

wt+1 = wt − ηtgt

⇒ keep track of best iterate wbestt+1 among w1, ..., wt+1, i.e.,

f(wbestt+1 ) = minj=1,...t+1

f(wj)

wt: tth parameter estimateηt: Learning rategt: Subgradientt: Iteration t = 1, 2, ...(*w =̂ x: Switch back to notation used in the beginning)

S. Boyd, Stanford University, EE364bR. Tibshirani, Convex Optimization 10-725/36-725

54 / 91




Stochastic Subgradient Method

Noisy subgradients: g̃ = g + v, where g ∈ ∂f(w), E[v] = 0

wt+1 = wt − ηtg̃t

⇒ Random choice of (sample) index i at iteration t(out of a set (/batch) of samples with size n)

wt+1 = wt − ηtgit

S. Boyd, Stanford University, EE364bR. Tibshirani, Convex Optimization 10-725/36-725J. Zhu, University of Melbourne, 2020, Discussion after IT lecture

55 / 91




Stochastic Subgradient Method


⇒ keep track of best iterate wbestt+1 among w1, ..., wt+1, i.e.,

f(wbestt+1 ) = minj=1,...t+1

f(wj)

wt: tth parameter estimateηt: Learning rategit : Subgradient for uniformly at random chosen sample (Zi)t: Iteration t = 1, 2, ...(*w =̂ x: Switch back to notation used in the beginning)

S. Boyd, Stanford University, EE364bR. Tibshirani, Convex Optimization 10-725/36-725

56 / 91




Convergence Results

Fixed step size η

SGM:

limt→∞

F (w(best)t ) ≤ F ∗ +

L2η

2

Stochastic SGM:

limt→∞

F (w(best)t ) ≤ F ∗ +

5n2L2η

2

S. Boyd et al., 2003, Subgradient MethodsR. Tibshirani, Convex Optimization 10-725/36-725 Subgradient Method*Convergence rates for f(w) Lipschitz continuous and constant L > 0

57 / 91




Convergence Results

Diminishing stepsize (Robins-Monro condition)

(Stochastic) SGM:

limt→∞

F (w(best)t ) ≤ F ∗

ηt > 0, limt→∞

ηt = 0,

∞∑t=1

ηt =∞

e.g.:

ηt > 0,

∞∑t=1

η2t <∞,

∞∑t=1

ηt =∞

*More about how to choose η: S. Boyd et al., 2003, Subgradient MethodsR. Tibshirani, Convex Optimization 10-725/36-725 Subgradient Method**Convergence rates for f(w) Lipschitz continuous and constant L > 0

58 / 91




Applications

Algorithms for non-differentiable convex optimization

Convex analysis

ML/DL

⇒ Methods based on stochastic subgradients are used to(approximately) optimize (nonconvex, nonsmooth) deep neuralnetworks (DNNs)

⇒ E.g.: Adagrad, ADAM, NADAM, RMSProp, ...

Udell, Operations Research and Information Engineering Cornell, 2017, PresentationJ. Duchi et al., 2011, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

59 / 91



Optimization Considering Additional Information

Outline

1 Introduction

2 Gradient Descent



5 Conclusion

60 / 91




Adagrad

wt+1 = wt − ηt(Gt)−12 git

⇒ Incorporates knowledge of the geometry of past iterations

wt: tth parameter estimateηt: Learning rategit : Subgradient for uniformly at random chosen sample (Zi)t: Iteration t = 1, 2, ...

Gt: Outer product matrix of past gradients up to time step t:∑tτ=1 gτg

Tτ

Udell, Cornell, 2017, Operations Research and Information EngineeringJ. Duchi et al., 2011, Adaptive Subgradient Methods for Online Learning and Stochastic OptimizationAdaGrad - Adaptive Subgradient Methods, https://ppasupat.github.io/a9online/1107.html 61 / 91

https://ppasupat.github.io/a9online/1107.html




Adagrad

wt+1,j = wt,j − ηt,j(t∑

τ=1

g2τ,j)− 1

2 git,j

Example: minw

f(w) = 100w21 + w2

2

j: jth feature/parameter wηt: Learning rategit : Subgradient for uniformly at random chosen sample (Zi)t: Iteration t = 1, 2, ...∑tτ=1 g

2τ,j : Sum of past gradients up to time step t

Udell, Cornell, 2017, Operations Research and Information EngineeringJ. Duchi et al., 2011, Adaptive Subgradient Methods for Online Learning and Stochastic OptimizationAdaGrad - Adaptive Subgradient Methods, https://ppasupat.github.io/a9online/1107.html

62 / 91

https://ppasupat.github.io/a9online/1107.html




Adagrad

⇒Variable metric projected subgradient method

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric Methods

63 / 91




Adagrad

⇒Variable metric projected subgradient method

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric Methods

64 / 91




Variable metric projected subgradient method

Projection carried out in Ht = 1ηt

(Gt)12 metric

wt+1 = wt − ηt(Gt)−12 git

wt+1 = PHtW (wt −H−1t git) = PHtW (y)

where

PHtW (y) = argminw∈W

||w − y||2Ht

PHtW (y): Projection of a vector y onto W according to Ht metric

||w − y||Ht =√

(w − y)THt−1(w − y): Mahalanobis norm, weighted l2-distance

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsY. Chen, Princton University, 2019, ELE 522: Large-Scale Optimization for Data Science

65 / 91






(Gt)12 metric

wt+1 = wt −H−1t git


where


||w − y||2Ht


||w − y||Ht =√



66 / 91






(Gt)12 metric



where


||w − y||2Ht


||w − y||Ht =√



67 / 91






(Gt)12 metric



where


||w − y||2Ht


||w − y||Ht =√



68 / 91




But now, what if...

Parameter of interest lies on a non-Euclidean manifold

Probability vectors

G. Raskutti, The information geometry of mirror descentY. Chen, Princton University, 2019, ELE 522: Large-Scale Optimization for Data Science

69 / 91




Convergence Analysis

Basic inequality: (Projected) subgradient method

F (w(best)t )− F ∗ ≤

R2 + L2∑t

i=1 η2i

2∑t

i=1 ηi

ηi = (R/L)/√t

F (w(best)t )− F ∗ ≤ RL√

t

for: L = maxw∈W||git ||2 and R = max

w,w∗∈W||w − w∗||2

⇒Analysis and convergence results depend on l2 normS. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric Methods

70 / 91




Subgradient Method Update Rule(s)

minw∈W

F (w)


wt+1 = PW(wt − ηtgit)wt+1 = argmin

w∈W||w − (wt − ηtgit)||22

Using some math:

wt+1 = argminw∈W

{gTitw +1

2ηt||w − wt||22}

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsJ. Duchi et al., 2003, Proximal and First-Order Methods for Convex Optimization

71 / 91





minw∈W

F (w)

wt+1 = wt − ηtgitwt+1 = PW(wt − ηtgit)

wt+1 = argminw∈W

||w − (wt − ηtgit)||22

Using some math:

wt+1 = argminw∈W

{gTitw +1

2ηt||w − wt||22}


72 / 91





minw∈W

F (w)

wt+1 = wt − ηtgitwt+1 = PW(wt − ηtgit)wt+1 = argmin


Using some math:

wt+1 = argminw∈W

{gTitw +1

2ηt||w − wt||22}


73 / 91





minw∈W

F (w)

wt+1 = wt − ηtgitwt+1 = PW(wt − ηtgit)wt+1 = argmin


Using some math:

wt+1 = argminw∈W

{gTitw +1

2ηt||w − wt||22}


74 / 91




Stochastic Mirror Descent

wt+1 = argminw{gTtiw +

1

2ηtBφ(w,wt)}

Bregman divergence:

Bφ(w,wt) = φ(w)− φ(wt)−∇φ(wt)T (w − wt)

∇φ(.): Mirror map, invertible mapφ(.): Potential function, strictly convex, differentiable

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and ComplexityN. Azizan et al., Stochastic Interpretation of SMD: Risk-Sensitive Optimality

75 / 91




Generalization/Examples

Gradient descent for φ(w) = 12 ||w||

22

Bφ = 12 ||w − wt||

22

Mirror descent = Projected subgradient method

Negative entropy for φ(w) =∑n

i=1wi logwi(For w ∈ unit simplex)

p-norm Algorithm φ(w) = 12 ||w||

2p

Exponential Gradient descent

Sparse Mirror Descent

...

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsN. Azizan al., 2019, SMD on Overparameterized Nonlinear Models: Conv., Implicit Regul., and General.

76 / 91





1

2ηtBφ(w,wt)}

Using some math:

wt+1 = argminw∈W∩D

Bφ(w, yt+1)

wt+1 = P φW(yt+1)

wt+1 = P φW(∇φ(wt+1)−1)

Alternative update rule stochastic mirror descent:

∇φ(wt+1) = ∇φ(wt)− ηtgTti

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and Complexity

77 / 91





1

2ηtBφ(w,wt)}

Using some math:


Bφ(w, yt+1)

wt+1 = P φW(yt+1)

wt+1 = P φW(∇φ(wt+1)−1)




78 / 91





1

2ηtBφ(w,wt)}

Using some math:


Bφ(w, yt+1)

wt+1 = P φW(yt+1)

wt+1 = P φW(∇φ(wt+1)−1)




79 / 91





1

2ηtBφ(w,wt)}

Using some math:


Bφ(w, yt+1)

wt+1 = P φW(yt+1)

wt+1 = P φW(∇φ(wt+1)−1)




80 / 91





1

2ηtBφ(w,wt)}

Using some math:


Bφ(w, yt+1)

wt+1 = P φW(yt+1)

wt+1 = P φW(∇φ(wt+1)−1)


∇φ(wt+1) = ∇φ(wt)− ηtgTtiS. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsS. Bubeck, 2015, Convex Optimization: Algorithms and Complexity

81 / 91




Stochastic Mirror Descent


∇φ(.): Mirror map, invertible mapφ(.): Potential function, strictly convex, differentiable

S. Boyd, J. Duchi, M. Pilanci, Standford University, 2019, EE364b: Mirror Descent and Variable Metric MethodsM.S. Alkousa al., 2019, On some SMD methods for constrained online optimization problems.Z. Zhou et al., 2020, On the convergence of MD beyond stochastic convex programming 82 / 91




Applications

Non-smooth and/or non-convex stochastic optimizationproblems

Highly overparameterized nonlinear learning problems

Large scale optimization problems

Online learning

Reinforcement learning

Z. Zhou et al., 2020, On the convergence of MD beyond stochastic convex programmingN. Aziza et al., 2019, SMD on Overparametrized Nonlinear Models: Conv., Implicit Regul., and General.M. Raginsky et al., Sparse Q-learning with Mirror DescentS. Mahadevan et al., Continuous-Time SMD on a Network: Variance Reduction, Consensus, Convergence

83 / 91



Optimization in Case of Non-I.i.d. Data

Outline

1 Introduction

2 Gradient Descent



5 Conclusion

84 / 91




Ergodic Mirror Descent

Update rule:


1

2ηtBφt(w,wt)}

⇒ Based on stochastic mirror descent

J. Duchi et al., 2012, Ergodic Mirror Descent

85 / 91





Update rule:


1

2ηtBφt(w,wt)}

⇒ Based on stochastic mirror descent


86 / 91





F (w) := EΠ[f(w;Zi)]w∈W

Stochastic process Pi

Stationary distribution Π such that Pi → Π

Training samples (Z1, ..., Zn) ∼ PLoss function for w on sample Zi is f(w,Zi)


87 / 91





Convergence in expectation and with high probabilityshown for:

Distributed convex optimization

(Potentially nonlinear) ARMA processes

Learning ranking facts

Pseudo-random sanity

J. Duchi et al., 2012, Ergodic Mirror DescentMicrosoft Research, 2016, Learning and stochastic optimization with non-iid data,https://www.youtube.com/watch?v=_yRnHRQVMgw

88 / 91

https://www.youtube.com/watch?v=_yRnHRQVMgw


Conclusion

Outline

1 Introduction

2 Gradient Descent



5 Conclusion

89 / 91


Conclusion

Usually

⇒ Stochastic gradient descent

Non-differentiable (non-smooth)

⇒ Stochastic subgradient methods

Additional Information

⇒ Stochastic mirror descent

Non-i.i.d. input data

⇒ Ergodic mirror descent

90 / 91


Conclusion

Thank you

91 / 91

stochastic gradient descent methods

Documents