# stochastic optimization algorithms for machine · pdf filestochastic optimization algorithms...

Post on 27-Jun-2019

215 views

Embed Size (px)

TRANSCRIPT

Stochastic Optimization Algorithms forMachine Learning

Guanghui (George) Lan

H. Milton School of Industrial and Systems Engineering,Georgia Tech, USA

Workshop on Optimization and LearningCIMI Workshop, Toulouse, France

September 10-13, 2018

beamer-tu-logo

Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

Machine learning (ML)

ML exploits optimization, statistics, high performancecomputing, among other techniques, to transform raw data intoknowledge in order to support decision-making in variousareas, e.g., in biomedicine, health care, logistics, energy andtransportation etc.

2 / 118

beamer-tu-logo

Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

The role of optimization

Optimization provides theoretical insights and efficient solutionmethods for ML models.

2 / 118

beamer-tu-logo

Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

Typical ML models

Given a set of observed data S = {(ui , vi)}mi=1, drawn from acertain unknown distribution D on U V .

Goal: to describe the relation between ui and vi s forprediction.Applications: predicting strokes and seizures, identifyingheart failure, stopping credit card fraud, predicting machinefailure, identifying spam, ......Classic models:

Lasso regression: minx E[(x ,u v)2] + x1.Support vector machine: minEu,v [max{0, vx ,u] + x22.Deep learning: min,W Eu,v (F (T(Wu)) v)2.

3 / 118

beamer-tu-logo

Optimization models/methods in ML

Stochastic optimization: minxX {f (x) := E[F (x , )]} .F is the regularized loss function and = (u, v):F (x , ) = (x ,u v)2 + x1F (x , ) = max{0, vx ,u}+ x22.Often called population risk minimization in ML

Finite-sum minimization: minxX{

f (x) := 1NN

i=1fi(x)}.

Empirical risk minimization: fi(x) = F (x , i).Distributed/decentralized ML: fi - loss for each agent i .

4 / 118

beamer-tu-logo

Outline

Deterministic first-order methodsStochastic optimization methods

Stochastic gradient descent (SGD)Accelerated SGD (SGD with momentum)Nonconvex SGD and its accelerationAdaptive and accelerated methods

Finite-sum and distributed optimization methodsVariance reduced gradient methodsPrimal-dual gradient methodsDecentralized/Distributed SGD

Unconvered topics: projection-free, second-order methods

5 / 118

beamer-tu-logo

(Sub)Gradient descent

Problemf := min

xXf (x). Here 6= X Rn is a convex set and f is a convex

function.

Basic IdeaStarting from x1 Rn, update xt by xt+1 = xt tf (xt ), t = 1,2, . . . .

Two essential enhancements

f may be non-differentiable: replace f (xt ) by g(xt ) f (xt ).

xt+1 may be infeasible: project back to X .

xt+1 := argminxXx (xt tg(xt ))2, t = 1,2, . . . .

6 / 118

beamer-tu-logo

Interpretation

From the proximity control point of view,xt+1 = argminyX

12x (xt tg(xt ))

22

= argminxXtg(xt ), x xt+ 12x xt22

= argminxXt [f (xt ) + g(xt ), x xt] + 12x xt22

= argminxXtg(xt ), x+ 12x xt22.

Implication

To minimize the linear approximation f (xt ) + g(xt ), x xt of f (x)over X , without moving too far away from xt .

The role of stepsize

t controls how much we trust the model and depends on whichproblem classes to be solved.

7 / 118

beamer-tu-logo

Convergence of (Sub)Gradient Descent (GD)

Nonsmooth problems

f is M-Lipschitz continuous, i.e., |f (x) f (y)| Mx y2.

Theorem

Let xks :=(k

t=st

)1kt=s(txt ). Then

f (xks ) f (

2k

t=st

)1 [x xs22 + M2

kt=s

2t

].

Selection of t

If t =

D2X/(kM2), for some fixed k , then f (xk1 ) f

MDX2

k,

where DX maxx1,x2X x1 x2.

If t =

D2X/(tM2), then f (xkdk/2e) f

O(1)(MDX/

k).

8 / 118

beamer-tu-logo

Convergence of GD

Smooth problems

f is differentiable, and f is L-Lipschitz continuous, i.e.,|f (x)f (y)| Lx y2.

0 s.t. f (y) f (x) + f (x), y x+ 12y x2.

Theorem

Let (0, 2L ]. If t = , t = 1,2, . . ., thenf (xk ) f x0x

2kh(2Lh) .

Moreover, if > 0 and Qf = L/, thenxk x ( Qf1Qf +1 )

kx0 x,f (xk ) f L2 (

Qf1Qf +1

)2kx0 x2.

9 / 118

beamer-tu-logo

Adaption to Geometry: Mirror Descent (MD)

GD is intrinsically linked to the Euclidean structure of Rn:

The method relies on the Euclidean projection,

DX , L and M are defined in terms of the Euclidean norm.

Bregman Distance

Let be a (general) norm on Rn and x = supy1x , y, and be a continuously differentiable and strongly convex function withmodulus with respect to . Define

V (x , z) = (z) [(x) +(x)T (z x)].

Mirror Descent (Nemirovski and Yudin 83, Beck and Teboule 03)

xt+1 = argminxXtg(xt ), x+ V (xt , x), t = 1,2, . . . .

GD is a special case of MD: V (xt , x) = x xt22/2.10 / 118

beamer-tu-logo

Acceleration scheme

Assume that f is smooth: f (x)f (y) Lx y.

Accelerated gradient descent (AGD) (Nesterov 83, 04; Tseng 08)

Choose x0 = x0 X .1) Set xk = (1 k )xk1 + k xk1.2) Compute f (xk ) and set

xk = argminxX{k f (xk ), x+ k V (xk1, x)},xk = (1 k )xk1 + k xk .

3) Set k k + 1 and go to step 1).

Theorem

If k L2k and k = (1 k )k1 for k 1, thenf (xk ) f (x) 111 [f (x0) f (x)] + k v(x0,x0), x X .

In particular, If k = 2/(k + 1) and k = 4L/[k(k + 1)], thenf (xk ) f (x) 4Lk(k+1) V (x0, x

).

11 / 118

beamer-tu-logo

Acceleration scheme (strongly convex case)

Assumption: > 0 s.t. f (y) f (x) + f (x), y x+ 12y x2.

AGD for strongly convex problems

Idea: Restart AGD every N

8L/

iterations.The algorithm.Input: p0 X .Phase t = 1,2, . . .:

Set pt = xN , where xN is obtained from AGD with x0 = pt1.

Theorem

For any t 1, we have pt x2 ( 12 )tp0 x2.

To have pt x2 , the total number of iterations is bounded by8L

log p0x

2

12 / 118

beamer-tu-logo

Stochastic optimization problems

The Problem: minxX {f (x) := E[F (x , )]}.

Challenges: To compute exact (sub)gradients is computationallyprohibitive.

Stochastic oracleAt iteration t , xt X being the input, SO outputs a vector G(xt , t ),where {t}t1 are i.i.d. random variables s.t.

E[G(xt , t )] g(xt ) (xt ).

Examples:

f (x) = E[F (x , )]: G(xt , t ) = F (xt , t ), t being a randomrealization of .

f (x) =m

i=1 fi (x)/m: G(xt , it ) = fit (xt ), it being a uniform random

variable on {1, . . . ,m}.

13 / 118

beamer-tu-logo

stochastic mirror descent

Mirror descent stochastic approximation (MDSA)

The algorithm: Replace the exact linear model in MD with itsstochastic approximation (goes back to Robinson and Monro 51,Nemirovski and Yudin 83).

xt+1 = argminxXtGt , x+ V (xt , x), t = 1,2, . . .

Theorem (Nemirovski, Juditsky, Lan and Shapiro 07 (09))

Assume E[G(x , )2] M2.

E[f (xks )] f (k

t=st

)1 [E[V (xs, x)] + (2)1M2

kt=s

2t

].

The selection of t

If t =

2/(kM2) for some fixed k , then E[f (xk1 ) f ] M2

k,

where maxx1,x2X V (x1, x2).

If t =

2/(tM2), then E[f (xkdk/2e) f] O(1)(M/

k).

14 / 118

beamer-tu-logo

stochastic mirror descent

Complexity?

Stochastic Optimization: minxX E[F (x , )],One F (xt , t ) per iteration, totally O(1/2) iterations.Optimal sampling complexity.

Deterministic Finite-sum Optimization:minxX

{f (x) := 1m

mi=1 fi(x)

}|fi(x) fi(y)| Mix y, |f (x) f (y)| Mx y, x , y X

M M maxi M

Iteration complexity