stochastic optimization algorithms for machine · pdf filestochastic optimization algorithms...
Post on 27-Jun-2019
215 views
Embed Size (px)
TRANSCRIPT
Stochastic Optimization Algorithms forMachine Learning
Guanghui (George) Lan
H. Milton School of Industrial and Systems Engineering,Georgia Tech, USA
Workshop on Optimization and LearningCIMI Workshop, Toulouse, France
September 10-13, 2018
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Machine learning (ML)
ML exploits optimization, statistics, high performancecomputing, among other techniques, to transform raw data intoknowledge in order to support decision-making in variousareas, e.g., in biomedicine, health care, logistics, energy andtransportation etc.
2 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
The role of optimization
Optimization provides theoretical insights and efficient solutionmethods for ML models.
2 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Typical ML models
Given a set of observed data S = {(ui , vi)}mi=1, drawn from acertain unknown distribution D on U V .
Goal: to describe the relation between ui and vi s forprediction.Applications: predicting strokes and seizures, identifyingheart failure, stopping credit card fraud, predicting machinefailure, identifying spam, ......Classic models:
Lasso regression: minx E[(x ,u v)2] + x1.Support vector machine: minEu,v [max{0, vx ,u] + x22.Deep learning: min,W Eu,v (F (T(Wu)) v)2.
3 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Optimization models/methods in ML
Stochastic optimization: minxX {f (x) := E[F (x , )]} .F is the regularized loss function and = (u, v):F (x , ) = (x ,u v)2 + x1F (x , ) = max{0, vx ,u}+ x22.Often called population risk minimization in ML
Finite-sum minimization: minxX{
f (x) := 1NN
i=1fi(x)}.
Empirical risk minimization: fi(x) = F (x , i).Distributed/decentralized ML: fi - loss for each agent i .
4 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Outline
Deterministic first-order methodsStochastic optimization methods
Stochastic gradient descent (SGD)Accelerated SGD (SGD with momentum)Nonconvex SGD and its accelerationAdaptive and accelerated methods
Finite-sum and distributed optimization methodsVariance reduced gradient methodsPrimal-dual gradient methodsDecentralized/Distributed SGD
Unconvered topics: projection-free, second-order methods
5 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
(Sub)Gradient descent
Problemf := min
xXf (x). Here 6= X Rn is a convex set and f is a convex
function.
Basic IdeaStarting from x1 Rn, update xt by xt+1 = xt tf (xt ), t = 1,2, . . . .
Two essential enhancements
f may be non-differentiable: replace f (xt ) by g(xt ) f (xt ).
xt+1 may be infeasible: project back to X .
xt+1 := argminxXx (xt tg(xt ))2, t = 1,2, . . . .
6 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Interpretation
From the proximity control point of view,xt+1 = argminyX
12x (xt tg(xt ))
22
= argminxXtg(xt ), x xt+ 12x xt22
= argminxXt [f (xt ) + g(xt ), x xt] + 12x xt22
= argminxXtg(xt ), x+ 12x xt22.
Implication
To minimize the linear approximation f (xt ) + g(xt ), x xt of f (x)over X , without moving too far away from xt .
The role of stepsize
t controls how much we trust the model and depends on whichproblem classes to be solved.
7 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Convergence of (Sub)Gradient Descent (GD)
Nonsmooth problems
f is M-Lipschitz continuous, i.e., |f (x) f (y)| Mx y2.
Theorem
Let xks :=(k
t=st
)1kt=s(txt ). Then
f (xks ) f (
2k
t=st
)1 [x xs22 + M2
kt=s
2t
].
Selection of t
If t =
D2X/(kM2), for some fixed k , then f (xk1 ) f
MDX2
k,
where DX maxx1,x2X x1 x2.
If t =
D2X/(tM2), then f (xkdk/2e) f
O(1)(MDX/
k).
8 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Convergence of GD
Smooth problems
f is differentiable, and f is L-Lipschitz continuous, i.e.,|f (x)f (y)| Lx y2.
0 s.t. f (y) f (x) + f (x), y x+ 12y x2.
Theorem
Let (0, 2L ]. If t = , t = 1,2, . . ., thenf (xk ) f x0x
2kh(2Lh) .
Moreover, if > 0 and Qf = L/, thenxk x ( Qf1Qf +1 )
kx0 x,f (xk ) f L2 (
Qf1Qf +1
)2kx0 x2.
9 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Adaption to Geometry: Mirror Descent (MD)
GD is intrinsically linked to the Euclidean structure of Rn:
The method relies on the Euclidean projection,
DX , L and M are defined in terms of the Euclidean norm.
Bregman Distance
Let be a (general) norm on Rn and x = supy1x , y, and be a continuously differentiable and strongly convex function withmodulus with respect to . Define
V (x , z) = (z) [(x) +(x)T (z x)].
Mirror Descent (Nemirovski and Yudin 83, Beck and Teboule 03)
xt+1 = argminxXtg(xt ), x+ V (xt , x), t = 1,2, . . . .
GD is a special case of MD: V (xt , x) = x xt22/2.10 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Acceleration scheme
Assume that f is smooth: f (x)f (y) Lx y.
Accelerated gradient descent (AGD) (Nesterov 83, 04; Tseng 08)
Choose x0 = x0 X .1) Set xk = (1 k )xk1 + k xk1.2) Compute f (xk ) and set
xk = argminxX{k f (xk ), x+ k V (xk1, x)},xk = (1 k )xk1 + k xk .
3) Set k k + 1 and go to step 1).
Theorem
If k L2k and k = (1 k )k1 for k 1, thenf (xk ) f (x) 111 [f (x0) f (x)] + k v(x0,x0), x X .
In particular, If k = 2/(k + 1) and k = 4L/[k(k + 1)], thenf (xk ) f (x) 4Lk(k+1) V (x0, x
).
11 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Acceleration scheme (strongly convex case)
Assumption: > 0 s.t. f (y) f (x) + f (x), y x+ 12y x2.
AGD for strongly convex problems
Idea: Restart AGD every N
8L/
iterations.The algorithm.Input: p0 X .Phase t = 1,2, . . .:
Set pt = xN , where xN is obtained from AGD with x0 = pt1.
Theorem
For any t 1, we have pt x2 ( 12 )tp0 x2.
To have pt x2 , the total number of iterations is bounded by8L
log p0x
2
12 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
Stochastic optimization problems
The Problem: minxX {f (x) := E[F (x , )]}.
Challenges: To compute exact (sub)gradients is computationallyprohibitive.
Stochastic oracleAt iteration t , xt X being the input, SO outputs a vector G(xt , t ),where {t}t1 are i.i.d. random variables s.t.
E[G(xt , t )] g(xt ) (xt ).
Examples:
f (x) = E[F (x , )]: G(xt , t ) = F (xt , t ), t being a randomrealization of .
f (x) =m
i=1 fi (x)/m: G(xt , it ) = fit (xt ), it being a uniform random
variable on {1, . . . ,m}.
13 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
stochastic mirror descent
Mirror descent stochastic approximation (MDSA)
The algorithm: Replace the exact linear model in MD with itsstochastic approximation (goes back to Robinson and Monro 51,Nemirovski and Yudin 83).
xt+1 = argminxXtGt , x+ V (xt , x), t = 1,2, . . .
Theorem (Nemirovski, Juditsky, Lan and Shapiro 07 (09))
Assume E[G(x , )2] M2.
E[f (xks )] f (k
t=st
)1 [E[V (xs, x)] + (2)1M2
kt=s
2t
].
The selection of t
If t =
2/(kM2) for some fixed k , then E[f (xk1 ) f ] M2
k,
where maxx1,x2X V (x1, x2).
If t =
2/(tM2), then E[f (xkdk/2e) f] O(1)(M/
k).
14 / 118
beamer-tu-logo
Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary
stochastic mirror descent
Complexity?
Stochastic Optimization: minxX E[F (x , )],One F (xt , t ) per iteration, totally O(1/2) iterations.Optimal sampling complexity.
Deterministic Finite-sum Optimization:minxX
{f (x) := 1m
mi=1 fi(x)
}|fi(x) fi(y)| Mix y, |f (x) f (y)| Mx y, x , y X
M M maxi M
Iteration complexity