stochastic optimization algorithms for machine · pdf filestochastic optimization algorithms...

Click here to load reader

Post on 27-Jun-2019

215 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Stochastic Optimization Algorithms forMachine Learning

    Guanghui (George) Lan

    H. Milton School of Industrial and Systems Engineering,Georgia Tech, USA

    Workshop on Optimization and LearningCIMI Workshop, Toulouse, France

    September 10-13, 2018

  • beamer-tu-logo

    Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

    Machine learning (ML)

    ML exploits optimization, statistics, high performancecomputing, among other techniques, to transform raw data intoknowledge in order to support decision-making in variousareas, e.g., in biomedicine, health care, logistics, energy andtransportation etc.

    2 / 118

  • beamer-tu-logo

    Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

    The role of optimization

    Optimization provides theoretical insights and efficient solutionmethods for ML models.

    2 / 118

  • beamer-tu-logo

    Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

    Typical ML models

    Given a set of observed data S = {(ui , vi)}mi=1, drawn from acertain unknown distribution D on U V .

    Goal: to describe the relation between ui and vi s forprediction.Applications: predicting strokes and seizures, identifyingheart failure, stopping credit card fraud, predicting machinefailure, identifying spam, ......Classic models:

    Lasso regression: minx E[(x ,u v)2] + x1.Support vector machine: minEu,v [max{0, vx ,u] + x22.Deep learning: min,W Eu,v (F (T(Wu)) v)2.

    3 / 118

  • beamer-tu-logo

    Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

    Optimization models/methods in ML

    Stochastic optimization: minxX {f (x) := E[F (x , )]} .F is the regularized loss function and = (u, v):F (x , ) = (x ,u v)2 + x1F (x , ) = max{0, vx ,u}+ x22.Often called population risk minimization in ML

    Finite-sum minimization: minxX{

    f (x) := 1NN

    i=1fi(x)}.

    Empirical risk minimization: fi(x) = F (x , i).Distributed/decentralized ML: fi - loss for each agent i .

    4 / 118

  • beamer-tu-logo

    Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

    Outline

    Deterministic first-order methodsStochastic optimization methods

    Stochastic gradient descent (SGD)Accelerated SGD (SGD with momentum)Nonconvex SGD and its accelerationAdaptive and accelerated methods

    Finite-sum and distributed optimization methodsVariance reduced gradient methodsPrimal-dual gradient methodsDecentralized/Distributed SGD

    Unconvered topics: projection-free, second-order methods

    5 / 118

  • beamer-tu-logo

    Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

    (Sub)Gradient descent

    Problemf := min

    xXf (x). Here 6= X Rn is a convex set and f is a convex

    function.

    Basic IdeaStarting from x1 Rn, update xt by xt+1 = xt tf (xt ), t = 1,2, . . . .

    Two essential enhancements

    f may be non-differentiable: replace f (xt ) by g(xt ) f (xt ).

    xt+1 may be infeasible: project back to X .

    xt+1 := argminxXx (xt tg(xt ))2, t = 1,2, . . . .

    6 / 118

  • beamer-tu-logo

    Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

    Interpretation

    From the proximity control point of view,xt+1 = argminyX

    12x (xt tg(xt ))

    22

    = argminxXtg(xt ), x xt+ 12x xt22

    = argminxXt [f (xt ) + g(xt ), x xt] + 12x xt22

    = argminxXtg(xt ), x+ 12x xt22.

    Implication

    To minimize the linear approximation f (xt ) + g(xt ), x xt of f (x)over X , without moving too far away from xt .

    The role of stepsize

    t controls how much we trust the model and depends on whichproblem classes to be solved.

    7 / 118

  • beamer-tu-logo

    Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

    Convergence of (Sub)Gradient Descent (GD)

    Nonsmooth problems

    f is M-Lipschitz continuous, i.e., |f (x) f (y)| Mx y2.

    Theorem

    Let xks :=(k

    t=st

    )1kt=s(txt ). Then

    f (xks ) f (

    2k

    t=st

    )1 [x xs22 + M2

    kt=s

    2t

    ].

    Selection of t

    If t =

    D2X/(kM2), for some fixed k , then f (xk1 ) f

    MDX2

    k,

    where DX maxx1,x2X x1 x2.

    If t =

    D2X/(tM2), then f (xkdk/2e) f

    O(1)(MDX/

    k).

    8 / 118

  • beamer-tu-logo

    Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

    Convergence of GD

    Smooth problems

    f is differentiable, and f is L-Lipschitz continuous, i.e.,|f (x)f (y)| Lx y2.

    0 s.t. f (y) f (x) + f (x), y x+ 12y x2.

    Theorem

    Let (0, 2L ]. If t = , t = 1,2, . . ., thenf (xk ) f x0x

    2kh(2Lh) .

    Moreover, if > 0 and Qf = L/, thenxk x ( Qf1Qf +1 )

    kx0 x,f (xk ) f L2 (

    Qf1Qf +1

    )2kx0 x2.

    9 / 118

  • beamer-tu-logo

    Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

    Adaption to Geometry: Mirror Descent (MD)

    GD is intrinsically linked to the Euclidean structure of Rn:

    The method relies on the Euclidean projection,

    DX , L and M are defined in terms of the Euclidean norm.

    Bregman Distance

    Let be a (general) norm on Rn and x = supy1x , y, and be a continuously differentiable and strongly convex function withmodulus with respect to . Define

    V (x , z) = (z) [(x) +(x)T (z x)].

    Mirror Descent (Nemirovski and Yudin 83, Beck and Teboule 03)

    xt+1 = argminxXtg(xt ), x+ V (xt , x), t = 1,2, . . . .

    GD is a special case of MD: V (xt , x) = x xt22/2.10 / 118

  • beamer-tu-logo

    Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

    Acceleration scheme

    Assume that f is smooth: f (x)f (y) Lx y.

    Accelerated gradient descent (AGD) (Nesterov 83, 04; Tseng 08)

    Choose x0 = x0 X .1) Set xk = (1 k )xk1 + k xk1.2) Compute f (xk ) and set

    xk = argminxX{k f (xk ), x+ k V (xk1, x)},xk = (1 k )xk1 + k xk .

    3) Set k k + 1 and go to step 1).

    Theorem

    If k L2k and k = (1 k )k1 for k 1, thenf (xk ) f (x) 111 [f (x0) f (x)] + k v(x0,x0), x X .

    In particular, If k = 2/(k + 1) and k = 4L/[k(k + 1)], thenf (xk ) f (x) 4Lk(k+1) V (x0, x

    ).

    11 / 118

  • beamer-tu-logo

    Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

    Acceleration scheme (strongly convex case)

    Assumption: > 0 s.t. f (y) f (x) + f (x), y x+ 12y x2.

    AGD for strongly convex problems

    Idea: Restart AGD every N

    8L/

    iterations.The algorithm.Input: p0 X .Phase t = 1,2, . . .:

    Set pt = xN , where xN is obtained from AGD with x0 = pt1.

    Theorem

    For any t 1, we have pt x2 ( 12 )tp0 x2.

    To have pt x2 , the total number of iterations is bounded by8L

    log p0x

    2

    12 / 118

  • beamer-tu-logo

    Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

    Stochastic optimization problems

    The Problem: minxX {f (x) := E[F (x , )]}.

    Challenges: To compute exact (sub)gradients is computationallyprohibitive.

    Stochastic oracleAt iteration t , xt X being the input, SO outputs a vector G(xt , t ),where {t}t1 are i.i.d. random variables s.t.

    E[G(xt , t )] g(xt ) (xt ).

    Examples:

    f (x) = E[F (x , )]: G(xt , t ) = F (xt , t ), t being a randomrealization of .

    f (x) =m

    i=1 fi (x)/m: G(xt , it ) = fit (xt ), it being a uniform random

    variable on {1, . . . ,m}.

    13 / 118

  • beamer-tu-logo

    Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

    stochastic mirror descent

    Mirror descent stochastic approximation (MDSA)

    The algorithm: Replace the exact linear model in MD with itsstochastic approximation (goes back to Robinson and Monro 51,Nemirovski and Yudin 83).

    xt+1 = argminxXtGt , x+ V (xt , x), t = 1,2, . . .

    Theorem (Nemirovski, Juditsky, Lan and Shapiro 07 (09))

    Assume E[G(x , )2] M2.

    E[f (xks )] f (k

    t=st

    )1 [E[V (xs, x)] + (2)1M2

    kt=s

    2t

    ].

    The selection of t

    If t =

    2/(kM2) for some fixed k , then E[f (xk1 ) f ] M2

    k,

    where maxx1,x2X V (x1, x2).

    If t =

    2/(tM2), then E[f (xkdk/2e) f] O(1)(M/

    k).

    14 / 118

  • beamer-tu-logo

    Background Deterministic methods Stochastic optimization Finite-sum/distributed optimization GEM RGEM Communication sliding Summary

    stochastic mirror descent

    Complexity?

    Stochastic Optimization: minxX E[F (x , )],One F (xt , t ) per iteration, totally O(1/2) iterations.Optimal sampling complexity.

    Deterministic Finite-sum Optimization:minxX

    {f (x) := 1m

    mi=1 fi(x)

    }|fi(x) fi(y)| Mix y, |f (x) f (y)| Mx y, x , y X

    M M maxi M

    Iteration complexity

View more