offline and online optimization in machine...

Offline and Online Optimizationin Machine Learning

Peter Sunehag

ANU

September 2010

Section

Introduction: Optimization with gradient methods

Introduction:Machine Learning and Optimization

Convex Analysis

(Sub)Gradient Methods

Online (sub)Gradient Methods

Steepest decent

I We want to find the minimum of a real valued functionf : Rn → R.

I We will primarily discuss optimization using gradients

I Gradient Decent is also called the steepest decent method

Problem: Curvature

I If we the level sets are very curved (far from round), gradientdecent might take a long route to the optimum

I Newton’s method takes curvature into account and ”corrects”for it. It relies on calculating inverse Hessian (secondderivatives) of the objective function

I Quasi-Newton methods estimates the ”correction” by lookingat the path so far. Red Newton, Green Gradient Decent

Problem: Zig-Zag

I Gradient Descent has a tendency to Zig-Zag

I This is particularly problematic in valleys

I One solution: Introduce momentum

I Another solution: Conjugate gradients

Section



Convex Analysis



Optimization in Machine Learning

I Data is becoming ”unlimited” in many areas.

I Computational resources are not

I Taking advantage of the data requires efficient optimization

I Machine Learning researchers have realized that we are in adifferent situation compared to the contexts in which theoptimization (say convex optimization) field developed.

I Some of those methods and theories are still useful.

I Different ultimate goal.

I We optimize surrogates for what we really want.

I We also often have much higher dimensionality than otherapplications.


I Known training data A, unknown test data BI We want optimal performance on the test data

I Alternatively we have streaming data (or pretend that we do).

I Given a loss function L(w , z) (parameters w ∈W , datasample(s) z), we want as low loss

∑z∈B L(z ,w) as possible

on the test set.

I Since we do not have access to the test set, we minimizeregularized empirical loss on the training set∑

z∈AL(z ,w) + λR(w).

I Getting extremely close to the minimum of this objectivemight not increase performance on the test set.

I In Machine learning, the optimization methods of interest arethose that fast get a good estimate wk to the optimum w∗


I How large k to get εk = f (wk)− f (w∗) ≤ ε for a given ε > 0?

I Is this number like 1/ε, 1/ε2, 1/√ε, log 1

ε ?

I Is εk like 1/k , 1/k2, etc

I How long does an iteration take?

I Quasi-Newton methods (BFGS, L-BFGS), are rapid when weare already close to the optimum but otherise not better thanmethods (gradient decent) with much cheaper iterations.

I We care the most about the early phase when we are notclose to the optimum

Examples

I linear classifier (classes ±1) f (x) = sgn(wT x + b).

I Training data {(xi , yi )}I SVMs (hinge loss),

minw∑

i min(0, 1− yi (wT xi + b)) + λ‖w‖22

I Logistic multi-class: minw∑

i logexpwT

yixi∑

l expwTl xi

+ λ‖w‖22I Decision l∗ = argmaxl{wT

l x}I Different regularizers, e.g. ‖x‖1 encourages sparsity.

I hinge loss and `1 regularization causes non-differentiability onnull sets.

Section



Convex Analysis



Gradients

I We will make use of gradients with respect to a givencoordinate system and its associated Euclidean geometry.

I If f (x) is a real valued function that is differentiable atx = (x1, ..., xd), then

∇x f = (∂f

∂x1, ...,

∂f

∂xd)

I The gradient points in the direction of steepest ascent

Subgradients

I If f (x) is a convex real valued function, and if g ∈ Rd is suchthat

f (x)− f (y) ≥ gT (x − y)

for all y in some open ball around x , then g is in thesubgradient set of f at x .

I If f is differentiable, then only ∇f (x) is in the subgradient set.

I A line with slope g can be drawn that only meet f at x

Convex set

I A set X ⊂ Rn is convex if for all x , y ∈ X and 0 ≤ λ ≤ 1,

(1− λ)x + λy ∈ X

Convex function

I A function f : Rn → R is convex if ∀λ ∈ [0, 1] and x , y ∈ Rn,

f ((1− λ)x + λy) ≤ (1− λ)f (x) + λf (y).

I The set above the function’s curve is called its epigraph

I A function is convex iff its epigraph is a convex set

I A convex function is called closed if the epigraph is closed

Strong ConvexityI A function f : Rn → R is ρ-strongly convex (ρ > 0) if∀λ ∈ [0, 1] and x , y ∈ Rn,

f ((1−λ)x +λy) ≤ (1−λ)f (x) +λf (y)− ρλ(1− λ)

2‖x − y‖22

I or equivalently if f (x)− (ρ/2)‖x‖22 is convexI or equivalently

f (y) ≥ f (x)+ < ∇f (x), y − x > +(ρ/2)‖x − y‖22I or equivalently (if twice diff) ∇2f (x) ≥ ρI (id matrix)I Green (y = x4) is too flat. Blue (y = x2) is not too flat.

Lipschitz Continuity

I f is L-Lipschitz if

|f (x)− f (y)| ≤ L‖x − y‖2

I Lipschitz implies continuity but not differentiability

I If differentiable, then bounded (norm) gradients impliesLipschitz continuity.

I The secant slopes are bounded by L

Lipschitz Continuous Gradients

I If f is differentiable, then having L-Lipschitz gradients(L-LipG) means that

‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2

which is equivalent to

f (y) ≤ f (x)+ < ∇f (x), y − x > +L

2‖x − y‖22

which if f is twice differentiable is equivalent to∇2f (x) ≤ L · I .

Legendre transform

I Transformation between points and lines (set of tangents)

I One dimensional functions (generalizes to Fenchel transform)

I f ∗(p) = maxx(px − f (x)).

I If f differentiable, the maximizing x has the property thatf ′(x) = p.

Fenchel duality

I The fenchel dual of a function f is

f ∗(s) = supx< s, x > −f (x)

I If f is a closed convex function, then f ∗∗ = f .

I f ∗(0) = maxx(−f (x)) = −minx f (x)

I

∇f ∗(s) = argmaxx

< s, x > −f (x).

∇f (x) = argmaxs

< s, x > −f ∗(s).

I ∇f ∗(∇f (x)) = x and ∇f (∇f ∗(s)) = s

Duality between strong convexity and Lipschitz gradients

I f is ρ strongly convex iff f ∗ has 1ρ -Lip. gradients

I f ∗ has L-Lip. gradients iff f ∗ is 1L strongly convex

I If a non-differentiable funtion f can be identified as g∗ we can”smooth it out” by considering (g + δ‖ · ‖22)∗ which isdifferentiable and has 1

δ -Lip gradients.

I δ = µ2 below

Section



Convex Analysis



(Sub)Gradient MethodsI Gradient descent: wt+1 = wt − atgt , gt (sub)gradient of

objective f at wt .I at can be decided by line search or by a predetermined schemeI If we are constrained to a convex set X , then use

wt+1 = ΠX (wt − atgt) = argminw∈X

‖w − (wt − atgt)‖2

First and Second order information

I First order information: ∇f (wt), t = 0, 1, 2, ..., n

I Second order information is about ∇2f and is expensive tocompute and store for high dimensional problems.

I Newton’s method is a second order method wherewt+1 = wt − (Ht)

−1gt .

I Quasi-Newton methods use first order information to try toapproximate Newton’s method.

I Conjugate gradient also use first order information

Lower Bounds

I There are lower bounds for how well a first order method, i.e.a method for which wk+1 − wk ∈Span(∇f (x0), ...,∇f (xk))can do. They depend on the function class!

I Given w0 ( and L, ρ), the bound is saying that for any ε > 0there exists f in the class such that it takes at least nε stepsfor any first order method and a statement about nε is made.

I Differentiable functions with Lipschitz continuous gradients:

Lower bound is O(√

1ε )

I Lipschitz continuous gradients and strong convexity: Lowerbound is O(log 1

ε ).

Gradient Descent with Lipschitz gradients

I If we have L-Lipschitz continuous gradients, then definegradient decent by

wk+1 = wk −1

L∇f (wk).

I Monotone: f (wk+1) ≤ f (wk).

I If f is also ρ-strongly convex (ρ > 0), then

(f (wk)− f (w∗)) ≤ (1− ρ

L)k(f (w0)− f (w∗)).

I This result translates into the optimal O(log 1ε ) rate for first

order methods

Gradient Descent without strong convexity

I Suppose that we do not have strong convexity (ρ = 0) but wehave L-LipG.

I Then the gradient descent procedure wk+1 = wk − 1L∇f (wk)

has rate O(1ε ). (f (wk)− f (w∗)) ≤ Lk ‖w

∗ − w0‖2.

I The lower bound for this class is O(√

1ε ).

I There is a gap!

I The question arose if the bound was loose or if there is abetter first order method

Nesterov’s methodsI Nesterov answered the question in 1983 by constructing an

”Optimal Gradient Method” with the rate O(√

1ε )

(O(√

Lε ) if we have not fixed L).

I Later variations (1988,2005) widened the applicability.I Introduces a momentum termI wk+1 = yk −∇f (yk), yk+1 = wk + αk(wk − wk−1)

I αk = βk−1βk+1

, βk+1 = (1 +√

4β2k + 1)/2.

I Iteration cost similar to gradient descent

Non-differentiable objectives: Subgradient descent

I Consider continuous but not necessarily differentiableobjectives

I Given a subgradient gk at wk we can perform an update

wk+1 = wk − ηkgk .

I In the typical machine learning situation we only havenon-differentiability on a set of Lebesgue measure 0 so gk willactually often be a gradient

I However, the gradients will ”jump” at the non-differentiabilityso we cannot have Lipschitz continuity of a (sub)gradient fieldin this situation.

I Subgradient descent has the bad rate O( 1ε2

).

Nesterov’s smoothing trick

I If we have a continuous non-differentiable function f that canbe identified with g∗, then we can use the following trick.

I Given a desired accuracy ε > 0, apply Nestov’s method toh = (g + δ‖x‖22)∗ where δ = ε

2 , but go to precision ε2 .

I h has L-Lipschitz gradients where L = 1δ

I f (w)− h(w) ≤ δ = ε2 .

I Combined error at most ε.

I Since h has Lipschitz gradients it takes O(√

Lε ) = O(1ε ).

Summary of offline gradient methods

I Normal gradient decent has optimal rate for strongly convexobjectives with Lipschitz gradients. O(log 1

ε )

I Normal gradient decent is not optimal (its O(1ε )) for convexbut not strongly convex objectives. Nesterov’s gradient

method is optimal ( O(√

1ε )).

I For non-differentiable objectives, subgradient decent is ingeneral O( 1

ε2).

I By utilizing special properties this can be improved. If theobjective can be identified with the dual of another function,we can use a smoothing trick and get O(1ε ).

Section



Convex Analysis



Online Optimization

I I will consider optimization problems of the formC (w) = EzL(z ,w).

I If the underlying distribution for z is unknown, then we do notknow C either.

I We observe a sequence z1, z2, ... and we use that informationto update parameters wt .

Performance Measures

I There are different ways of measuring performance.

I C (wt), ‖w − w∗‖ (if there exists a unique minimum w∗)

I∑t

i=1 L(zi ,wi ),∑t

i=1 L(zi ,wi )− infw∑t

i=1 L(zi ,w)

I Interesting function classes (C and L)

I E.g. L(z ,w) = ‖w − z‖22.

Stochastic Gradient Decent

I When an empirical average Cn(w) = 1n

∑ni=1 L(zi ,w) is

minimized, ”Batch optimizers” have to calculate the entiresum for each evalutation of Cn(w) and ∇wCn(w). (Gradientwith respect to a chosen metric, e.g. standard Euclidean w.r.t.a given coordinate system)

I Stochastic gradient methods work with gradient estimatesobtained from subsets of the training data.

I On large redundant data sets simple stochastic gradientdescent (SGD) typically outperforms second order ”batch”methods by orders of magnitude.

I wt+1 = wt − atYt where E (Yt) = ∇wC (w) and at > 0defines SGD.

I Many different attempts at using second order information

I wt+1 = wt − atBtYt where Bt is a positive scaling matrix.

Some Historical landmarksI Robbins and Monro 1951 introduced one-dimensional SGD

and proved a convergence theoremI Blum 1954 generalizes this to the multivariate case in but

with restrictive assumptionsI Robbins and Siegmund 1971 achieved a more general result

for the multivariate case using super-martingale theoryI Lai and Robbins 1979 investigates one-dimensional ”adaptive

procedures” (step-sizes not predetermined) by introducingestimated curvature

I Wei 1987 does the multivariate case (proves asymptoticefficiency)

I More on second order methods Amari 1998, Murata 1998,Bottou and Lecun 2005

I Second order information only change the convergence by aconstant factor, the rate remains the same.

I E. Hazan et al. 2006 shows logaritmic (instead of square root)regret for general online convex optimization with Newton step

Conditions for SGD convergence

I Cost function C (w) = EzL(z ,w) is three times continuouslydifferentiable and bounded from below (Can be replaced byother assumptions, e.g. bound on subgradient sets)

I∑

t a2t <∞,

∑t at =∞

I Ez(L(z ,w)2) ≤ A + B‖w‖22I inf‖w‖>D w · ∇wC (w) > 0

I∑‖w‖<E ‖L(z ,w)‖ ≤ K

I Conclusion: ‖wt − w∗‖22 = O(1/t) a.s.

I Addition: Updates with scaling matrices in the updatesconverges a.s. if the eigenvalues are uniformly bounded frombelow by a strictly positive constant and from above by afinite constant (Sunehag et. al. 2009).

Step sizes

I The Robbins-Monro condition∑

t a2t <∞,

∑t at =∞ is e.g.

satisified by at = 1/t

I bt+b (or 1

t+b ) also works. Halfed after b steps. Initial searchphase with rather flat schedule

I Step sizes that are not satisfying the Robbins-Monroconditions are also of interest

I Constant step sizes get us to a ball around the optimumwhose radius depends on this constant. This is also good fortracking a changing environment

I Intermediary alternatives like 1√t.

I Various adaptive (data dependent) schemes have been tried,e.g. Stochastic Meta Decent by Schraudolph. Unclear verdict.Shows improvement in special cases. Causes extracomputation.

Regret bounds

I Finite time analysis

I The regret is∑t

i=1 L(zi ,wi )− infw∑t

i=1 L(zi ,w)

I Under a strong convexity assumption one can prove that SGDwith the right step sizes 1

ρt has logarithmic regret which is thebest possible. One has to know ρ. Depends on knowing ρ.

I Without strong convexity its O(√t)).

I Using Newton-steps, one can still have logarithmic regret.

I See later online learning lecture

Second order methods

I wt − wt−1 ≈ H−1wt(∇C (wt)−∇C (wt−1)) (used by

Quasi-Newton methods)

I Estimating full Hessians (various methods) gives cost order d2

per step instead of d . Prohibitive for high dimensionalproblems.

I Low rank approximations (online LBFGS Schraudolph et. al.2007) can be cost order kd where k is the memory length.

I Convergence speed at most improve my a constant factor.The cost increase by a constant factor.

Second order on the cheap

I Online LBFGS with memory 10 multiplies the cost per stepwith at least 11. Only worth it for really ill-conditionedproblems

I Becker and Lecun estimated Diagonal Matrix for Neural Nets1989.

I SGD-QN by Bordes, Bordes, Bottou, Gallinari wins wild trackof the PASCAL Large Scale Learning Challenge 2008 byestimating Diagonal Scaling matrix (for SVMs) withLBFGS-inspired method

Support Vector Machines with SGD

I SVMs (hinge loss),minw

1m

∑mi=1 min(0, 1− yi (w

T xi )) + λ‖w‖22I Instantaneous objective min(0, 1− yi (w

Tt xi )) + λ‖wt‖22 if we

have drawn sample i for the current time t

I (Sub)Gradient: If correctly classified then gt = 2λwt . Ifwrong, then gt = 2λwt − yixi .

I λ-strong convexity suggests step size 1λt or 1

λ(t+t0)

Support Vector Machines with Pegasos

I Note: Plugging in w = 0 makes the expression equal to 1. Ifwe have w with ‖w‖ ≥ 1√

λ, then the expression is at least 1.

I Conclusion, we only have to consider ‖w‖2 ≤ 1√λ

.

I This is utilized by the Pegasos algorithm which alternates thegradient step above with a projection (scaling) onto this ball.

I Proven O(1ε ).

Summary

I Simple methods with cheap updates can be the best idea forparameter estimation in machine learning

I This is particularly true in large scale learning

I With SGD, the time per iteration does not depend on data setsize. Having more data just means that we do not rerun onold data.

I Faster progress is made when using ”fresh data”.

I More data, therefore, means LESS runtime (for SGD) until wereach the desired true accuracy.

Literature

I Numerical Optimization, Jorge Nocedal and Stephen J.Wright , Springer, August 2000

I Introductory Lectures on Convex Optimization: A BasicCourse, Yurii Nesterov, Springer 2003

I Convex Analysis, R. Tyrrell Rockafellar, Princeton UniversityPress 1996

I Convex Optimization, Stephen Boyd and LievenVandenberghe, Cambridge University Press 2004

I L. Bottou’s SGD page:http://leon.bottou.org/research/stochastic

offline and online optimization in machine...

Documents