offline and online optimization in machine...

45
Offline and Online Optimization in Machine Learning Peter Sunehag ANU September 2010

Upload: others

Post on 07-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Offline and Online Optimizationin Machine Learning

Peter Sunehag

ANU

September 2010

Page 2: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Section

Introduction: Optimization with gradient methods

Introduction:Machine Learning and Optimization

Convex Analysis

(Sub)Gradient Methods

Online (sub)Gradient Methods

Page 3: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Steepest decent

I We want to find the minimum of a real valued functionf : Rn → R.

I We will primarily discuss optimization using gradients

I Gradient Decent is also called the steepest decent method

Page 4: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Problem: Curvature

I If we the level sets are very curved (far from round), gradientdecent might take a long route to the optimum

I Newton’s method takes curvature into account and ”corrects”for it. It relies on calculating inverse Hessian (secondderivatives) of the objective function

I Quasi-Newton methods estimates the ”correction” by lookingat the path so far. Red Newton, Green Gradient Decent

Page 5: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Problem: Zig-Zag

I Gradient Descent has a tendency to Zig-Zag

I This is particularly problematic in valleys

I One solution: Introduce momentum

I Another solution: Conjugate gradients

Page 6: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Section

Introduction: Optimization with gradient methods

Introduction:Machine Learning and Optimization

Convex Analysis

(Sub)Gradient Methods

Online (sub)Gradient Methods

Page 7: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Optimization in Machine Learning

I Data is becoming ”unlimited” in many areas.

I Computational resources are not

I Taking advantage of the data requires efficient optimization

I Machine Learning researchers have realized that we are in adifferent situation compared to the contexts in which theoptimization (say convex optimization) field developed.

I Some of those methods and theories are still useful.

I Different ultimate goal.

I We optimize surrogates for what we really want.

I We also often have much higher dimensionality than otherapplications.

Page 8: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Optimization in Machine Learning

I Known training data A, unknown test data BI We want optimal performance on the test data

I Alternatively we have streaming data (or pretend that we do).

I Given a loss function L(w , z) (parameters w ∈W , datasample(s) z), we want as low loss

∑z∈B L(z ,w) as possible

on the test set.

I Since we do not have access to the test set, we minimizeregularized empirical loss on the training set∑

z∈AL(z ,w) + λR(w).

I Getting extremely close to the minimum of this objectivemight not increase performance on the test set.

I In Machine learning, the optimization methods of interest arethose that fast get a good estimate wk to the optimum w∗

Page 9: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Optimization in Machine Learning

I How large k to get εk = f (wk)− f (w∗) ≤ ε for a given ε > 0?

I Is this number like 1/ε, 1/ε2, 1/√ε, log 1

ε ?

I Is εk like 1/k , 1/k2, etc

I How long does an iteration take?

I Quasi-Newton methods (BFGS, L-BFGS), are rapid when weare already close to the optimum but otherise not better thanmethods (gradient decent) with much cheaper iterations.

I We care the most about the early phase when we are notclose to the optimum

Page 10: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Examples

I linear classifier (classes ±1) f (x) = sgn(wT x + b).

I Training data {(xi , yi )}I SVMs (hinge loss),

minw∑

i min(0, 1− yi (wT xi + b)) + λ‖w‖22

I Logistic multi-class: minw∑

i logexpwT

yixi∑

l expwTl xi

+ λ‖w‖22I Decision l∗ = argmaxl{wT

l x}I Different regularizers, e.g. ‖x‖1 encourages sparsity.

I hinge loss and `1 regularization causes non-differentiability onnull sets.

Page 11: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Section

Introduction: Optimization with gradient methods

Introduction:Machine Learning and Optimization

Convex Analysis

(Sub)Gradient Methods

Online (sub)Gradient Methods

Page 12: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Gradients

I We will make use of gradients with respect to a givencoordinate system and its associated Euclidean geometry.

I If f (x) is a real valued function that is differentiable atx = (x1, ..., xd), then

∇x f = (∂f

∂x1, ...,

∂f

∂xd)

I The gradient points in the direction of steepest ascent

Page 13: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Subgradients

I If f (x) is a convex real valued function, and if g ∈ Rd is suchthat

f (x)− f (y) ≥ gT (x − y)

for all y in some open ball around x , then g is in thesubgradient set of f at x .

I If f is differentiable, then only ∇f (x) is in the subgradient set.

I A line with slope g can be drawn that only meet f at x

Page 14: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Convex set

I A set X ⊂ Rn is convex if for all x , y ∈ X and 0 ≤ λ ≤ 1,

(1− λ)x + λy ∈ X

Page 15: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Convex function

I A function f : Rn → R is convex if ∀λ ∈ [0, 1] and x , y ∈ Rn,

f ((1− λ)x + λy) ≤ (1− λ)f (x) + λf (y).

I The set above the function’s curve is called its epigraph

I A function is convex iff its epigraph is a convex set

I A convex function is called closed if the epigraph is closed

Page 16: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Strong ConvexityI A function f : Rn → R is ρ-strongly convex (ρ > 0) if∀λ ∈ [0, 1] and x , y ∈ Rn,

f ((1−λ)x +λy) ≤ (1−λ)f (x) +λf (y)− ρλ(1− λ)

2‖x − y‖22

I or equivalently if f (x)− (ρ/2)‖x‖22 is convexI or equivalently

f (y) ≥ f (x)+ < ∇f (x), y − x > +(ρ/2)‖x − y‖22I or equivalently (if twice diff) ∇2f (x) ≥ ρI (id matrix)I Green (y = x4) is too flat. Blue (y = x2) is not too flat.

Page 17: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Lipschitz Continuity

I f is L-Lipschitz if

|f (x)− f (y)| ≤ L‖x − y‖2

I Lipschitz implies continuity but not differentiability

I If differentiable, then bounded (norm) gradients impliesLipschitz continuity.

I The secant slopes are bounded by L

Page 18: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Lipschitz Continuous Gradients

I If f is differentiable, then having L-Lipschitz gradients(L-LipG) means that

‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2

which is equivalent to

f (y) ≤ f (x)+ < ∇f (x), y − x > +L

2‖x − y‖22

which if f is twice differentiable is equivalent to∇2f (x) ≤ L · I .

Page 19: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Legendre transform

I Transformation between points and lines (set of tangents)

I One dimensional functions (generalizes to Fenchel transform)

I f ∗(p) = maxx(px − f (x)).

I If f differentiable, the maximizing x has the property thatf ′(x) = p.

Page 20: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Fenchel duality

I The fenchel dual of a function f is

f ∗(s) = supx< s, x > −f (x)

I If f is a closed convex function, then f ∗∗ = f .

I f ∗(0) = maxx(−f (x)) = −minx f (x)

I

∇f ∗(s) = argmaxx

< s, x > −f (x).

∇f (x) = argmaxs

< s, x > −f ∗(s).

I ∇f ∗(∇f (x)) = x and ∇f (∇f ∗(s)) = s

Page 21: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Duality between strong convexity and Lipschitz gradients

I f is ρ strongly convex iff f ∗ has 1ρ -Lip. gradients

I f ∗ has L-Lip. gradients iff f ∗ is 1L strongly convex

I If a non-differentiable funtion f can be identified as g∗ we can”smooth it out” by considering (g + δ‖ · ‖22)∗ which isdifferentiable and has 1

δ -Lip gradients.

I δ = µ2 below

Page 22: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Section

Introduction: Optimization with gradient methods

Introduction:Machine Learning and Optimization

Convex Analysis

(Sub)Gradient Methods

Online (sub)Gradient Methods

Page 23: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

(Sub)Gradient MethodsI Gradient descent: wt+1 = wt − atgt , gt (sub)gradient of

objective f at wt .I at can be decided by line search or by a predetermined schemeI If we are constrained to a convex set X , then use

wt+1 = ΠX (wt − atgt) = argminw∈X

‖w − (wt − atgt)‖2

Page 24: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

First and Second order information

I First order information: ∇f (wt), t = 0, 1, 2, ..., n

I Second order information is about ∇2f and is expensive tocompute and store for high dimensional problems.

I Newton’s method is a second order method wherewt+1 = wt − (Ht)

−1gt .

I Quasi-Newton methods use first order information to try toapproximate Newton’s method.

I Conjugate gradient also use first order information

Page 25: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Lower Bounds

I There are lower bounds for how well a first order method, i.e.a method for which wk+1 − wk ∈Span(∇f (x0), ...,∇f (xk))can do. They depend on the function class!

I Given w0 ( and L, ρ), the bound is saying that for any ε > 0there exists f in the class such that it takes at least nε stepsfor any first order method and a statement about nε is made.

I Differentiable functions with Lipschitz continuous gradients:

Lower bound is O(√

1ε )

I Lipschitz continuous gradients and strong convexity: Lowerbound is O(log 1

ε ).

Page 26: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Gradient Descent with Lipschitz gradients

I If we have L-Lipschitz continuous gradients, then definegradient decent by

wk+1 = wk −1

L∇f (wk).

I Monotone: f (wk+1) ≤ f (wk).

I If f is also ρ-strongly convex (ρ > 0), then

(f (wk)− f (w∗)) ≤ (1− ρ

L)k(f (w0)− f (w∗)).

I This result translates into the optimal O(log 1ε ) rate for first

order methods

Page 27: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Gradient Descent without strong convexity

I Suppose that we do not have strong convexity (ρ = 0) but wehave L-LipG.

I Then the gradient descent procedure wk+1 = wk − 1L∇f (wk)

has rate O(1ε ). (f (wk)− f (w∗)) ≤ Lk ‖w

∗ − w0‖2.

I The lower bound for this class is O(√

1ε ).

I There is a gap!

I The question arose if the bound was loose or if there is abetter first order method

Page 28: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Nesterov’s methodsI Nesterov answered the question in 1983 by constructing an

”Optimal Gradient Method” with the rate O(√

1ε )

(O(√

Lε ) if we have not fixed L).

I Later variations (1988,2005) widened the applicability.I Introduces a momentum termI wk+1 = yk −∇f (yk), yk+1 = wk + αk(wk − wk−1)

I αk = βk−1βk+1

, βk+1 = (1 +√

4β2k + 1)/2.

I Iteration cost similar to gradient descent

Page 29: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Non-differentiable objectives: Subgradient descent

I Consider continuous but not necessarily differentiableobjectives

I Given a subgradient gk at wk we can perform an update

wk+1 = wk − ηkgk .

I In the typical machine learning situation we only havenon-differentiability on a set of Lebesgue measure 0 so gk willactually often be a gradient

I However, the gradients will ”jump” at the non-differentiabilityso we cannot have Lipschitz continuity of a (sub)gradient fieldin this situation.

I Subgradient descent has the bad rate O( 1ε2

).

Page 30: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Nesterov’s smoothing trick

I If we have a continuous non-differentiable function f that canbe identified with g∗, then we can use the following trick.

I Given a desired accuracy ε > 0, apply Nestov’s method toh = (g + δ‖x‖22)∗ where δ = ε

2 , but go to precision ε2 .

I h has L-Lipschitz gradients where L = 1δ

I f (w)− h(w) ≤ δ = ε2 .

I Combined error at most ε.

I Since h has Lipschitz gradients it takes O(√

Lε ) = O(1ε ).

Page 31: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Summary of offline gradient methods

I Normal gradient decent has optimal rate for strongly convexobjectives with Lipschitz gradients. O(log 1

ε )

I Normal gradient decent is not optimal (its O(1ε )) for convexbut not strongly convex objectives. Nesterov’s gradient

method is optimal ( O(√

1ε )).

I For non-differentiable objectives, subgradient decent is ingeneral O( 1

ε2).

I By utilizing special properties this can be improved. If theobjective can be identified with the dual of another function,we can use a smoothing trick and get O(1ε ).

Page 32: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Section

Introduction: Optimization with gradient methods

Introduction:Machine Learning and Optimization

Convex Analysis

(Sub)Gradient Methods

Online (sub)Gradient Methods

Page 33: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Online Optimization

I I will consider optimization problems of the formC (w) = EzL(z ,w).

I If the underlying distribution for z is unknown, then we do notknow C either.

I We observe a sequence z1, z2, ... and we use that informationto update parameters wt .

Page 34: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Performance Measures

I There are different ways of measuring performance.

I C (wt), ‖w − w∗‖ (if there exists a unique minimum w∗)

I∑t

i=1 L(zi ,wi ),∑t

i=1 L(zi ,wi )− infw∑t

i=1 L(zi ,w)

I Interesting function classes (C and L)

I E.g. L(z ,w) = ‖w − z‖22.

Page 35: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Stochastic Gradient Decent

I When an empirical average Cn(w) = 1n

∑ni=1 L(zi ,w) is

minimized, ”Batch optimizers” have to calculate the entiresum for each evalutation of Cn(w) and ∇wCn(w). (Gradientwith respect to a chosen metric, e.g. standard Euclidean w.r.t.a given coordinate system)

I Stochastic gradient methods work with gradient estimatesobtained from subsets of the training data.

I On large redundant data sets simple stochastic gradientdescent (SGD) typically outperforms second order ”batch”methods by orders of magnitude.

I wt+1 = wt − atYt where E (Yt) = ∇wC (w) and at > 0defines SGD.

I Many different attempts at using second order information

I wt+1 = wt − atBtYt where Bt is a positive scaling matrix.

Page 36: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Some Historical landmarksI Robbins and Monro 1951 introduced one-dimensional SGD

and proved a convergence theoremI Blum 1954 generalizes this to the multivariate case in but

with restrictive assumptionsI Robbins and Siegmund 1971 achieved a more general result

for the multivariate case using super-martingale theoryI Lai and Robbins 1979 investigates one-dimensional ”adaptive

procedures” (step-sizes not predetermined) by introducingestimated curvature

I Wei 1987 does the multivariate case (proves asymptoticefficiency)

I More on second order methods Amari 1998, Murata 1998,Bottou and Lecun 2005

I Second order information only change the convergence by aconstant factor, the rate remains the same.

I E. Hazan et al. 2006 shows logaritmic (instead of square root)regret for general online convex optimization with Newton step

Page 37: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Conditions for SGD convergence

I Cost function C (w) = EzL(z ,w) is three times continuouslydifferentiable and bounded from below (Can be replaced byother assumptions, e.g. bound on subgradient sets)

I∑

t a2t <∞,

∑t at =∞

I Ez(L(z ,w)2) ≤ A + B‖w‖22I inf‖w‖>D w · ∇wC (w) > 0

I∑‖w‖<E ‖L(z ,w)‖ ≤ K

I Conclusion: ‖wt − w∗‖22 = O(1/t) a.s.

I Addition: Updates with scaling matrices in the updatesconverges a.s. if the eigenvalues are uniformly bounded frombelow by a strictly positive constant and from above by afinite constant (Sunehag et. al. 2009).

Page 38: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Step sizes

I The Robbins-Monro condition∑

t a2t <∞,

∑t at =∞ is e.g.

satisified by at = 1/t

I bt+b (or 1

t+b ) also works. Halfed after b steps. Initial searchphase with rather flat schedule

I Step sizes that are not satisfying the Robbins-Monroconditions are also of interest

I Constant step sizes get us to a ball around the optimumwhose radius depends on this constant. This is also good fortracking a changing environment

I Intermediary alternatives like 1√t.

I Various adaptive (data dependent) schemes have been tried,e.g. Stochastic Meta Decent by Schraudolph. Unclear verdict.Shows improvement in special cases. Causes extracomputation.

Page 39: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Regret bounds

I Finite time analysis

I The regret is∑t

i=1 L(zi ,wi )− infw∑t

i=1 L(zi ,w)

I Under a strong convexity assumption one can prove that SGDwith the right step sizes 1

ρt has logarithmic regret which is thebest possible. One has to know ρ. Depends on knowing ρ.

I Without strong convexity its O(√t)).

I Using Newton-steps, one can still have logarithmic regret.

I See later online learning lecture

Page 40: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Second order methods

I wt − wt−1 ≈ H−1wt(∇C (wt)−∇C (wt−1)) (used by

Quasi-Newton methods)

I Estimating full Hessians (various methods) gives cost order d2

per step instead of d . Prohibitive for high dimensionalproblems.

I Low rank approximations (online LBFGS Schraudolph et. al.2007) can be cost order kd where k is the memory length.

I Convergence speed at most improve my a constant factor.The cost increase by a constant factor.

Page 41: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Second order on the cheap

I Online LBFGS with memory 10 multiplies the cost per stepwith at least 11. Only worth it for really ill-conditionedproblems

I Becker and Lecun estimated Diagonal Matrix for Neural Nets1989.

I SGD-QN by Bordes, Bordes, Bottou, Gallinari wins wild trackof the PASCAL Large Scale Learning Challenge 2008 byestimating Diagonal Scaling matrix (for SVMs) withLBFGS-inspired method

Page 42: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Support Vector Machines with SGD

I SVMs (hinge loss),minw

1m

∑mi=1 min(0, 1− yi (w

T xi )) + λ‖w‖22I Instantaneous objective min(0, 1− yi (w

Tt xi )) + λ‖wt‖22 if we

have drawn sample i for the current time t

I (Sub)Gradient: If correctly classified then gt = 2λwt . Ifwrong, then gt = 2λwt − yixi .

I λ-strong convexity suggests step size 1λt or 1

λ(t+t0)

Page 43: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Support Vector Machines with Pegasos

I Note: Plugging in w = 0 makes the expression equal to 1. Ifwe have w with ‖w‖ ≥ 1√

λ, then the expression is at least 1.

I Conclusion, we only have to consider ‖w‖2 ≤ 1√λ

.

I This is utilized by the Pegasos algorithm which alternates thegradient step above with a projection (scaling) onto this ball.

I Proven O(1ε ).

Page 44: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Summary

I Simple methods with cheap updates can be the best idea forparameter estimation in machine learning

I This is particularly true in large scale learning

I With SGD, the time per iteration does not depend on data setsize. Having more data just means that we do not rerun onold data.

I Faster progress is made when using ”fresh data”.

I More data, therefore, means LESS runtime (for SGD) until wereach the desired true accuracy.

Page 45: Offline and Online Optimization in Machine Learningusers.cecs.anu.edu.au/~ssanner/MLSS2010/Sunehag1.pdf · 2010-09-27 · Optimization in Machine Learning I Data is becoming "unlimited"

Literature

I Numerical Optimization, Jorge Nocedal and Stephen J.Wright , Springer, August 2000

I Introductory Lectures on Convex Optimization: A BasicCourse, Yurii Nesterov, Springer 2003

I Convex Analysis, R. Tyrrell Rockafellar, Princeton UniversityPress 1996

I Convex Optimization, Stephen Boyd and LievenVandenberghe, Cambridge University Press 2004

I L. Bottou’s SGD page:http://leon.bottou.org/research/stochastic