lecture 20 opt - github pages · convex sets definition asetc ⊆ rn is convex if for x,y ∈ c...

34
CS639: Data Management for Data Science Lecture 20: Optimization/Gradient Descent Theodoros Rekatsinas 1

Upload: others

Post on 19-Aug-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

CS639:DataManagementfor

DataScienceLecture20:Optimization/GradientDescent

TheodorosRekatsinas

1

Page 2: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

Today

1. Optimization

2. GradientDescent

2

Page 3: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

WhatisOptimization?

Findtheminimumormaximumofanobjectivefunctiongivenasetofconstraints:

argminx

f0(x)

s.t.fi

(x) 0, i = {1, . . . , k}h

j

(x) = 0, j = {1, . . . l}

3

Page 4: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

WhyDoWeCare?

LinearClassification MaximumLikelihood

K-Means

argmax

nX

i=1

log p✓(xi)

arg minµ1,µ2,. . . ,µk

J(µ) =kX

j=1

X

i2Cj

||xi � µj ||2

argminw

nX

i=1

||w||2 + C

nX

i=1

⇠i

s.t. 1� yixTi w ⇠i

⇠i � 0

4

Page 5: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

PreferConvexProblems

Local(nonglobal)minimaandmaxima:

What is Optimization

But generally speaking...

We’re screwed.! Local (non global) minima of f0

! All kinds of constraints (even restricting to continuous functions):

h(x) = sin(2πx) = 0

−3−2

−10

12

3

−3−2

−10

12

3−50

0

50

100

150

200

250

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 535

Page 6: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

Convex Sets

DefinitionA set C ⊆ Rn is convex if for x, y ∈ C and any α ∈ [0, 1],

αx + (1 − α)y ∈ C.

x

y

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 9 / 53

Convex Functions

DefinitionA function f : Rn → R is convex if for x, y ∈ dom f and any α ∈ [0, 1],

f(αx + (1 − α)y) ≤ αf(x) + (1 − α)f(y).

f(x)

f(y)αf(x) + (1 - α)f(y)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 15 / 53

ConvexFunctionsandSets

A function f : Rn ! R is convex if for x, y 2 domfand any a 2 [0, 1],

f(ax+ (1� a)y) af(x) + (1� a)f(y)

A set C ✓ Rnis convex if for x, y 2 C and any a 2 [0, 1],

ax+ (1� a)y 2 C

6

Page 7: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

ImportantConvexFunctions

Convex Functions Examples

Important examples in Machine Learning

! SVM loss:

f(w) =!

1 − yixTi w"

+

! Binary logistic loss:

f(w) = log#

1 + exp(−yixTi w)

$

−2 30

3

[1 - x]+

log(1 + ex)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 22 / 53

Convex Functions Examples

Important examples in Machine Learning

! SVM loss:

f(w) =!

1 − yixTi w"

+

! Binary logistic loss:

f(w) = log#

1 + exp(−yixTi w)

$

−2 30

3

[1 - x]+

log(1 + ex)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 22 / 53

7

Page 8: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

ConvexOptimizationProblem

Convex Optimization Problems

Convex Optimization Problems

DefinitionAn optimization problem is convex if its objective is a convex function, theinequality constraints fj are convex, and the equality constraints hj areaffine

minimizex

f0(x) (Convex function)

s.t. fi(x) ≤ 0 (Convex sets)

hj(x) = 0 (Affine)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 23 / 53

8

Page 9: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

Lagrangian Dual

9

Page 10: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

FirstOrderMethods:GradientDescentNewton’sMethod

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009Subgradient Descent

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009StochasticGradientDescent

StochasticOptimizationforMachineLearning,NathanSrebro andAmbuj Tewari,presentedatICML'10

TrustRegionsTrustRegionNewtonmethodforlarge-scalelogisticregression,C.-J.Lin,R.C.Weng, andS.S.Keerthi,JournalofMachineLearningResearch,2008

DualCoordinateDescentDualCoordinateDescentMethodsforlogisticregressionandmaximumentropymodels,H.-F.Yu,F.-L.Huang,andC.-J.Lin,.MachineLearningJournal,2011

LinearClassificationRecentAdvancesofLarge-scalelinearclassification,G.-X.Yuan,C.-H.Ho,andC.-J.Lin.ProceedingsoftheIEEE,100(2012)

10

Page 11: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

FirstOrderMethods:GradientDescentNewton’sMethod

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009Subgradient Descent

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009StochasticGradientDescent

StochasticOptimizationforMachineLearning,NathanSrebro andAmbuj Tewari,presentedatICML'10

TrustRegionsTrustRegionNewtonmethodforlarge-scalelogisticregression,C.-J.Lin,R.C.Weng, andS.S.Keerthi,JournalofMachineLearningResearch,2008

DualCoordinateDescentDualCoordinateDescentMethodsforlogisticregressionandmaximumentropymodels,H.-F.Yu,F.-L.Huang,andC.-J.Lin,.MachineLearningJournal,2011

LinearClassificationRecentAdvancesofLarge-scalelinearclassification,G.-X.Yuan,C.-H.Ho,andC.-J.Lin.ProceedingsoftheIEEE,100(2012)

11

Page 12: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

GradientDescent

Optimization Algorithms Gradient Methods

Gradient Descent

The simplest algorithm in the world (almost). Goal:

minimizex

f(x)

Just iteratext+1 = xt − ηt∇f(xt)

where ηt is stepsize.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 39 / 53 12

Page 13: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

SingleStepIllustrationOptimization Algorithms Gradient Methods

Single Step Illustration

f(xt) - η∇f(xt)T (x - xt)

f(x)

(xt, f(xt))

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 40 / 53

13

Page 14: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

FullGradientDescentIllustration

14

Page 15: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

FirstOrderMethods:GradientDescentNewton’sMethod

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009Subgradient Descent

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009StochasticGradientDescent

StochasticOptimizationforMachineLearning,NathanSrebro andAmbuj Tewari,presentedatICML'10

TrustRegionsTrustRegionNewtonmethodforlarge-scalelogisticregression,C.-J.Lin,R.C.Weng, andS.S.Keerthi,JournalofMachineLearningResearch,2008

DualCoordinateDescentDualCoordinateDescentMethodsforlogisticregressionandmaximumentropymodels,H.-F.Yu,F.-L.Huang,andC.-J.Lin,.MachineLearningJournal,2011

LinearClassificationRecentAdvancesofLarge-scalelinearclassification,G.-X.Yuan,C.-H.Ho,andC.-J.Lin.ProceedingsoftheIEEE,100(2012)

15

Page 16: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

FirstOrderMethods:GradientDescentNewton’sMethod

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009Subgradient Descent

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009StochasticGradientDescent

StochasticOptimizationforMachineLearning,NathanSrebro andAmbuj Tewari,presentedatICML'10

TrustRegionsTrustRegionNewtonmethodforlarge-scalelogisticregression,C.-J.Lin,R.C.Weng, andS.S.Keerthi,JournalofMachineLearningResearch,2008

DualCoordinateDescentDualCoordinateDescentMethodsforlogisticregressionandmaximumentropymodels,H.-F.Yu,F.-L.Huang,andC.-J.Lin,.MachineLearningJournal,2011

LinearClassificationRecentAdvancesofLarge-scalelinearclassification,G.-X.Yuan,C.-H.Ho,andC.-J.Lin.ProceedingsoftheIEEE,100(2012)

16

Page 17: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

Newton’sMethod

Optimization Algorithms Gradient Methods

Newton’s method

Idea: use a second-order approximation to function.

f(x + ∆x) ≈ f(x) + ∇f(x)T ∆x +1

2∆xT∇2f(x)∆x

Choose ∆x to minimize above:

∆x = −!

∇2f(x)"−1 ∇f(x)

This is descent direction:

∇f(x)T ∆x = −∇f(x)T!

∇2f(x)"−1 ∇f(x) < 0.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 44 / 53

InverseHessian Gradient

17

Page 18: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

Newton’sMethodPictureOptimization Algorithms Gradient Methods

Newton step picture

(x, f(x))

(x + ∆x, f(x + ∆x))

f

f̂ is 2nd-order approximation, f is true function.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 45 / 53 18

Page 19: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

FirstOrderMethods:GradientDescentNewton’sMethod

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009Subgradient Descent

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009StochasticGradientDescent

StochasticOptimizationforMachineLearning,NathanSrebro andAmbuj Tewari,presentedatICML'10

TrustRegionsTrustRegionNewtonmethodforlarge-scalelogisticregression,C.-J.Lin,R.C.Weng, andS.S.Keerthi,JournalofMachineLearningResearch,2008

DualCoordinateDescentDualCoordinateDescentMethodsforlogisticregressionandmaximumentropymodels,H.-F.Yu,F.-L.Huang,andC.-J.Lin,.MachineLearningJournal,2011

LinearClassificationRecentAdvancesofLarge-scalelinearclassification,G.-X.Yuan,C.-H.Ho,andC.-J.Lin.ProceedingsoftheIEEE,100(2012)

19

Page 20: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

FirstOrderMethods:GradientDescentNewton’sMethod

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009Subgradient Descent

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009StochasticGradientDescent

StochasticOptimizationforMachineLearning,NathanSrebro andAmbuj Tewari,presentedatICML'10

TrustRegionsTrustRegionNewtonmethodforlarge-scalelogisticregression,C.-J.Lin,R.C.Weng, andS.S.Keerthi,JournalofMachineLearningResearch,2008

DualCoordinateDescentDualCoordinateDescentMethodsforlogisticregressionandmaximumentropymodels,H.-F.Yu,F.-L.Huang,andC.-J.Lin,.MachineLearningJournal,2011

LinearClassificationRecentAdvancesofLarge-scalelinearclassification,G.-X.Yuan,C.-H.Ho,andC.-J.Lin.ProceedingsoftheIEEE,100(2012)

20

Page 21: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

Subgradient DescentMotivation

Convex Functions

Actually, more general than that

DefinitionThe subgradient set, or subdifferential set, ∂f(x) of f at x is

∂f(x) =!

g : f(y) ≥ f(x) + gT (y − x) for all y"

.

Theoremf : Rn → R is convex if and

only if it has non-empty

subdifferential set everywhere.

f(y)

f(x) + gT (y - x)

(x, f(x))

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 17 / 53

Optimization Algorithms Subgradient Methods

Why subgradient descent?

! Lots of non-differentiable convex functions used in machine learning:

f(x) =!

1 − aT x"

+, f(x) = ∥x∥1 , f(X) =

k#

r=1

σr(X)

where σr is the rth singular value of X.

! Easy to analyze

! Do not even need true sub-gradient: just have Egt ∈ ∂f(xt).

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 52 / 53

Convex Functions

Actually, more general than that

DefinitionThe subgradient set, or subdifferential set, ∂f(x) of f at x is

∂f(x) =!

g : f(y) ≥ f(x) + gT (y − x) for all y"

.

Theoremf : Rn → R is convex if and

only if it has non-empty

subdifferential set everywhere.

f(y)

f(x) + gT (y - x)

(x, f(x))

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 17 / 53

21

Page 22: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

Subgradient Descent– Algorithm

Optimization Algorithms Subgradient Methods

Subgradient Descent

Really, the simplest algorithm in the world. Goal:

minimizex

f(x)

Just iteratext+1 = xt − ηtgt

where ηt is a stepsize, gt ∈ ∂f(xt).

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 51 / 53

22

Page 23: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

FirstOrderMethods:GradientDescentNewton’sMethod

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009Subgradient Descent

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009StochasticGradientDescent

StochasticOptimizationforMachineLearning,NathanSrebro andAmbuj Tewari,presentedatICML'10

TrustRegionsTrustRegionNewtonmethodforlarge-scalelogisticregression,C.-J.Lin,R.C.Weng, andS.S.Keerthi,JournalofMachineLearningResearch,2008

DualCoordinateDescentDualCoordinateDescentMethodsforlogisticregressionandmaximumentropymodels,H.-F.Yu,F.-L.Huang,andC.-J.Lin,.MachineLearningJournal,2011

LinearClassificationRecentAdvancesofLarge-scalelinearclassification,G.-X.Yuan,C.-H.Ho,andC.-J.Lin.ProceedingsoftheIEEE,100(2012)

23

Page 24: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

FirstOrderMethods:GradientDescentNewton’sMethod

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009Subgradient Descent

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009StochasticGradientDescent

StochasticOptimizationforMachineLearning,NathanSrebro andAmbuj Tewari,presentedatICML'10

TrustRegionsTrustRegionNewtonmethodforlarge-scalelogisticregression,C.-J.Lin,R.C.Weng, andS.S.Keerthi,JournalofMachineLearningResearch,2008

DualCoordinateDescentDualCoordinateDescentMethodsforlogisticregressionandmaximumentropymodels,H.-F.Yu,F.-L.Huang,andC.-J.Lin,.MachineLearningJournal,2011

LinearClassificationRecentAdvancesofLarge-scalelinearclassification,G.-X.Yuan,C.-H.Ho,andC.-J.Lin.ProceedingsoftheIEEE,100(2012)

24

Page 25: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

Onlinelearningandoptimization

• Goalofmachinelearning:• Minimizeexpectedloss

givensamples

• ThisisStochasticOptimization• Assumelossfunctionisconvex

25

Page 26: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

Batch(sub)gradientdescentforML

• Processallexamplestogetherineachstep

• Entiretrainingsetexaminedateachstep• Veryslowwhennisverylarge

26

Page 27: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

Stochastic(sub)gradientdescent• “Optimize”oneexampleatatime• Chooseexamplesrandomly(orreorderandchooseinorder)• Learningrepresentativeofexampledistribution

27

Page 28: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

Stochastic(sub)gradientdescent

• Equivalenttoonlinelearning(theweightvectorwchangeswitheveryexample)• Convergenceguaranteedforconvexfunctions(tolocalminimum)

28

Page 29: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

Hybrid!

• Stochastic– 1exampleperiteration• Batch– Alltheexamples!• SampleAverageApproximation(SAA):• SamplemexamplesateachstepandperformSGDonthem

• Allowsforparallelization,butchoiceofmbasedonheuristics

29

Page 30: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

SGD- Issues

• Convergenceverysensitivetolearningrate()(oscillationsnearsolutionduetoprobabilisticnatureofsampling)• Mightneedtodecreasewithtimetoensurethealgorithmconvergeseventually

• Basically– SGDgoodformachinelearningwithlargedatasets!

30

Page 31: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

FirstOrderMethods:GradientDescentNewton’sMethod

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009Subgradient Descent

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009StochasticGradientDescent

StochasticOptimizationforMachineLearning,NathanSrebro andAmbuj Tewari,presentedatICML'10

TrustRegionsTrustRegionNewtonmethodforlarge-scalelogisticregression,C.-J.Lin,R.C.Weng, andS.S.Keerthi,JournalofMachineLearningResearch,2008

DualCoordinateDescentDualCoordinateDescentMethodsforlogisticregressionandmaximumentropymodels,H.-F.Yu,F.-L.Huang,andC.-J.Lin,.MachineLearningJournal,2011

LinearClassificationRecentAdvancesofLarge-scalelinearclassification,G.-X.Yuan,C.-H.Ho,andC.-J.Lin.ProceedingsoftheIEEE,100(2012)

31

Page 32: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

StatisticalLearningCrashCourseStaggeringamountofmachinelearning/statscanbewrittenas: minx f (x, yi )

i=1

N

∑N(numberofyi’s,data)typicallyinthebillions

Ex:Classification,Recommendation,DeepLearning.

Defacto iterationtosolvelarge-scaleproblems:SGD.

Billionsoftinyiterations

xk+1 = xk −αN∇f (xk, yj )Selectoneterm,j,andestimate

gradient.

Page 33: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

ParallelSGD(Centralized)

wΔw

DataShards

ParameterServer:w=w- αΔw

ModelWorkers

• Centralizedparameterupdatesguaranteeconvergence

• ParameterServerisabottleneck

Page 34: Lecture 20 opt - GitHub Pages · Convex Sets Definition AsetC ⊆ Rn is convex if for x,y ∈ C and any α ∈ [0,1], αx+(1−α)y ∈ C. x y Duchi (UC Berkeley) Convex Optimization

ParallelSGD(HogWild!- asynchronous)DataSystemsPerspectiveofSGD

Insaneconflicts: Billionsoftinyjobs(~100instructions),RWconflictsonx

xk+1 = xk −αN∇f (xk, yj )

Multipleworkersneedtocommunicate!HogWild!:Forsparseconvexmodels(e.g.,logisticregression)runwithoutlocks;SGDstillconverges(answerisstatisticallycorrect)