lecture 20 opt - github pages · convex sets deﬁnition asetc ⊆ rn is convex if for x,y ∈ c...

CS639:DataManagementfor

DataScienceLecture20:Optimization/GradientDescent

TheodorosRekatsinas

1

Today

1. Optimization

2. GradientDescent

2

WhatisOptimization?

Findtheminimumormaximumofanobjectivefunctiongivenasetofconstraints:

argminx

f0(x)

s.t.fi

(x) 0, i = {1, . . . , k}h

j

(x) = 0, j = {1, . . . l}

3

WhyDoWeCare?

LinearClassification MaximumLikelihood

K-Means

argmax

✓

nX

i=1

log p✓(xi)

arg minµ1,µ2,. . . ,µk

J(µ) =kX

j=1

X

i2Cj

||xi � µj ||2

argminw

nX

i=1

||w||2 + C

nX

i=1

⇠i

s.t. 1� yixTi w ⇠i

⇠i � 0

4

PreferConvexProblems

Local(nonglobal)minimaandmaxima:

What is Optimization

But generally speaking...

We’re screwed.! Local (non global) minima of f0

! All kinds of constraints (even restricting to continuous functions):

h(x) = sin(2πx) = 0

−3−2

−10

12

3

−3−2

−10

12

3−50

0

50

100

150

200

250

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 535

Convex Sets

DefinitionA set C ⊆ Rn is convex if for x, y ∈ C and any α ∈ [0, 1],

αx + (1 − α)y ∈ C.

x

y


Convex Functions

DefinitionA function f : Rn → R is convex if for x, y ∈ dom f and any α ∈ [0, 1],

f(αx + (1 − α)y) ≤ αf(x) + (1 − α)f(y).

f(x)

f(y)αf(x) + (1 - α)f(y)


ConvexFunctionsandSets

A function f : Rn ! R is convex if for x, y 2 domfand any a 2 [0, 1],

f(ax+ (1� a)y) af(x) + (1� a)f(y)

A set C ✓ Rnis convex if for x, y 2 C and any a 2 [0, 1],

ax+ (1� a)y 2 C

6

ImportantConvexFunctions

Convex Functions Examples

Important examples in Machine Learning

! SVM loss:

f(w) =!

1 − yixTi w"

+

! Binary logistic loss:

f(w) = log#

1 + exp(−yixTi w)

$

−2 30

3

[1 - x]+

log(1 + ex)


Convex Functions Examples

Important examples in Machine Learning

! SVM loss:

f(w) =!

1 − yixTi w"

+

! Binary logistic loss:

f(w) = log#

1 + exp(−yixTi w)

$

−2 30

3

[1 - x]+

log(1 + ex)


7

ConvexOptimizationProblem

Convex Optimization Problems

Convex Optimization Problems

DefinitionAn optimization problem is convex if its objective is a convex function, theinequality constraints fj are convex, and the equality constraints hj areaffine

minimizex

f0(x) (Convex function)

s.t. fi(x) ≤ 0 (Convex sets)

hj(x) = 0 (Affine)


8

Lagrangian Dual

9

FirstOrderMethods:GradientDescentNewton’sMethod

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009Subgradient Descent

IntroductiontoConvexOptimizationforMachineLearning,JohnDuchi,UC Berkeley,Tutorial,2009StochasticGradientDescent

StochasticOptimizationforMachineLearning,NathanSrebro andAmbuj Tewari,presentedatICML'10

TrustRegionsTrustRegionNewtonmethodforlarge-scalelogisticregression,C.-J.Lin,R.C.Weng, andS.S.Keerthi,JournalofMachineLearningResearch,2008

DualCoordinateDescentDualCoordinateDescentMethodsforlogisticregressionandmaximumentropymodels,H.-F.Yu,F.-L.Huang,andC.-J.Lin,.MachineLearningJournal,2011

LinearClassificationRecentAdvancesofLarge-scalelinearclassification,G.-X.Yuan,C.-H.Ho,andC.-J.Lin.ProceedingsoftheIEEE,100(2012)

10








11

GradientDescent

Optimization Algorithms Gradient Methods

Gradient Descent

The simplest algorithm in the world (almost). Goal:

minimizex

f(x)

Just iteratext+1 = xt − ηt∇f(xt)

where ηt is stepsize.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 39 / 53 12

SingleStepIllustrationOptimization Algorithms Gradient Methods

Single Step Illustration

f(xt) - η∇f(xt)T (x - xt)

f(x)

(xt, f(xt))


13

FullGradientDescentIllustration

14








15








16

Newton’sMethod

Optimization Algorithms Gradient Methods

Newton’s method

Idea: use a second-order approximation to function.

f(x + ∆x) ≈ f(x) + ∇f(x)T ∆x +1

2∆xT∇2f(x)∆x

Choose ∆x to minimize above:

∆x = −!

∇2f(x)"−1 ∇f(x)

This is descent direction:

∇f(x)T ∆x = −∇f(x)T!

∇2f(x)"−1 ∇f(x) < 0.


InverseHessian Gradient

17

Newton’sMethodPictureOptimization Algorithms Gradient Methods

Newton step picture

(x, f(x))

(x + ∆x, f(x + ∆x))

f

f̂

f̂ is 2nd-order approximation, f is true function.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 45 / 53 18








19








20

Subgradient DescentMotivation

Convex Functions

Actually, more general than that

DefinitionThe subgradient set, or subdifferential set, ∂f(x) of f at x is

∂f(x) =!

g : f(y) ≥ f(x) + gT (y − x) for all y"

.

Theoremf : Rn → R is convex if and

only if it has non-empty

subdifferential set everywhere.

f(y)

f(x) + gT (y - x)

(x, f(x))


Optimization Algorithms Subgradient Methods

Why subgradient descent?

! Lots of non-differentiable convex functions used in machine learning:

f(x) =!

1 − aT x"

+, f(x) = ∥x∥1 , f(X) =

k#

r=1

σr(X)

where σr is the rth singular value of X.

! Easy to analyze

! Do not even need true sub-gradient: just have Egt ∈ ∂f(xt).


Convex Functions

Actually, more general than that

DefinitionThe subgradient set, or subdifferential set, ∂f(x) of f at x is

∂f(x) =!

g : f(y) ≥ f(x) + gT (y − x) for all y"

.

Theoremf : Rn → R is convex if and

only if it has non-empty

subdifferential set everywhere.

f(y)

f(x) + gT (y - x)

(x, f(x))


21

Subgradient Descent– Algorithm

Optimization Algorithms Subgradient Methods

Subgradient Descent

Really, the simplest algorithm in the world. Goal:

minimizex

f(x)

Just iteratext+1 = xt − ηtgt

where ηt is a stepsize, gt ∈ ∂f(xt).


22








23








24

Onlinelearningandoptimization

• Goalofmachinelearning:• Minimizeexpectedloss

givensamples

• ThisisStochasticOptimization• Assumelossfunctionisconvex

25

Batch(sub)gradientdescentforML

• Processallexamplestogetherineachstep

• Entiretrainingsetexaminedateachstep• Veryslowwhennisverylarge

26

Stochastic(sub)gradientdescent• “Optimize”oneexampleatatime• Chooseexamplesrandomly(orreorderandchooseinorder)• Learningrepresentativeofexampledistribution

27

Stochastic(sub)gradientdescent

• Equivalenttoonlinelearning(theweightvectorwchangeswitheveryexample)• Convergenceguaranteedforconvexfunctions(tolocalminimum)

28

Hybrid!

• Stochastic– 1exampleperiteration• Batch– Alltheexamples!• SampleAverageApproximation(SAA):• SamplemexamplesateachstepandperformSGDonthem

• Allowsforparallelization,butchoiceofmbasedonheuristics

29

SGD- Issues

• Convergenceverysensitivetolearningrate()(oscillationsnearsolutionduetoprobabilisticnatureofsampling)• Mightneedtodecreasewithtimetoensurethealgorithmconvergeseventually

• Basically– SGDgoodformachinelearningwithlargedatasets!

30








31

StatisticalLearningCrashCourseStaggeringamountofmachinelearning/statscanbewrittenas: minx f (x, yi )

i=1

N

∑N(numberofyi’s,data)typicallyinthebillions

Ex:Classification,Recommendation,DeepLearning.

Defacto iterationtosolvelarge-scaleproblems:SGD.

Billionsoftinyiterations

xk+1 = xk −αN∇f (xk, yj )Selectoneterm,j,andestimate

gradient.

ParallelSGD(Centralized)

wΔw

DataShards

ParameterServer:w=w- αΔw

ModelWorkers

• Centralizedparameterupdatesguaranteeconvergence

• ParameterServerisabottleneck

ParallelSGD(HogWild!- asynchronous)DataSystemsPerspectiveofSGD

Insaneconflicts: Billionsoftinyjobs(~100instructions),RWconflictsonx

xk+1 = xk −αN∇f (xk, yj )

Multipleworkersneedtocommunicate!HogWild!:Forsparseconvexmodels(e.g.,logisticregression)runwithoutlocks;SGDstillconverges(answerisstatisticallycorrect)

lecture 20 opt - github pages · convex sets deﬁnition asetc ⊆ rn is convex if for x,y ∈ c...

Documents