rutgers university -...

Optimization in Machine Learning

Tong Zhang

Rutgers University

T. Zhang (Rutgers) Optimization 1 / 24

Topics

Gradient DescentProximal Projection MethodCoordinate DescentConvex Duality and Dual Coordinate DescentLBFGS


Supervised Learning

Training data: (Xi ,Yi) (i = 1, . . . ,n)Example: linear prediction function wT xTraining algorithm: SVM

w = arg minw

[n−1

n∑i=1

(1− wT XiYi)+ + λwT w

].

This is an optimization problem: how to find w?


Unconstrained Optimization

Consider a general unconstrained optimization problem:

w∗ = arg minw

f (w),

How to find the optimal solution?global solution: w such that f (w) ≤ f (w ′) for all w ′.local solution: w such that f (w) ≤ f (w ′) when w ′ is close to w .global solution is local solution but not necessarily vice versa.local optimal (and thus global optimal) solution satisfies ∇f (w) = 0.for convex problems: local and global solutions are the same.


Gradient Descent

wk = wk−1 − ηk∇f (wk−1).

How fast does this method converge to the optimal solution?

General result: converge to local minimum under suitableconditions.What’s the convergence rate?

Answer: depends on conditions of f (·).This lecture focuses on convex problems.


Gradient Descent

wk = wk−1 − ηk∇f (wk−1).

How fast does this method converge to the optimal solution?

General result: converge to local minimum under suitableconditions.What’s the convergence rate?Answer: depends on conditions of f (·).This lecture focuses on convex problems.


Convexity

For all α ∈ [0,1], we have

f (αx + (1− α)x ′) ≤ αf (x) + (1− α)f (x ′).

A subgradient ∇f (x0) at x0 satisfies:

f (x) ≥ f (x0) + (x − x0)>∇f (x0)

Generalize gradient for functionsSubgradient v0 is not necessarily unique:

f (x) = |x | (x ∈ R)at x0 = 0: any v0 ∈ [−1,1] satisfies the requirement (thus asubgradient)

In the following we assume subgradient always exists


Common Conditions of Objective Function

Convexity:f (x ′)− f (x)−∇f (x)T (x ′ − x) ≥ 0

Nonsmooth: first order derivative may be discontinuous: e.g. hingeloss or L1 regularizationSmooth: first order derivative is Lipschitz or second order derivativeis bounded:

f (x ′)− f (x)−∇f (x)T (x ′ − x) ≤ L2‖x ′ − x‖22

Strongly Convex:

f (x ′)− f (x)−∇f (x)T (x ′ − x) ≥ µ

2‖x ′ − x‖22

Convex is when the above satisfied with µ = 0.A function can be strongly convex and nonsmooth: f (x) = x2 + |x |.


Results

Smooth and strongly convex: gradient descent with a sufficientlysmall constant ηk has linear (or geometric) convergence:

f (wk )− f (w∗) = O(γk )

for some γ < 1.Smooth but not strongly convex:

f (wk )− f (w∗) = O(1/k),

with learning rate ηk = O(1/k).Nonsmooth:

f (wk )− f (w∗) = O(1/√

k),

for ηk = O(1/√

k) and wk = k−1∑nj=1 wk .

The learning rate can be tuned with line search.


Reformulation of Gradient Descent

Gradient descent can be derived from:

wk =arg minw

Qk (w)

Qk (w) :=f (wk−1) +∇f (wk−1)T (w − wk−1) +

12ηk‖w − wk−1‖22

Key properties: assume smoothness for simplicity and 1/ηk ≥ L(smoothness parameter of f ).

Qk (wk−1) = f (wk−1)

Qk (w) ≥ f (w)

Qk (w) is easy to optimizeConsequence: minimize Qk (w) reduces objective value of f (w):f (wk−1)− f (wk ) ≥ Qk (wk−1)−Qk (wk ).

This idea can be be generalized to other convex upper bound of f (w).




wk =arg minw

Qk (w)

Qk (w) :=f (wk−1) +∇f (wk−1)T (w − wk−1) +

12ηk‖w − wk−1‖22


Qk (wk−1) = f (wk−1)

Qk (w) ≥ f (w)

Qk (w) is easy to optimize

Consequence: minimize Qk (w) reduces objective value of f (w):f (wk−1)− f (wk ) ≥ Qk (wk−1)−Qk (wk ).





wk =arg minw

Qk (w)

Qk (w) :=f (wk−1) +∇f (wk−1)T (w − wk−1) +

12ηk‖w − wk−1‖22


Qk (wk−1) = f (wk−1)

Qk (w) ≥ f (w)

Qk (w) is easy to optimizeConsequence: minimize Qk (w) reduces objective value of f (w):f (wk−1)− f (wk ) ≥ Qk (wk−1)−Qk (wk ).



Proximal Gradient Method

Assumef (w) = φ(w) + g(w),

then we may consider the following upper bound of f (w)

Qk (w) := φ(wk−1)+∇φ(wk−1)T (w −wk−1)+

12ηk‖w −wk−1‖22 + g(w),

with 1/ηk larger than the smoothness parameter of φ. Then solve for

wk = arg minw

Qk (w).

We assume that this minimization problem is easy.

generalization of gradient descent called proximal gradient descent.useful when g(w) is a simple nonsmooth function such as L1regularization g(w) = λ‖w‖1.


Example: L1 regularization

f (w) =n∑

i=1

(wT xi − yi)2 + λ‖w‖1.

For example, φ(w) =∑n

i=1(wT xi − yi)

2 and g(w) = λ‖w‖1. Then

Qk := φ(wk−1) +∇φ(wk−1)T (w − wk−1) +

12ηk‖w − wk−1‖22 + λ‖w‖1.

Solution iswk = trunc(wk−1 − ηk∇φ(wk−1)),

where

trunc([u1, . . . ,ud ]) =[trunc(uj)]j=1,...,d

trunc(uj) =sign(uj)(|uj | − ληk )+


Property of Proximal Gradient

Smoothness depending on φ rather than f (can tolerate nonsmooth g)Convergence similar to gradient descent:

if φ is smooth and f is strongly convex: convergence is linearif φ is smooth but not strongly convex: convergence is 1/k .if φ is not smooth: convergence is 1/

√k .


Nesterov’s Accelerated Gradient (one version)

Procedure:Pick η1, η2, . . . ≥ 0Pick w1 = y1 = z0, thenDefine α0 = 0 and α−2

i − α−1i = α−2

i−1 for i ≥ 1(may also set αi = (1 + i/2)−1)Iterate for i = 1,2, . . . ,T :

zi =arg minz

[g(z) +

12ηi‖z‖22 − (η−1

i zi−1 − α−1i ∇φ(yi))

>z],

wi =(1− αi−1)wi−1 + αi−1zi

yi+1 =(1− αi)wi + αizi

Advantage: faster convergence of 1/k2 for smooth φDisadvantage:

for smooth and strongly convex f : algorithm has to be modified toachieve geometric convergencemodification depends on strong convexity parameter µ.


Beyond First Order Method: LBFGS (high level view)

Recall gradient descent: successive minimization of

Qk (w) = f (wk−1) +∇f (wk−1)T (w − wk−1) +

12ηk‖w − wk−1‖22.

upper bound of f (w)

Locally a more accurate approximation of f (x) is to use Hessian:

Qk (w) = f (wk−1)+∇f (wk−1)T (w−wk−1)+

12(w−wk−1)

T H(w−wk−1).

BFGS; approximate H using first order gradients.LBFGS: use limited memory (store a few vectors) to approximate HVery effective for optimization of smooth objective functions.


Coordinate Descent (CD)

Let f (w) = f ([w1, . . . ,wd ])Algorithm:

for j = 1, . . . ,dwj ← arg minu f ([w1, . . . ,wj−1,u,wj+1, . . . ,wd ])

repeat until convergence

Idea: optimize one parameter at a time and fix others

Assumption:each one dimensional problem can be solved easily.each coordinate update for variable j is inexpensive compared togradient descent.


Coordinate Descent (CD)

Let f (w) = f ([w1, . . . ,wd ])Algorithm:

for j = 1, . . . ,dwj ← arg minu f ([w1, . . . ,wj−1,u,wj+1, . . . ,wd ])

repeat until convergence

Idea: optimize one parameter at a time and fix othersAssumption:

each one dimensional problem can be solved easily.each coordinate update for variable j is inexpensive compared togradient descent.


Linear Regularization Problem

Consider regularized logistic regression:

w = arg minw

[n∑

i=1

ln(1 + exp(−w>xiyi)) + λ‖w‖1

]

or more generally the following problem with scaler functions fi and hj :

w = arg minw

n∑i=1

fi(wT xi) +d∑

j=1

hj(wj)

.Iteration complexity:

maintain zi = w>xi for i = 1, . . . ,ncoordinate j : update wj and {zi} requires scanning a featurecolumnone pass over j = 1, . . . ,d : one gradient descent step


Convergence

Practice:for suitable problems, coordinate descent works much better thangradient descente.g., regularized logistic regression

Theory: incompletecurrent analysis either shows no improvements or improvementsunder very restricted scaneriosPaul Tseng, Yurii Nesterov, ...

It is still an open question to develop better theoretical understandingon when does coordinate descent performs better.


Convex Duality

Given a convex function f (w), we can define its conjugate or dual

f ∗(α) = supw

[wTα− f (w)].

The optimal w is α = ∇f (w).

The dual of f ∗ is f :

f (w) = supα

[wTα− f ∗(α)].

The optimal α is ∇f ∗(α) = w .We have the following property: for all w and α:

f (w) + f ∗(α) ≥ wTα

equality holds only at w = ∇f ∗(α): equivalent to α = ∇f (w).


Convex Duality


f ∗(α) = supw

[wTα− f (w)].

The optimal w is α = ∇f (w).The dual of f ∗ is f :

f (w) = supα

[wTα− f ∗(α)].

The optimal α is ∇f ∗(α) = w .

We have the following property: for all w and α:

f (w) + f ∗(α) ≥ wTα



Convex Duality


f ∗(α) = supw

[wTα− f (w)].

The optimal w is α = ∇f (w).The dual of f ∗ is f :

f (w) = supα

[wTα− f ∗(α)].

The optimal α is ∇f ∗(α) = w .We have the following property: for all w and α:

f (w) + f ∗(α) ≥ wTα



Dual of Linear Regularization Method

Primal optimization problem:

w∗ = arg minw

P(w) P(w) :=n∑

i=1

fi(wT xi) + λg(w).

Dual optimization problem:

α∗ = arg maxα

D(α) D(α) =n∑

i=1

−f ∗i (−αi)− λg∗(λ−1∑

i

αixi).

Strong duality:P(w) ≥ D(α) for all w and αP(w∗) = D(α∗) with the relationship:

w∗ = ∇g∗(λ−1

n∑i=1

α∗,ixi

)α∗ i = f ′i (w

T∗ xi).

Solve dual instead of primal problem.T. Zhang (Rutgers) Optimization 19 / 24

Quick Justification of Strong Duality

P(w)− D(α) =

[n∑

i=1

fi(wT xi) + λg(w)

]

−

[n∑

i=1

−f ∗i (−αi)− λg∗(λ−1

n∑i=1

αixi

)]

=n∑

i=1

[fi(wT xi) + f ∗i (αi)− αiwT xi

]+ λ

[g(w) + g∗

(λ−1

n∑i=1

αixi

)− wT

(λ−1

n∑i=1

αixi

)]≥ 0.

Equality holds at f ′i (wT xi) = αi and w = ∇g∗(λ−1∑

i αixi).

Can check this gives the first order optimality conditions for w∗ and α∗.


Quick Justification of Strong Duality

P(w)− D(α) =

[n∑

i=1

fi(wT xi) + λg(w)

]

−

[n∑

i=1

−f ∗i (−αi)− λg∗(λ−1

n∑i=1

αixi

)]

=n∑

i=1

[fi(wT xi) + f ∗i (αi)− αiwT xi

]+ λ

[g(w) + g∗

(λ−1

n∑i=1

αixi

)− wT

(λ−1

n∑i=1

αixi

)]≥ 0.

Equality holds at f ′i (wT xi) = αi and w = ∇g∗(λ−1∑

i αixi).Can check this gives the first order optimality conditions for w∗ and α∗.


Example: Linear Support Vector Machine

Primal formulation:

P(w) =n∑

i=1

(1− w>xiyi)+ + 0.5λ‖w‖22

fi(u) = (1− uyi)+g(w) = 0.5‖w‖2

2.

Dual formulation:

D(α) =n∑

i=1

αiyi + 0.5λ−1

∥∥∥∥∥n∑

i=1

αixiyi

∥∥∥∥∥2

2

, αiyi ∈ [0,1].

−f ∗i (αi) = αiyi with constraint αiyi ∈ [0,1]g∗(w) = 0.5‖w‖2

2


Dual Coordinate Descent

Dual optimization problem:

α∗ = arg maxα

D(α) D(α) =n∑

i=1

−f ∗i (−αi)− λg∗(λ−1∑

i

αixi).

Apply coordinate descent on dual:maintain w = λ−1∑

i αixi

for i = 1, . . . ,n, we update αi one at a time while fixing the others

Computation: total computation of one pass over the data iscomparable to one gradient descent.


Convergence

Previous analysis of the method only shows slow convergence.

Our new analysis (work in process with Shai Shalev-Schwartz):To achieve accuracy ε

for smooth loss (e.g. logistic), requires

O(

ln n +ln(1/ε)

n

)passes over data

gradient descent: O(ln(1/ε))

for nonsmooth loss (.e.g, SVM), requires

O(

ln n +1nε

)passes over data

and convergence becomes geometric asymptoticallygradient descent: O(1/ε)


References

LBFGS: “On the limited memory BFGS method for large scaleoptimization”, Dong C. Liu and Jorge Nocedal, MathematicalProgramming, 1989.Stephen Boyd and Lieven Vandenberghe: Convex OptimizationBook (http://www.stanford.edu/ boyd/cvxbook/)Yurii Nesterov: proximal gradient and accelerated proximal gradient

Introductory Lectures on Convex Optimization: A Basic CourseGradient methods for minimizing composite objective function

Arkadi Nemirovski: optimization lecture noteshttp://www2.isye.gatech.edu/ nemirovs/


rutgers university -...

Documents