rutgers university -...
TRANSCRIPT
Optimization in Machine Learning
Tong Zhang
Rutgers University
T. Zhang (Rutgers) Optimization 1 / 24
Topics
Gradient DescentProximal Projection MethodCoordinate DescentConvex Duality and Dual Coordinate DescentLBFGS
T. Zhang (Rutgers) Optimization 2 / 24
Supervised Learning
Training data: (Xi ,Yi) (i = 1, . . . ,n)Example: linear prediction function wT xTraining algorithm: SVM
w = arg minw
[n−1
n∑i=1
(1− wT XiYi)+ + λwT w
].
This is an optimization problem: how to find w?
T. Zhang (Rutgers) Optimization 3 / 24
Unconstrained Optimization
Consider a general unconstrained optimization problem:
w∗ = arg minw
f (w),
How to find the optimal solution?global solution: w such that f (w) ≤ f (w ′) for all w ′.local solution: w such that f (w) ≤ f (w ′) when w ′ is close to w .global solution is local solution but not necessarily vice versa.local optimal (and thus global optimal) solution satisfies ∇f (w) = 0.for convex problems: local and global solutions are the same.
T. Zhang (Rutgers) Optimization 4 / 24
Gradient Descent
wk = wk−1 − ηk∇f (wk−1).
How fast does this method converge to the optimal solution?
General result: converge to local minimum under suitableconditions.What’s the convergence rate?
Answer: depends on conditions of f (·).This lecture focuses on convex problems.
T. Zhang (Rutgers) Optimization 5 / 24
Gradient Descent
wk = wk−1 − ηk∇f (wk−1).
How fast does this method converge to the optimal solution?
General result: converge to local minimum under suitableconditions.What’s the convergence rate?Answer: depends on conditions of f (·).This lecture focuses on convex problems.
T. Zhang (Rutgers) Optimization 5 / 24
Convexity
For all α ∈ [0,1], we have
f (αx + (1− α)x ′) ≤ αf (x) + (1− α)f (x ′).
A subgradient ∇f (x0) at x0 satisfies:
f (x) ≥ f (x0) + (x − x0)>∇f (x0)
Generalize gradient for functionsSubgradient v0 is not necessarily unique:
f (x) = |x | (x ∈ R)at x0 = 0: any v0 ∈ [−1,1] satisfies the requirement (thus asubgradient)
In the following we assume subgradient always exists
T. Zhang (Rutgers) Optimization 6 / 24
Convexity
For all α ∈ [0,1], we have
f (αx + (1− α)x ′) ≤ αf (x) + (1− α)f (x ′).
A subgradient ∇f (x0) at x0 satisfies:
f (x) ≥ f (x0) + (x − x0)>∇f (x0)
Generalize gradient for functionsSubgradient v0 is not necessarily unique:
f (x) = |x | (x ∈ R)at x0 = 0: any v0 ∈ [−1,1] satisfies the requirement (thus asubgradient)
In the following we assume subgradient always exists
T. Zhang (Rutgers) Optimization 6 / 24
Common Conditions of Objective Function
Convexity:f (x ′)− f (x)−∇f (x)T (x ′ − x) ≥ 0
Nonsmooth: first order derivative may be discontinuous: e.g. hingeloss or L1 regularizationSmooth: first order derivative is Lipschitz or second order derivativeis bounded:
f (x ′)− f (x)−∇f (x)T (x ′ − x) ≤ L2‖x ′ − x‖22
Strongly Convex:
f (x ′)− f (x)−∇f (x)T (x ′ − x) ≥ µ
2‖x ′ − x‖22
Convex is when the above satisfied with µ = 0.A function can be strongly convex and nonsmooth: f (x) = x2 + |x |.
T. Zhang (Rutgers) Optimization 7 / 24
Results
Smooth and strongly convex: gradient descent with a sufficientlysmall constant ηk has linear (or geometric) convergence:
f (wk )− f (w∗) = O(γk )
for some γ < 1.Smooth but not strongly convex:
f (wk )− f (w∗) = O(1/k),
with learning rate ηk = O(1/k).Nonsmooth:
f (wk )− f (w∗) = O(1/√
k),
for ηk = O(1/√
k) and wk = k−1∑nj=1 wk .
The learning rate can be tuned with line search.
T. Zhang (Rutgers) Optimization 8 / 24
Reformulation of Gradient Descent
Gradient descent can be derived from:
wk =arg minw
Qk (w)
Qk (w) :=f (wk−1) +∇f (wk−1)T (w − wk−1) +
12ηk‖w − wk−1‖22
Key properties: assume smoothness for simplicity and 1/ηk ≥ L(smoothness parameter of f ).
Qk (wk−1) = f (wk−1)
Qk (w) ≥ f (w)
Qk (w) is easy to optimizeConsequence: minimize Qk (w) reduces objective value of f (w):f (wk−1)− f (wk ) ≥ Qk (wk−1)−Qk (wk ).
This idea can be be generalized to other convex upper bound of f (w).
T. Zhang (Rutgers) Optimization 9 / 24
Reformulation of Gradient Descent
Gradient descent can be derived from:
wk =arg minw
Qk (w)
Qk (w) :=f (wk−1) +∇f (wk−1)T (w − wk−1) +
12ηk‖w − wk−1‖22
Key properties: assume smoothness for simplicity and 1/ηk ≥ L(smoothness parameter of f ).
Qk (wk−1) = f (wk−1)
Qk (w) ≥ f (w)
Qk (w) is easy to optimize
Consequence: minimize Qk (w) reduces objective value of f (w):f (wk−1)− f (wk ) ≥ Qk (wk−1)−Qk (wk ).
This idea can be be generalized to other convex upper bound of f (w).
T. Zhang (Rutgers) Optimization 9 / 24
Reformulation of Gradient Descent
Gradient descent can be derived from:
wk =arg minw
Qk (w)
Qk (w) :=f (wk−1) +∇f (wk−1)T (w − wk−1) +
12ηk‖w − wk−1‖22
Key properties: assume smoothness for simplicity and 1/ηk ≥ L(smoothness parameter of f ).
Qk (wk−1) = f (wk−1)
Qk (w) ≥ f (w)
Qk (w) is easy to optimizeConsequence: minimize Qk (w) reduces objective value of f (w):f (wk−1)− f (wk ) ≥ Qk (wk−1)−Qk (wk ).
This idea can be be generalized to other convex upper bound of f (w).
T. Zhang (Rutgers) Optimization 9 / 24
Proximal Gradient Method
Assumef (w) = φ(w) + g(w),
then we may consider the following upper bound of f (w)
Qk (w) := φ(wk−1)+∇φ(wk−1)T (w −wk−1)+
12ηk‖w −wk−1‖22 + g(w),
with 1/ηk larger than the smoothness parameter of φ. Then solve for
wk = arg minw
Qk (w).
We assume that this minimization problem is easy.
generalization of gradient descent called proximal gradient descent.useful when g(w) is a simple nonsmooth function such as L1regularization g(w) = λ‖w‖1.
T. Zhang (Rutgers) Optimization 10 / 24
Example: L1 regularization
f (w) =n∑
i=1
(wT xi − yi)2 + λ‖w‖1.
For example, φ(w) =∑n
i=1(wT xi − yi)
2 and g(w) = λ‖w‖1. Then
Qk := φ(wk−1) +∇φ(wk−1)T (w − wk−1) +
12ηk‖w − wk−1‖22 + λ‖w‖1.
Solution iswk = trunc(wk−1 − ηk∇φ(wk−1)),
where
trunc([u1, . . . ,ud ]) =[trunc(uj)]j=1,...,d
trunc(uj) =sign(uj)(|uj | − ληk )+
T. Zhang (Rutgers) Optimization 11 / 24
Example: L1 regularization
f (w) =n∑
i=1
(wT xi − yi)2 + λ‖w‖1.
For example, φ(w) =∑n
i=1(wT xi − yi)
2 and g(w) = λ‖w‖1. Then
Qk := φ(wk−1) +∇φ(wk−1)T (w − wk−1) +
12ηk‖w − wk−1‖22 + λ‖w‖1.
Solution iswk = trunc(wk−1 − ηk∇φ(wk−1)),
where
trunc([u1, . . . ,ud ]) =[trunc(uj)]j=1,...,d
trunc(uj) =sign(uj)(|uj | − ληk )+
T. Zhang (Rutgers) Optimization 11 / 24
Property of Proximal Gradient
Smoothness depending on φ rather than f (can tolerate nonsmooth g)Convergence similar to gradient descent:
if φ is smooth and f is strongly convex: convergence is linearif φ is smooth but not strongly convex: convergence is 1/k .if φ is not smooth: convergence is 1/
√k .
T. Zhang (Rutgers) Optimization 12 / 24
Nesterov’s Accelerated Gradient (one version)
Procedure:Pick η1, η2, . . . ≥ 0Pick w1 = y1 = z0, thenDefine α0 = 0 and α−2
i − α−1i = α−2
i−1 for i ≥ 1(may also set αi = (1 + i/2)−1)Iterate for i = 1,2, . . . ,T :
zi =arg minz
[g(z) +
12ηi‖z‖22 − (η−1
i zi−1 − α−1i ∇φ(yi))
>z],
wi =(1− αi−1)wi−1 + αi−1zi
yi+1 =(1− αi)wi + αizi
Advantage: faster convergence of 1/k2 for smooth φDisadvantage:
for smooth and strongly convex f : algorithm has to be modified toachieve geometric convergencemodification depends on strong convexity parameter µ.
T. Zhang (Rutgers) Optimization 13 / 24
Beyond First Order Method: LBFGS (high level view)
Recall gradient descent: successive minimization of
Qk (w) = f (wk−1) +∇f (wk−1)T (w − wk−1) +
12ηk‖w − wk−1‖22.
upper bound of f (w)
Locally a more accurate approximation of f (x) is to use Hessian:
Qk (w) = f (wk−1)+∇f (wk−1)T (w−wk−1)+
12(w−wk−1)
T H(w−wk−1).
BFGS; approximate H using first order gradients.LBFGS: use limited memory (store a few vectors) to approximate HVery effective for optimization of smooth objective functions.
T. Zhang (Rutgers) Optimization 14 / 24
Coordinate Descent (CD)
Let f (w) = f ([w1, . . . ,wd ])Algorithm:
for j = 1, . . . ,dwj ← arg minu f ([w1, . . . ,wj−1,u,wj+1, . . . ,wd ])
repeat until convergence
Idea: optimize one parameter at a time and fix others
Assumption:each one dimensional problem can be solved easily.each coordinate update for variable j is inexpensive compared togradient descent.
T. Zhang (Rutgers) Optimization 15 / 24
Coordinate Descent (CD)
Let f (w) = f ([w1, . . . ,wd ])Algorithm:
for j = 1, . . . ,dwj ← arg minu f ([w1, . . . ,wj−1,u,wj+1, . . . ,wd ])
repeat until convergence
Idea: optimize one parameter at a time and fix othersAssumption:
each one dimensional problem can be solved easily.each coordinate update for variable j is inexpensive compared togradient descent.
T. Zhang (Rutgers) Optimization 15 / 24
Linear Regularization Problem
Consider regularized logistic regression:
w = arg minw
[n∑
i=1
ln(1 + exp(−w>xiyi)) + λ‖w‖1
]
or more generally the following problem with scaler functions fi and hj :
w = arg minw
n∑i=1
fi(wT xi) +d∑
j=1
hj(wj)
.Iteration complexity:
maintain zi = w>xi for i = 1, . . . ,ncoordinate j : update wj and {zi} requires scanning a featurecolumnone pass over j = 1, . . . ,d : one gradient descent step
T. Zhang (Rutgers) Optimization 16 / 24
Convergence
Practice:for suitable problems, coordinate descent works much better thangradient descente.g., regularized logistic regression
Theory: incompletecurrent analysis either shows no improvements or improvementsunder very restricted scaneriosPaul Tseng, Yurii Nesterov, ...
It is still an open question to develop better theoretical understandingon when does coordinate descent performs better.
T. Zhang (Rutgers) Optimization 17 / 24
Convergence
Practice:for suitable problems, coordinate descent works much better thangradient descente.g., regularized logistic regression
Theory: incompletecurrent analysis either shows no improvements or improvementsunder very restricted scaneriosPaul Tseng, Yurii Nesterov, ...
It is still an open question to develop better theoretical understandingon when does coordinate descent performs better.
T. Zhang (Rutgers) Optimization 17 / 24
Convex Duality
Given a convex function f (w), we can define its conjugate or dual
f ∗(α) = supw
[wTα− f (w)].
The optimal w is α = ∇f (w).
The dual of f ∗ is f :
f (w) = supα
[wTα− f ∗(α)].
The optimal α is ∇f ∗(α) = w .We have the following property: for all w and α:
f (w) + f ∗(α) ≥ wTα
equality holds only at w = ∇f ∗(α): equivalent to α = ∇f (w).
T. Zhang (Rutgers) Optimization 18 / 24
Convex Duality
Given a convex function f (w), we can define its conjugate or dual
f ∗(α) = supw
[wTα− f (w)].
The optimal w is α = ∇f (w).The dual of f ∗ is f :
f (w) = supα
[wTα− f ∗(α)].
The optimal α is ∇f ∗(α) = w .
We have the following property: for all w and α:
f (w) + f ∗(α) ≥ wTα
equality holds only at w = ∇f ∗(α): equivalent to α = ∇f (w).
T. Zhang (Rutgers) Optimization 18 / 24
Convex Duality
Given a convex function f (w), we can define its conjugate or dual
f ∗(α) = supw
[wTα− f (w)].
The optimal w is α = ∇f (w).The dual of f ∗ is f :
f (w) = supα
[wTα− f ∗(α)].
The optimal α is ∇f ∗(α) = w .We have the following property: for all w and α:
f (w) + f ∗(α) ≥ wTα
equality holds only at w = ∇f ∗(α): equivalent to α = ∇f (w).
T. Zhang (Rutgers) Optimization 18 / 24
Dual of Linear Regularization Method
Primal optimization problem:
w∗ = arg minw
P(w) P(w) :=n∑
i=1
fi(wT xi) + λg(w).
Dual optimization problem:
α∗ = arg maxα
D(α) D(α) =n∑
i=1
−f ∗i (−αi)− λg∗(λ−1∑
i
αixi).
Strong duality:P(w) ≥ D(α) for all w and αP(w∗) = D(α∗) with the relationship:
w∗ = ∇g∗(λ−1
n∑i=1
α∗,ixi
)α∗ i = f ′i (w
T∗ xi).
Solve dual instead of primal problem.T. Zhang (Rutgers) Optimization 19 / 24
Quick Justification of Strong Duality
P(w)− D(α) =
[n∑
i=1
fi(wT xi) + λg(w)
]
−
[n∑
i=1
−f ∗i (−αi)− λg∗(λ−1
n∑i=1
αixi
)]
=n∑
i=1
[fi(wT xi) + f ∗i (αi)− αiwT xi
]+ λ
[g(w) + g∗
(λ−1
n∑i=1
αixi
)− wT
(λ−1
n∑i=1
αixi
)]≥ 0.
Equality holds at f ′i (wT xi) = αi and w = ∇g∗(λ−1∑
i αixi).
Can check this gives the first order optimality conditions for w∗ and α∗.
T. Zhang (Rutgers) Optimization 20 / 24
Quick Justification of Strong Duality
P(w)− D(α) =
[n∑
i=1
fi(wT xi) + λg(w)
]
−
[n∑
i=1
−f ∗i (−αi)− λg∗(λ−1
n∑i=1
αixi
)]
=n∑
i=1
[fi(wT xi) + f ∗i (αi)− αiwT xi
]+ λ
[g(w) + g∗
(λ−1
n∑i=1
αixi
)− wT
(λ−1
n∑i=1
αixi
)]≥ 0.
Equality holds at f ′i (wT xi) = αi and w = ∇g∗(λ−1∑
i αixi).Can check this gives the first order optimality conditions for w∗ and α∗.
T. Zhang (Rutgers) Optimization 20 / 24
Example: Linear Support Vector Machine
Primal formulation:
P(w) =n∑
i=1
(1− w>xiyi)+ + 0.5λ‖w‖22
fi(u) = (1− uyi)+g(w) = 0.5‖w‖2
2.
Dual formulation:
D(α) =n∑
i=1
αiyi + 0.5λ−1
∥∥∥∥∥n∑
i=1
αixiyi
∥∥∥∥∥2
2
, αiyi ∈ [0,1].
−f ∗i (αi) = αiyi with constraint αiyi ∈ [0,1]g∗(w) = 0.5‖w‖2
2
T. Zhang (Rutgers) Optimization 21 / 24
Dual Coordinate Descent
Dual optimization problem:
α∗ = arg maxα
D(α) D(α) =n∑
i=1
−f ∗i (−αi)− λg∗(λ−1∑
i
αixi).
Apply coordinate descent on dual:maintain w = λ−1∑
i αixi
for i = 1, . . . ,n, we update αi one at a time while fixing the others
Computation: total computation of one pass over the data iscomparable to one gradient descent.
T. Zhang (Rutgers) Optimization 22 / 24
Convergence
Previous analysis of the method only shows slow convergence.
Our new analysis (work in process with Shai Shalev-Schwartz):To achieve accuracy ε
for smooth loss (e.g. logistic), requires
O(
ln n +ln(1/ε)
n
)passes over data
gradient descent: O(ln(1/ε))
for nonsmooth loss (.e.g, SVM), requires
O(
ln n +1nε
)passes over data
and convergence becomes geometric asymptoticallygradient descent: O(1/ε)
T. Zhang (Rutgers) Optimization 23 / 24
Convergence
Previous analysis of the method only shows slow convergence.
Our new analysis (work in process with Shai Shalev-Schwartz):To achieve accuracy ε
for smooth loss (e.g. logistic), requires
O(
ln n +ln(1/ε)
n
)passes over data
gradient descent: O(ln(1/ε))
for nonsmooth loss (.e.g, SVM), requires
O(
ln n +1nε
)passes over data
and convergence becomes geometric asymptoticallygradient descent: O(1/ε)
T. Zhang (Rutgers) Optimization 23 / 24
Convergence
Previous analysis of the method only shows slow convergence.
Our new analysis (work in process with Shai Shalev-Schwartz):To achieve accuracy ε
for smooth loss (e.g. logistic), requires
O(
ln n +ln(1/ε)
n
)passes over data
gradient descent: O(ln(1/ε))
for nonsmooth loss (.e.g, SVM), requires
O(
ln n +1nε
)passes over data
and convergence becomes geometric asymptoticallygradient descent: O(1/ε)
T. Zhang (Rutgers) Optimization 23 / 24
References
LBFGS: “On the limited memory BFGS method for large scaleoptimization”, Dong C. Liu and Jorge Nocedal, MathematicalProgramming, 1989.Stephen Boyd and Lieven Vandenberghe: Convex OptimizationBook (http://www.stanford.edu/ boyd/cvxbook/)Yurii Nesterov: proximal gradient and accelerated proximal gradient
Introductory Lectures on Convex Optimization: A Basic CourseGradient methods for minimizing composite objective function
Arkadi Nemirovski: optimization lecture noteshttp://www2.isye.gatech.edu/ nemirovs/
T. Zhang (Rutgers) Optimization 24 / 24