ecs289: scalable machine learning - statisticschohsieh/teaching/ecs289g... · ecs289: scalable...
TRANSCRIPT
Outline
Linear Support Vector Machines and Dual Problems
General Empirical Risk Minimization
Optimization for SVM (and general ERM problems)
Support Vector Machines
SVM is a widely used classifier.Given:
Training data points x1, · · · , xn.Each xi ∈ Rd is a feature vector:Consider a simple case with two classes: yi ∈ {+1,−1}.
Goal: Find a hyperplane to separate these two classes of data:if yi = 1, wTxi ≥ 1; if yi = −1, wTxi ≤ −1.
Support Vector Machines (hard constraints)
Given training data x1, · · · , xn ∈ Rd with labels yi ∈ {+1,−1}.SVM primal problem (with hard constraints):
minw ,ξ
1
2wTw
s.t. yi (wTxi ) ≥ 1, i = 1, . . . , n,
What if there are outliers?
Support Vector Machines
Given training data x1, · · · , xn ∈ Rd with labels yi ∈ {+1,−1}.SVM primal problem:
minw ,ξ
1
2wTw + C
n∑i=1
ξi
s.t. yi (wTxi ) ≥ 1− ξi , i = 1, . . . , n,
ξi ≥ 0
Support Vector Machines
SVM primal problem can be written as
minw
1
2‖w‖2︸ ︷︷ ︸
L2 regularization
+n∑
i=1
max(0, 1− yiwTxi )︸ ︷︷ ︸hinge loss
Non-differentiable when yiwTxi = 1 for some i
Next, we show how to derive the dual form of SVM
Support Vector Machines (dual)
Primal problem:
minw ,ξ
1
2‖w‖2 + C
∑i
ξi
s.t. yiwTxi − 1 + ξi ≥ 0, and ξi ≥ 0 ∀i = 1, . . . , n
Equivalent to:
minw ,ξ
maxα≥0,β≥0
1
2‖w‖2 + C
∑i
ξi −∑i
αi (yiwTxi − 1 + ξi )−∑i
βiξi
Under certain condition (e.g., slater’s condition), exchanging min,maxwill not change the optimal solution:
maxα≥0,β≥0
minw ,ξ
1
2‖w‖2 + C
∑i
ξi −∑i
αi (yiwTxi − 1 + ξi )−∑i
βiξi
Support Vector Machines (dual)
Reorganize the equation:
maxα≥0,β≥0
minw ,ξ
1
2‖w‖2 −
∑i
αiyiwTxi +∑i
ξi (C − αi − βi ) +∑i
αi
Now, for any given α,β, the minimizer of w will satisfy
∂L
∂w= w −
∑i
αiyixi = 0 ⇒ w∗ =∑i
yiαixi
Also, we have C = αi + βi , otherwise ξi can make the objectivefunction −∞Substitue these two equations back we get
maxα≥0,β≥0,C=α+β
−1
2
∑i ,j
αiαjyiyjxTi xj +
∑i
αi
Support Vector Machines (dual)
Therefore, we get the following dual problem
maxC≥α≥0
{−1
2αTQα + eTα} := D(α),
where Q is an n by n matrix with Qij = yiyjxTi xj
Based on the derivations, we know1 Primal minimum = dual maximum (under slater’s condition)2 Let α∗ be the dual solution and w∗ be the primal solution, we have
w∗ =∑i
yiα∗i xi
We can solve the dual problem instead of the primal problem.
General Empirical Risk Minimization
L2-regularized ERM:
minw∈Rd
P(w) :=n∑
i=1
`i (wTxi ) +1
2‖w‖2
`i (·): loss function
Dual problem for L2-regularized ERM:
minα
D(α) :=1
2‖
n∑i=1
αixi‖2 +n∑
i=1
`∗i (−αi ),
`∗i (·): conjugate of `Primal-dual relationship: w∗ =
∑ni=1 αixi
General Empirical Risk Minimization
Regularized ERM:
minw∈Rd
P(w) :=n∑
i=1
`i (wTxi ) + R(w)
`i (·): loss functionR(w): regularization
Dual problem may have a different form (?)
Examples
Loss functions:
Regression: `i (xi ) = (xi − yi )2
SVM (hinge loss): `i (xi ) = max(1− yiwTxi , 0)Square hinge loss: `i (xi ) = max(1− yiwTxi , 0)2
Logistic regression: `i (xi ) = log(1 + e−yiwT xi )
Regularizations:
L2-regularization: ‖w‖22
L1-regularization: ‖w‖1
Group Lasso: ‖wS1‖2 + ‖wS2‖2 + · · ·+ ‖wSk‖2
Nuclear norm: ‖W ‖∗
Optimization Methods for ERM
smooth loss smooth loss
and regularzation nonsmooth regularization
gradient descent Yes
proximal gradient Yes Yes
SGD Yes (Yes, with modification)
CD Yes Yes
Newton Yes
prox Newton Yes Yes
(assume non-smooth regularization is “simple”)
Non-smooth loss: generally hard, need to use subgradient
Stochastic Gradient
Decompose the problem into n parts:
f (w) =1
n
∑i
(1
2‖w‖2 + nC max(0, 1− yiwTxi )
):=
1
n
∑i
fi (w)
Stochastic Gradient (SG):
For t = 1, 2, . . .Randomly pick an index iw t+1 ← w t − ηt∇fi (w t)
Can be directly applied for smooth loss functions.
But for SVM, fi is non-differentiable.
Stochastic Subgradient Method
A vector g is a subgradient of f at a point x0 if
f (x)− f (x0) ≥ gT (x − x0) ∀x
Stochastic Subgradient descent:
For t = 1, 2, . . .Randomly pick an index iw t+1 ← w t − ηtgi , where gi is a subgradient of fi at w t
Stochastic Subgradient Method for SVM
A subgradient of `i (w) = max(0, 1− yiwTxi ):−yixi if 1− yiwTxi > 0
0 if 1− yiwTxi < 0
0 if 1− yiwTxi = 0
Stochastic Subgradient descent for SVM:
For t = 1, 2, . . .Randomly pick an index iIf yiwTxi < 1, then
w ← (1− ηt)w + ηtnCyixiElse (if yiwTxi ≥ 1):
w ← (1− ηt)w
Stochastic Subgradient Method
Improve the time complexity when xi is sparse:
w ← (1− ηt)w : O(d) time complexityInstead, we maintain w = av , so this operation requires only O(1) timew ← w − ηnCyixi : O(ni ) time where ni is the number of nonzeroes in xi
This algorithm was proposed in:
Shalev-Shwartz et al., “Pegasos: Primal Estimated sub-GrAdientSOlver for SVM”, in ICML 2007
Dual Form of SVM
Given training data x1, · · · , xn ∈ Rd with labels yi ∈ {+1,−1}.SVM dual problem:
min0≤α≤C
1
2αTQα−
n∑i=1
αi
where Q is an n by n matrix with Qij = yiyjxTi xj .
After computing the dual optimal solution α∗, the primal solution is
w∗ =n∑
i=1
yiα∗i xi
Coordinate Descent Algorithm
Stochastic Coordinate descent for solving min0≤α≤C f (α):
For t = 1, 2, . . .Randomly pick an index iCompute the optimal one-variable update:
δ∗ = argminδ:0≤αi+δ≤C
f (α + δei )
Update αi ← αi + δ∗
Dual Coordinate Descent for SVM
One-variable subproblem:
f (α + δei ) =1
2(α + δei )TQ(α + δei )−
n∑i=1
αi − δ
=1
2αTQα + αTQeiδ +
Qii
2δ2 −
n∑i=1
αi − δ
Compute argminδ f (α + δei ): set gradient equals to zero
(Qα)i + Qiiδ∗ − 1 = 0 ⇒ δ∗ =
1− (Qα)iQii
However, we require 0 ≤ αi + δ ≤ C , so the optimal solution is
δ∗ = max
(− αi ,min
(C − αi ,
1− (Qα)iQii
))
Dual Coordinate Descent for SVM
Main computation: the i-th element of Qα
Time complexity:
(Qα)i =n∑
j=1
Qijαj
=n∑
j=1
yiyjxTi xjαj
= yi
n∑j=1
yjαjxTj xi
Naive implementation: O(nd) time.
Dual Coordinate Descent for SVM
Main computation: the i-th element of Qα
Time complexity:
(Qα)i =n∑
j=1
Qijαj
=n∑
j=1
yiyjxTi xjαj
= yi
n∑j=1
yjαjxTj xi
Naive implementation: O(nd) time.
Speedup the Computation
A faster way to compute coordinate descent updates.
(Qα)i = yi (n∑
j=1
yjαjxj)Txi
Maintain w =∑n
j=1 yjαjxj in the memory:
⇒ O(d) time for computing (Qα)i
w is the primal variables correspond to current dual solution α!
Dual Coordinate Descent for SVM
After updating αi ← αi + δ∗, we need to maintain w :
w ← w + δ∗yixi
Time complexity: O(d)
After convergence,
w∗ =n∑
i=1
α∗i yixi
is the optimal primal solution.
Dual Coordinate Descent for SVM
Initial: α, w =∑n
i=1 αiyixiFor t = 1, 2, . . .
Randomly pick an index i
Compute the optimal one-variable update:
δ∗ = max
(− αi ,min
(C − αi ,
1− yiwTxiQii
))Update αi ← αi + δ∗
Update w ← w + δ∗yixi
(Hsieh et al., “A Dual Coordinate Descent Method for Large-scale Linear SVM”, ICML
2008)
Can be applied to general L2-regularized ERM problems (Shalev-Shwartz and
Zhang, “Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization”
JMLR 2013)
Convergence of Dual Coordinate Descent
Is the dual SVM problem strongly convex?
Q = X̄ X̄T may be low-rank
Sublinear convergence in dual objective function value.
Sublinear convergence in duality gap (primal obj - dual obj)
(Shown in Shalev-Shwartz and Zhang, “Stochastic Dual Coordinate Ascent Methods
for Regularized Loss Minimization”. JMLR 2013).
Global linear convergence rate in terms of dual objective function value.
(Shown in Wang and Lin, “Iteration Complexity of Feasible Descent Methods for
Convex Optimization”. JMLR 2014)
Convergence of Dual Coordinate Descent
Is the dual SVM problem strongly convex?
Q = X̄ X̄T may be low-rank
Sublinear convergence in dual objective function value.
Sublinear convergence in duality gap (primal obj - dual obj)
(Shown in Shalev-Shwartz and Zhang, “Stochastic Dual Coordinate Ascent Methods
for Regularized Loss Minimization”. JMLR 2013).
Global linear convergence rate in terms of dual objective function value.
(Shown in Wang and Lin, “Iteration Complexity of Feasible Descent Methods for
Convex Optimization”. JMLR 2014)
Experimental comparison
RCV1: 677,399 training samples; 47,236 features; 49,556,258nonzeroes in the whole dataset.
Time vs primal objective function value
Experimental comparison
RCV1: 677,399 training samples; 47,236 features; 49,556,258nonzeroes in the whole dataset.
Time vs prediction accuracy
LIBLINEAR
Implemented in LIBLINEAR:
https://www.csie.ntu.edu.tw/~cjlin/liblinear/
Other functionalities:
Logistic regression (L1 or L2 regularization)Multi-class SVMSupport vector regressionCross-validation
Linear SVM on a multi-core machine
Asynchronous dual coordinate descent algorithm:
Each thread repeatedly performs the following updates:For t = 1, 2, . . .
Randomly pick an index iCompute the optimal one-variable update:
δ∗i = max
(− αi ,min
(C − αi ,
1− yiwTxiQii
))Update αi ← αi + δ∗iUpdate w ← w + δ∗i yixi
Different mechanisms on accessing w in the shared memory.
Hsieh et al., “PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent”, in
ICML 2015
Huan and Hsieh, “ Fixing the Convergence Problems in Parallel Asynchronous Dual
Coordinate Descent”, in ICDM 2016
Other multi-core algorithms for SVM
Asynchronous stochastic (sub-)gradient decent for the primal problem:
Niu et al., “Hogwild! A Lock-Free Approach to Parallelizing StochasticGradient Descent”, in NIPS 2011.
Another approach (parallelized variable selection)
Chiang et al., “Parallel Dual Coordinate Descent Method forLarge-scale Linear Classification in Multi-core Environment”, in KDD2016.
Other parallel primal solvers:
Lee et al., “Fast Matrix-vector Multiplications for Large-scale LogisticRegression on Shared-memory Systems”, in ICDM 2015.
Large-scale linear ERM: out-of-core version
Question: how to solve linear SVM on a single machine when datacannot fit in memory?
Dual coordinate descent: need random access, not suitable whentraining samples are stored in disk
Large-scale linear: out-of-core version
Block coordinate descent:
Partition data into blocks S1, . . . ,SkFor t = 1, 2, . . .
Load a block Si into memoryUpdate this block of dual variables by dual coordinate descent
(Yu et al., “Large Linear Classification when Data Cannot Fit In Memory”, in KDD 2010)
Selective block coordinate descent: keep important samples in memory(Chang and Roth, “Selective Block Minimization for Faster Convergence of Limited
Memory Large-scale Linear Models”, in KDD 2011)
StreamSVM: one thread keep loading data, while another thread keeprunning coordinate descent updates(Matsushima et al., “Linear Support Vector Machines via Dual Cached Loops”, in KDD
2012)
Distributed Linear ERM
Each machine stores a subset of training samples
Each machine conducts dual coordinate updates and has a local wHow to communicate and synchronize the updates?
Can the methods generalize to other local solvers?
What if there are millions of machines (nodes), and data is non-iiddistributed?
Distributed Linear ERM
COCOA: local dual coordinate descent updates, and averaging.(Jaggi et al., “Communication-Efficient Distributed Dual Coordinate Ascent”, in NIPS,
2014. )
Block Quadratic Optimization (COCOA with improved step sizeselection).(Lee et al., “Distributed Box-Constrained Quadratic Optimization for Dual Linear Support
Vector Machines”, in ICML, 2015. )
COCOA+: improved COCOA by solving a modified subproblem.(Ma et al., “Adding vs. Averaging in Distributed Primal-Dual Optimization”, in ICML,
2015. )
A more general framework (for other regularizations).(Zheng et al., “A General Distributed Dual Coordinate Optimization Framework for
Regularized Loss Minimization”, arXiv, 2016.)
Other distributed algorithms
Primal solver:(Zhuang et al., “Distributed Newton Method for Regularized Logistic Regression”, in
PAKDD 2015. )
DANE: distributed Newton-type method(Shamir et al., “Communication-Efficient Distributed Optimization using an Approximate
Newton-type Method”, in ICML 2014. )
DisCO: based on inexact Newton method:(Zhang et al., “Communication-Efficient Distributed Optimization of Self-Concordant
Empirical Loss”, in ICML 2015. )