ecs289: scalable machine learning - statisticschohsieh/teaching/ecs289g... · ecs289: scalable...

ECS289: Scalable Machine Learning

Cho-Jui HsiehUC Davis

Oct 6, 2015

Outline

Linear Support Vector Machines and Dual Problems

General Empirical Risk Minimization

Optimization for SVM (and general ERM problems)

Support Vector Machines

SVM is a widely used classifier.Given:

Training data points x1, · · · , xn.Each xi ∈ Rd is a feature vector:Consider a simple case with two classes: yi ∈ {+1,−1}.

Goal: Find a hyperplane to separate these two classes of data:if yi = 1, wTxi ≥ 1; if yi = −1, wTxi ≤ −1.

Support Vector Machines (hard constraints)

Given training data x1, · · · , xn ∈ Rd with labels yi ∈ {+1,−1}.SVM primal problem (with hard constraints):

minw ,ξ

1

2wTw

s.t. yi (wTxi ) ≥ 1, i = 1, . . . , n,

What if there are outliers?


Given training data x1, · · · , xn ∈ Rd with labels yi ∈ {+1,−1}.SVM primal problem:

minw ,ξ

1

2wTw + C

n∑i=1

ξi

s.t. yi (wTxi ) ≥ 1− ξi , i = 1, . . . , n,

ξi ≥ 0


SVM primal problem can be written as

minw

1

2‖w‖2︸︷︷︸

L2 regularization

+n∑

i=1

max(0, 1− yiwTxi )︸︷︷︸hinge loss

Non-differentiable when yiwTxi = 1 for some i

Next, we show how to derive the dual form of SVM

Support Vector Machines (dual)

Primal problem:

minw ,ξ

1

2‖w‖2 + C

∑i

ξi

s.t. yiwTxi − 1 + ξi ≥ 0, and ξi ≥ 0 ∀i = 1, . . . , n

Equivalent to:

minw ,ξ

maxα≥0,β≥0

1

2‖w‖2 + C

∑i

ξi −∑i

αi (yiwTxi − 1 + ξi )−∑i

βiξi

Under certain condition (e.g., slater’s condition), exchanging min,maxwill not change the optimal solution:

maxα≥0,β≥0

minw ,ξ

1

2‖w‖2 + C

∑i

ξi −∑i

αi (yiwTxi − 1 + ξi )−∑i

βiξi


Reorganize the equation:

maxα≥0,β≥0

minw ,ξ

1

2‖w‖2 −

∑i

αiyiwTxi +∑i

ξi (C − αi − βi ) +∑i

αi

Now, for any given α,β, the minimizer of w will satisfy

∂L

∂w= w −

∑i

αiyixi = 0 ⇒ w∗ =∑i

yiαixi

Also, we have C = αi + βi , otherwise ξi can make the objectivefunction −∞Substitue these two equations back we get

maxα≥0,β≥0,C=α+β

−1

2

∑i ,j

αiαjyiyjxTi xj +

∑i

αi


Therefore, we get the following dual problem

maxC≥α≥0

{−1

2αTQα + eTα} := D(α),

where Q is an n by n matrix with Qij = yiyjxTi xj

Based on the derivations, we know1 Primal minimum = dual maximum (under slater’s condition)2 Let α∗ be the dual solution and w∗ be the primal solution, we have

w∗ =∑i

yiα∗i xi

We can solve the dual problem instead of the primal problem.


L2-regularized ERM:

minw∈Rd

P(w) :=n∑

i=1

`i (wTxi ) +1

2‖w‖2

`i (·): loss function

Dual problem for L2-regularized ERM:

minα

D(α) :=1

2‖

n∑i=1

αixi‖2 +n∑

i=1

`∗i (−αi ),

`∗i (·): conjugate of `Primal-dual relationship: w∗ =

∑ni=1 αixi


Regularized ERM:

minw∈Rd

P(w) :=n∑

i=1

`i (wTxi ) + R(w)

`i (·): loss functionR(w): regularization

Dual problem may have a different form (?)

Examples

Loss functions:

Regression: ì (xi ) = (xi − yi )2

SVM (hinge loss): ì (xi ) = max(1− yiwTxi , 0)Square hinge loss: ì (xi ) = max(1− yiwTxi , 0)2

Logistic regression: ì (xi ) = log(1 + e−yiwT xi )

Regularizations:

L2-regularization: ‖w‖22

L1-regularization: ‖w‖1

Group Lasso: ‖wS1‖2 + ‖wS2‖2 + · · ·+ ‖wSk‖2

Nuclear norm: ‖W ‖∗

Optimization Methods for ERM

smooth loss smooth loss

and regularzation nonsmooth regularization

gradient descent Yes

proximal gradient Yes Yes

SGD Yes (Yes, with modification)

CD Yes Yes

Newton Yes

prox Newton Yes Yes

(assume non-smooth regularization is “simple”)

Non-smooth loss: generally hard, need to use subgradient

Stochastic Gradient for SVM

Stochastic Gradient

Decompose the problem into n parts:

f (w) =1

n

∑i

(1

2‖w‖2 + nC max(0, 1− yiwTxi )

):=

1

n

∑i

fi (w)

Stochastic Gradient (SG):

For t = 1, 2, . . .Randomly pick an index iw t+1 ← w t − ηt∇fi (w t)

Can be directly applied for smooth loss functions.

But for SVM, fi is non-differentiable.

Stochastic Subgradient Method

A vector g is a subgradient of f at a point x0 if

f (x)− f (x0) ≥ gT (x − x0) ∀x

Stochastic Subgradient descent:

For t = 1, 2, . . .Randomly pick an index iw t+1 ← w t − ηtgi , where gi is a subgradient of fi at w t

Stochastic Subgradient Method for SVM

A subgradient of `i (w) = max(0, 1− yiwTxi ):−yixi if 1− yiwTxi > 0

0 if 1− yiwTxi < 0

0 if 1− yiwTxi = 0

Stochastic Subgradient descent for SVM:

For t = 1, 2, . . .Randomly pick an index iIf yiwTxi < 1, then

w ← (1− ηt)w + ηtnCyixiElse (if yiwTxi ≥ 1):

w ← (1− ηt)w

Stochastic Subgradient Method

Improve the time complexity when xi is sparse:

w ← (1− ηt)w : O(d) time complexityInstead, we maintain w = av , so this operation requires only O(1) timew ← w − ηnCyixi : O(ni ) time where ni is the number of nonzeroes in xi

This algorithm was proposed in:

Shalev-Shwartz et al., “Pegasos: Primal Estimated sub-GrAdientSOlver for SVM”, in ICML 2007

Dual Coordinate Descent for SVM

Dual Form of SVM

Given training data x1, · · · , xn ∈ Rd with labels yi ∈ {+1,−1}.SVM dual problem:

min0≤α≤C

1

2αTQα−

n∑i=1

αi

where Q is an n by n matrix with Qij = yiyjxTi xj .

After computing the dual optimal solution α∗, the primal solution is

w∗ =n∑

i=1

yiα∗i xi

Coordinate Descent Algorithm

Stochastic Coordinate descent for solving min0≤α≤C f (α):

For t = 1, 2, . . .Randomly pick an index iCompute the optimal one-variable update:

δ∗ = argminδ:0≤αi+δ≤C

f (α + δei )

Update αi ← αi + δ∗


One-variable subproblem:

f (α + δei ) =1

2(α + δei )TQ(α + δei )−

n∑i=1

αi − δ

=1

2αTQα + αTQeiδ +

Qii

2δ2 −

n∑i=1

αi − δ

Compute argminδ f (α + δei ): set gradient equals to zero

(Qα)i + Qiiδ∗ − 1 = 0 ⇒ δ∗ =

1− (Qα)iQii

However, we require 0 ≤ αi + δ ≤ C , so the optimal solution is

δ∗ = max

(− αi ,min

(C − αi ,

1− (Qα)iQii

))


Main computation: the i-th element of Qα

Time complexity:

(Qα)i =n∑

j=1

Qijαj

=n∑

j=1

yiyjxTi xjαj

= yi

n∑j=1

yjαjxTj xi

Naive implementation: O(nd) time.

Speedup the Computation

A faster way to compute coordinate descent updates.

(Qα)i = yi (n∑

j=1

yjαjxj)Txi

Maintain w =∑n

j=1 yjαjxj in the memory:

⇒ O(d) time for computing (Qα)i

w is the primal variables correspond to current dual solution α!


After updating αi ← αi + δ∗, we need to maintain w :

w ← w + δ∗yixi

Time complexity: O(d)

After convergence,

w∗ =n∑

i=1

α∗i yixi

is the optimal primal solution.


Initial: α, w =∑n

i=1 αiyixiFor t = 1, 2, . . .

Randomly pick an index i

Compute the optimal one-variable update:

δ∗ = max

(− αi ,min

(C − αi ,

1− yiwTxiQii

))Update αi ← αi + δ∗

Update w ← w + δ∗yixi

(Hsieh et al., “A Dual Coordinate Descent Method for Large-scale Linear SVM”, ICML

2008)

Can be applied to general L2-regularized ERM problems (Shalev-Shwartz and

Zhang, “Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization”

JMLR 2013)

Convergence of Dual Coordinate Descent

Is the dual SVM problem strongly convex?

Q = X̄ X̄T may be low-rank

Sublinear convergence in dual objective function value.

Sublinear convergence in duality gap (primal obj - dual obj)

(Shown in Shalev-Shwartz and Zhang, “Stochastic Dual Coordinate Ascent Methods

for Regularized Loss Minimization”. JMLR 2013).

Global linear convergence rate in terms of dual objective function value.

(Shown in Wang and Lin, “Iteration Complexity of Feasible Descent Methods for

Convex Optimization”. JMLR 2014)

Experimental comparison

RCV1: 677,399 training samples; 47,236 features; 49,556,258nonzeroes in the whole dataset.

Time vs primal objective function value

Experimental comparison

RCV1: 677,399 training samples; 47,236 features; 49,556,258nonzeroes in the whole dataset.

Time vs prediction accuracy

LIBLINEAR

Implemented in LIBLINEAR:

https://www.csie.ntu.edu.tw/~cjlin/liblinear/

Other functionalities:

Logistic regression (L1 or L2 regularization)Multi-class SVMSupport vector regressionCross-validation

https://www.csie.ntu.edu.tw/~cjlin/liblinear/

Recent Research Topics

Linear SVM on a multi-core machine

Asynchronous dual coordinate descent algorithm:

Each thread repeatedly performs the following updates:For t = 1, 2, . . .

Randomly pick an index iCompute the optimal one-variable update:

δ∗i = max

(− αi ,min

(C − αi ,

1− yiwTxiQii

))Update αi ← αi + δ∗iUpdate w ← w + δ∗i yixi

Different mechanisms on accessing w in the shared memory.

Hsieh et al., “PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent”, in

ICML 2015

Huan and Hsieh, “ Fixing the Convergence Problems in Parallel Asynchronous Dual

Coordinate Descent”, in ICDM 2016

Other multi-core algorithms for SVM

Asynchronous stochastic (sub-)gradient decent for the primal problem:

Niu et al., “Hogwild! A Lock-Free Approach to Parallelizing StochasticGradient Descent”, in NIPS 2011.

Another approach (parallelized variable selection)

Chiang et al., “Parallel Dual Coordinate Descent Method forLarge-scale Linear Classification in Multi-core Environment”, in KDD2016.

Other parallel primal solvers:

Lee et al., “Fast Matrix-vector Multiplications for Large-scale LogisticRegression on Shared-memory Systems”, in ICDM 2015.

Large-scale linear ERM: out-of-core version

Question: how to solve linear SVM on a single machine when datacannot fit in memory?

Dual coordinate descent: need random access, not suitable whentraining samples are stored in disk

Large-scale linear: out-of-core version

Block coordinate descent:

Partition data into blocks S1, . . . ,SkFor t = 1, 2, . . .

Load a block Si into memoryUpdate this block of dual variables by dual coordinate descent

(Yu et al., “Large Linear Classification when Data Cannot Fit In Memory”, in KDD 2010)

Selective block coordinate descent: keep important samples in memory(Chang and Roth, “Selective Block Minimization for Faster Convergence of Limited

Memory Large-scale Linear Models”, in KDD 2011)

StreamSVM: one thread keep loading data, while another thread keeprunning coordinate descent updates(Matsushima et al., “Linear Support Vector Machines via Dual Cached Loops”, in KDD

2012)

Comparisons

On webspam dataset, 0.35 million samples, 16.61 million features, datasetsize 20.03GB

Distributed Linear ERM

Each machine stores a subset of training samples

Each machine conducts dual coordinate updates and has a local wHow to communicate and synchronize the updates?

Can the methods generalize to other local solvers?

What if there are millions of machines (nodes), and data is non-iiddistributed?

Distributed Linear ERM

COCOA: local dual coordinate descent updates, and averaging.(Jaggi et al., “Communication-Efficient Distributed Dual Coordinate Ascent”, in NIPS,

2014. )

Block Quadratic Optimization (COCOA with improved step sizeselection).(Lee et al., “Distributed Box-Constrained Quadratic Optimization for Dual Linear Support

Vector Machines”, in ICML, 2015. )

COCOA+: improved COCOA by solving a modified subproblem.(Ma et al., “Adding vs. Averaging in Distributed Primal-Dual Optimization”, in ICML,

2015. )

A more general framework (for other regularizations).(Zheng et al., “A General Distributed Dual Coordinate Optimization Framework for

Regularized Loss Minimization”, arXiv, 2016.)

Other distributed algorithms

Primal solver:(Zhuang et al., “Distributed Newton Method for Regularized Logistic Regression”, in

PAKDD 2015. )

DANE: distributed Newton-type method(Shamir et al., “Communication-Efficient Distributed Optimization using an Approximate

Newton-type Method”, in ICML 2014. )

DisCO: based on inexact Newton method:(Zhang et al., “Communication-Efficient Distributed Optimization of Self-Concordant

Empirical Loss”, in ICML 2015. )

Coming up

Questions?

ecs289: scalable machine learning - statisticschohsieh/teaching/ecs289g... · ecs289: scalable...

Documents