mini-batch primal and dual methods for svmsprichtar/talks/talk-2013-03-20-paris.pdf · 3/20/2013...
TRANSCRIPT
![Page 1: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/1.jpg)
Mini-Batch Primal and Dual Methods for SVMs
Peter Richtarik
School of MathematicsThe University of Edinburgh
Coauthors: M. Takac (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago)
arXiv:1303.2314
Fete Parisienne in Computation, Inference and OptimizationMarch 20, 2013
1 / 31
![Page 2: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/2.jpg)
PeteroRichtárikBoMartinoTakáčoParallel coordinate descent
methods for big data optimization 2012BoarXiv:121240873
ShaioShalevjSchwartzBoTongoZhangStochastic dual coordinate ascent methods
for regularized loss minimization2012BoarXiv:120941873
ShaioShalevjSchwartzBoYoramoSingerBoNathanoSrebro
Pegasos: Primal estimated subgradient solver for SVM
ICMLo2007
MartinoTakáčBoAvleenoBijralBoPeteroRichtárikBoNathanoSrebroMini-batch primal and dual methods for SVMs
2013BoarXiv:1303:2314
2 / 31
![Page 3: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/3.jpg)
Support Vector Machine
<w,x> - b = 0
<w,x> - b = 1
<w,x> - b = -1
3 / 31
![Page 4: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/4.jpg)
Family Support Machine
4 / 31
![Page 5: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/5.jpg)
PART I:
Stochastic Gradient Descent (SGD)
5 / 31
![Page 6: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/6.jpg)
SVM: Primal Problem
Data:{(xi , yi ) ∈ Rd × {+1,−1} : i ∈ S
def= {1, 2, . . . , n}}
I Examples: x1, . . . , xn (assumption: maxi ‖xi‖2 ≤ 1)
I Labels: yi ∈ {+1,−1}
Optimization formulation of SVM:
minw∈Rd
{PS(w)
def=λ
2‖w‖2 + LS(w)
}, (P)
where
I LA(w)def= 1|A|∑
i∈A `(yi 〈w, xi 〉) (average hinge loss on examples in A)
I `(ζ)def= max{0, 1− ζ} (hinge loss)
6 / 31
![Page 7: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/7.jpg)
Pegasos (SGD)
Algorithm
1. Choose w1 = 0 ∈ Rd
2. Iterate for t = 1, 2, . . . ,T
2.1 Choose At ⊂ S = {1, 2, . . . , n}, |At | = b, uniformly at random2.2 Set stepsize ηt ← 1
λt
2.3 Update wt+1 ← wt − ηt∂PAt (wt)
TheoremFor w = 1
T
∑Tt=1 wt we have:
E[P(w)] ≤ P(w∗) + c log(T ) · 1
λT,
where c = (√λ+ 1)2.
Shai Shalev-Shwartz, Yoram Singer and Nathan Srebro
Pegasos: Primal Estimated sub-GrAdient SOlver for SVM, ICML 2007
7 / 31
![Page 8: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/8.jpg)
Pegasos (SGD)
Algorithm
1. Choose w1 = 0 ∈ Rd
2. Iterate for t = 1, 2, . . . ,T
2.1 Choose At ⊂ S = {1, 2, . . . , n}, |At | = b, uniformly at random2.2 Set stepsize ηt ← 1
λt
2.3 Update wt+1 ← (1− ηtλ)wt +ηtb
∑i∈At : yi 〈wt ,xi 〉<1 yixi
TheoremFor w = 1
T
∑Tt=1 wt we have:
E[P(w)] ≤ P(w∗) + c log(T ) · 1
λT,
where c = (√λ+ 1)2.
Shai Shalev-Shwartz, Yoram Singer and Nathan Srebro
Pegasos: Primal Estimated sub-GrAdient SOlver for SVM, ICML 2007
7 / 31
![Page 9: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/9.jpg)
Pegasos (SGD)
Algorithm
1. Choose w1 = 0 ∈ Rd
2. Iterate for t = 1, 2, . . . ,T
2.1 Choose At ⊂ S = {1, 2, . . . , n}, |At | = b, uniformly at random2.2 Set stepsize ηt ← 1
λt
2.3 Update wt+1 ← (1− ηtλ)wt +ηtb
∑i∈At : yi 〈wt ,xi 〉<1 yixi
Theorem 1For w = 2
T
∑Tt=bT/2c+1 wt we have:
E[P(w)] ≤ P(w∗) +30βbb· 1
λT,
where βb = 1 + (b−1)(nσ2−1)n−1 , σ2 def
= 1n‖Q‖ and Qij = 〈yixi , yjxj〉
Martin Takac, Avleen Bijral, P. R. and Nathan Srebro
Mini-batch primal and dual methods for SVMs, 2013
7 / 31
![Page 10: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/10.jpg)
Insight into βbb
βbb
=1 + (b−1)(nσ2−1)
n−1
b
Letting X = [x1, x2, . . . , xn], Z = [y1x1, . . . , ynxn] and assuming ‖xi‖ = 1for all i , we have
nσ2 def= ‖Q‖ = ‖ZZT‖ = ‖ZTZ‖
= λmax(ZTZ) ∈ [ tr(ZT Z)n , tr(ZTZ)] = [ tr(XT X)
n︸ ︷︷ ︸=1
, tr(XTX)︸ ︷︷ ︸=n
]
I nσ2 = n⇒ βb
b = 1(no parallelization speedup; mini-batching does not help)
I nσ2 = 1⇒ βb
b = 1b
(speedup equal to batch size!)
Similar expression appears inP. R. and Martin Takac
Parallel coordinate descent methods for big data problems, 2012
with nσ2 replaced by ω (degree of partial separability of the loss function)
8 / 31
![Page 11: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/11.jpg)
Insight into βbb
βbb
=1 + (b−1)(nσ2−1)
n−1
b
Letting X = [x1, x2, . . . , xn], Z = [y1x1, . . . , ynxn] and assuming ‖xi‖ = 1for all i , we have
nσ2 def= ‖Q‖ = ‖ZZT‖ = ‖ZTZ‖
= λmax(ZTZ) ∈ [ tr(ZT Z)n , tr(ZTZ)] = [ tr(XT X)
n︸ ︷︷ ︸=1
, tr(XTX)︸ ︷︷ ︸=n
]
I nσ2 = n⇒ βb
b = 1(no parallelization speedup; mini-batching does not help)
I nσ2 = 1⇒ βb
b = 1b
(speedup equal to batch size!)
Similar expression appears inP. R. and Martin Takac
Parallel coordinate descent methods for big data problems, 2012
with nσ2 replaced by ω (degree of partial separability of the loss function)
8 / 31
![Page 12: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/12.jpg)
Insight into βbb
βbb
=1 + (b−1)(nσ2−1)
n−1
b
Letting X = [x1, x2, . . . , xn], Z = [y1x1, . . . , ynxn] and assuming ‖xi‖ = 1for all i , we have
nσ2 def= ‖Q‖ = ‖ZZT‖ = ‖ZTZ‖
= λmax(ZTZ) ∈ [ tr(ZT Z)n , tr(ZTZ)] = [ tr(XT X)
n︸ ︷︷ ︸=1
, tr(XTX)︸ ︷︷ ︸=n
]
I nσ2 = n⇒ βb
b = 1(no parallelization speedup; mini-batching does not help)
I nσ2 = 1⇒ βb
b = 1b
(speedup equal to batch size!)
Similar expression appears inP. R. and Martin Takac
Parallel coordinate descent methods for big data problems, 2012
with nσ2 replaced by ω (degree of partial separability of the loss function)
8 / 31
![Page 13: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/13.jpg)
Insight into βbb
βbb
=1 + (b−1)(nσ2−1)
n−1
b
Letting X = [x1, x2, . . . , xn], Z = [y1x1, . . . , ynxn] and assuming ‖xi‖ = 1for all i , we have
nσ2 def= ‖Q‖ = ‖ZZT‖ = ‖ZTZ‖
= λmax(ZTZ) ∈ [ tr(ZT Z)n , tr(ZTZ)] = [ tr(XT X)
n︸ ︷︷ ︸=1
, tr(XTX)︸ ︷︷ ︸=n
]
I nσ2 = n⇒ βb
b = 1(no parallelization speedup; mini-batching does not help)
I nσ2 = 1⇒ βb
b = 1b
(speedup equal to batch size!)
Similar expression appears inP. R. and Martin Takac
Parallel coordinate descent methods for big data problems, 2012
with nσ2 replaced by ω (degree of partial separability of the loss function)8 / 31
![Page 14: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/14.jpg)
Computing βb: SVM DatasetsTo run SDCA with safe mini-batching, we need to compute βb:
βb = 1 +(b − 1)(nσ2 − 1)
n − 1, where nσ2 = λmax(ZTZ)
Two options:
I Compute the largest eigenvalue (e.g., power method)
I Replace nσ2 by an upper bound: degree∗ of partial separability ωGeneral SDCA methods based on ω described here:P. R. and Martin Takac
Parallel coordinate descent methods for big data optimization, 2012
Dataset # Examples (d) # Features (n) ‖A‖0 nσ2 ∈ [1, n] ω
a1a 1,605 123 22,249 13.879 14a9a 32,561 123 451,592 13.885 14rcv1 20,242 47,236 1,498,952 105.570 980real-sim 72,309 20,958 3,709,191 44.253 3,484news20 19,996 1,355,191 9,097,958 9,674.184 16,423url 2,396,130 3,231,961 277,058,644 114.956 414webspam 350,000 16,609,143 29,796,333 50.436 127kdda2010 8,407,752 20,216,830 305,613,510 47.459 85kddb2010 19,264,097 29,890,095 566,345,888 41.467 75
9 / 31
![Page 15: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/15.jpg)
Where does βb Come from?
Lemma 1Consider any symmetric Q ∈ Rn×n, random subset A ⊂ {1, 2, . . . , n}with |A| = b and v ∈ Rn. Then
E[vT[A]Qv[A]] =
b
n
[(1− b − 1
n − 1
) n∑i=1
Qiiv2i +
b − 1
n − 1vTQv
].
Moreover, if Qii ≤ 1 for all i , then we get the following ESO (ExpectedSeparable Overapproximation):
E[vT[A]Qv[A]] ≤
b
nβb‖v‖2.
Remark: ESO inequalities are systematically developed inP. R. and Martin Takac
Parallel coordinate descent methods for big data problems, 2012
10 / 31
![Page 16: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/16.jpg)
Where does βb Come from?
Lemma 1Consider any symmetric Q ∈ Rn×n, random subset A ⊂ {1, 2, . . . , n}with |A| = b and v ∈ Rn. Then
E[vT[A]Qv[A]] =
b
n
[(1− b − 1
n − 1
) n∑i=1
Qiiv2i +
b − 1
n − 1vTQv
].
Moreover, if Qii ≤ 1 for all i , then we get the following ESO (ExpectedSeparable Overapproximation):
E[vT[A]Qv[A]] ≤
b
nβb‖v‖2.
Remark: ESO inequalities are systematically developed inP. R. and Martin Takac
Parallel coordinate descent methods for big data problems, 2012
10 / 31
![Page 17: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/17.jpg)
Insight into the Analysis
Classical Pegasos AnalysisUses the inequality:
‖∇LAt (w)‖2 ≤ 1
which holds for any At ⊂ S = {1, 2, . . . , n}
New AnalysisUses the inequality:
E‖∇LAt (w)‖2 ≤ βbb
which holds for At , |At | = b, chosen uniformly at random (established byprevious lemma)
11 / 31
![Page 18: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/18.jpg)
Insight into the Analysis
Classical Pegasos AnalysisUses the inequality:
‖∇LAt (w)‖2 ≤ 1
which holds for any At ⊂ S = {1, 2, . . . , n}
New AnalysisUses the inequality:
E‖∇LAt (w)‖2 ≤ βbb
which holds for At , |At | = b, chosen uniformly at random (established byprevious lemma)
11 / 31
![Page 19: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/19.jpg)
PART II:
Stochastic Dual Coordinate Ascent(SDCA)
12 / 31
![Page 20: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/20.jpg)
Stochastic Dual Coordinate Ascent (SDCA)
Problem:
maxα∈Rn, 0≤αi≤1
{D(α)
def=
1
n
n∑i=1
αi −1
2λn2αTQα
}(D)
Algorithm
1. Choose α0 = 0 ∈ Rn
2. For t = 0, 1, 2, . . . iterate:
2.1 Choose i ∈ {1, . . . , n}, uniformly at random2.2 Set δ∗ ← argmax{D(αt + δei ) : 0 ≤ αi + δ ≤ 1}2.3 αt+1 ← αt + δ∗ei
First proposed for SVM byC.-J. Hsieh K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan
A dual coordinate descent method for large-scale linear SVM, ICML 2008
General analysis inP. R. and M. Takac
Iteration complexity of randomized block-coordinate descent methods . . . , MAPR 2012
[INFORMS Computing Society Best Student Paper Prize (runner-up), 2012]
13 / 31
![Page 21: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/21.jpg)
Stochastic Dual Coordinate Ascent (SDCA)
Problem:
maxα∈Rn, 0≤αi≤1
{D(α)
def=
1
n
n∑i=1
αi −1
2λn2αTQα
}(D)
Algorithm
1. Choose α0 = 0 ∈ Rn
2. For t = 0, 1, 2, . . . iterate:
2.1 Choose i ∈ {1, . . . , n}, uniformly at random2.2 Set δ∗ ← argmax{D(αt + δei ) : 0 ≤ αi + δ ≤ 1}2.3 αt+1 ← αt + δ∗ei
First proposed for SVM byC.-J. Hsieh K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan
A dual coordinate descent method for large-scale linear SVM, ICML 2008
General analysis inP. R. and M. Takac
Iteration complexity of randomized block-coordinate descent methods . . . , MAPR 2012
[INFORMS Computing Society Best Student Paper Prize (runner-up), 2012]13 / 31
![Page 22: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/22.jpg)
“Naive” Mini-Batching / Parallelization
Problem:
maxα∈Rn, 0≤αi≤1
{D(α)
def=
1
n
n∑i=1
αi −1
2λn2αTQα
}(D)
Algorithm
1. Choose α0 = 0 ∈ Rn
2. For t = 0, 1, 2, . . . iterate:
2.1 Choose At ⊂ {1, . . . , n}, |At | = b, uniformly at random2.2 Set αt+1 ← αt
2.3 For i ∈ At do
2.3.1 Set δ∗ ← arg max{D(α + δei ) : 0 ≤ αi + δ ≤ 1}2.3.2 αt+1 ← αt+1 + δ∗ei
Analyzed inJoseph K. Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin
Parallel coordinate descent for L1-regularized loss minimization, ICML 2011
I Convergence guaranteed only for “small” b (βb ≤ 2)I Analysis does not cover SVM dual (D)
14 / 31
![Page 23: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/23.jpg)
Example: Failure of Naive Parallelization
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
15 / 31
![Page 24: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/24.jpg)
Example: Details
Problem:
Q =
(1 11 1
), λ =
1
n=
1
2, b = 2.
⇒ D(α) =1
2eTα− 1
4αTQα
The naive approach will produce the sequence:
I α0 = (0, 0)T with D(α0) = 0
I α1 = (1, 1)T with D(α1) = 0
I α2 = (0, 0)T with D(α2) = 0
I α3 = (1, 1)T with D(α3) = 0
I · · ·Optimal solution: D(α∗) = D(( 1
2 ,12 )T ) = 0.25
16 / 31
![Page 25: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/25.jpg)
Safe Mini-Batching
Instead of choosing δ “naively” via maximizing the original function
D(α + δ) := − (αTQα + 2αTQδ + δTQδ)
2λn2+
n∑i=1
αi + δin
,
work with its Expected Separable Underapproximation:
Hβb(δ,α) := − (αTQα + 2αTQδ + βb‖δ‖2)
2λn2+
n∑i=1
αi + δin
,
That is, instead of
δ∗ ← arg max{D(α + δei ,α) : 0 ≤ αi + δ ≤ 1}, i ∈ At
doδ∗ ← arg max{Hβ(δei ,α) : 0 ≤ αi + δ ≤ 1}, i ∈ At
17 / 31
![Page 26: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/26.jpg)
Safe Mini-Batching: General Theory
Developed inP. R. and Martin Takac
Parallel coordinate descent methods for big data optimization, 2012
Based on the idea
“If you can’t guarantee descent/ascent, guarantee it in expectation”
Definition (Expected Separable Overapproximation)Let f : Rn → R be convex and smooth. Let S be a random subset of{1, 2, . . . , n} s.t. P(i ∈ S) = const for all i (uniform sampling) and for
w ∈ Rn++ define ‖x‖w
def= (∑n
i=1 x2i )1/2.
Then we say that f admits a (β,w)-ESO w.r.t. S if for all x , h ∈ Rn:
E[f (x + h[S])] ≤ f (x) +E[|S |]n
(〈∇f (x), h〉+
β
2‖h‖2
w
)
18 / 31
![Page 27: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/27.jpg)
ESO for SVM DualESO can also be written as
E[f (x + h[S])] ≤(
1− E[|S|]n
)f (x) + E[|S|]
n
(f (x) + 〈∇f (x), h〉+ β
2 ‖h‖2w
)︸ ︷︷ ︸
def=Hβ(h;x)
SVM setting: f (x) = −D(α), h = δ, S = At , |S | = b, w = (1, . . . , 1)T
Lemma 3For the SVM dual loss we have for all α, δ ∈ Rn the following ESO:
E[D(α + δ)] ≥ (1− bn )D(α) + b
nHβb(δ;α),
where
Hβb(δ,α) = − (αTQα + 2αTQδ + βb‖δ‖2)
2λn2+
n∑i=1
αi + δin
,
βbdef= 1 +
(b − 1)(nσ2 − 1)
n − 1.
19 / 31
![Page 28: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/28.jpg)
ESO for SVM DualESO can also be written as
E[f (x + h[S])] ≤(
1− E[|S|]n
)f (x) + E[|S|]
n
(f (x) + 〈∇f (x), h〉+ β
2 ‖h‖2w
)︸ ︷︷ ︸
def=Hβ(h;x)
SVM setting: f (x) = −D(α), h = δ, S = At , |S | = b, w = (1, . . . , 1)T
Lemma 3For the SVM dual loss we have for all α, δ ∈ Rn the following ESO:
E[D(α + δ)] ≥ (1− bn )D(α) + b
nHβb(δ;α),
where
Hβb(δ,α) = − (αTQα + 2αTQδ + βb‖δ‖2)
2λn2+
n∑i=1
αi + δin
,
βbdef= 1 +
(b − 1)(nσ2 − 1)
n − 1.
19 / 31
![Page 29: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/29.jpg)
Primal Suboptimality for SDCA with Safe Mini-BatchingTheorem 2Let us run SDCA with safe mini-batching and let α0 = 0 ∈ Rn andε > 0. If we let
t0 ≥ max{0, d nb log( 2λnβb
)e},
T0 ≥ t0 + βb
b
[4λε − 2 n
βb
]+,
T ≥ T0 + max{d nb e,βb
b1λε},
αdef= 1
T−T0
T−1∑t=T0
αt ,
then
w(α)def=
1
λn
n∑i=1
αiyixi
is an ε-approximate solution to the PRIMAL problem, i.e.,
E[P(w(α))]− P(w∗) ≤ E[P(w(α))−D(α)︸ ︷︷ ︸duality gap
] ≤ ε.
20 / 31
![Page 30: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/30.jpg)
Primal Suboptimality: Simple Expression
βbb· 5λε
+n
b
(1 + log
(2λn
βb
))
21 / 31
![Page 31: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/31.jpg)
PART III:
SGD vs SDCA: Theory and Numerics
22 / 31
![Page 32: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/32.jpg)
23 / 31
![Page 33: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/33.jpg)
SGD vs. SDCA: Theory
Stochastic Gradient Descent (SGD)SGD needs
T =βbb· 30
λε
Stochastic Dual Coordinate Ascent (SDCA)SDCA (with safe mini-batching) needs
T =βbb· 5
λε+
n
b
(1 + log
(2λn
βb
))
24 / 31
![Page 34: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/34.jpg)
Numerical Experiments: Datasets
Data # train # test # features (n) Sparsity % λ
cov 522,911 58,101 54 22 0.000010rcv1 20,242 677,399 47,236 0.16 0.000100
astro-ph 29,882 32,487 99,757 0.08 0.000050news20 15,020 4,976 1,355,191 0.04 0.000125
25 / 31
![Page 35: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/35.jpg)
Batch Size vs Iterations ε = 0.001
100
101
102
103
104
102
103
104
105
106
Batch size
Ite
ratio
ns
covertype
100
101
102
103
10410
−2
10−1
100
Pegasos (SGD)
Naive SDCA
Safe SDCA
Aggressive SDCAβ
b/b (right axis)
26 / 31
![Page 36: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/36.jpg)
Batch Size vs Iterations ε = 0.001
100
101
102
103
104
102
103
104
105
106
Batch size
Ite
ratio
ns
astro−ph
100
101
102
103
10410
−2
10−1
100
27 / 31
![Page 37: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/37.jpg)
Numerical Experiments
100
101
102
103
0
0.05
0.1
0.15
0.2
0.25
Iterations
Test E
rror
astro−ph, b=8192
100
101
102
103
10−8
10−6
10−4
10−2
100
Iterations
astro−ph, b=8192
Prim
al/D
ual S
uboptim
alit
y
28 / 31
![Page 38: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/38.jpg)
Test Error and Primal/Dual Suboptimality
100
101
102
103
104
0
0.1
0.2
0.3
0.4
0.5
Iterations
Test E
rror
news20, b=256
Pegasos (SGD)
Naive SDCA
Safe SDCA
Aggressive SDCA
100
101
102
103
104
10−8
10−6
10−4
10−2
100
Iterations
Prim
al/D
ual S
uboptim
alit
y
news20, b=256
29 / 31
![Page 39: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/39.jpg)
Summary 1
Shai Shalev-Shwartz, Yoram Singer and Nathan Srebro
Pegasos: Primal Estimated sub-GrAdient SOlver for SVM, ICML 2007
+ Analysis of SGD for b = 1- Weak analysis for b > 1 (no speedup)
P. R. and Martin Takac
Parallel coordinate descent methods for big data optimization, 2012
+ General analysis of SDCA for b > 1 (even variable b)+ ESO: Expected Separable Overapproximation- Dual suboptimality only
Shai Shalev-Shwartz and Tong Zhang
Stochastic dual coordinate ascent methods for regularized loss minimization, 2012
+ Primal sub-optimality for SDCA with b = 1- No analysis for b > 1
30 / 31
![Page 40: Mini-Batch Primal and Dual Methods for SVMsprichtar/talks/TALK-2013-03-20-Paris.pdf · 3/20/2013 · March 20, 2013 1/31. Peter Richtárik, Martin Taká Parallel coordinate descent](https://reader034.vdocuments.mx/reader034/viewer/2022042408/5f249c27fcd0ee5e003b0d35/html5/thumbnails/40.jpg)
Summary 2
Martin Takac, Avleen Bijral, P. R. and Nathan Srebro
Mini-batch primal and dual methods for SVMs, 2013
I First analysis of mini-batched SGD for SVM primal which works
I New mini-batch SDCA method for SVM dualI with safe mini-batchingI with aggressive mini-batching
I Both SGD and SDCAI have guarantees in terms of primal suboptimalityI spectral norm of the data controls parallelization speedupI have essentially identical iterations
31 / 31