Peter Richtárik
Coordinate Descent Methods with Arbitrary Sampling
Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015
Papers & Coauthors
Zheng Qu
Zheng Qu, P.R. and Tong ZhangRandomized dual coordinate ascent with arbitrary sampling arXiv:1411.5873, 2014
P.R. and Martin TakáčOn optimal probabilities in stochastic coordinate descent methodsIn NIPS Workshop on Optimization for Machine Learning, 2013 (arXiv:1310.3438)
Zheng Qu and P.R.Coordinate descent with arbitrary sampling I: algorithms and complexity arXiv:1412.8060, 2014
Zheng Qu and P.R.Coordinate descent with arbitrary sampling II: expected separable overapproximation arXiv:1412.8063, 2014
Martin Takáč Tong Zhang
Warmup
Part ANSync
P.R. and Martin TakáčOn optimal probabilities in stochastic coordinate descent methodsIn NIPS Workshop on Optimization for Machine Learning, 2013 (arXiv:1310.3438)
Problem
Smooth and strongly convex
NSynci.i.d. with arbitrary
distribution
Key Assumption
Inequality must hold for all
Complexity Theorem
strong convexity constant
Proof
Copy-paste from the
paper
Uniform vs Optimal Sampling
Two-level samplingDefinition of a parametric family of random subsets {1,2, … , n} of fixed cardinality:
STEP 0:
STEP 1:
STEP 2:
Part B
Zheng Qu and P.R.Coordinate descent with arbitrary sampling I: algorithms and complexity arXiv:1412.8060
Problem
Smooth & convex Convex
ALPHA (for smooth minimization)STEP 0:
STEP 1:
STEP 2:
\[z^{t+1}_{{\color{blue}i}} \leftarrow z^t_{\color{blue}i} - \frac{\color{blue}p_i}{{\color{red}v_i}\theta_k} \nabla_{\color{blue}i} f(y^t)\]
STEP 3:
i.i.d. random subsets of coordinates
(any distribution allowed)
Same as in NSync
Complexity Theorem
Arbitrary point
Same as in NSync
Part C
PRIMAL-DUAL FRAMEWORK
Zheng Qu, P.R. and Tong ZhangRandomized dual coordinate ascent with arbitrary sampling arXiv:1411.5873
Primal Problem
d = # features (parameters)
n = # samples 1 - strongly convex function (regularizer)
- smooth & convex
regularizationparameter
Assumption 1
Loss functions have Lipschitz gradient
Lipschitz constant
Assumption 2
Regularizer is 1-strongly convex
subgradient
Dual Problem
- strongly convex 1 – smooth
& convex
C.1ALGORITHM
Quartz
Fenchel Duality
Weak duality
Optimality conditions
The Algorithm
Quartz: Bird’s Eye View
STEP 1: PRIMAL UPDATE
STEP 2: DUAL UPDATE
The Algorithm
STEP 1
STEP 2
Convex combinationconstant
Randomized Primal-Dual Methods
SDCA: SS Shwartz & T Zhang, 09/2012mSDCA M Takac, A Bijral, P R & N Srebro, 03/2013ASDCA: SS Shwartz & T Zhang, 05/2013AccProx-SDCA: SS Shwartz & T Zhang, 10/2013 DisDCA: T Yang, 2013 Iprox-SDCA: P Zhao & T Zhang, 01/2014 APCG: Q Lin, Z Lu & L Xiao, 07/2014SPDC: Y Zhang & L Xiao, 09/2014Quartz: Z Qu, P R & T Zhang, 11/2014
C.2MAIN RESULT
Assumption 3 (Expected Separable Overapproximation)
inequality must hold for all
Complexity Theorem (QRZ’14)
C.3UPDATING ONE DUALVARIABLE AT A TIME
Complexity of Quartz specialized to serial sampling
Data
Standard primal update
Experiment: Quartz vs SDCA,uniform vs optimal sampling
“Aggressive” primal update
C.4TAU-NICE SAMPLING
(STANDARD MINIBATCHING)
Data sparsity
A normalized measure of average sparsity of the data
“Fully sparse data” “Fully dense data”
Complexity of Quartz
Speedup
Assume the data is normalized:
Then:
Linear speedup up to a certain data-independent minibatch size:
Further data-dependent speedup, up to the extreme case:
Speedup: sparse data
Speedup: denser data
Speedup: fully dense data
astro_ph: n = 29,882 density = 0.08%
CCAT: n = 781,265 density = 0.16%
Primal-dual methods with tau-nice sampling
SS-Shwartz & T Zhang ‘13
SS-Shwartz & T Zhang ‘13
Y Zhang & L Xiao ‘14
For sufficiently sparse data, Quartz wins even when compared against accelerated methods
Acce
lera
ted
GOTTA END HERE
C.5DISTRIBUTED
QUARTZ
Distributed Quartz: Perform the Dual Updates in a Distributed Manner
Quartz STEP 2: DUAL UPDATE
Data required to compute the update
Distribution of Datan = # dual variables Data matrix
Distributed sampling
Distributed sampling
Random set of dual variables
Distributed sampling & distributed coordinate descent
P.R. and Martin TakáčDistributed coordinate descent for learning with big dataarXiv:1310.2059, 2013
Previously studied (not in the primal-dual setup):
Olivier Fercoq, Zheng Qu, P.R. and Martin TakáčFast distributed coordinate descent for minimizing non strongly convex losses2014 IEEE Int Workshop on Machine Learning for Signal Processing, May 2014
Jakub Marecek, P.R. and Martin TakáčFast distributed coordinate descent for minimizing partially separable functionsarXiv:1406.0238, June 2014
2
strongly convex & smooth
convex & smooth
Complexity of distributed Quartz
Reallocating load: theoretical speedup
n = 1,000,000density = 100%
n = 1,000,000density = 0.01%
Extra materialin the zero probability
event that I will have time for it
Part DESO
Zheng Qu and P.R.Coordinate Descent with Arbitrary Sampling II: Expected Separable OverapproximationarXiv:1412.8063, 2014
Computation of ESO parameters
Lemma (QR’14b) {For simplicity, assume that m = 1}
ESO
For any sampling , ESO holds with
Theorem (QR’14b)
where
ESO
Experiment
Machine: 128 nodes of Hector Supercomputer (4096 cores)
Problem: LASSO, n = 1 billion, d = 0.5 billion, 3 TB
Algorithm: with c = 512
P.R. and Martin Takáč, Distributed coordinate descent method for learning with big data, arXiv:1310.2059, 2013
LASSO: 3TB data + 128 nodes
Experiment
Machine: 128 nodes of Archer Supercomputer
Problem: LASSO, n = 5 million, d = 50 billion, 5 TB(60,000 nnz per row of A)
Algorithm: Hydra2 with c = 256
Olivier Fercoq, Zheng Qu, P.R. and Martin Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, IEEE Int Workshop on Machine Learning for Signal Processing, 2014
LASSO: 5TB data (d = 50b) + 128 nodes
END