Download - Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Peter Richtárik

Coordinate Descent Methods with Arbitrary Sampling

Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Papers & Coauthors

Zheng Qu

Zheng Qu, P.R. and Tong ZhangRandomized dual coordinate ascent with arbitrary sampling arXiv:1411.5873, 2014

P.R. and Martin TakáčOn optimal probabilities in stochastic coordinate descent methodsIn NIPS Workshop on Optimization for Machine Learning, 2013 (arXiv:1310.3438)

Zheng Qu and P.R.Coordinate descent with arbitrary sampling I: algorithms and complexity arXiv:1412.8060, 2014

Zheng Qu and P.R.Coordinate descent with arbitrary sampling II: expected separable overapproximation arXiv:1412.8063, 2014

Martin Takáč Tong Zhang

Warmup

Part ANSync

P.R. and Martin TakáčOn optimal probabilities in stochastic coordinate descent methodsIn NIPS Workshop on Optimization for Machine Learning, 2013 (arXiv:1310.3438)

Problem

Smooth and strongly convex

NSynci.i.d. with arbitrary

distribution

Key Assumption

Inequality must hold for all

Complexity Theorem

strong convexity constant

Proof

Copy-paste from the

paper

Uniform vs Optimal Sampling

Two-level samplingDefinition of a parametric family of random subsets {1,2, … , n} of fixed cardinality:

STEP 0:

STEP 1:

STEP 2:

Part B

Zheng Qu and P.R.Coordinate descent with arbitrary sampling I: algorithms and complexity arXiv:1412.8060

Problem

Smooth & convex Convex

ALPHA (for smooth minimization)STEP 0:

STEP 1:

STEP 2:

\[z^{t+1}_{{\color{blue}i}} \leftarrow z^t_{\color{blue}i} - \frac{\color{blue}p_i}{{\color{red}v_i}\theta_k} \nabla_{\color{blue}i} f(y^t)\]

STEP 3:

i.i.d. random subsets of coordinates

(any distribution allowed)

Same as in NSync

Complexity Theorem

Arbitrary point

Same as in NSync

Part C

PRIMAL-DUAL FRAMEWORK

Zheng Qu, P.R. and Tong ZhangRandomized dual coordinate ascent with arbitrary sampling arXiv:1411.5873

Primal Problem

d = # features (parameters)

n = # samples 1 - strongly convex function (regularizer)

- smooth & convex

regularizationparameter

Assumption 1

Loss functions have Lipschitz gradient

Lipschitz constant

Assumption 2

Regularizer is 1-strongly convex

subgradient

Dual Problem

- strongly convex 1 – smooth

& convex

C.1ALGORITHM

Quartz

Fenchel Duality

Weak duality

Optimality conditions

The Algorithm

Quartz: Bird’s Eye View

STEP 1: PRIMAL UPDATE

STEP 2: DUAL UPDATE

The Algorithm

STEP 1

STEP 2

Convex combinationconstant

Randomized Primal-Dual Methods

SDCA: SS Shwartz & T Zhang, 09/2012mSDCA M Takac, A Bijral, P R & N Srebro, 03/2013ASDCA: SS Shwartz & T Zhang, 05/2013AccProx-SDCA: SS Shwartz & T Zhang, 10/2013 DisDCA: T Yang, 2013 Iprox-SDCA: P Zhao & T Zhang, 01/2014 APCG: Q Lin, Z Lu & L Xiao, 07/2014SPDC: Y Zhang & L Xiao, 09/2014Quartz: Z Qu, P R & T Zhang, 11/2014

C.2MAIN RESULT

Assumption 3 (Expected Separable Overapproximation)

inequality must hold for all

Complexity Theorem (QRZ’14)

C.3UPDATING ONE DUALVARIABLE AT A TIME

Complexity of Quartz specialized to serial sampling

Standard primal update

Experiment: Quartz vs SDCA,uniform vs optimal sampling

“Aggressive” primal update

C.4TAU-NICE SAMPLING

(STANDARD MINIBATCHING)

Data sparsity

A normalized measure of average sparsity of the data

“Fully sparse data” “Fully dense data”

Complexity of Quartz

Speedup

Assume the data is normalized:

Then:

Linear speedup up to a certain data-independent minibatch size:

Further data-dependent speedup, up to the extreme case:

Speedup: sparse data

Speedup: denser data

Speedup: fully dense data

astro_ph: n = 29,882 density = 0.08%

CCAT: n = 781,265 density = 0.16%

Primal-dual methods with tau-nice sampling

SS-Shwartz & T Zhang ‘13

SS-Shwartz & T Zhang ‘13

Y Zhang & L Xiao ‘14

For sufficiently sparse data, Quartz wins even when compared against accelerated methods

Acce

lera

ted

GOTTA END HERE

C.5DISTRIBUTED

QUARTZ

Distributed Quartz: Perform the Dual Updates in a Distributed Manner

Quartz STEP 2: DUAL UPDATE

Data required to compute the update

Distribution of Datan = # dual variables Data matrix

Distributed sampling

Distributed sampling

Random set of dual variables

Distributed sampling & distributed coordinate descent

P.R. and Martin TakáčDistributed coordinate descent for learning with big dataarXiv:1310.2059, 2013

Previously studied (not in the primal-dual setup):

Olivier Fercoq, Zheng Qu, P.R. and Martin TakáčFast distributed coordinate descent for minimizing non strongly convex losses2014 IEEE Int Workshop on Machine Learning for Signal Processing, May 2014

Jakub Marecek, P.R. and Martin TakáčFast distributed coordinate descent for minimizing partially separable functionsarXiv:1406.0238, June 2014

2

strongly convex & smooth

convex & smooth

Complexity of distributed Quartz

Reallocating load: theoretical speedup

n = 1,000,000density = 100%

n = 1,000,000density = 0.01%

Extra materialin the zero probability

event that I will have time for it

Part DESO

Zheng Qu and P.R.Coordinate Descent with Arbitrary Sampling II: Expected Separable OverapproximationarXiv:1412.8063, 2014

Computation of ESO parameters

Lemma (QR’14b) {For simplicity, assume that m = 1}

ESO

For any sampling , ESO holds with

Theorem (QR’14b)

where

Experiment

Machine: 128 nodes of Hector Supercomputer (4096 cores)

Problem: LASSO, n = 1 billion, d = 0.5 billion, 3 TB

Algorithm: with c = 512

P.R. and Martin Takáč, Distributed coordinate descent method for learning with big data, arXiv:1310.2059, 2013

LASSO: 3TB data + 128 nodes

Experiment

Machine: 128 nodes of Archer Supercomputer

Problem: LASSO, n = 5 million, d = 50 billion, 5 TB(60,000 nnz per row of A)

Algorithm: Hydra2 with c = 256

Olivier Fercoq, Zheng Qu, P.R. and Martin Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, IEEE Int Workshop on Machine Learning for Signal Processing, 2014

LASSO: 5TB data (d = 50b) + 128 nodes

Download - Peter Richtárik Coordinate Descent Methods with Arbitrary Sampling Optimization and Statistical Learning – Les Houches – France – January 11-16, 2015

Top Related