large scale machine learning over networks · large scale machine learning over networks francis...

94
Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Kevin Scaman, Hadrien Hendrikx, Laurent Massouli´ e, S´ ebastien Bubeck, Yin-Tat Lee PAISS Summer school - October 5, 2019

Upload: others

Post on 17-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Large Scale Machine LearningOver Networks

Francis Bach

INRIA - Ecole Normale Superieure Paris France

EacuteCOLE NORMALE

S U P Eacute R I E U R E

Joint work with Kevin Scaman Hadrien Hendrikx Laurent

Massoulie Sebastien Bubeck Yin-Tat Lee

PAISS Summer school - October 5 2019

Scientific context

bull Proliferation of digital data

ndash Personal data

ndash Industry

ndash Scientific from bioinformatics to humanities

bull Need for automated processing of massive data

Scientific context

bull Proliferation of digital data

ndash Personal data

ndash Industry

ndash Scientific from bioinformatics to humanities

bull Need for automated processing of massive data

bull Series of ldquohypesrdquo

Big data rarr Data science rarr Machine Learning

rarr Deep Learning rarr Artificial Intelligence

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

ldquoIntelligencerdquo = models + algorithms + data

+ computing power

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

ldquoIntelligencerdquo = models + algorithms + data

+ computing power

Outline

1 Parametric supervised learning on a single machine

minus Machine learning asymp optimization of finite sums

minus From batch to stochastic gradient methods

minus Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

ndash Centralized and decentralized methods

ndash From network averaging to optimization

ndash Distributing the fastest single machine algorithms

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

- Linear predictions

- h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

bull Linear predictions

ndash h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))

x y

θ1θ3

θ2

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

2n

nsum

i=1

(

yi minus h(xi θ))2

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(least-squares regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

log(

1 + exp(minusyih(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(logistic regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 2: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Scientific context

bull Proliferation of digital data

ndash Personal data

ndash Industry

ndash Scientific from bioinformatics to humanities

bull Need for automated processing of massive data

Scientific context

bull Proliferation of digital data

ndash Personal data

ndash Industry

ndash Scientific from bioinformatics to humanities

bull Need for automated processing of massive data

bull Series of ldquohypesrdquo

Big data rarr Data science rarr Machine Learning

rarr Deep Learning rarr Artificial Intelligence

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

ldquoIntelligencerdquo = models + algorithms + data

+ computing power

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

ldquoIntelligencerdquo = models + algorithms + data

+ computing power

Outline

1 Parametric supervised learning on a single machine

minus Machine learning asymp optimization of finite sums

minus From batch to stochastic gradient methods

minus Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

ndash Centralized and decentralized methods

ndash From network averaging to optimization

ndash Distributing the fastest single machine algorithms

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

- Linear predictions

- h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

bull Linear predictions

ndash h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))

x y

θ1θ3

θ2

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

2n

nsum

i=1

(

yi minus h(xi θ))2

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(least-squares regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

log(

1 + exp(minusyih(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(logistic regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 3: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Scientific context

bull Proliferation of digital data

ndash Personal data

ndash Industry

ndash Scientific from bioinformatics to humanities

bull Need for automated processing of massive data

bull Series of ldquohypesrdquo

Big data rarr Data science rarr Machine Learning

rarr Deep Learning rarr Artificial Intelligence

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

ldquoIntelligencerdquo = models + algorithms + data

+ computing power

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

ldquoIntelligencerdquo = models + algorithms + data

+ computing power

Outline

1 Parametric supervised learning on a single machine

minus Machine learning asymp optimization of finite sums

minus From batch to stochastic gradient methods

minus Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

ndash Centralized and decentralized methods

ndash From network averaging to optimization

ndash Distributing the fastest single machine algorithms

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

- Linear predictions

- h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

bull Linear predictions

ndash h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))

x y

θ1θ3

θ2

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

2n

nsum

i=1

(

yi minus h(xi θ))2

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(least-squares regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

log(

1 + exp(minusyih(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(logistic regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 4: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

ldquoIntelligencerdquo = models + algorithms + data

+ computing power

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

ldquoIntelligencerdquo = models + algorithms + data

+ computing power

Outline

1 Parametric supervised learning on a single machine

minus Machine learning asymp optimization of finite sums

minus From batch to stochastic gradient methods

minus Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

ndash Centralized and decentralized methods

ndash From network averaging to optimization

ndash Distributing the fastest single machine algorithms

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

- Linear predictions

- h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

bull Linear predictions

ndash h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))

x y

θ1θ3

θ2

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

2n

nsum

i=1

(

yi minus h(xi θ))2

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(least-squares regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

log(

1 + exp(minusyih(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(logistic regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 5: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

ldquoIntelligencerdquo = models + algorithms + data

+ computing power

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

ldquoIntelligencerdquo = models + algorithms + data

+ computing power

Outline

1 Parametric supervised learning on a single machine

minus Machine learning asymp optimization of finite sums

minus From batch to stochastic gradient methods

minus Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

ndash Centralized and decentralized methods

ndash From network averaging to optimization

ndash Distributing the fastest single machine algorithms

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

- Linear predictions

- h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

bull Linear predictions

ndash h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))

x y

θ1θ3

θ2

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

2n

nsum

i=1

(

yi minus h(xi θ))2

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(least-squares regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

log(

1 + exp(minusyih(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(logistic regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 6: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

ldquoIntelligencerdquo = models + algorithms + data

+ computing power

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

ldquoIntelligencerdquo = models + algorithms + data

+ computing power

Outline

1 Parametric supervised learning on a single machine

minus Machine learning asymp optimization of finite sums

minus From batch to stochastic gradient methods

minus Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

ndash Centralized and decentralized methods

ndash From network averaging to optimization

ndash Distributing the fastest single machine algorithms

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

- Linear predictions

- h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

bull Linear predictions

ndash h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))

x y

θ1θ3

θ2

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

2n

nsum

i=1

(

yi minus h(xi θ))2

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(least-squares regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

log(

1 + exp(minusyih(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(logistic regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 7: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Recent progress in perception (vision audio text)

person ride dog

From translategooglefr From Peyre et al (2017)

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

ldquoIntelligencerdquo = models + algorithms + data

+ computing power

Outline

1 Parametric supervised learning on a single machine

minus Machine learning asymp optimization of finite sums

minus From batch to stochastic gradient methods

minus Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

ndash Centralized and decentralized methods

ndash From network averaging to optimization

ndash Distributing the fastest single machine algorithms

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

- Linear predictions

- h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

bull Linear predictions

ndash h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))

x y

θ1θ3

θ2

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

2n

nsum

i=1

(

yi minus h(xi θ))2

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(least-squares regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

log(

1 + exp(minusyih(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(logistic regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 8: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Outline

1 Parametric supervised learning on a single machine

minus Machine learning asymp optimization of finite sums

minus From batch to stochastic gradient methods

minus Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

ndash Centralized and decentralized methods

ndash From network averaging to optimization

ndash Distributing the fastest single machine algorithms

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

- Linear predictions

- h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

bull Linear predictions

ndash h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))

x y

θ1θ3

θ2

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

2n

nsum

i=1

(

yi minus h(xi θ))2

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(least-squares regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

log(

1 + exp(minusyih(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(logistic regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 9: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

- Linear predictions

- h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

bull Linear predictions

ndash h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))

x y

θ1θ3

θ2

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

2n

nsum

i=1

(

yi minus h(xi θ))2

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(least-squares regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

log(

1 + exp(minusyih(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(logistic regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 10: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

- Linear predictions

- h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

bull Linear predictions

ndash h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))

x y

θ1θ3

θ2

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

2n

nsum

i=1

(

yi minus h(xi θ))2

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(least-squares regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

log(

1 + exp(minusyih(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(logistic regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 11: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull Advertising n gt 109

ndash Φ(x) isin 0 1d d gt 109

ndash Navigation history + ad

bull Linear predictions

ndash h(x θ) = θ⊤Φ(x)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))

x y

θ1θ3

θ2

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

2n

nsum

i=1

(

yi minus h(xi θ))2

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(least-squares regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

log(

1 + exp(minusyih(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(logistic regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 12: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))

x y

θ1θ3

θ2

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

2n

nsum

i=1

(

yi minus h(xi θ))2

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(least-squares regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

log(

1 + exp(minusyih(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(logistic regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 13: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

x1 x2 x3 x4 x5 x6

y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1

ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))

x y

θ1θ3

θ2

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

2n

nsum

i=1

(

yi minus h(xi θ))2

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(least-squares regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

log(

1 + exp(minusyih(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(logistic regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 14: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

2n

nsum

i=1

(

yi minus h(xi θ))2

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(least-squares regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

log(

1 + exp(minusyih(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(logistic regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 15: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

2n

nsum

i=1

(

yi minus h(xi θ))2

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(least-squares regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

log(

1 + exp(minusyih(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(logistic regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 16: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

log(

1 + exp(minusyih(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

(logistic regression)

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 17: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 18: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 19: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 20: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Parametric supervised machine learning

bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid

bull Prediction function h(x θ) isin R parameterized by θ isin Rd

bull (regularized) empirical risk minimization

minθisinRd

1

n

nsum

i=1

ℓ(

yi h(xi θ))

+ λΩ(θ)

=1

n

nsum

i=1

fi(θ)

data fitting term + regularizer

bull Optimization optimization of regularized risk training cost

bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 21: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

smooth non-smooth

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 22: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Smoothness and (strong) convexity

bull A function g Rd rarr R is L-smooth if and only if it is twice

differentiable and

forallθ isin Rd

∣eigenvalues[

gprimeprime(θ)]∣

∣ 6 L

bull Machine learning

ndash with g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 23: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt 0

convex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 24: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

convex

stronglyconvex

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 25: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

ndash Condition number κ = Lmicro gt 1

(small κ = Lmicro) (large κ = Lmicro)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 26: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 27: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)

bull Relevance of convex optimization

ndash Easier design and analysis of algorithms

ndash Global minimum vs local minimum vs stationary points

ndash Gradient-based algorithms only need convexity for their analysis

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 28: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 29: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 30: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Smoothness and (strong) convexity

bull A twice differentiable function g Rd rarr R is micro-strongly convex

if and only if

forallθ isin Rd eigenvalues

[

gprimeprime(θ)]

gt micro

bull Strong convexity in machine learning

ndash With g(θ) = 1n

sumni=1 ℓ(yi h(xi θ))

ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)

ndash Invertible covariance matrix 1n

sumni=1Φ(xi)Φ(xi)

⊤ rArr n gt d

ndash Even when micro gt 0 micro may be arbitrarily small

bull Adding regularization by micro2θ2

ndash creates additional bias unless micro is small but reduces variance

ndash Typically Lradicn gt micro gt Ln rArr κ isin [

radicn n]

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 31: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 32: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

(small κ = Lmicro) (large κ = Lmicro)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 33: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 34: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Iterative methods for minimizing smooth functions

bull Assumption g convex and L-smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

g(θt)minus g(θlowast) 6 O(1t)

g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex

bull Acceleration (Nesterov 1983) second-order recursion

θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)

ndash Good choice of momentum term δt isin [0 1)

g(θt)minus g(θlowast) 6 O(1t2)

g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust

radicκ) if micro-strongly convex

ndash Optimal rates after t = O(d) iterations (Nesterov 2004)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 35: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 36: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 37: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr O(log log 1ε) iterations

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 38: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 39: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 40: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Iterative methods for minimizing smooth functions

bull Assumption g convex and smooth on Rd

bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)

ndash O(1t) convergence rate for convex functions

ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)

bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)

ndash O(

eminusρ2t)

quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)

bull Key insights for machine learning (Bottou and Bousquet 2008)

1 No need to optimize below statistical error

2 Cost functions are averages

3 Testing error is more important than training error

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 41: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 42: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 43: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Stochastic gradient descent (SGD) for finite sums

minθisinRd

g(θ) =1

n

nsum

i=1

fi(θ)

bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =

1t+1

sumtu=0 θu

bull Convergence rate if each fi is convex L-smooth and g micro-strongly-

convex

Eg(θt)minus g(θlowast) 6

O(1radict) if γt = 1(L

radict)

O(L(microt)) = O(κt) if γt = 1(microt)

ndash No adaptivity to strong-convexity in general

ndash Running-time complexity O(d middot κε)

bull NB single pass leads to bounds on testing error

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 44: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 45: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 46: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 47: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

ndash Exponential convergence rate in O(eminustκ) for convex problems

ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)

ndash Iteration complexity is linear in n

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)

ndash Iteration complexity is independent of n

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 48: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Stochastic vs deterministic methods

bull Minimizing g(θ) =1

n

nsum

i=1

fi(θ) with fi(θ) = ℓ(

yi h(xi θ))

+ λΩ(θ)

bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ

n

nsum

i=1

nablafi(θtminus1)

bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 49: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 50: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Stochastic vs deterministic methods

bull Goal = best of both worlds Linear rate with O(d) iteration cost

Simple choice of step size

time

log(excess

cost)

deterministic

stochastic

new

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 51: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 52: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

θt = θtminus1 minus γ[

nablafi(t)(θtminus1)+1

n

nsum

i=1

ytminus1i minus ytminus1

i(t)

]

(with yti stored value at time t of gradient of the i-th function)

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 53: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Recent progress in single machine optimization

bull Variance reduction

ndash Exponential convergence with O(d) iteration cost

ndash SAG (Le Roux Schmidt and Bach 2012)

ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)

ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc

bull Running-time to reach precision ε (with κ = condition number)

Stochastic gradient descent dtimes∣

∣κ times 1

ε

Gradient descent dtimes∣

∣nκ times log 1

ε

Variance reduction dtimes∣

∣(n+ κ) times log 1

ε

ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ

ndash Matching upper and lower bounds of complexity

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 54: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Outline

1 Parametric supervised learning on a single machine

ndash Machine learning asymp optimization of finite sums

ndash From batch to stochastic gradient methods

ndash Linearly-convergent stochastic methods for convex problems

2 Machine learning over networks

minus Centralized and decentralized methods

minus From network averaging to optimization

minus Distributing the fastest single machine algorithms

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 55: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 56: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

1 3

24

57

6 9

8

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 57: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Each dataset function fi only accessible by node i in a graph

ndash Massive datasets multiple machines cores

ndash Communication legal constraints

bull Goal Minimize communication and local computation costs

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 58: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

bull Why not simply distributing a simple single machine algorithm

ndash (accelerated) gradient descent (see eg Nesterov 2004)

θt = θtminus1 minus γnablag(θtminus1)

ndash Requiresradicκ log 1

ε full gradient computations to reach precision ε

ndash Need to perform distributed averaging over a network

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 59: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 60: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 61: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 62: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

1 3

24

57

6 9

81 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 63: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 64: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Centralized algorithms

ndash Compute a spanning tree with diameter 6 2∆

ndash Masterslave algorithm ∆ communication steps + no error

bull Application to centralized distributed optimization

ndashradicκ log 1

ε gradient steps andradicκ∆log 1

ε communication steps

ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)

bull Robustness

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 65: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 66: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

1 3

24

57

6 9

8

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 67: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Classical algorithms for distributed averaging

bull Goal Given n observations ξ1 ξn isin R

ndash Compute θlowast =1

n

nsum

i=1

ξi = argminθisinR

1

n

nsum

i=1

(θ minus ξi)2

bull Decentralized algorithms - gossip (Boyd et al 2006)

ndash Replace θi by a weighted average of its neighborssumn

j=1Wijθjndash Potential asynchrony changing network

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 68: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 69: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Convergence of synchronous gossip

bull Synchronous gossip (all nodes simultaneously)

ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ

ndash Typical assumption W symmetric doubly stochastic matrix

ndash Consequence Eigenvalues(W ) isin [minus1 1]

ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )

ndash γminus1 = mixing time of the associated Markov chain

1 3

24

57

6 9

8

ndash Need 1γ log

1ε iterations to reach precision ε (for classical averaging)

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 70: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Illustration of synchronous gossip

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 71: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Decentralized optimization

bull Mixing gossip and optimization

ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar

(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al

(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al

(2016) Scaman et al (2017) etc

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 72: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 73: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Decentralized optimization

bull Mixing gossip and optimization

bull Lower bound on complexity (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash Plain gossip not optimal

(need to gossip gradients with increasing precision)

bull Is this lower bound achievable

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 74: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 75: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 76: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 77: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 78: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Dual reformulation (Jakovetic et al 2015)

minθisinRd

nsum

i=1

fi(θ) = minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) such that foralli sim j θ(i) = θ(j)

= minθ(1)θ(n)isinRd

maxforallisimjλijisinRd

nsum

i=1

fi(θ(i)) +

sum

isimj

λ⊤ij(θ

(i)minusθ(j))

= maxforallisimjλijisinRd

minθ(1)θ(n)isinRd

nsum

i=1

fi(θ(i)) +

nsum

i=1

[θ(i)]⊤lineari(λ)

= maxforallisimjλijisinRd

nsum

i=1

functioni(λ) = maxforallisimjλijisinRd

function(λ)

bull Accelerated gradient descent (Scaman et al 2017)

hArr alternating local gradient computations and a gossip step

ndashradic

κγ log 1ε gradient steps and

radic

κγ log 1ε communication steps

ndash Not optimal rArr need accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 79: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 80: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 81: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Illustration of accelerated gossip

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 82: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Accelerated gossip

bull Regular gossip

ndash Iterations θt = W tθ0

bull Accelerated gossip

ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)

ndash Shift-register gossip (Cao et al 2006)

ndash Linear combinations hArr ηt =t

sum

k=0

αkθk =t

sum

k=0

αkWkξ = Pt(W )ξ

ndash Optimal polynomial is the Chebyshev polynomial

ndash Can be computed online with same cost as regular gossip eg

θt = ωtWθtminus1 + (1minus ωt)θtminus1

ndash Replace γminus1 by γminus12 in rates

bull rArr optimal complexity for optimization (Scaman et al 2017)

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 83: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 84: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull Single machine vs ldquooptimalrdquo decentralized algorithm

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 85: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Distribution in machine learning (and beyond)

bull Machine learning through optimization

minθisinRd

1

n

nsum

i=1

fi(θ) = g(θ)

ndash fi(θ) error of model defined by θ on dataset indexed by i

ndash Example fi(θ) =1

mi

misum

j=1

ℓ(yij θ⊤Φ(xij)) if mi observations

bull MSDA (Scaman et al 2017)

ndashradicκ log 1

ε gradient steps andradic

κγ log 1ε communication steps

ndash ldquoOptimalrdquo but still not adapted to machine learning

ndash Huge slow down when going from 1 to 2 machines

ndash Only synchronous

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 86: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Trade-offs between gradient and communication steps

ndash Adapted to functions of the type fi(θ) =1

m

msum

j=1

ℓ(yij θ⊤Φ(xij))

ndash Allows for partial asynchrony

bull n computing nodes with m observations each

Algorithm gradient steps communication

Single machine algorithm nm+radicnmκ 0

MSDA (Scaman et al 2017) mradicκ

radic

κγ

ADFS (Hendrikx et al 2019) m+radicmκ

radic

κγ

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 87: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

ADFS - Algorithm principle

bull Minimizingnsum

i=1

msum

j=1

fij(θ) +σi

2θ2

ndash Create an equivalent graph

ndash Dual randomized coordinate ascent (with non uniform sampling)

ndash Decoupling of data and gossip steps

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 88: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with m = 104 observations per node in R28

ndash Two-dimensional grid network

n = 4 n = 100

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 89: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Decentralized algorithms for machine learning

(Hendrikx Bach and Massoulie 2019)

bull Running times on an actual cluster

ndash Logistic regression with mn asymp 105 observations in R47236

ndash Two-dimensional grid network with n = 100 nodes

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 90: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 91: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Conclusions

bull Distributed decentralized machine learning

ndash Distributing the fastest single machine algorithms

ndash n machines and m observations per machine

ndash From nm+radicnmκ (single machine) to m+

radicmκ gradient steps

ndash Linear speed-ups for well-conditioned problems

bull Extensions

ndash Beyond convex problems

ndash Matching running time complexity lower bounds

ndash Experiments on large-scale clouds

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 92: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

References

M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)

591ndash608 2014

W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011

L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008

Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms

IEEE transactions on information theory 52(6)2508ndash2530 2006

Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed

computation In 44th Annual Allerton Conference on Communication Control and Computation

pages 952ndash959 2006

Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for

decentralized optimization of pairwise functions In International Conference on Machine Learning

pages 1388ndash1396 2016

Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method

with support for non-strongly convex composite objectives In Advances in Neural Information

Processing Systems 2014

John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization

Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606

2012

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 93: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic

gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019

Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed

optimization using a randomized alternating direction method of multipliers In Decision and

Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013

Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed

augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015

Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance

reduction In Advances in Neural Information Processing Systems 2013

G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv

2015

N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets In Advances in Neural Information

Processing Systems (NIPS) 2012

A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact

linear convergence rate for consensus optimization IEEE Transactions on Signal and Information

Processing over Networks 2(4)507ndash522 2016

Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization

IEEE Transactions on Automatic Control 54(1)48ndash61 2009

A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization

over time-varying graphs ArXiv e-prints 2016

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013

Page 94: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint

Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)

Soviet Math Doklady 269(3)543ndash547 1983

Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004

Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal

algorithms for smooth and strongly convex distributed optimization in networks In International

Conference on Machine Learning pages 3027ndash3036 2017

Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized

consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015

Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st

Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012

L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of

full gradients In Advances in Neural Information Processing Systems 2013