fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · sebastian u. stich epfl martin...

30
Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication Anastasia Koloskova EPFL Sebastian U. Stich EPFL Martin Jaggi EPFL {anastasia.koloskova, sebastian.stich, martin.jaggi}@epfl.ch Abstract We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over n machines that can only communicate to their neighbors on a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g. quantize or sparsify) their model updates. We cover both unbiased and biased compression operators with quality denoted by ω 1(ω = 1 meaning no compression). We (i) propose a novel gossip-based stochastic gradient descent algorithm, Choco-SGD, that converges at rate O ( 1/(nT )+1/(2 ω) 2 ) for strongly convex objectives, where T denotes the number of iterations and δ the eigengap of the connectivity matrix. Despite compression quality and network connectivity affecting the higher order terms, the first term in the rate, O(1/(nT )), is the same as for the centralized baseline with exact communication. We (ii) present a novel gossip algorithm, Choco-Gossip, for the average consensus problem that converges in time O(1/(δ 2 ω) log(1/)) for accuracy > 0. This is (up to our knowledge) the first gossip algorithm that supports arbitrary compressed messages for ω> 0 and still exhibits linear convergence. We (iii) show in experiments that both of our algorithms do outperform the respective state-of-the-art baselines and Choco-SGD can reduce communication by at least two orders of magnitudes. 1 Introduction Decentralized machine learning methods are becoming core aspects of many important applications, both in view of scalability to larger datasets and systems, but also from the perspective of data locality, ownership and privacy. In this work we address the general data-parallel setting where the data is distributed across different compute devices, and consider decentralized optimization methods that do not rely on a central coordinator (e.g. parameter server) but instead only require on-device computation and local communication with neighboring devices. This covers for instance the classic setting of training machine learning models in large data-centers, but also emerging applications were the computations are executed directly on the consumer devices, which keep their part of the data private at all times. 1 Formally, we consider optimization problems distributed across n devices or nodes of the form f ? := min xR d f (x) := 1 n n X i=1 f i (x) , (1) where f i : R d R for i [n] := {1,...,n} are the objectives defined by the local data available on each node. We also allow each local objective f i to have stochastic optimization (or sum) structure, covering the important case of empirical risk minimization in distributed machine learning and deep learning applications. 1 Note the optimization process itself (as for instance the computed result) might leak information about the data of other nodes. We do not focus on quantifying notions of privacy in this work. 1 arXiv:1902.00340v1 [cs.LG] 1 Feb 2019

Upload: others

Post on 02-Aug-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

Decentralized Stochastic Optimization and Gossip

Algorithms with Compressed Communication

Anastasia KoloskovaEPFL

Sebastian U. StichEPFL

Martin JaggiEPFL

anastasia.koloskova, sebastian.stich, [email protected]

Abstract

We consider decentralized stochastic optimization with the objective function (e.g. data samples formachine learning task) being distributed over n machines that can only communicate to their neighborson a fixed communication graph. To reduce the communication bottleneck, the nodes compress (e.g.quantize or sparsify) their model updates. We cover both unbiased and biased compression operatorswith quality denoted by ω ≤ 1 (ω = 1 meaning no compression).We (i) propose a novel gossip-based stochastic gradient descent algorithm, Choco-SGD, that convergesat rate O

(1/(nT ) + 1/(Tδ2ω)2

)for strongly convex objectives, where T denotes the number of iterations

and δ the eigengap of the connectivity matrix. Despite compression quality and network connectivityaffecting the higher order terms, the first term in the rate, O(1/(nT )), is the same as for the centralizedbaseline with exact communication. We (ii) present a novel gossip algorithm, Choco-Gossip, for theaverage consensus problem that converges in time O(1/(δ2ω) log(1/ε)) for accuracy ε > 0. This is (up toour knowledge) the first gossip algorithm that supports arbitrary compressed messages for ω > 0 and stillexhibits linear convergence. We (iii) show in experiments that both of our algorithms do outperform therespective state-of-the-art baselines and Choco-SGD can reduce communication by at least two ordersof magnitudes.

1 Introduction

Decentralized machine learning methods are becoming core aspects of many important applications, both inview of scalability to larger datasets and systems, but also from the perspective of data locality, ownershipand privacy. In this work we address the general data-parallel setting where the data is distributed acrossdifferent compute devices, and consider decentralized optimization methods that do not rely on a centralcoordinator (e.g. parameter server) but instead only require on-device computation and local communicationwith neighboring devices. This covers for instance the classic setting of training machine learning modelsin large data-centers, but also emerging applications were the computations are executed directly on theconsumer devices, which keep their part of the data private at all times.1

Formally, we consider optimization problems distributed across n devices or nodes of the form

f? := minx∈Rd

[f(x) :=

1

n

n∑i=1

fi(x)

], (1)

where fi : Rd → R for i ∈ [n] := 1, . . . , n are the objectives defined by the local data available on eachnode. We also allow each local objective fi to have stochastic optimization (or sum) structure, covering theimportant case of empirical risk minimization in distributed machine learning and deep learning applications.

1Note the optimization process itself (as for instance the computed result) might leak information about the data of othernodes. We do not focus on quantifying notions of privacy in this work.

1

arX

iv:1

902.

0034

0v1

[cs

.LG

] 1

Feb

201

9

Page 2: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

Decentralized Communication. We model the network topology as a graph G = ([n], E) with edgesi, j ∈ E if and only if nodes i and j are connected by a communication link, meaning that these nodesdirectly can exchange messages (for instance computed model updates). The decentralized setting is moti-vated by centralized topologies (corresponding to a star graph) often not being possible, and otherwise oftenposing a significant bottleneck on the central node in terms of communication latency, bandwidth and faulttolerance. Decentralized topologies avoid these bottlenecks and thereby offer hugely improved potential inscalability. For example, while the master node in the centralized setting receives (and sends) in each roundmessages from all workers, Θ(n) in total2, in decentralized topologies the maximal degree of the network isoften constant (e.g. ring or torus) or a slowly growing function in n (e.g. scale-free networks).

Decentralized Optimization. For the case of deterministic (full-gradient) optimization, recent seminaltheoretical advances show that the network topology only affects higher-order terms of the convergencerate of decentralized optimization algorithms on convex problems (Scaman et al., 2017, 2018). We provethe first analogue result for the important case of decentralized stochastic gradient descent (SGD), provingconvergence at rate O(1/(nT )) (ignoring for now higher order terms) on strongly convex functions where Tdenotes the number of iterations.

This result is significant since stochastic methods are highly preferred for their efficiency over deterministicgradient methods in machine learning applications. Our algorithm, Choco-SGD, is as efficient in terms ofiterations as centralized mini-batch SGD (and consequently also achieves a speedup of factor n comparedto the serial setting on a single node) but avoids the communication bottleneck that centralized algorithmssuffer from.

Communication Compression. In distributed training, model updates (or gradient vectors) have tobe exchanged between the worker nodes. To reduce the amount of data that has to be send, gradientcompression has become a popular strategy. For instance by quantization (Alistarh et al., 2017; Wen et al.,2017; Lin et al., 2018) or sparsification (Wangni et al., 2018; Stich et al., 2018).

These ideas have recently been introduced also to the decentralized setting by Tang et al. (2018a).However, their analysis only covers unbiased compression operators with very (unreasonably) high accu-racy constraints. Here we propose the first method that supports arbitrary low accuracy and even biasedcompression operators, such as in (Alistarh et al., 2018; Lin et al., 2018; Stich et al., 2018).

Contributions. Our contributions can be summarized as follows:

• We show that the proposed Choco-SGD converges at rate O(1/(nT ) + 1/(Tδ2ω)2), where T denotesthe number of iterations, n the number of workers, δ the eigengap of the gossip (connectivity) matrixand ω ≤ 1 the compression quality factor (ω = 1 meaning no compression). We show that thedecentralized method achieves the same speedup as centralized mini-batch SGD when the number nof workers grows. The network topology and the compression only mildly affect the convergence rate.This is verified experimentally on the ring topology and by reducing the communication by a factor of100 (ω = 1

100 ).

• We present the first provably-converging gossip algorithm with communication compression, for thedistributed average consensus problem. Our algorithm, Choco-Gossip, converges linearly at rateO(1/(δ2ω) log(1/ε)) for accuracy ε > 0, and allows arbitrary communication compression operators(including biased and unbiased ones). In contrast, previous work required very high-precision quanti-zation ω ≈ 1 and could only show convergence towards a neighborhood of the optimal solution.

• Choco-SGD significantly outperforms state-of-the-art methods for decentralized optimization withgradient compression, such as ECD-SGD and DCD-SGD introduced in (Tang et al., 2018a), in all ourexperiments.

2For better connected topologies sometimes more efficient all-reduce and broadcast implementations are available.

2

Page 3: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

2 Related Work

Stochastic gradient descent (SGD) (Robbins & Monro, 1951; Bottou, 2010) and variants thereof are thestandard algorithms for machine learning problems of the form (1), though it is an inherit serial algorithmthat does not take the distributed setting into account. Mini-batch SGD (Dekel et al., 2012) is the naturalparallelization of SGD for (1) in the centralized setting, i.e. when a master node collects the updates fromall worker nodes, and serves a baseline here.

Decentralized Optimization. The study of decentralized optimization algorithms can be tracked backat least to the 1980s (Tsitsiklis, 1984). Decentralized algorithms are sometimes referred to as gossip algo-rithms (Kempe et al., 2003; Xiao & Boyd, 2004; Boyd et al., 2006) as the information is not broadcastedby a central entity, but spreads—similar as gossip—along the edges specified by the communication graph.The most popular algorithms are based on (sub)gradient descent (Nedic & Ozdaglar, 2009; Johansson et al.,2010), alternating direction method of multipliers (ADMM) (Wei & Ozdaglar, 2012; Iutzeler et al., 2013) ordual averaging (Duchi et al., 2012; Nedic et al., 2015). He et al. (2018) address the more specific problemclass of generalized linear models.For the deterministic (non-stochastic) convex version of (1) a recent line of work developed optimal algo-rithms based on acceleration (Jakovetic et al., 2014; Scaman et al., 2017, 2018; Uribe et al., 2018). Ratesfor the stochastic setting are derived in (Shamir & Srebro, 2014; Rabbat, 2015), under the assumptionthat the distributions on all nodes are equal. This is a strong restriction which prohibits most distributedmachine learning applications. Our algorithm Choco-SGD avoids any such assumption. Also, (Rabbat,2015) requires multiple communication rounds per stochastic gradient computation and so is not suited forsparse communication, as the required number of communication rounds would increase proportionally tothe sparsity. Lan et al. (2018) applied gradient sliding techniques allowing to skip some of the communicationrounds.Lian et al. (2017); Tang et al. (2018b,a); Assran et al. (2018) consider the non-convex setting with Tanget al. (2018a) also applying gradient quantization techniques to reduce the communication cost. However,their algorithms require very high precision quantization, a constraint we can overcome here.

Gradient Compression. Instead of transmitting a full dimensional (gradient) vector g ∈ Rd, methodswith gradient compression transmit a compressed vector Q(g) instead, where Q : Rd → Rd is a (random)operator chosen such that Q(g) can be more efficiently represented, for instance by using limited bit repre-sentation (quantization) or enforcing sparsity. A class of very common quantization operators is based onrandom dithering (Goodall, 1951; Roberts, 1962) that is in addition also unbiased, Eξ Q(x) = x, ∀x ∈ Rd,see (Alistarh et al., 2017; Wen et al., 2017; Zhang et al., 2017). Much sparser vectors can be obtained byrandom sparsification techniques that randomly mask the input vectors and only preserve a constant numberof coordinates (Wangni et al., 2018; Konecny & Richtarik, 2018; Stich et al., 2018). Techniques that do notdirectly quantize gradients, but instead maintain additional states are known to perform better in theoryand practice (Seide et al., 2014; Lin et al., 2018; Stich et al., 2018), an approach that we pick up here. Ouranalysis also covers deterministic and biased compression operators, such as in (Alistarh et al., 2018; Stichet al., 2018). We will not further distinguish between sparsification and quantization approaches, and referto both of them as compression operators in the following.

Distributed Average Consensus. In the decentralized setting, the average consensus problem consistsin finding the average vector of n local vectors (see (2) below for a formal definition). The problem isan important sub-routine of many decentralized algorithms. It is well known that gossip-type algorithmsconverge linearly for average consensus (Kempe et al., 2003; Xiao & Boyd, 2004; Olfati-Saber & Murray,2004; Boyd et al., 2006). However, for consensus algorithms with compressed communication it has beenremarked that the standard gossip algorithm does not converge to the correct solution (Xiao et al., 2005).The proposed schemes in (Carli et al., 2007; Nedic et al., 2008; Aysal et al., 2008; Carli et al., 2010b; Yuanet al., 2012) do only converge to a neighborhood (whose size depends on the compression accuracy) of the

3

Page 4: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

solution.In order to converge, adaptive schemes (with varying compression accuracy) have been proposed (Carli et al.,2010a; Fang & Li, 2010; Li et al., 2011; Thanou et al., 2013). However, these approaches fall back to full(uncompressed) communication to reach high accuracy. In contrast, our method converges linearly to thetrue solution, even for arbitrary compressed communication, without requiring adaptive accuracy. We arenot aware of a method in the literature with similar guarantees.

3 Average Consensus with Communication Compression

In this section we present Choco-Gossip, a novel gossip algorithm for distributed average consensus withcompressed communication. As mentioned, the average consensus problem is an important special case oftype (1), and formalized as

x :=1

n

n∑i=1

xi , (2)

for vectors xi ∈ Rd distributed on n nodes (consider fi(x) = 12 ‖x− xi‖2 in (1)). Our proposed algorithm

will later serve as a crucial primitive in our optimization algorithm for the general optimization problem (1),but is of independent interest for any average consensus problem with communication constraints.

In Sections 3.1–3.3 below we first review existing schemes that we later consider as baselines for thenumerical comparison. The novel algorithm follows in Section 3.4.

3.1 Gossip algorithms

The classic decentralized algorithms for the average consensus problem are gossip type algorithms (see

e.g. (Xiao & Boyd, 2004)) that generate sequencesx(t)i

t≥0 on every node i ∈ [n] by iterations of the form

x(t+1)i := x

(t)i + γ

n∑j=1

wij∆(t)ij . (3)

Here γ ∈ (0, 1] denotes a stepsize parameter, wij ∈ [0, 1] averaging weights and ∆(t)ij ∈ Rd denotes a vector

that is sent from node j to node i in iteration t. Note that no communication is required if wij = 0. If weassume symmetry, wij = wji, the weights naturally define the communication graph G = ([n], E) with edgesi, j ∈ E if wij > 0 and self-loops i ∈ E for i ∈ [n]. The convergence rate of scheme (3) crucially dependson the connectivity matrix W ∈ Rn×n of the network defined as (W )ij = wij , also called the interaction orgossip matrix.

Definition 1 (Gossip matrix). We assume that W ∈ [0, 1]n×n is a symmetric (W = W>) doubly stochastic(W1 = 1,1>W = 1>) matrix with eigenvalues 1 = |λ1(W )| > |λ2(W )| ≥ · · · ≥ |λn(W )| and spectral gap

δ := 1− |λ2(W )| ∈ (0, 1] . (4)

It will also be convenient to define

ρ := 1− δ , and β := ‖I −W‖2 ∈ [0, 2] . (5)

Table 1 gives a few values of the spectral gap for commonly used network topologies (with uniformaveraging between the nodes). It is well known that simple matrices W with δ > 0 do exist for everyconnected graph.

4

Page 5: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

graph/topology δ−1 node degreering O(n2) 22d-torus O(n) 4fully connected O(1) n− 1

Table 1: Spectral gap δ for some important network topologies on n nodes (see e.g. (Aldous & Fill, 2002, p.169)) for uniformly averaging W , i.e. wij = 1

deg(i) = 1deg(j) for i, j ∈ E.

3.2 Gossip with Exact Communication

For a fixed gossip matrix W , the classical algorithm analyzed in (Xiao & Boyd, 2004) corresponds to thechoice

γ := 1, ∆(t)ij := x

(t)j − x

(t)i , (E-G)

in (3), with (E-G) standing for exact gossip. This scheme can also conveniently be written in matrix notationas

X(t+1) := X(t) + γX(t)(W − I) , (6)

for iterates X(t) := [x(t)1 , . . . ,x

(t)n ] ∈ Rd×n.

Theorem 1. Let γ ∈ (0, 1] and δ be the spectral gap of W . Then the iterates of (E-G) converge linearly to

the average x = 1n

∑ni=1 x

(0)i with the rate

n∑i=1

∥∥∥x(t)i − x

∥∥∥2 ≤ (1− γδ)2tn∑i=1

∥∥∥x(0)i − x

∥∥∥2 .For γ = 1 this corresponds to the classic result in e.g. (Xiao & Boyd, 2004), here we slightly extend the

analysis for arbitrary stepsizes. The short proof shows the elegance of the matrix notation (that we will lateralso adapt for the proofs that will follow).

Proof for γ = 1. Let X := [x, . . . ,x] ∈ Rd×n. Then for γ = 1 the theorem follows from the observation∥∥∥X(t+1) −X∥∥∥2F

(6)=∥∥∥(X(t) −X)W

∥∥∥2F

=∥∥∥(X(t) −X)(W − 1

n11>)∥∥∥2F

≤∥∥W − 1

n11>∥∥22

∥∥∥X(t) −X∥∥∥2F

= ρ2∥∥∥X(t) −X

∥∥∥2F.

Here on the second line we used the crucial identity X(t)( 1n11>) = X, i.e. the algorithm preserves the

average over all iterations. This can be seen from (6):

X(t+1)( 1n11>) = X(t)W ( 1

n11>) = X(t)( 1n11>) = X ,

by Definition 1. The proof for arbitrary γ follows the same lines and is given in the appendix.

5

Page 6: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

3.3 Gossip with Quantized Communication

In every round of scheme (E-G) a full dimensional vector g ∈ Rd is exchanged between two neighboring

nodes for every link on the communication graph (node j sends g = x(t)j to all its neighbors i, i, j ∈ E).

A natural way to reduce the communication is to compress g before sending it, denoted as Q(g), for a(potentially random) compression Q : Rd → Rd. Informally, we can think of Q as either a sparsificationoperator (that enforces sparsity of Q(g)) or a quantization operator that reduces the number of bits requiredto represent Q(g). For instance random rounding to less precise floating point numbers or to integers.

Aysal et al. (2008) propose the quantized gossip (Q1-G),

γ := 1, ∆(t)ij := Q(x

(t)j )− x

(t)i , (Q1-G)

in scheme (3), i.e. to apply the compression operator directly on the message that is send out from nodej to node i. However, this algorithm does not preserve the average of the iterates over the iterations,1n

∑ni=1 x

(0)i 6= 1

n

∑ni=1 x

(t)i for t ≥ 1, and as a consequence does not converge to the optimal solution x of (2)

(though in practice often to a close neighborhood).An alternative proposal by Carli et al. (2007) alleviates this drawback. The scheme

γ := 1, ∆(t)ij := Q(x

(t)j )−Q(x

(t)i ) , (Q2-G)

preserves the average of the iterates over the iterations. However, the scheme also fails to converge for

arbitrary precision. If x 6= 0, the noise introduced by the compression,∥∥Q(x

(t)j )∥∥, does not vanish for

t → ∞. As a consequence, the iterates oscillate around x when compression error becomes larger than the

suboptimality∥∥x(t)

i − x∥∥.

Both these schemes have been theoretically studied in (Carli et al., 2010b) under assumption of un-biasendness, i.e. assuming EQ Q(x) = x for all x ∈ Rd (and we will later also adopt this theoreticallyunderstood setting in our experiments).

3.4 Proposed Method for Compressed Communication

We propose the novel compressed gossip scheme Choco-Gossip that supports a much larger class of compres-sion operators, beyond unbiased quantization as for the schemes above. The algorithm can be summarizedas

x(t+1)j := x

(t)j +Q(x

(t)j − x

(t)j ) ,

∆(t)ij := x

(t+1)j − x

(t+1)i ,

(Choco-G)

for a stepsize γ < 1 depending on the specific compression operator Q (this will be detailed below). Here

x(t)i ∈ Rd denote additional variables that are stored3 by all neighbors j of node i, i, j ∈ E, as well as on

node i itself.We will show in Theorem 2 below that this scheme (i) preserves the averages of the iterates x

(t)i , i ∈ [n]

over the iterations t ≥ 0. Moreover, (ii) the noise introduced by the compression operator vanishes as t→ 0.

Precisely, we will show that (x(t)i , x

(t)i )→ (x,x) for t→∞ for every i ∈ [n]. Consequently, the argument for

Q in (Choco-G) goes to zero, and the noise introduced by Q can be controlled.The proposed scheme is summarized in Algorithm 1. Every worker i ∈ [n] stores and updates its own

local variable xi as well as the variables xj for all neighbors (including itself) j : i, j ∈ E.Algorithm 1 seems to require each machine to store deg(i) + 2 vectors. This is not necessary and

the algorithm could be re-written in a way that every node stores only three vectors: xi, xi and si =∑j:i,j∈E wijxj . For simplicity, we omit this technical modification here and refer to Appendix E for the

exact form of the memory-efficient algorithm.

3A closer look reveals that actually only 2 additional vectors have to be stored per node (refer to Appendix E).

6

Page 7: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

Algorithm 1 Choco-Gossip

input : Initial values x(0)i ∈ Rd on each node i ∈ [n], stepsize γ, communication graph G = ([n], E) and

mixing matrix W , initialize x(0)i := 0 ∀i

1: for t in 0 . . . T − 1 do in parallel for all workers i ∈ [n]

2: q(t)i := Q(x

(t)i − x

(t)i )

3: for neighbors j : i, j ∈ E (including i ∈ E) do

4: Send q(t)i and receive q

(t)j

5: x(t+1)j := x

(t)j + q

(t)j

6: end for

7: x(t+1)i := x

(t)i + γ

∑j:i,j∈E

wij

(x(t+1)j − x

(t+1)i

)8: end for

3.5 Convergence Analysis for Choco-Gossip

We analyze Algorithm 1 under the following general quality notion for the compression operator Q.

Assumption 1 (Compression operator). We assume that the compression operator Q : Rd → Rd satisfies

EQ ‖Q(x)− x‖ 2 ≤ (1− ω) ‖x‖2 , ∀x ∈ Rd , (7)

for a parameter ω > 0. Here EQ denotes the expectation over the internal randomness of operator Q.

Example operators that satisfy (7) include

• sparsification: Randomly selecting k out of d coordinates (randk), or the k coordinates with highestmagnitude values (topk) give ω = k

d (Stich et al., 2018, Lemma A.1).

• randomized gossip: Setting Q(x) = x with probability p ∈ (0, 1] and Q(x) = 0 otherwise, gives ω = p.

• rescaled unbiased estimators: suppose EQ Q(x) = x, ∀x ∈ Rd and EQ ‖Q(x)‖2 ≤ τ ‖x‖2, thenQ′(x) :=1τQ(x) satisfies (7) with ω = 1

τ .

• random quantization: For precision (levels) s ∈ N+, and τ = (1 + mind/s2,√d/s) the quantization

operator

qsgds(x) =sign(x) · ‖x‖

sτ·⌊s|x|‖x‖

+ ξ

⌋,

for random variable ξ ∼u.a.r. [0, 1]d satisfies (7) with ω = 1τ (Alistarh et al., 2017, Lemma 3.1).

Theorem 2. Choco-Gossip (Algorithm 1) converges linearly for average consensus:

et ≤(

1− δ2ω

82

)te0 ,

when using the stepsize γ := δ2ω16δ+δ2+4β2+2δβ2−8δω , where ω is the compression factor as in Assumption 1,

and et = EQ∑ni=1

(∥∥x(t)i − x

∥∥2 +∥∥x(t)

i − x(t+1)i

∥∥2) .

For the proof we refer to the appendix, where we used matrix notation to simplify derivations. For theexact communication case ω = 1 we recover the rate from Theorem 1 for stepsize γ < 1 up to constantfactors (which seems to be a small artifact of our proof technique). The theorem shows convergence forarbitrary ω > 0, showing the superiority of scheme (Choco-G) over (Q1-G) and (Q2-G).

7

Page 8: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

4 Decentralized Stochastic Optimization

In this section we leverage our proposed average consensus Algorithm 1 to achieve consensus among thecompute nodes in a decentralized optimization setting with communication restrictions.

In the decentralized optimization setting (1), not only does every node have a different local objectivefi, but we also allow each fi to have stochastic optimization (or sum) structure, that is

fi(x) := Eξi∼DiFi(x, ξi) , (8)

for a loss function Fi : Rd ×Ω→ R and distributions D1, . . . ,Dn which can be different on every node. Ourframework therefore covers both stochastic optimization (e.g. when all Di are identical) and empirical riskminimization (as in machine learning and deep learning applications) when the Di’s are discrete with disjointsupport.

4.1 Proposed Scheme for Decentralized Optimization

Our proposed method Choco-SGD—Communication-Compressed Decentralized SGD—is stated in Algo-rithm 2.

Algorithm 2 Choco-SGD

input : Initial values x(0)i ∈ Rd on each node i ∈ [n], consensus stepsize γ, SGD stepsizes ηtt≥0, commu-

nication graph G = ([n], E) and mixing matrix W , initialize x(0)i := 0 ∀i

1: for t in 0 . . . T − 1 do in parallel for all workers i ∈ [n]

2: Sample ξ(t)i , compute gradient g

(t)i := ∇Fi(x(t)

i , ξ(t)i )

3: x(t+ 1

2 )i := x

(t)i − ηtg

(t)i

4: q(t)i := Q(x

(t+ 12 )

i − x(t)i )

5: for neighbors j : i, j ∈ E (including i ∈ E) do

6: Send q(t)i and receive q

(t)j

7: x(t+1)j := q

(t)j + x

(t)j

8: end for

9: x(t+1)i := x

(t+ 12 )

i + γ∑

j:i,j∈E

wij

(x(t+1)j − x

(t+1)i

)10: end for

The algorithm consists of four parts. The stochastic gradient step in line 3, application of the compressionoperator in step 4, and the (Choco-G) local communication in lines 5–8 followed by the final iterate updatein line 9.

Remark 3. As a special case without any communication compression, and for consensus stepsize γ = 1 asin exact gossip (E-G), Choco-SGD (Algorithm 2) recovers the following standard variant of decentralizedSGD with gossip (similar e.g. to (Sirb & Ye, 2016; Lian et al., 2017)), stated for illustration in Algorithm 3.

Algorithm 3 Plain Decentralized SGD

1: for t in 0 . . . T − 1 do in parallel for all workers i ∈ [n]

2: Sample ξ(t)i , compute gradient g

(t)i := ∇Fi(x(t)

i , ξ(t)i )

3: x(t+ 1

2 )i := x

(t)i − ηtg

(t)i

4: Send x(t+ 1

2 )i to neighbors

5: x(t+1)i :=

∑ni=1 wijx

(t+ 12 )

j

6: end for

8

Page 9: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

1

2

3

4

5

6

1 2 3

4 5 6

7 8 9

Figure 1: Ring topology (left) and Torus topology (right).

4.2 Convergence Analysis for Choco-SGD

Assumption 2. We assume that each function fi : Rd → R for i ∈ [n] is L-smooth and µ-strongly convexand that the variance on each worker is bounded

Eξi ‖∇Fi(x, ξi)−∇fi(x)‖2 ≤ σ2i , ∀x ∈ Rd, i ∈ [n],

Eξi ‖∇Fi(x, ξi)‖2 ≤ G2 , ∀x ∈ Rd, i ∈ [n],

where Eξi [·] denotes the expectation over ξi ∼ Di. It will be also convenient to denote

σ2 :=1

n

n∑i=1

σ2i .

For the (standard) definitions of smoothness and strong convexity we refer to Appendix A.1. These

assumptions could be relaxed to only hold for x ∈x(t)i

Tt=1

, the set of iterates of Algorithm 2.

Theorem 4. Under Assumption 2, Algorithm 2 with SGD stepsizes ηt := 4µ(a+t) for parameter a ≥

max

410δ2ω , 16κ

for condition number κ = L

µ and consensus stepsize γ := γ(δ, ω) chosen as in Theorem 2,converges with the rate

EΥ(T ) =O(σ2

µnT

)+O

(κG2

µω2δ4T 2

)+O

(G2

µω3δ6T 3

),

where Υ(T ) := f(x(T )avg)− f? for an averaged iterate x

(T )avg = 1

ST

∑T−1t=0 wtx

(t) with weights wt = (a+ t)2, and

ST =∑T−1t=0 wt. As reminder, δ denotes the eigengap of W , and ω the compression ratio.

For the proof we refer to the appendix. When T and σ are sufficiently large, the second two terms

become negligible compared to O(σ2

µnT

)—and we recover the convergence rate of of mini-batch SGD in the

centralized setting and with exact communication. This is because topology (parameter δ) and compression(parameter ω) only affect the higher-order terms in the rate. We also see that we obtain in this setting an× speed up compared to the serial implementation of SGD on only one worker.

5 Experiments

In this section we first compare Choco-Gossip to the gossip baselines from Section 5.2 and then compare theChoco-SGD to state of the art decentralized stochastic optimization schemes (that also support compressedcommunication) in Section 5.3.

9

Page 10: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

0 100 200 300 400Iteration

10 9

10 7

10 5

10 3

Erro

r

qsgd 8bit(E-G)(Q1-G), qsgd 8bit(Q2-G), qsgd 8bitCHOCO, qsgd 8bit

0.0 0.5 1.0 1.5 2.0 2.5 3.0Number of transmitted bits 1e8

10 8

10 6

10 4

10 2

Erro

r

qsgd 8bit

(E-G)(Q1-G), qsgd 8bit(Q2-G), qsgd 8bitCHOCO, qsgd 8bit

Figure 2: Average consensus on the ring topology with n = 25 nodes, d = 2000coordinates and (qsgd256) compression

0 1000 2000 3000 4000Iteration

10 9

10 7

10 5

10 3

10 1

Erro

r

rand1%/top1%

(E-G)(Q1-G), rand1%(Q2-G), rand1%CHOCO, rand1%CHOCO, top1%

0.0 0.5 1.0 1.5 2.0 2.5Number of transmitted bits 1e8

10 4

10 3

10 2

10 1

Erro

r

rand1%/top1%

(E-G)(Q1-G), rand1%(Q2-G), rand1%CHOCO, rand1%CHOCO, top1%

Figure 3: Average consensus on the ring topology with n = 25 nodes, d = 2000coordinates and (rand1%) and (top1%) compression

5.1 Shared Experimental Setup

For our experiments we always report the number of iterations of the respective scheme, as well as the numberof transmitted bits. These quantities are independent of systems architectures and network bandwidth.

Datasets. In the experiments we rely on the epsilon (Sonnenburg et al., 2008) and rcv1 (Lewis et al.,2004) datasets (cf. Table 2).

Compression operators. We use the (randk), (topk) and (qsgds) compression operators as described inSection 3.5, where we choose k to be 1% of all coordinates and s ∈ 24, 28, only requiring 4, respectively 8bits to represent a coordinate.

Note that in contrast to Choco-Gossip, the earlier schemes (Q1-G) and (Q2-G) were both analyzedin (Carli et al., 2010b) for unbiased compression operators. In order to reflect this theoretical understoodsetting we use the rescaled operators ( dk · randk) and (τ · qsgds) in combination with those schemes.

5.2 Average Consensus

We compare the performance of the gossip schemes (E-G) (exact communication), (Q1-G), (Q2-G) (bothwith unbiased compression), and our scheme (Choco-G) in Figure 2 for the (qsgd256) compression schemeand in Figure 3 for the random (rand1%) compression scheme. In addition, we also depict the performanceof Choco-Gossip with biased (top1%) compression. We use ring topology with uniformly averaging mixingmatrix W as in Figure 1, left. The stepsizes γ that were used for Choco-Gossip are listed in the Table 3.

10

Page 11: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

dataset m d densityepsilon 400000 2000 100%rcv1 20242 47236 0.15%

Table 2: Size (m, d) and density of the datasets.

experiment γChoco, (qsgd256) 1Choco, (rand1%) 0.011Choco, (top1%) 0.046

Table 3: Tuned stepsizes γ for averaging in Figs. 2– 3.

epsilon rcv1algorithm a b γ a b γ

Plain 0.1 d - 1 1 -Choco, (qsgd16) 0.1 d 0.34 1 1 0.078Choco, (rand1%) 0.1 d 0.01 1 1 0.016Choco, (top1%) 0.1 d 0.04 1 1 0.04DCD, (rand1%) 10−15 d - 10−10 d -DCD, (qsgd16) 0.01 d - 10−10 d -ECD, (rand1%) 10−10 d - 10−10 d -ECD, (qsgd16) 10−12 d - 10−10 d -

Table 4: Parameters for the SGD learning rate ηt = mat+b and consensus learning γ used in the experiments in

Figs. 5–6. Parameters where tuned separately for each algorithm. Tuning details can be found in Appendix F.The ECD and DCD stepsizes are small because the algorithms were observed to diverge for larger choices.

We consider here the consensus problem (2) with data x(0)i ∈ Rd on the i-machine was chosen to be the i-th

vector in the epsilon dataset. We depict the errors 1n

∑ni=1

∥∥x(t)i − x

∥∥2.The proposed scheme (Choco-G) with 8 bit quantization (qsgd256) converges with the same rate as (E-G)

that uses exact communications (Fig. 2, left), while it requires much less data to be transmitted (Fig. 2,right). The schemes (Q1-G) and (Q2-G) can do not converge and reach only accuracies of 10−4 –10−5. Thescheme (Q1-G) even starts to diverge, because the quantization error becomes larger than the optimizationerror.

With sparsified communication (rand1%), i.e. transmitting only 1% of all the coordinates, the scheme(Q1-G) quickly zeros out all the coordinates, and (Q2-G) diverges because quantization error is too largealready from the first step (Fig. 3). Choco-Gossip proves to be more robust and converges. The observedrate matches with the theoretical findings, as we expect the scheme with factor 100× compression to be100× slower than (E-G) without compression. In terms of total data transmitted, both schemes converge atthe same speed (Fig. 3, right). We also see that (rand1%) sparsification can give additional gains and comesout as the most data-efficient method in these experiments.

5.3 Decentralized SGD

We asses the performance of Choco-SGD on logistic regression, defined as 1m

∑mj=1 log(1+exp(−bja>j x))+

12m ‖x‖

2, where aj ∈ Rd and bj ∈ −1, 1 are the data samples and m denotes the number of samples in

the dataset. We distribute the m data samples evenly among the n workers and consider two settings: (i)randomly shuffled, where datapoints are randomly assigned to workers, and the more difficult (ii) sortedsetting, where each worker only gets data samples just from one class (with the possible exception of oneworker that gets two labels assigned). Moreover, we try to make the setting as difficult as possible, meaningthat e.g. on the ring topology the machines with the same label form two connected clusters. We repeat eachexperiment three times and depict the mean curve and the area corresponding to one standard deviation.We plot suboptimality, i.e. f(x(t)) − f? (obtained by LogisticSGD optimizer from scikit-learn (Pedregosaet al., 2011)) versus number of iterations and the number of transmitted bits between workers, which isproportional to the actual running time if communication is a bottleneck.

11

Page 12: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

Figure 4: Performance of Algorithm 3 on ring, torus and fully connected topologies for n ∈ 9, 25, 64 nodes.Here we consider the sorted setting, whilst the performance for randomly shuffled data is depicted in theAppendix G.

Figure 5: Comparison of Algorithm 3 (plain), ECD-SGD, DCD-SGD and Choco-SGDwith (rand1%) sparsification (in addition (top1%) for Choco-SGD), for epsilon (top)and rcv1 (bottom) in terms of iterations (left) and communication cost (right), n = 9.

Algorithms. As baselines we consider Alg. 3 with exact communication (denoted as ‘plain’) and thecommunication efficient state-of-the-art optimization schemes DCD-SGD and ECD-SGD recently proposedin (Tang et al., 2018a) (for unbiased quantization operators) and compare them to Choco-SGD. We use de-caying stepsize ηt = ma

t+b where the parameters a, b are individually tuned for each algorithm and compressionscheme, with values given in Table 4.

Impact of Topology. In Figure 4 we depict the performance of the baseline Algorithm 3 with exact com-munication on different topologies (ring, torus and fully-connected; Fig. 1) with uniformly averaging mixingmatrix W . Note that Algorithm 3 for fully-connected graph corresponds to mini-batch SGD. Increasing thenumber of workers from n = 9 to n = 25 and n = 64 shows the mild effect of the network topology on theconvergence. We observe that the sorted setting is more difficult than the randomly shuffled setting (see

12

Page 13: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

Figure 6: Comparison of Algorithm 3 (plain), ECD-SGD, DCD-SGD and Choco-SGDwith (qsgd16) quantization, for epsilon (top) and rcv1 (bottom) in terms of iterations(left) and communication cost (right), on n = 9 nodes on a ring topology.

Fig. 7 in the Appendix G), where the convergence behavior remains almost unaffected. In the following wefocus on the hardest case, i.e. the ring topology.

Comparison to Baselines. In Figures 5 and 6 depict the performance of these algorithms on the ringtopology with n = 9 nodes for sorted data of the epsilon and rcv1 datasets. Choco-SGD performs almostas good as the exact Algorithm 3 in all situations, but using 100× less communication with (rand1%)sparsification (Fig. 5, right) and approximately 15× less communication for (qsgd4) quantization. The(top1%) variant performs slightly better than (rand1%) sparsification.

Choco-SGD consistently outperforms DCD-SGD in all settings. We also observed that DCD-SGDstarts to perform better for larger number of levels s in the (qsgds) in the quantification operator (increasingcommunication cost). This is consistent with the reporting in (Tang et al., 2018a) that assumed high precisionquantization. As a surprise to us, ECD-SGD, which was proposed in (Tang et al., 2018a) a the preferredalternative over DCD-SGD for less precise quantization operators, always performs worse than DCD-SGD,and often diverges.

Figures for randomly shuffled data and be found in the Appendix G. In that case Choco-SGD performsexactly as well as the exact Algorithm 3 in all situations.

Conclusion. The experiments verify our theoretical findings: Choco-Gossip is the first linearly con-vergent gossip algorithm with quantized communication and Choco-SGD consistently outperforms thebaselines for decentralized optimization, reaching almost the same performance as the exact algorithm with-out communication restrictions while significantly reducing communication cost. In view of the strikingpopularity of SGD as opposed to full-gradient methods for deep-learning, the application of Choco-SGDto decentralized deep learning—an instance of problem (1)— is a promising direction.

13

Page 14: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

Acknowledgments. We acknowledge funding from SNSF grant 200021 175796, as well as a Google Fo-cused Research Award.

References

Aldous, D. and Fill, J. A. Reversible markov chains and random walks on graphs, 2002. Unfinished monograph,recompiled 2014, available at http://www.stat.berkeley.edu/$\sim$aldous/RWG/book.html.

Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. QSGD: Communication-efficient SGD via gradientquantization and encoding. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan,S., and Garnett, R. (eds.), NIPS - Advances in Neural Information Processing Systems 30, pp. 1709–1720. Cur-ran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6768-qsgd-communication-efficient-sgd-via-gradient-quantization-and-encoding.pdf.

Alistarh, D., Hoefler, T., Johansson, M., Konstantinov, N., Khirirat, S., and Renggli, C. The convergence of sparsifiedgradient methods. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R.(eds.), Advances in Neural Information Processing Systems 31, pp. 5977–5987. Curran Associates, Inc., 2018. URLhttp://papers.nips.cc/paper/7837-the-convergence-of-sparsified-gradient-methods.pdf.

Assran, M., Loizou, N., Ballas, N., and Rabbat, M. Stochastic Gradient Push for Distributed Deep Learning. arXiv,November 2018.

Aysal, T. C., Coates, M. J., and Rabbat, M. G. Distributed average consensus with dithered quantization. IEEETransactions on Signal Processing, 56(10):4905–4918, Oct 2008. ISSN 1053-587X. doi: 10.1109/TSP.2008.927071.

Bottou, L. Large-scale machine learning with stochastic gradient descent. In Lechevallier, Y. and Saporta, G. (eds.),Proceedings of COMPSTAT’2010, pp. 177–186, Heidelberg, 2010. Physica-Verlag HD. ISBN 978-3-7908-2604-3.

Boyd, S., Ghosh, A., Prabhakar, B., and Shah, D. Randomized gossip algorithms. IEEE/ACM Trans. Netw., 14(SI):2508–2530, June 2006. ISSN 1063-6692. doi: 10.1109/TIT.2006.874516. URL https://doi.org/10.1109/TIT.2006.874516.

Carli, R., Fagnani, F., Frasca, P., Taylor, T., and Zampieri, S. Average consensus on networks with transmissionnoise or quantization. In 2007 European Control Conference (ECC), pp. 1852–1857, July 2007. doi: 10.23919/ECC.2007.7068829.

Carli, R., Bullo, F., and Zampieri, S. Quantized average consensus via dynamic coding/decoding schemes. Interna-tional Journal of Robust and Nonlinear Control, 20:156–175, 2010a. ISSN 1049-8923.

Carli, R., Frasca, P., Fagnani, F., and Zampieri, S. Gossip consensus algorithms via quantized communication.Automatica, 46:70–80, 2010b. ISSN 0005-1098.

Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. Optimal distributed online prediction using mini-batches.J. Mach. Learn. Res., 13(1):165–202, January 2012. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=2503308.2188391.

Duchi, J. C., Agarwal, A., and Wainwright, M. J. Dual averaging for distributed optimization: Convergence analysisand network scaling. IEEE Transactions on Automatic Control, 57(3):592–606, March 2012. ISSN 0018-9286. doi:10.1109/TAC.2011.2161027.

Fang, J. and Li, H. Distributed estimation of gauss - markov random fields with one-bit quantized data. IEEE SignalProcessing Letters, 17(5):449–452, May 2010. ISSN 1070-9908. doi: 10.1109/LSP.2010.2043157.

Goodall, W. M. Television by pulse code modulation. The Bell System Technical Journal, 30(1):33–49, Jan 1951.ISSN 0005-8580. doi: 10.1002/j.1538-7305.1951.tb01365.x.

He, L., Bian, A., and Jaggi, M. Cola: Decentralized linear learning. In Bengio, S., Wallach, H., Larochelle, H.,Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31,pp. 4541–4551. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/7705-cola-decentralized-linear-learning.pdf.

Iutzeler, F., Bianchi, P., Ciblat, P., and Hachem, W. Asynchronous distributed optimization using a randomizedalternating direction method of multipliers. In Proceedings of the 52nd IEEE Conference on Decision and Control,

14

Page 15: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

CDC 2013, December 10-13, 2013, Firenze, Italy, pp. 3671–3676. IEEE, 2013. doi: 10.1109/CDC.2013.6760448.URL https://doi.org/10.1109/CDC.2013.6760448.

Jakovetic, D., Xavier, J., and Moura, J. M. F. Fast distributed gradient methods. IEEE Transactions on AutomaticControl, 59(5):1131–1146, May 2014. ISSN 0018-9286. doi: 10.1109/TAC.2014.2298712.

Johansson, B., Rabi, M., and Johansson, M. A randomized incremental subgradient method for distributed opti-mization in networked systems. SIAM Journal on Optimization, 20(3):1157–1170, 2010. doi: 10.1137/08073038X.URL https://doi.org/10.1137/08073038X.

Kempe, D., Dobra, A., and Gehrke, J. Gossip-based computation of aggregate information. In Proceedings of the44th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’03, pp. 482–, Washington, DC, USA,2003. IEEE Computer Society. ISBN 0-7695-2040-5. URL http://dl.acm.org/citation.cfm?id=946243.946317.

Konecny, J. and Richtarik, P. Randomized Distributed Mean Estimation: Accuracy vs. Communication. Frontiersin Applied Mathematics and Statistics, 4:1502, December 2018.

Lan, G., Lee, S., and Zhou, Y. Communication-efficient algorithms for decentralized and stochastic optimization.Mathematical Programming, Dec 2018. ISSN 1436-4646. doi: 10.1007/s10107-018-1355-4. URL https://doi.org/10.1007/s10107-018-1355-4.

Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. Rcv1: A new benchmark collection for text categorization research.J. Mach. Learn. Res., 5:361–397, December 2004. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1005332.1005345.

Li, T., Fu, M., Xie, L., and Zhang, J. Distributed consensus with limited communication data rate. IEEE Transactionson Automatic Control, 56(2):279–292, Feb 2011. ISSN 0018-9286. doi: 10.1109/TAC.2010.2052384.

Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. Can decentralized algorithms out-perform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. InGuyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R.(eds.), Advances in Neural Information Processing Systems 30, pp. 5330–5340. Curran Associates, Inc.,2017. URL http://papers.nips.cc/paper/7117-can-decentralized-algorithms-outperform-centralized-algorithms-a-case-study-for-decentralized-parallel-stochastic-gradient-descent.pdf.

Lin, Y., Han, S., Mao, H., Wang, Y., and Dally, B. Deep gradient compression: Reducing the communicationbandwidth for distributed training. In ICLR 2018 - International Conference on Learning Representations, 2018.URL https://openreview.net/forum?id=SkhQHMW0W.

Nedic, A. and Ozdaglar, A. Distributed subgradient methods for multi-agent optimization. IEEE Transactions onAutomatic Control, 54(1):48–61, Jan 2009. ISSN 0018-9286. doi: 10.1109/TAC.2008.2009515.

Nedic, A., Olshevsky, A., Ozdaglar, A., and Tsitsiklis, J. N. Distributed subgradient methods and quantizationeffects. In Proceedings of the 47th IEEE Conference on Decision and Control, CDC 2008, pp. 4177–4184, 2008.ISBN 9781424431243. doi: 10.1109/CDC.2008.4738860.

Nedic, A., Lee, S., and Raginsky, M. Decentralized online optimization with global objectives and local communica-tion. In 2015 American Control Conference (ACC), pp. 4497–4503, July 2015. doi: 10.1109/ACC.2015.7172037.

Olfati-Saber, R. and Murray, R. M. Consensus problems in networks of agents with switching topology and time-delays. IEEE Transactions on Automatic Control, 49(9):1520–1533, Sep. 2004. ISSN 0018-9286. doi: 10.1109/TAC.2004.834113.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E.Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

Rabbat, M. Multi-agent mirror descent for decentralized stochastic optimization. In 2015 IEEE 6th InternationalWorkshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pp. 517–520, Dec 2015.doi: 10.1109/CAMSAP.2015.7383850.

Rakhlin, A., Shamir, O., and Sridharan, K. Making gradient descent optimal for strongly convex stochastic op-timization. In Proceedings of the 29th International Coference on International Conference on Machine Learn-ing, ICML’12, pp. 1571–1578, USA, 2012. Omnipress. ISBN 978-1-4503-1285-1. URL http://dl.acm.org/citation.cfm?id=3042573.3042774.

15

Page 16: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

Robbins, H. and Monro, S. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400–407, September 1951.

Roberts, L. Picture coding using pseudo-random noise. IRE Transactions on Information Theory, 8(2):145–154,February 1962. ISSN 0096-1000. doi: 10.1109/TIT.1962.1057702.

Scaman, K., Bach, F., Bubeck, S., Lee, Y. T., and Massoulie, L. Optimal algorithms for smooth and strongly convexdistributed optimization in networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th InternationalConference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3027–3036, Inter-national Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/scaman17a.html.

Scaman, K., Bach, F., Bubeck, S., Massoulie, L., and Lee, Y. T. Optimal algorithms for non-smooth distributed opti-mization in networks. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R.(eds.), Advances in Neural Information Processing Systems 31, pp. 2745–2754. Curran Associates, Inc., 2018.URL http://papers.nips.cc/paper/7539-optimal-algorithms-for-non-smooth-distributed-optimization-in-networks.pdf.

Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-bit stochastic gradient descent and its application todata-parallel distributed training of speech DNNs. In Li, H., Meng, H. M., Ma, B., Chng, E., and Xie, L.(eds.), INTERSPEECH, pp. 1058–1062. ISCA, 2014. URL http://dblp.uni-trier.de/db/conf/interspeech/interspeech2014.html#SeideFDLY14.

Shamir, O. and Srebro, N. Distributed stochastic optimization and learning. 2014 52nd Annual Allerton Conferenceon Communication, Control, and Computing (Allerton), pp. 850–857, 2014.

Sirb, B. and Ye, X. Consensus optimization with delayed and stochastic gradients on decentralized networks. In 2016IEEE International Conference on Big Data (Big Data), pp. 76–85, Dec 2016. doi: 10.1109/BigData.2016.7840591.

Sonnenburg, S., Franc, V., Yom-Tov, E., and Sebag, M. Pascal large scale learning challenge. 25th InternationalConference on Machine Learning (ICML2008) Workshop. J. Mach. Learn. Res, 10:1937–1953, 01 2008.

Stich, S. U. Local SGD Converges Fast and Communicates Little. arXiv e-prints, art. arXiv:1805.09767, May 2018.

Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsified SGD with memory. In Bengio, S., Wallach, H., Larochelle,H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems31, pp. 4452–4463. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/7697-sparsified-sgd-with-memory.pdf.

Tang, H., Gan, S., Zhang, C., Zhang, T., and Liu, J. Communication compression for decentralized training. InBengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in NeuralInformation Processing Systems 31, pp. 7663–7673. Curran Associates, Inc., 2018a. URL http://papers.nips.cc/paper/7992-communication-compression-for-decentralized-training.pdf.

Tang, H., Lian, X., Yan, M., Zhang, C., and Liu, J. d2: Decentralized training over decentralized data. In Dy, J. andKrause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedingsof Machine Learning Research, pp. 4848–4856, Stockholmsmssan, Stockholm Sweden, 10–15 Jul 2018b. PMLR.URL http://proceedings.mlr.press/v80/tang18a.html.

Thanou, D., Kokiopoulou, E., Pu, Y., and Frossard, P. Distributed average consensus with quantization refine-ment. IEEE Transactions on Signal Processing, 61(1):194–205, Jan 2013. ISSN 1053-587X. doi: 10.1109/TSP.2012.2223692.

Tsitsiklis, J. N. Problems in decentralized decision making and computation. PhD thesis, Massachusetts Institute ofTechnology, 1984.

Uribe, C. A., Lee, S., and Gasnikov, A. A Dual Approach for Optimal Algorithms in Distributed Optimization overNetworks. arXiv, September 2018.

Wangni, J., Wang, J., Liu, J., and Zhang, T. Gradient sparsification for communication-efficient distributedoptimization. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett,R. (eds.), Advances in Neural Information Processing Systems 31, pp. 1306–1316. Curran Associates, Inc.,2018. URL http://papers.nips.cc/paper/7405-gradient-sparsification-for-communication-efficient-distributed-optimization.pdf.

16

Page 17: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

Wei, E. and Ozdaglar, A. Distributed alternating direction method of multipliers. In 2012 IEEE 51st IEEE Conferenceon Decision and Control (CDC), pp. 5445–5450, Dec 2012. doi: 10.1109/CDC.2012.6425904.

Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li, H. Terngrad: Ternary gradients to reducecommunication in distributed deep learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R.,Vishwanathan, S., and Garnett, R. (eds.), NIPS - Advances in Neural Information Processing Systems 30, pp. 1509–1519. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6749-terngrad-ternary-gradients-to-reduce-communication-in-distributed-deep-learning.pdf.

Xiao, L. and Boyd, S. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65–78,2004. ISSN 0167-6911. doi: https://doi.org/10.1016/j.sysconle.2004.02.022. URL http://www.sciencedirect.com/science/article/pii/S0167691104000398.

Xiao, L., Boyd, S., and Lall, S. A scheme for robust distributed sensor fusion based on average consensus. In IPSN2005. Fourth International Symposium on Information Processing in Sensor Networks, 2005., pp. 63–70, April2005. doi: 10.1109/IPSN.2005.1440896.

Yuan, D., Xu, S., Zhao, H., and Rong, L. Distributed dual averaging method for multi-agent optimization with quan-tized communication. Systems & Control Letters, 61(11):1053 – 1061, 2012. ISSN 0167-6911. doi: https://doi.org/10.1016/j.sysconle.2012.06.004. URL http://www.sciencedirect.com/science/article/pii/S0167691112001193.

Zhang, H., Li, J., Kara, K., Alistarh, D., Liu, J., and Zhang, C. ZipML: Training linear models with end-to-end low precision, and a little bit of deep learning. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research,pp. 4035–4043, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://

proceedings.mlr.press/v70/zhang17e.html.

17

Page 18: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

A Basic Identities and Inequalities

A.1 Smooth and Strongly Convex Functions

Definition 2. A differentiable function f : Rd → R is L-strongly convex for parameter L ≥ 0 if

f(y) ≤ f(x) + 〈∇f(x),y − x〉+L

2‖y − x‖2 , ∀x,y ∈ Rd . (9)

Definition 3. A differentiable function f : Rd → R is µ-strongly convex for parameter µ ≥ 0 if

f(y) ≥ f(x) + 〈∇f(x),y − x〉+µ

2‖y − x‖2 , ∀x,y ∈ Rd . (10)

Remark 5. If f is L-smooth with minimizer x? s.t ∇f(x?) = 0, then

‖∇f(x)‖2 = ‖∇f(x)−∇f(x?)‖2 ≤ 2L (f(x)− f(x?)) . (11)

A.2 Vector and Matrix Inequalities

Remark 6. For A ∈ Rd×n, B ∈ Rn×n

‖AB‖F ≤ ‖A‖F ‖B‖2 . (12)

Remark 7. For arbitrary set of n vectors aini=1, ai ∈ Rd∥∥∥∥∥n∑i=1

ai

∥∥∥∥∥2

≤ nn∑i=1

‖ai‖2 . (13)

Remark 8. For given two vectors a,b ∈ Rd

2 〈a,b〉 ≤ γ ‖a‖2 + γ−1 ‖b‖2 , ∀γ > 0 . (14)

Remark 9. For given two vectors a,b ∈ Rd

‖a + b‖2 ≤ (1 + α) ‖a‖2 + (1 + α−1) ‖b‖2 , ∀α > 0 . (15)

This inequality also holds for the sum of two matrices A,B ∈ Rn×d in Frobenius norm.

A.3 Implications of the bounded gradient and bounded variance assumption

Remark 10. If Fi : Rd ×Ω→ R, i = 1, . . . , n are convex functions with Eξ ‖∇Fi(x, ξ)‖ 2 ≤ G2, ∂F (X, ξ) =[∇F1(x, ξ1), . . . ,∇Fn(x, ξn)]

Eξ1,...,ξn ‖∂F (X, ξ)‖2F ≤ nG2 , ∀X .

Remark 11 (Mini-batch variance). If for functions fi, Fi defined in (8) Eξ ‖∇Fi(x, ξ)−∇fi(x)‖2 ≤ σ2i , i ∈

[n], then

Eξ(t)1 ,...,ξ

(t)n

∥∥∥∥∥∥ 1

n

n∑j=1

(∇fj(x(t)

j )−∇Fj(x(t)j , ξ

(t)j ))∥∥∥∥∥∥

2

≤ σ2

n,

where σ2 =∑n

i=1 σ2i

n .

Proof. This follows from

E

∥∥∥∥∥∥ 1

n

n∑j=1

Yj

∥∥∥∥∥∥2

=1

n2

n∑j=1

E ‖Yj‖ 2 +∑i6=j

E 〈Yi, Yj〉

=1

n2

n∑j=1

E ‖Yj‖ 2 ≤1

n2

n∑j=1

σ2j =

σ2

n

for Yj = fj(x(t)j )−∇Fj(x(t)

j , ξ(t)j ). Expectation of scalar product is equal to zero because ξi is independent

of ξj since i 6= j.

18

Page 19: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

B Consensus in Matrix notation

In the proofs in the next section we will use the matrix notation, as already introduced in the main text.We define

X(t) :=[x(t)1 , . . . ,x(t)

n

]∈ Rd×n, Q(t) :=

[q(t)1 , . . . ,q(t)

n

]∈ Rd×n, X(t) :=

[x(t)1 , . . . , x(t)

n

]∈ Rd×n . (16)

Then using matrix notation we can rewrite Algorithm 1 as

Algorithm 1 Choco-Gossip in matrix notation

input : X(0), γ, W .1: Initialize: X(0) = 02: for t in 0 . . . T − 1 do3: Q(t) = Q(X(t) − X(t))4: X(t+1) = X(t) +Q(t)

5: X(t+1) = X(t) + γX(t+1) (W − I)6: end for

Remark 12. Note that since every worker i for each neighbor j : i, j ∈ E stores xj, the proper notation

for x would be to use xij instead. We simplified it using the property that if x(0)ij = x

(0)kj , ∀i, k : i, j ∈ E

and k, j ∈ E, then they are equal at all timesteps x(t)ij = x

(t)kj , ∀t ≥ 0.

Remark 13. The results of Theorem 4 and 19 also hold for arbitrary initialized X(0) with the constraint

that ∀j all the neighbors of the node j initialized with the same xi, i.e. using extended notation x(0)ij = x

(0)kj ,

∀i, k : i, j ∈ E and k, j ∈ E.

B.1 Useful Facts

Remark 14. Let X(t) =[x(t)1 , . . . ,x

(t)n

]∈ Rd×n and X

(t)=[x(t), . . . ,x(t)

]∈ Rd×n, for x(t) = 1

n

∑ni=1 x

(t)i ,

then because W is doubly stochastic

X(t)

= X(t) 1

n11>, X

(t)W = X

(t). (17)

Remark 15. The average X(t)

=[x(t), . . . ,x(t)

]∈ Rd×n during iterates of the Algorithm 1 is preserved,

i.e.

X(t)

= X(0), ∀t, (18)

where X(t)

=[x(t) . . . ,x(t)

]∈ Rd×n.

Proof.

X(t+1)

= X(t)

+ γX(t) (W − I)11>

n= X

(t),

because W 11>

n = I since W is doubly stochastic.

Lemma 16. For W satisfying Definition 1, i.e. W is symmetric doubly stochastic matrix with second largesteigenvalue 1− δ = |λ2(W )| < 1 ∥∥∥∥W k − 1

n11>

∥∥∥∥2

≤ (1− δ)k . (19)

19

Page 20: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

Proof. Let UΛU> be SVD-decomposition of W , then W k = UΛkU> Because of the stochastic property ofW its first eigenvector is u1 = 1√

n1.

U

1 0 . . . 00 0 . . . 0. . .0 0 . . . 0

U> = u1u>1 =

1

n11>

Hence,

∥∥∥∥W k − 1

n11>

∥∥∥∥2

=

∥∥∥∥∥∥∥∥UΛkU> − U

1 0 . . . 00 0 . . . 0. . .0 0 . . . 0

U>

∥∥∥∥∥∥∥∥2

=

∥∥∥∥∥∥∥∥Λk −

1 0 . . . 00 0 . . . 0. . .0 0 . . . 0

∥∥∥∥∥∥∥∥2

= (1− δ)k.

C Proof of Theorem 2—Convergence of Choco-Gossip

Lemma 17. Let X(t), X(t) ∈ Rd×n, X = [x, . . . ,x] for average x = 1nX

(t)1 ∈ Rd and let X(t+1) =

X(t) + γX(t)(W − I) ∈ Rd×n be defined as in Algorithm 1 with stepsize γ ≥ 0 and mixing matrix W ∈ Rn×nas in Definition 1. Then∥∥∥X(t+1) −X

∥∥∥2F≤ (1− δγ)2(1 + α1)

∥∥∥X(t) −X∥∥∥2F

+ γ2(1 + α−11 )β2∥∥∥X(t+1) −X(t)

∥∥∥2F, ∀α1 > 0 .

Here α1 > 0 is a parameter whose value will be chosen later, δ = 1− |λ2(W )| and β = maxi1− λi(W ) asdefined above.

Proof. By the definition of X(t+1) and the observation X(W − I) = 0, we can write∥∥∥X(t+1) −X∥∥∥2F

=∥∥∥X(t) −X + γX(t+1)(W − I)

∥∥∥2F

=∥∥∥X(t) −X + γ

(X(t) −X

)(W − I) + γ

(X(t+1) −X(t)

)(W − I)

∥∥∥2F

=∥∥∥(X(t) −X

)((1− γ)I + γW ) + γ

(X(t+1) −X(t)

)(W − I)

∥∥∥2F

(15)

≤ (1 + α1)∥∥∥(X(t) −X

)((1− γ)I + γW )

∥∥∥2F

+ (1 + α−11 )∥∥∥γ (X(t+1) −X(t)

)(W − I)

∥∥∥2F

(12)

≤ (1 + α1)∥∥∥(X(t) −X

)((1− γ)I + γW )

∥∥∥2F

+ (1 + α−11 )γ2 ‖W − I‖22 ·∥∥∥X(t+1) −X(t)

∥∥∥2F.

Let’s estimate the first term∥∥∥(X(t) −X)

((1− γ)I + γW )∥∥∥F≤ (1− γ)

∥∥∥X(t) −X∥∥∥F

+ γ∥∥∥(X(t) −X

)W∥∥∥F

(17)= (1− γ)

∥∥∥X(t) −X∥∥∥F

+ γ∥∥∥(X(t) −X

) (W − 11>/n

)∥∥∥F

(19), (12)

≤ (1− γδ)∥∥∥X(t) −X

∥∥∥F

where we used (X(t) −X)11>/n = 0, by definition of X, in the second line. Putting this together gives usthe statement of the lemma.

20

Page 21: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

Lemma 18. Let X(t), X(t) ∈ Rd×n, X = [x, . . . ,x] for average x = 1nX

(t)1 ∈ Rd and let X(t+1) ∈ Rd×n and

X(t+2) ∈ Rd×n be defined as in Algorithm 1 with stepsize γ ≥ 0, mixing matrix W ∈ Rn×n as in Definition 1and quantization as in Assumption 1. Then

EQ∥∥∥X(t+1) − X(t+2)

∥∥∥2F≤ (1− ω)(1 + γβ)2(1 + α2)

∥∥∥X(t) − X(t+1)∥∥∥2F

+ (1− ω)γ2β2(1 + α−12 )∥∥∥X(t) −X

∥∥∥2F, ∀α2 > 0 .

Here α2 > 0 is a parameter whose value will be chosen later, β = maxi1 − λi(W ) as defined above andcompression ratio ω > 0.

Proof. By the definition of X(t+1) and X(t+2) we can write

EQ∥∥∥X(t+1) − X(t+2)

∥∥∥2F

= EQ∥∥∥X(t+1) − X(t+1) −Q(X(t+1) − X(t+1))

∥∥∥2F

(7)

≤ (1− ω)∥∥∥X(t+1) − X(t+1)

∥∥∥2F

= (1− ω)∥∥∥X(t) + γX(t+1)(W − I)− X(t+1)

∥∥∥2F

(17)= (1− ω)

∥∥∥(X(t) − X(t+1))

((1 + γ)I − γW ) + γ(W − I)(X(t) −X

)∥∥∥2F

(15)

≤ (1− ω)(1 + α2)∥∥∥(X(t) − X(t+1)

)((1 + γ)I − γW )

∥∥∥2F

+ (1− ω)(1 + α−12 )∥∥∥γ(W − I)

(X(t) −X

)∥∥∥2F

(12)

≤ (1− ω)(1 + γβ)2(1 + α2)∥∥∥X(t) − X(t+1)

∥∥∥2F

+ (1− ω)γ2β2(1 + α−12 )∥∥∥X(t) −X

∥∥∥2F,

where we used ‖I + γ(I −W )‖2 = 1+γ ‖I −W‖2 = 1+γβ because eigenvalues of γ(I−W ) are positive.

Proof of Theorem 2. As observed in Remark 14 the averages of the iterates is preserved, i.e. X ≡ X(t) 1n11>

for all t ≥ 0. By applying the Lemmas 17 and 18 from above we obtain

EQ et+1 ≤ η1(γ)∥∥∥X(t) −X

∥∥∥2F

+ ξ1(γ)∥∥∥X(t+1) −X(t)

∥∥∥2F≤ maxη1(γ), ξ1(γ) · et ,

where

η1(γ) := (1− δγ)2(1 + α1) + (1− ω)γ2β2(1 + α−12 ) ,

ξ1(γ) := γ2β2(1 + α−11 ) + (1− ω)(1 + γβ)2(1 + α2) .

Now, we need to choose the parameters α1, α2 and stepsize γ such as to minimize the factor maxη1(γ), ξ1(γ).Whilst the optimal parameter settings can for instance be obtained using specialized optimization software,we here proceed by showing that for the (suboptimal) choice

α1 :=γδ

2,

α2 :=ω

2

γ? :=δω

16δ + δ2 + 4β2 + 2δβ2 − 8δω

(20)

it holds

maxη1(γ?), ξ1(γ?) ≤ 1− δ2ω

2(16δ + δ2 + 4β2 + 2δβ2 − 8δω). (21)

21

Page 22: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

The claim of the theorem then follows by observing

1− δ2ω

2(16δ + δ2 + 4β2 + 2δβ2 − 8δω)≤ 1− δ2ω

82, (22)

using the crude estimates 0 ≤ δ ≤ 1, β ≤ 2, ω ≥ 0.We now proceed to show that (21) holds. Observe that for α1, α2 as in (20),

η1(γ) ≤ (1− γδ) (1− γδ)(

1 +γδ

2

)+ γ2β2(1− ω)

(1 +

2

ω

)≤(

1− γδ

2

)2

+2

ωγ2β2 =: η2(γ) ,

where we used the inequality (1− x)(1 + x2 ) ≤ (1− x

2 ) and (1− ω)(1 + 2/ω) ≤ 2ω for ω > 0. The quadratic

function η2(γ) is minimized for γ′ = 2δω8β2+δ2ω with value η2(γ′) = 8β2

8β2+δ2ω < 1. Thus by Jensen’s inequality

η2(λγ′) ≤ (1− λ)η2(0) + λη2(γ′) = 1− λ δ2ω

8β2 + δ2ω(23)

for 0 ≤ λ ≤ 1, and especially for the choice λ′ = 8β2+δ2ω2(16δ+δ2+4β2+2δβ2−8δω) we have

η1(γ?) ≤ η2(λ′γ′)(23)

≤ 1− δ2ω

2(16δ + δ2 + 4β2 + 2δβ2 − 8δω), (24)

as γ? = λ′γ′. Now we proceed to estimate ξ1(γ?). Observe

ξ1(γ) ≤ γ2β2

(1 +

2

γδ

)+ (1 + γβ)2(1− ω)

(1 +

ω

2

)≤ γ2β2

(1 +

2

γδ

)+ (1 + γβ)2

(1− ω

2

), (25)

again from (1 − x)(1 + x2 ) ≤ (1 − x

2 ) for x > 0. As β ≤ 2 we can estimate (1 + γβ)2 ≤ 1 + 8γ for any0 ≤ γ ≤ 1. Furthermore γ2 ≤ γ for 0 ≤ γ ≤ 1. Thus

ξ1(γ?) ≤ β2

(γ? +

2γ?

δ

)+(

1− ω

2

)(1 + 8γ?) = 1− δ2ω

2(16δ + δ2 + 4β2 + 2δβ2 − 8δω), (26)

as a quick calculation shows.

D Proof of Theorem 4—Convergence of Choco-SGD

Recall, thatx(t)i Tt=0 denote the iterates of Algorithm 2 on worker i ∈ [n]. We define

x(t) :=1

n

n∑i=1

x(t)i , (27)

the average over all workers. Note that this quantity is not available to the workers at any given time, but itwill be conveniently to use for the proofs. In this section we use both vector and matrix notation wheneverit is more convenient, and define

X(t) :=[x(t)1 , . . . ,x(t)

n

]∈ Rd×n, X

(t):=[x(t), . . . ,x(t)

]∈ Rd×n,

∂F (X(t), ξ(t)) :=[∇F1(x

(t)1 , ξ

(t)1 ), . . . ,∇Fn(x(t)

n , ξ(t)n )]∈ Rd×n.

(28)

22

Page 23: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

Instead of proving Theorem 4 directly, we prove a slightly more general statement in this section. Al-gorithm 2 relies on the (compressed) consensus Algorithm 1. However, we can also show convergence ofAlgorithm 2 for more general averaging schemes. In Algorithm 4 below, the function h : Rd×n × Rd×n →Rd×n × Rd×n denotes a blackbox averaging scheme. Note that h could be random.

Algorithm 4 decentralized SGD with arbitrary averaging scheme

input : X(0), stepsizes ηtT−1t=0 , averaging function h : Rd×n × Rd×n → Rd×n × Rd×n1: In parallel (task for worker i, i ∈ [n])2: for t in 0 . . . T − 1 do3: X(t+ 1

2 ) = X(t) − ηt∂Fi(X(t), ξ(t)) . stochastic gradient updates

4: (X(t+1), Y (t+1)) = h(X(t+ 12 ), Y (t)) . blackbox averaging/gossip

5: end for

In this work we in particular focus on two choices of h, the averaging operator h(X(t), Y (t)) 7→ (X(t+1), Y (t+1)):

• Setting X(t+1) = X(t)W and Y (t+1) = X(t+1) corresponds to standard (exact) averaging with mixingmatrix W , as in algorithm (E-G).

• Setting X(t+1) = X(t) + γY (t) (W − I)and Y (t+1) = Y (t) +Q(X(t+1) − Y (t)) for Y (t) = X(t+1), we getthe compressed consensus algorithm (Choco-G), leading to Algorithm 2, as introduced in the maintext.

Assumption 3. For an averaging scheme h : Rd×n × Rd×n → Rd×n × Rd×n let (X+, Y +) := h(X,Y ) forX,Y ∈ Rd×n. Assume that h preserves the average of the first iterate over all iterations:

X+ 11>

n= X

11>

n, ∀X,Y ∈ Rd×n ,

and that it converges with linear rate for a parameter 0 < p ≤ 1

Eh Ψ(X+, Y +) ≤ (1− p)Ψ(X,Y ) , ∀X,Y ∈ Rd×n ,

and Laypunov function Ψ(X,Y ) := ‖X − X‖2F + ‖X − Y ‖2F with X := 1nX11>, where Eh denotes the

expectation over internal randomness of averaging scheme h.

This assumption holds for exact averaging as in (E-G) with parameter p = γδ (as shown in Theorem 1).

For the proposed compressed consensus algorithm (Choco-G) the assumption holds for parameter p = ωδ2

82(as show in Theorem 2). Here ω denotes the compression ratio and δ the eigengap of mixing matrx W . Wecan now state the more general Theorem (that generalizes Theorem 4):

Theorem 19. Under Assumption 3 for p > 0, Algorithm 4 with stepsize ηt = 4µ(a+t) , for parameter a ≥

max

5p , 16κ

, κ = L

µ converges at the rate

f(x(T )avg)− f? ≤

µa3

8ST

∥∥∥x(0) − x?∥∥∥2 +

4T (T + 2a)

µST

σ2

n+

64T

µ2ST(2L+ µ)

40

p2G2,

where x(T )avg = 1

ST

∑T−1t=0 wtx

(t) for weights wt = (a+ t)2, and ST =∑T−1t=0 wt ≥ 1

3T3.

Proof of Theorem 4. The proof follows from Theorem 19 using the consensus averaging algorithm 1 (giving

p = δ2ω82 by Theorem 2) and the inequality E µ ‖x0 − x?‖ ≤ 2G derived in (Rakhlin et al., 2012, Lemma 2)

to upper bound the first term.

23

Page 24: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

D.1 Proof of Theorem 19

The proof below uses techniques from both (Stich et al., 2018) and (Stich, 2018).

Lemma 20. The averages x(t) of the iterates of the Algorithm 4 satisfy the following

Eξ(t)1 ,...,ξ

(t)n‖x(t+1) − x?‖

2≤(

1− ηtµ

2

)∥∥∥x(t) − x?∥∥∥2 +

η2t σ2

n− 2ηt (1− 2Lηt)

(f(x(t))− f?

)+

+ ηt2ηtL

2 + L+ µ

n

n∑i=1

∥∥∥x(t) − x(t)i

∥∥∥2 ,where σ2 = 1

n

∑ni=1 σ

2i .

Proof. Because the blackbox averaging function h preserves the average (Assumption 3), we have

∥∥∥x(t+1) − x?∥∥∥2 =

∥∥∥∥∥∥x(t) − ηtn

n∑j=1

∇Fj(x(t)j , ξ

(t)j )− x?

∥∥∥∥∥∥2

=

∥∥∥∥∥∥x(t) − x? − ηtn

n∑i=1

∇fi(x(t)i ) +

ηtn

n∑i=1

∇fi(x(t)i )− ηt

n

n∑j=1

∇Fj(x(t)j , ξ

(t)j )

∥∥∥∥∥∥2

=

=

∥∥∥∥∥x(t) − x? − ηtn

n∑i=1

∇fi(x(t)i )

∥∥∥∥∥2

+ η2t

∥∥∥∥∥∥ 1

n

n∑i=1

∇fi(x(t)i )− 1

n

n∑j=1

∇Fj(x(t)j , ξ

(t)j )

∥∥∥∥∥∥2

+

+2ηtn

⟨x(t) − x? − ηt

n

n∑i=1

∇fi(x(t)i ),

n∑i=1

∇fi(x(t)i )−

n∑j=1

∇Fj(x(t)j , ξ

(t)j )

⟩.

The last term is zero in expectation, as Eξ(t)i∇Fi(x(t)

i , ξ(t)i ) = ∇fi(x(t)

i ). The second term is less thanη2tσ

2

n

(Remark 11). The first term can be written as:∥∥∥∥∥x(t) − x? − ηtn

n∑i=1

∇fi(x(t)i )

∥∥∥∥∥2

=∥∥∥x(t) − x?

∥∥∥2 + η2t

∥∥∥∥∥ 1

n

n∑i=1

∇fi(x(t)i )

∥∥∥∥∥2

︸ ︷︷ ︸=:T1

− 2ηt

⟨x(t) − x?,

1

n

n∑i=1

∇fi(x(t)i )

⟩︸ ︷︷ ︸

=:T2

.

We can estimate

T1 =

∥∥∥∥∥ 1

n

n∑i=1

(∇fi(x(t)i )−∇fi(x(t)) +∇fi(x(t))−∇fi(x?))

∥∥∥∥∥2

(13)

≤ 2

n

n∑i=1

∥∥∥∇fi(x(t)i )−∇fi(x(t))

∥∥∥2 + 2

∥∥∥∥∥ 1

n

n∑i=1

∇fi(x(t))− 1

n

n∑i=1

∇fi(x?)

∥∥∥∥∥2

(9),(11)

≤ 2L2

n

n∑i=1

∥∥∥x(t)i − x(t)

∥∥∥2 +4L

n

n∑i=1

(fi(x

(t))− fi(x?))

=2L2

n

n∑i=1

∥∥∥x(t)i − x(t)

∥∥∥2 + 4L(f(x(t))− f?

).

24

Page 25: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

And for the remaining T2 term:

− 1

ηtT2 = − 2

n

n∑i=1

[⟨x(t) − x

(t)i ,∇fi(x(t)

i )⟩

+⟨x(t)i − x?,∇fi(x(t)

i )⟩]

(9),(10)

≤ − 2

n

n∑i=1

[fi(x

(t))− fi(x(t)i )− L

2

∥∥∥x(t) − x(t)i

∥∥∥2 + fi(x(t)i )− fi(x?) +

µ

2

∥∥∥x(t)i − x?

∥∥∥2](13)

≤ −2(f(x(t))− f(x?)

)+L+ µ

n

n∑i=1

∥∥∥x(t) − x(t)i

∥∥∥2 − µ

2

∥∥∥x(t) − x?∥∥∥2 .

Putting everything together we are getting statement of the lemma.

Lemma 21. The iterates X(t)t≥0 of Algorithm 4 with stepsizes ηt = bt+a , for parameters a ≥ 5

p , b > 0satisfy ∥∥∥X(t+1) −X(t+1)

∥∥∥2F≤ 40η2t

1

p2nG2 .

Here 0 < p ≤ 1 denotes the a convergence rate of the blackbox averaging algorithm as in Assumption 3.

Proof. Using linear convergence of the blackbox averaging algorithm as given in Assumption 3 we can write

for Ξ := E∥∥∥X(t+1) −X(t+1)

∥∥∥2F

+ E∥∥∥X(t+1) − X(t+2)

∥∥∥2F

,

Ξ ≤ (1− p)E∥∥∥X(t+ 1

2 ) −X(t+ 12 )∥∥∥2F

+ (1− p)E∥∥∥X(t+1) −X(t+ 1

2 )∥∥∥2F

= (1− p)E∥∥∥∥X(t) −X(t) + ηt∂F (X(t), ξ(t))

(11>

n− I)∥∥∥∥2

F

+ (1− p)E∥∥∥X(t+1) −X(t) + ηt∂F (X(t), ξ(t))

∥∥∥2F

(15)

≤ (1− p)(1 + α−13 )E(∥∥∥X(t) −X(t)

∥∥∥2F

+∥∥∥X(t+1) −X(t)

∥∥∥2F

)+ (1− p)(1 + α3)η2tE

(∥∥∥∥∂F (X(t), ξ(t))

(11>

n− I)∥∥∥∥2

F

+∥∥∥∂F (X(t), ξ(t))

∥∥∥2F

)

≤ (1− p)(

(1 + α−13 )E(∥∥∥X(t) −X(t)

∥∥∥2F

+∥∥∥X(t+1) −X(t)

∥∥∥2F

)+ 2n(1 + α3)η2tG

2

)α3=

2p

≤(

1− p

2

)E(∥∥∥X(t) −X(t)

∥∥∥2F

+∥∥∥X(t+1) −X(t)

∥∥∥2F

)+

4n

pη2tG

2 .

The statement now follows from Lemma 22 and the inequality

E∥∥∥X(t+1) −X(t+1)

∥∥∥2F≤ Ξ := E

∥∥∥X(t+1) − X(t+2)∥∥∥2F

+ E∥∥∥X(t+1) −X(t+1)

∥∥∥2F.

Lemma 22. Let rtt≥0 denote a sequence of positive real values satisfying r0 = 0 and

rt+1 ≤(

1− p

2

)et +

2

pη2tA , ∀t ≥ 0 ,

for a parameter p > 0, stepsize ηt = bt+a , for parameters a ≥ 5

p and with arbitrary b > 0. Then rt is boundedas

rt ≤ 20η2t1

p2A , ∀t ≥ 0 .

25

Page 26: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

Proof. We will proceed the proof by induction. For t = 0 the statement is true by assumption on r0 = 0.Suppose that for timestep t the statement is also true, then for timestep t+ 1

et+1 ≤(

1− p

2

)et +

2

pη2tA ≤

(1− p

2

)20η2t

1

p2A+

2

pη2tA = Aη2t

1

p2(−8p+ 20) .

Now we show η2t (−8p+ 20) ≤ 20η2t+1 which proves the claim. By assumption p ≥ 5a , hence

η2t (−8p+ 20) ≤ 20η2t

(1− 2

a

)≤ 20η2t+1 ,

where the second inequality follows from

(a+ t+ 1)2(

1− 2

a

)= (a+ t)2 + 2(a+ t) + 1−

(2

(a+ t)2

a+ 4

(a+ t)

a+

2

a

)≤ (a+ t)2 + 2(a+ t) + 1− (2(a+ t) + 4) ≤ (a+ t)2 .

Lemma 23 (Stich (2018)). Let att≥0, at ≥ 0, ett≥0, et ≥ 0 be sequences satisfying

at+1 ≤ (1− µηt)at − ηtetA+ η2tB + η3tC ,

for stepsizes ηt = 4µ(a+t) and constants A > 0, B,C ≥ 0, µ > 0, a > 1. Then

A

ST

T−1∑t=0

wtet ≤µa3

4STa0 +

2T (T + 2a)

µSTB +

16T

µ2STC ,

for wt = (a+ t)2 and ST :=∑T−1t=0 wt = T

6 (2T 2 + 6aT − 3T + 6a2 − 6a+ 1) ≥ 13T

3.

Proof of Theorem 19. Substituting the result of Lemma 21 into the bound provided in Lemma 20 (here weuse a ≥ 5

p ) we get that

E ‖x(t+1) − x?‖2≤(

1− ηtµ

2

)E∥∥∥x(t) − x?

∥∥∥ 2

− 2ηt (1− 2Lηt) et + η2tσ2

n(2ηtL

2 + L+ µ)40η3t1

p2G2,

For ηt ≤ 14L (this holds, as a ≤ 16κ) it holds 2Lηt − 1 ≤ − 1

2 and (2ηtL2 + L+ µ) < (2L+ µ), hence

E ‖x(t+1) − x?‖2≤(

1− ηtµ

2

)E∥∥∥x(t) − x?

∥∥∥ 2

+η2t σ

2

n− ηtet + (2L+ µ)40η3t

1

p2G2 .

From Lemma 23 we get

1

ST

T−1∑t=0

wtet ≤µa3

8ST

∥∥∥x(0) − x?∥∥∥2 +

4T (T + 2a)

µST

σ2

n+

64T

µ2ST(2L+ µ)40

1

p2G2,

for weights wt = (a + t)2 and ST :=∑T−1t=0 wt = T

6 (2T 2 + 6aT − 3T + 6a2 − 6a + 1) ≥ 13T

3, where p isconvergence rate of the averaging scheme. The theorem statement follows from convexity of f .

E Efficient Implementation of the Algorithms

In this section we present memory-efficient implementations of Choco-Gossip and Choco-SGD algorithms,which require each node to store only three vectors: x, xi and si =

∑ni=1 wijxj .

26

Page 27: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

Algorithm 5 Memory-efficient Choco-Gossip

input : Initial values x(0)i ∈ Rd on each node i ∈ [n], stepsize γ, communication graph G = ([n], E) and

mixing matrix W , initialize x(0)i := 0 , s

(0)i = 0, ∀i

1: for t in 0 . . . T − 1 do in parallel for all workers i ∈ [n]

2: q(t)i := Q(x

(t+1)i − x

(t)i )

3: for neighbors j : i, j ∈ E (including i ∈ E) do

4: Send q(t)i and receive q

(t)j

5: end for

6: x(t+1)i = x

(t)i + q

(t)i

7: s(t+1)i := s

(t)i +

∑ni=1 wijq

(t)j

8: x(t+1)i := x

(t)i + γ

(s(t+1)i − x

(t+1)i

)9: end for

Algorithm 6 Memory-efficient Choco-SGD

input : Initial values x(0)i ∈ Rd on each node i ∈ [n], consensus stepsize γ, communication graphG = ([n], E)

and mixing matrix W , initialize x(0)i := 0 ∀i

1: for t in 0 . . . T − 1 do in parallel for all workers i ∈ [n]

2: Sample ξ(t)i , compute gradient g

(t)i := ∇Fi(x(t)

i , ξ(t)i )

3: x(t+ 1

2 )i := x

(t)i − ηtg

(t)i

4: q(t)i := Q(x

(t+ 12 )

i − x(t)i )

5: for neighbors j : i, j ∈ E (including i ∈ E) do

6: Send q(t)i and receive q

(t)j

7: end for

8: x(t+1)i := q

(t)i + x

(t)i

9: s(t+1)i := s

(t)i +

∑ni=1 wijq

(t)j

10: x(t+1)i := x

(t+ 12 )

i + γ(s(t+1)i − x

(t+1)i

)11: end for

F Parameters Search Details of SGD Experiments

For each optimization problem, we first tuned γ on a separate average consensus problem with the sameconfiguration (topology, number of nodes, quantization, dimension). Parameters a, b where later tunedseparately for each algorithm by running the algorithm for 10 epochs. To find a and b we performed gridsearch independently for each algorithm and each quantization function. For values of a we used logarithmicgrid of powers of 10. We searched values of b in the set 1, 0.1d, d, 10d, 100d.

G Additional Experiments

27

Page 28: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

Epsilon RCV1experiment a τ γ a τ γ

Plain 0.1 d - 1 1 -Choco, (qsgd16) 0.1 d 0.34 1 1 0.078Choco, (rand1%) 0.1 d 0.01 1 0.1d 0.016Choco, (top1%) 0.1 d 0.04 1 1 0.04DCD, (rand1%) 10−15 d - 10−15 d -DCD, (qsgd16) 0.01 d - 10−15 d -ECD, (rand1%) 10−6 d - 10−4 10d -ECD, (qsgd16) 10−6 d - 10−15 d -

Table 5: Values for initial learning rate and consensus learning rate used in SGD experiments Fig. 5, 6.Parameter γ found separately by tuning average consensus with the same configuration (topology, number ofnodes, quantization, dimension). Parameters a, τ found by tuning. ECD, DCD stepsizes are small becauseit diverge for larger choices.

Figure 7: Performance of Algorithm 3 on ring, torus and fully connected topologies for n ∈ 9, 25, 64 nodes.Randomly shuffled data between workers

28

Page 29: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

Figure 8: Comparison of Algorithm 3 (plain), ECD-SGD, DCD-SGD and Choco-SGD with (rand1%) sparsification (in addition (top1%) for Choco-SGD), for epsilon(top) and rcv1 (bottom) in terms of iterations (left) and communication cost (right).Randomly shuffled data between workers.

29

Page 30: fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl ... · Sebastian U. Stich EPFL Martin Jaggi EPFL fanastasia.koloskova, sebastian.stich, martin.jaggig@epfl.ch Abstract We

Figure 9: Comparison of Algorithm 3 (plain), ECD-SGD, DCD-SGD and Choco-SGDwith (qsgd16) quantization, for epsilon (top) and rcv1 (bottom) in terms of iterations(left) and communication cost (right). Randomly shuffled data between workers.

30