byzantine-robust decentralized optimization for machine

Byzantine-robustdecentralized optimization

for Machine Learning—

William Cappelletti

Master Thesis Project

under the supervision ofProfessor M. Jaggi1 and Professor E. Abbe2

advised byS.P.R. Karimireddy1 and L. He1

Autumn 2020

1Machine Learning and Optimization lab2Mathematical Data Analysis lab

Introduction

This essay analyses a Byzantine-resilient variant of Decentralized Stochas-tic Gradient Descent.

We motivate and define the decentralized Stochastic Gradient Descentalgorithm, in which a network of computers aim to jointly minimize aparametrized stochastic function. Then, we introduce Byzantine adver-saries, whose goal is to impair the the optimization process by sendingarbitrary messages to regular workers.

We review many variants of the decentralized SGD algorithm, whichhave been designed to withstand such Byzantine attacks. We present theirassumptions and discuss their limitations.

We carry on a general theoretical analysis on the behavior of RobustDecentralized SGD, by providing convergence rates for a Byzantine-free set-ting. We prove sublinear convergence in the number of iterations T , and adependence on the number of good nodes N and on the connectivity of thegraph, represented by the spectral gap ρ.

We perform a series of experiments on the MNIST handwritten-digitclassification task. We test different communication networks and variantsof Robust DeSGD against two different kind of Byzantine attacks.

Finally, we point out the lack of a univocal definition of Byzantine-robustness, and the consequent confusion on the meaning of failure in athis setting. We propose a convergence based approach, which discuss theperformance by comparing it to the learning rate of local SGD, which isalways a solution for the iid setting that we consider.

i

Contents

1 Decentralized learning 3

1.1 The setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . 4

1.3 Distributed learning . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Decentralized learning . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Byzantine adversaries . . . . . . . . . . . . . . . . . . . . . . 8

2 Byzantine-Robust SGD 11

2.1 General algorithm . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Existing algorithms . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Krum . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 Bulyan . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.3 ByRDiE . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.4 BRIDGE . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.5 Zeno . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.6 ByGARS . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.7 Mozi . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.8 Total variation . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Comments on the state of the art . . . . . . . . . . . . . . . . 22

3 Convergence analysis 25

3.1 Proof plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Definitions and preliminary lemmas . . . . . . . . . . . . . . . 26

3.2.1 Strong convexity and smoothness . . . . . . . . . . . . 26

3.2.2 Spectral graph theory and linear algebra . . . . . . . . 27

3.3 Modelling decentralized SGD . . . . . . . . . . . . . . . . . . 30

3.3.1 Assumptions on the loss . . . . . . . . . . . . . . . . . 31

3.3.2 Modelling the aggregation function . . . . . . . . . . . 31

3.3.3 Unrolling the recursion . . . . . . . . . . . . . . . . . . 33

iii

CONTENTS 1

3.4 Average convergence without Byzantines . . . . . . . . . . . . 343.4.1 Strongly convex case . . . . . . . . . . . . . . . . . . . 353.4.2 Non-convex case . . . . . . . . . . . . . . . . . . . . . 42

3.5 Discussion on Byzantine setting . . . . . . . . . . . . . . . . . 46

4 Experimental analysis 494.1 Fully connected graph . . . . . . . . . . . . . . . . . . . . . . 504.2 Reducing connectivity . . . . . . . . . . . . . . . . . . . . . . 524.3 Byzantine influence . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3.1 Coordinate-wise median . . . . . . . . . . . . . . . . . 544.3.2 Krum . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.3.3 BRIDGE . . . . . . . . . . . . . . . . . . . . . . . . . 57

5 Limitations 595.1 Theoretical assumptions . . . . . . . . . . . . . . . . . . . . . 595.2 On local convergence . . . . . . . . . . . . . . . . . . . . . . . 61

6 Conclusion 636.1 Final comments . . . . . . . . . . . . . . . . . . . . . . . . . . 636.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

2 CONTENTS

Chapter 1

Decentralized learning

1.1 The setting

We discuss algorithms for the minimization of parametrized stochasticfunctions.

Consider some random vector ξ drawn from an hidden distibution Dwith support X . For instance, we could be sampling ξ from a Gaussiandistribution, with unknown mean and variance. Or, to get a more interestingproblem –which we analyze experimentally in Chapter 4– we could supposethat ξ encodes a pair (input, target), where input is a 28 × 28 grayscaleimage of an handwritten digit and target is the value of the representeddigit, whose distribution amongst all possible 28 × 28 images we cannoteven concieve.

This random vector ξ is fed to a parametrized function f : Rd×X → R,which is known as loss function. We are interested in finding some optimalparameters x∗ ∈ Rd that minimize the average loss value with respect to ξ:

F (x) = Eξ∼D[f(x, ξ)]. (1.1.1)

This quantity F (x), the expected risk with respect to x, cannot be com-puted, since the generating distibution D is usually inaccessible. Thus, weresort to its empirical version. Given an independent sample ξ1, . . . , ξn fromD, we can estimate F (x) through

F (x) =1

n

n∑i=1

f(x, ξi). (1.1.2)

This random sample is referred to as the dataset and, in particular, whenwe use it to estimate the optimal paramters, we call it a training dataset.

3

4 CHAPTER 1. DECENTRALIZED LEARNING

If we go back to our Gaussian example, we could be interested in findingthe average of the distribution. We know by Gauss-Markov Theorem thatthe point that minimize the mean squared distance to the samples is the bestlinear unbiased estimator of the Gaussian mean. Thus, our loss functionwould be

f(x, ξ) = ‖ξ − x‖2 ,

where x are the only parameters. The empirical risk for a sample ξ1, . . . , ξnis thus

F (x) =1

n

n∑i=1

‖ξi − x‖2 ,

which is minimized by

x∗ =

∑ni=1ξin

.

This example is indeed trivial, and there exists many different deriva-tions of this same estimator for the mean of the distribution. Nonetheless,it intoduces us to a wide class of problems, for which the minimizer maynot have a closed form solution, or it may even not be unique. This is,for instance, what happens with artificial neural networks [4], where theoptimization landscape presents many local minima.

1.2 Stochastic Gradient Descent

As we pointed out, even if a function can be minimized, it does not neces-sarily have a closed form solution. It is the case of many models used in Ma-chine Learning, such as logistic regression and Support Vector Machines [3],which are differentiable, but do not have explicit minimizers. Since we donot have a closed form expression to compute the desired solution x∗, weresort to numerical minimization.

To numerically minimize a function, the general approach is always sim-ilar. We start from some initial guess x(0) and we iteratively improve ourestimate x(t) by exploring the space with some heuristic. A big family ofalgorithms use the gradient at the current estimate to identify the steepestdescent direction and take a step towards it. They are known as GradientDescent algorithms.

Assumption 1. The function f : Rd×X → R is continuously differentiablewith respect to the first variable x, for all ξ ∈ X .

1.3. DISTRIBUTED LEARNING 5

Algorithm 1: SGD

Input: x(0) initial guess, Tnumber ofiterations, {ηt}t<Tlearning rates.

1 for t = 0, . . . , T − 1 do2 Sample ξt ∼ D ;

3 g(t) ← ∇f(x(t), ξt) ;

4 x(t+1) ← x(t) − ηtg(t) ;

5 end

6 return x(T )

x(t)

x(t+1)

−ηtg(t)

Figure 1.1: Representation of a step ofSGD. From the current estimate x(t), wecompute a stochastic gradient g(t) and wetake a small negative step in its direction.Different samples ξ gives different gradients.

Under Assumption 1, our minimization strategy will be Stochastic Gra-dient Descent, which is described in Algorithm 1. At every iteration t,we sample ξt ∼ D and we use it to compute an estimate of the gradientg(t) := ∇f(x(t), ξt), we then take a step in its negative direction, scaled bya learning rate ηt. Since ∇f is a function of the random vector ξ, it is itselfrandom, hence the “stochastic” name.

SGD is proven to converge as O(1/T ) under certain assumptions [9].This rate can be boosted by increasing the accuracy of the gradient estimateg through the average of more samples, which is known as minibatch-SGD.Nonetheless, the method becomes more computationally expensive the moreparameters we have and the more samples we use in our batch. In fact, themain cost comes from the gradient estimation.

Furthermore, this method requires that all data is accessible at the samelocation, which, creates privacy and security issues when treating with sen-sible information.

To assess these problems, novel methods have been designed. In thiswork, we focus on distributed and decentralized algorithms.

1.3 Distributed learning

In the distributed learning settings, we have a set of workers i ∈ V,with |V| = M , connected to a central parameter server PS. Each worker ican acces local samples ξi ∼ D and uses them to compute a local objectivefi(x) := f(x, ξi) and an estimate of its gradient ∇fi.


Algorithm 2: Federated SGD

Input: Initial guess x(0) ∈ Rd, set of workers V, parameter serverPS, number of iterations T , learning rates {ηt}t<T .

1 for t = 0, . . . , T − 1 do

2 PS broadcasts x(t) to each i ∈ V;3 foreach i ∈ V do in parallel4 Sample ξi ∼ Di ;

5 g(t)i ← ∇fi(x(t)) ;

6 Send g(t)i to PS;

7 end

8 PS gathers Ht := {g(t)i }i∈V ;

9 g(t) ← Avg Ht := 1M

∑Mi=1 g

(t)i ;

10 x(t+1) ← xt − ηtg(t) ;

11 end

12 return x(T )

The server tries to minimize the global objective F (x) = Eξ [f(x, ξ)]using a variant of SGD. It starts from an intial guess x(0), which is sharedwith all client workers. Then at each iteration t ≥ 0, the workers compute

gradients g(t)i = ∇fi(x(t)) and send them to the central server. The latter

averages them, performs one step of SGD, and proposes a new x(t+1), whichis broadcasted to the clients.

There are two main paradigms in this setting, which differ in how datais handled. The first one supposes that the central server has access to thewhole dataset and, at each iteration, it disptatches samples across workers.In the second paradigm, known as federated, each worker samples ξti directlyfrom the distribution D, so that the only information that is communicatedis the current parameter vector x(t) – from the PS to the workers – and

the gradients g(t)i – from each worker to the server. Algorithm 2 shows the

federated variant of SGD. It is evident that this approach allows for betterdata privacy, because each part of the data never leaves the device.

This distributed variant of SGD, if well implemented, can speed upthe learning procedure, as the intensive workload of computing gradientsis spread out to the clients and the aggregation of the gradients improve theestimate of the real one.

Nonetheless, this setting requires each worker to be idle while the param-

1.4. DECENTRALIZED LEARNING 7

Éò

Éò

Éò

Éò

Éò

Éò

õ

Figure 1.2: Representation of the federated learning setting. A central parameter serveris connected to each worker, which has access to its local data. The PS bradcast currentparamters to workers, which send back the local gradients.

eter server averages the gradients and is very sensitive to central failures.For this reason, we move our attention to decentralized learning.

1.4 Decentralized learning

In the decentralized learning paradigm, we have again a set of comput-ers V. They define the nodes of a communication graph G = (V, E), wherean edge (ei,j) = (i, j) is in E if and only if node i can communicate tonode j. We denote by Ni the set of i’s neighbors, i.e. the nodes which cancommunicate to i. Note that, in principle, the graph does not need to besymmetric, which mean that a node i could send messages to j, but not theother way around.

Each computer keeps a local parameter vector, or state, xi and can accesslocal samples ξi ∼ D. The final objective is now a bit different, since weshould enforce that all nodes learn the same optimal parameters x∗. In fact,we have to solve

minxi∈Rd

∑i∈V

Eξ [f(xi, ξ)] s. t.xi = xj∀i, j ∈ V. (1.4.1)

This differs from (1.1.1) as we would like to have consensus among the nodeson the optimal parameters. To phrase it differently, we want each node toagree on the same minimizer x∗.

To empirically minimize (1.4.1), we want to generalize the already intro-duced SGD variants to this setting. The easiest approach is what is knownas Gossip SGD, in which each node iteratively averages its state xi with itsneighbors and locally performs a step of SGD.


Algorithm 3: Gossip SGD

Input: x(0) initial guess, max T , {ηt}t<T learning rates,G = (V, E) comm. graph

1 Init x(0)i ← x(0) for all i ∈ V ;

2 for t = 0, . . . , T − 1 do// in parallel for each i ∈ V

3 Collect X(t)i :=

{x(t)j : j ∈ Ni

};

4 x(t)i ←

1|Ni|+1

(x(t)i +

∑j∈Ni x

(t)j

);

5 Sample ξti ∼ D ;

6 g(t)i ← ∇f(x

(t)i , ξ

ti) ;

7 Broadcast x(t+1)i ← x

(t)i − ηtg

(t)i ;

8 end

We present Gossip SGD in detail in Algorithm 3. While time advances,

a worker i ∈ V recieves the current states of its neighbors {x(t)j }j∈Ni and

averages them with its own x(t)i to obtain the intermediate state x

(t)i . The

latter is then used to perform a local SGD updated with a sample ξti ∼ D.As we will prove in Chapter 3, this algorithm grants a significant speed

up in the converge rates, compared to local SGD. In fact, the rate can passfrom O(1/T ), with T the number of iterations, to as much as O( 1

TN ), witha complete graph on N nodes.

Still, this setting opens up to new problems. Again, we can wonderabout the consequences of the failure of one or more nodes in the network.What would happen if they start sending wrong messages? Even worse,what would be the effect of a malicious entity infiltrating the network?

To answer these questions, we introduce a very flexible framework foradversarial agents, which has been longly studied in the decentralized algo-rithms literature.

1.5 Byzantine adversaries

In this section we intoduce an agent which encompass most of the pos-sible problems in a decentralized learning setting, then we explore whatactions it should perform to have an adversarial behavior.

1.5. BYZANTINE ADVERSARIES 9

õ

õõ

õ

õ õ

(1)x(t)2

õ

õõ

õ

õ õ

(2)x(t)2

x(t)1

x(t)4

x(t)5

õ

õõ

õ

õ õ

(3)x(t)2ξt2,

õ

õõ

õ

õ õ

(4)x(t+1)2

x(t+1)2

x(t+1)2

Figure 1.3: Representation of one step t of Decentralized SGD. (1) The focus is on a singlenode and its neighbors. (2) The worker gathers parameter vectors from neighboring nodes.(3) Then it aggregates them and perform a SGD step with a local sample. (4) Finally, itbroadcasts its updated parameters.

Definition 1.5.1. Let G = (V, E) be a communication graph, a node i is aByzantine agent if

1. It has complete knowledge of the states of all nodes in the network, ateach moment;

2. It can send arbitrary messages x(t)i,j to each neighboring node j.

We contrappose this Byzantine agents to what we call good nodes, whichare the regular workers that follow the predefined protocol. This is indeeda very powerful adversary, which is useful for worst-case analysis of generaldecentralized algorithms [10].

In practical examples, it may be unreasonable to suppose that an ad-versary has complete knowledge of the system. Nonetheless, this frameworkencompass the most interesting attacks. For a start, it can be easily proven


that, if no countermeasure is taken by the regular workers, such a Byzantineagent can make Gossip SGD converge to any arbitrary wrong point. Sincegood nodes update their states using the arithmetic mean of the recievedparameters, a single Byzantine agent can completely disrupt (or indefinitelyslow down) the optimization progress, by continously sending an erroneusparameter vector with huge magnitude.

Then, many attacks have been designed that can make a great damage,but only require little information about the problem. For instance, in [1] theauthors manage to break down most of the well-established algorithms, withthe only assumption that Byzantine agents can sample from the distibutionas regular workers. We discuss and experiment with this attack in Chapter 4.

Now that we have a general definition for the adversarial agents, weshould clarify what the objective of the “good workers” is, as we need todefine a game in order to have adversaries. We already said many timesthat the objective of a regular node is to minimize the global expected risk,thus, by contradiction, we could just say that the objective of an adver-sary is to break down the optimization procedure. This definition, althougheasy, is not very interesting. In fact, a regular node could just ignore com-pletely the information it receives from its neighbors and it would learn theobjective without any interference, as this strategy consists in reverting toAlgorithm 1, i.e. Local SGD.

The problem becomes more interesting if we suppose that the good nodesdo not only want to find a minimizer, but they want to do it “quickly”.To define this formally, we can start by recalling that regular SGD has aconvergence rate of O

(1T

), with T the number of iterations. On top of that,

if we forget about adversaries for a moment and apply Theorem 3.4.3, theGossip SGD algorithm can be as fast as O

(1TN

), in the optimal case of a

complete graph on N nodes , with a strongly convex objective.Therefore, we can say that the objective of a worker is to learn the

objective with a convergence rate of O(

1TN ′

), with N ′ > 1 a function of the

number of good and adversarial neighbors.On the other hand, the objective of an adversary would be to slow down

as much as possible the convergence – with the complete disruption of con-vergence viewed as an infinite slowdown. This adversarial objective is quitebroad and allows for many different strategies, which can be tailored to thedefences implemented by the workers.

Chapter 2

Byzantine-Robust SGD

2.1 General algorithm

In this section we design a decentralized SGD variant to defend againstByzantine attacks. The problem of Gossip SGD is in the averaging step atline 4 in Algorithm 3. In fact, we saw that we only need to introduce awrong vector with big magnitude to disrupt the optimization.

The starting point to get a robust learning procedure is the introductionof an aggregation function, which, as the name says, synthesize the incomingmessages to propose a valid parameter vector. More precisely, this functionwould take as input the local parameters and those recieved from neighbors,and output a new parameter vector x, which should contain the relevantinformation from the input. Ideally, this output should be a robust estimateof the average of the states from the good workers.

Function Aggr

Input: A set of vectors {v1, . . . ,vM} ⊂ RdOutput: v robust estimate of the mean v

A direct example of an aggregation function is given by the coordinate-wise median of the input vectors. Assuming that less than half of the neigh-bors are Byzantine, each entry of the median vector is within the minimumand maximum values proposed by the regular nodes.

Algorithm 4 describes the abstract Robust De-SGD algorithm. In thenext section we will explore some aggregation functions and some variantsof this algorithm which have been proposed to solve Byzantine-robust opti-mization.

11

12 CHAPTER 2. BYZANTINE-ROBUST SGD

Algorithm 4: Robust De-SGD

Input: x(0) initial guess, max T , {ηt}t<T learning rates,G = (V, E) comm. graph


2 for t = 0, . . . , T − 1 do// in parallel for each i ∈ V

3 Collect X(t)i :=

{x(t)j : j ∈ Ni

};

4 x(t)i ← Aggr

(x(t)i ,X

(t)i

);


6 g(t)i ← ∇f(x

(t)i , ξ

ti) ;

7 Broadcast x(t+1)i ← x

(t)i − ηtg

(t)i ;

8 end

2.2 Existing algorithms

In this section we review some existing algorithms. The focus is on theirassumptions –both algorithmical and theoretical– and on the motivatingproblems that different authors consider.

We present algorithms explicitly designed to deal with Byzantine agentsin the decentralized setting, which in most cases follows the abstract Algo-rithm 4; along with algorithms designed for the parameter server setting.

Distributed algorithms are still of interest, as they can be adapted to bedecentralized. For such a conversion to the decentralized setting, we instructeach node to behave like a parameter server whose client workers are itsneighbors. Furthermore, we ask it to treat the local information as comingfrom an additional client. Finally, when not explicitly required, instead ofcommunicating gradients as in distributed SGD, each node broadcast itsparameters, in compliance with Algorithm 4.

2.2.1 Krum

Krum is the first aggregation rule proposed to specifically address Byzan-tine agents in the distributed SGD context. It was presented in [2] for thefederated learning setting, where gradients are exchanged. Krum aims toidentify a “central” direction to perform the parameter update. Supposingto know the upper bound b on Byzantine nodes (among the neigbors ofeach node), the idea is to choose the vector which is closer to its M − b− 2

2.2. EXISTING ALGORITHMS 13

Function Krum

Input: b upper bound on # Byzantine, set {x1, . . . ,xM} ⊂ Rd

1 foreach i ∈ [M ] do// compute score si

2 Identify the M − b− 2 closest vectors to xi in ‖.‖2 and notethem as j ≈ i;

3 si ←∑

j≈ i ‖xi − xj‖2 ;

4 end

5 return x← arg mini∈[M ] {si}

neighboring vectors in euclidean distance. To say it differently, they try toidentify a point which would be central even after removing some of theworkers.

The authors proved almost sure asympthotical convergence in the pa-rameter server setting under very strong assumptions on the gradient noise.

In the same paper, Blanchard, Guerraoui, Stainer, et al. proposed a k-multi-Krum variant, in which they chose the first k vectors according tothe Krum score si =

∑j≈ i ‖xi − xj‖2 and average them. This address the

limitation that single Krum discard most of the information at each step, asthe aggragated value is just one of the input vectors.

It is straightforward to apply this aggregation function in the decentral-ized setting, by using parameter vectors instead of gradients Nonetheless,we encounter new issues, along with the existing ones. In particular, eachnode needs at least 2b neighbors, which is much more limiting than in thefederated setting. Furthermore, there is no use of local information and, infact, the local parameter vector could be discarded by Krum.

In Chapter 4, we experiment with this aggregation rule, to investigatesuch issues.

2.2.2 Bulyan

In [12], Mhamdi, Guerraoui, and Rouault introduce a meta-aggregationmethod called Bulyan, which builds upon another aggregation function.

If an Aggr function outputs one of the input vectors, it can be used itto iteratively build a selection set. To do so, you keep track of the selectedvectors and apply Aggr to the remaining ones. A valid aggregator choicesatisfying this property is Krum.

Bulyan builds the selection set to discard 2b vectors, where b is an upper


Function TrimmedMean

Input: Set {x1, . . . ,xM} ⊂ Rd, b upper bound on # Byzantine

1 Init an empty x ;2 foreach k ∈ [d] do

3 Sort{

[x1]k, . . . , [xM ]k

}as {x(1), . . . , x(M)} ;

// Average, discarding the lowest and highest b:

4 [x]k ← 1M−2b

∑M−bk=b+1 x(k) ;

5 end

6 return x

Function Bulyan

Input: Set X = {x1, . . . ,xM} ⊂ Rd, b upper bound on # Byzantine

1 Select← ∅ ;2 while |Select| < M − 2b do3 xs ← Aggr(X \ Select) ;4 Select← Select ∪ {xs} ;

5 end

6 return x← TrimmedMean(Select, b)

bound on the number of Byzantine agents, which they suppose to know.Then, it outputs the TrimmedMean of the selection set. The trimmed meanis a coordinate-wise function which discards the lowest and highest b valuesand then averages the remaining ones, assuming that extremes values areprobably Byzantine.

In the original paper they introduce this method to address an intrinsicweakness of aggregations function based on lp norms, which is well knownas the curse of dimensionality. They argue that, for a Byzantine agent tobypass the aggregator and disrupt training, it is sufficient to introduce a ‘big’error in only one coordinate. At the expense of requing more than 4b + 3workers around the parameter server, they say that Bulyan significantlydecreases the leeway of Byzantine agents, i.e. the magnitude of the errorthat can be introduced.


2.2.3 ByRDiE

Yang and Bajwa propose an approach which is natively designed fordecentralized optimization which they call ByRDiE [20] . They test it ex-perimetally and derive a proof of convergence in the infinite time limit.

They make the following assumptions:

1. Loss function: smooth, bounded over training samples and derivativeLipschitz continuous.

2. “All reduced graphs Gr generated from G contain a source componentof cardinality at least (b+ 1).”

I.e. there is a collection of at least (b + 1) nodes s.th. each node in ithas a directed path to every other node in the reduced graph Gr.

This algorithm is based on nested loops. The outer loop can be con-sidered as the usual state update, for at each iteration, every node updatesits parameter vector. Then, iteratively for each coordinate of the parametervector x, there is an inner loop that performs a Byzantine-robust line-searchbased on SGD. The algorithm computes the gradient of the loss functionwith respect to the full vector x from the previous itarate. Then, there isthe coordinate update, which first aggregates all nodes using TrimmedMeanand then uses the negative k-th component of the gradient for a monodime-sional SGD step.

The line search is the actual decentralized part of the algorithm, as foreach of those inner iterations the nodes broadcast and recieve the currentstates. The authors use the TrimmedMean to aggregate each coordinate.

They prove some theorems about statistical convergence and even algo-rithmical convergence. They prove consensus when the number of steps inthe outer loop and/or in the line search go to infinity. Statistical conver-gence is proven with high probability with the number of samples and theline-search iterations both going to ∞; first coordinate-wise, then iterate-wise.

Notably, they prove that the learning rate is O(1/√nN) in the optimal

distributed learning setting without adversaries, with n training samples pereach of the N workers. Their algorithm converges somewhere between suchO(1/

√Nn) and the learning rate of local SGD, i.e. O(1/

√n).

Nonetheless, we should remark that these results are stated for an infinitelong line search, which grants consensus at the end of each inner loop. Forthe single-step line search, they only state asymptotic convergence but don’tgive rates.


2.2.4 BRIDGE

The same authors of ByRDiE designed in [19] a much simpler algorithm,which relies on a very intuitive idea: discard extreme values from neighbor-ing nodes and average the rest, always including the local estimate. Theypresent this algorithm under the name of BRIDGE.

Their algorithm fits in our definition of Robust De-SGD: broadcast andreceive states, aggregate and perform one step of SGD – in the originalpaper the last two steps are entangled, so that gradient is computed beforeaggregation and is subtracted after.

The aggregation method is a variation of the coordinate-wise Trimmed-Mean; in fact, we always include the local paramters in the trimmed set.They suppose to know b the upper bound on byz agents and they computethe aggregated point, coordinatewise, by averaging over N b

i (t) ∪ {x(t)}, thetrimmed subset of neighbors plus the local state.

x(t)i ←

x(t)i +

∑j∈N bi (t)

x(t)j

|Ni| − 2b+ 1,

where Ni is the set of i’s neighbors.The autors prove convergence and consesus under the following assump-

tions:

1. The risk function f is bounded almost surely, µ-strongly convex andits gradient ∇f is L-Lipschitz.

2. All reduced graphs contain a source component of size at least b+ 1.

The result they prove is stated as follows.

Theorem 2.2.1. Under the aforementioned assumptions, BRIDGE achieveconsensus on all nonfaulty nodes as the number of iterations T →∞. Fur-ther, as N → ∞, the output of BRIDGE converges sublinearly in t to theminimum of the global statistical risk at each nonfaulty node with probabilitygoing to 1.

They prove consensus convergence writing the coordinatewise update inmatrix form involving only nonfaulty nodes through upper/lower bounds.They obtain rate of consensus convergence is O(

√dη), where η is the SGD

learning rate and d the size of the parameter vector.They test numerically the algorithm on MNIST, first with a linear clas-

sifier trained with Hinge loss (which satisfies their assumptions) and thenwith a convolutional net. They reported positive results.


In Chapter 4, we test this aggregation function against two significantlydifferent attacks, and we compare it to other methods.

2.2.5 Zeno

In [17, 18], the authors propose an algorithm which validates locallythe information recieved from each worker. They design it for the param-eter server setting, where workers share gradient estimate. The algorithm,called Zeno, relies on a Stochastic Descent Score computed by the centralserver on a validation set. Workers with an higher score are consideredmore trustworthy and the b gradients with lowest scores are discarded, assuspiciously Byzantine. The score of a recieved gradient gi is defined as:

Scoreη,λ(gi,x) = fS(x)− fS(x− ηgi)− λ ‖gi‖2 , (2.2.1)

where fS(x) = 1|S|∑ξ∈S f(x; ξ) is the loss on a validation set S, η is the

learning rate and λ > 0 is a constant weight.The authors also propose a variant of Zeno, named Zeno++, which is

asynchronous and approximates (2.2.1) only requiring ∇fs(x):

Scoreη,λ(gi,x) ≈ η 〈∇fS(x), gi〉 − λ ‖gi‖2 . (2.2.2)

In the decentralized setting, each agent can be seen as an autonomousparameter server and S is already available as the training batch for a giventiteration. In fact, Robust De-SGD already computes ∇fS(x) at each step.

The conceptual difference in the decentralized setting whould be that gis not computed at the same point. Nonetheless, the inventors of Zeno(++)did not provide any theoretical analysis. Thus, we could conjecture that themodel-parameters at each non-faulty node stay close during training. Byrequiring Lipschitz gradients, the distance between parameters would alsobound the distance between gradients.

2.2.6 ByGARS

ByGARS [14] is another algorithm for federated learning based on scores.Regatti and Gupta design a reputation score system, which keeps track, overtime, of the behavior of the workers.

As in Zeno, they use a validation set S in the central server to computean estimate of the loss fS(x(t)) on (current) parameters x(t). The recievedgradients are scored against that of the validation loss through inner prod-uct, to solve the following optimization problem on reputation scores q(t+1):

q(t+1)∗ = arg min

q∈RmfS

(x(t) − ηtG(t)q

). (2.2.3)


G(t) is the set of gradients recieved by the server and ηt is the learning rateparameter.

They propose two different meta-updates, which give rise to the twoalgorithms.

(ByGARS): x(t+1) ← x(t) − ηtG(t)q(t) (pseudo update)

q(t+1) ← q(t) + αtηtG(t)∇fS(x(t+1)) (meta update)

x(t+1) ← x(t) − ηtG(t)q(t+1) (actual update)

(ByGARS++): q(t+1) ← (1− αt)q(t) + αtG(t)T∇fS(xt)

In both algorithms αt is a step size parameter. Note that in ByGARS,the pseudo and meta updates can be repeated more times, to improve theestimate of q(t+1), at the expense of more computations.

They prove convergenge and consesus, in the infinite time limit, with thefollowing assumptions:

1. The population expected loss F is µ-strongly convex and its gradient∇F is locally Lipschitz and it is bounded.

2. Byzantine adversaries corrupt the gradients using multiplicative noise.

I.e. each Byzantine node computes its gradient g(t)i as a regular node,

but sends out g(t)i := κtig

(t)i , with κti iid multiplicative noise with mean

κi and finite second moment.

2.2.7 Mozi

The authors of [5] propose a native decentralized algorithm, MOZI, thatlevarages both distance-based and performance-based defences.

They make the following assumptions:

1. The induced subgraph of bening nodes is fully connected. Betweeneah pair of nodes there is a path of lenght at most τ .

Algorithm 5 details MOZI, whose iterations consist in:

I. Broadcast and receive neighbors’ estimates.

II. Compute local stochastic gradient ∇f(x(t)i , ξ

t).

III. Aggregate neighbors’ estimates as follows. Aggr({x(t)

j }j∈Ni)

:


Algorithm 5: MOZI

Input: x(0), max T , {ηt}t<T , # byz ngbs bi, tolerance ε


2 for t ∈ [T − 1] do

// in parallel for each i ∈ V3 Collect x

(t)j for j ∈ Ni ;


5 lti ← f(x(t)i , ξ

ti) ;

6 g(t)i ← ∇f(x

(t)i , ξ

ti) ;

7 for j ∈ Ni do

8 di,j ←∥∥∥x(t)

i − x(t)j

∥∥∥ ;

9 end10 Close← arg min N ∗⊆Ni,

|N∗|=M−bi

∑j∈N ∗ di,j ;

11 Sel← ∅ ;

12 for j ∈ Close do

13 ltj ← f(x(t)j , ξ

ti) ;

14 if lti − ltj ≥ ε then

15 Sel← Sel ∪ {j} ;16 end

17 end18 if Sel is ∅ then19 Sel← {arg minj∈Close l

tj} ;

20 end

21 x(t)i ←

1|Sel|

∑j∈Sel x

ti ;

22 Broadcast

x(t+1)i ← αx

(t)i + (1− α)x

(t)i − ηtg

(t)i ;

23 end

i. Select a pool of possible benign candidates by picking the closestestimates in Euclidean distance. (Stage 1)

ii. Evaluate the loss of each candidate in the pool (just a few byprevious pooling) and only select for aggregation (mean) thosewith lower loss than that from current estimate. (Stage 2)

IV. Update the local estimate:

x(t+1)i ← αx

(t)i + (1− α)Aggr

({x(t)

j }j∈Ni)− η∇f(x

(t)i , ξ

ti), (2.2.4)

with α an hyperparameter and η the learning rate.

Note that in in the first aggregation stage they suppose to know the ratioρi of benign nodes among the neighbors of i, for each node i.

The authors show a computational complexity of “O(|Ni|d) for eachnode at each iteration”, with d the size of the parameter vector x. Theycompare it to the complexity of Median-based methods and BRIDGE, whichis O(|Ni|d); and to that of Krum and Bulyan, O(|Ni|2d).

They add the following assumptions for convergence analysis:

2. All worker nodes are initialized with the same x(0).

3. (Monotonic Performance) The average (E) performance of benign es-timates is higher than the Byzantine estimates on the predefined train-ing distribution D at each iteration.


4. (Bounded Variance) The difference between all estimates generated bybenign nodes converges to zero.

5. The loss f is convex and Lipschitz continuous.

Note that assumption 4 is very strong and it is, generally, part of the con-vergence proof. In fact, it is one of the most crucial points and, as we willsee in Chapter 3 requires careful derivation.

They prove uniform convergence in the reduced graph and convergencein the Byzantine setting. They also prove that the losses of benign nodesare uniformly bounded.

They show experimentally that MOZI performs betterthan other meth-ods (BRIDGE, Krum and others) on MNIST and CIFAR-10.

2.2.8 Total variation

In [13], Peng and Ling approach the decentralized optimization problemby adding a penalization term, the Total Variation norm, to the loss, toforce local models of regular agents to be close.

They do not require iid-ness but they suppose that:

1. The network consiting of all regular agents i ∈ R, denoted (R, ER), isbidirectionally connected.

They define the decentralized optimization problem as

x∗ = arg minx∈Rp

∑i∈R

(E[f(x, ξi)] + Λ(x)

), (2.2.5)

where f(x, ξi) is the smooth cost function depending on the random variableξi ∼ Di and Λ(x) is a smooth regularization term.

Then, they propose to solve a Total-Variation-norm-penalized approxi-mation of (2.2.5). This is defined as

x∗ = arg minx:=[xi]

∑i∈R

E[f(xi, ξi)] +λ

2

∑j∈Ri

||xi − xj ||1 + Λ(xi)

, (2.2.6)

where λ is a penalty parameter. With (2.2.6), for every pair of regular neigh-bors (i, j), xi and xj are forced to be close through introducing the TV normpenalty

∑i∈R

∑j∈Ri ||xi − xj ||1. The new problem is solved by stochastic

subgradient method, which reduce the influence on neighbors’ messages inthe range [−1, 1], in particular the faulty ones.


Interestingly, the algorithm designed by Peng and Ling can deal withtime varying networks. In this case, the assumption on the graph connec-tivity becomes

1. The average network consisting of all regular agents i ∈ R, denotedas (R, ER), is bidirectionally connected.

The average network (R, ER) is given by the edges whose appearing fre-quency, over time going to the infinity, is nonnull.

In this case, they give a new formulation of (2.2.6):

x∗ = arg minx:=[xi]

∑i∈R

E[f(xi, ξi)] +λ

2Eζi[ ∑j∈Ri(ζi)

||xi − xj ||1]

+ Λ(xi)

,

(2.2.7)where Ri(ζi) is a realization of a random graph whose edges appears withprobability given by the frequency of the same edge in the time varyinggraph.

This is again solved by the stochastic subgradient method. At timet given Rti the regular ngbs of i and Bti the Byzantine ones, which send

arbitrary messages z(t)j , the model of node i is updated as

x(t+1)i = x

(t)i − η

t

(∇f(x

(t)i , ξ

ti) + λ

∑j∈Rti

sign(x(t)i − x

(t)j ) (2.2.8)

λ∑j∈Bti

sign(x(t)i − z

(t)j ) +∇Λ(x

(t)i )

)

The optimization algorithm is independent from the formulation of theproblem, and can be summarized as follows:

I. Broadcast current model x(t)i to all neighbors;

II. Recieve x(t)j from regular neighbors and z

(t)j from Byzantine ones;

III. Update local x(t+1)j according to (2.2.8).

The authors carried on a theoretical performance analysis, by adding thefollowing assumptions:

2. Strong convexity, Lipschitz continuos gradients, bounded variance ofthe gradient norm.


3. the penalty parameter λ is “sufficiently large”.

They prove sublinear convergence to a neighborhood of the optimal solutionx∗.

Looking closely at their results, we see that the error upper bound scales

as E∥∥x(t+1) − x∗

∥∥2 . This term always depends on

Λ2 =∑i∈R

λ2|Bi|2dε

,

where λ is the penalty term and d the model dimension. This means that,although the theorem grants convergence into a ngbhood of x∗, the gapscales quadratically in the penalty parameter and the number of Byzantineneighbors. Therefore, the more we push the models to be close, the worstthe solution gets. The same holds for the time-varying case, with Λ2 higherthan a term depending on the average graph instead of the fixed one.

Experimentally, the authors showed that this method performs betterthat ByRDiE and BRIDGE. Nonetheless, the final parameters of each nodeshave a high variance. In spite of requiring high λs for theoretical results,they use low values, thus each model learns almost independently.

2.3 Comments on the state of the art

In previous section, we introduced many algorithms designed by differentauthors to address the problem of Byzantine adversaries. Many others existand new papers keep appearing to this day, which bring new ideas for robustestimators, and for new attacks which can break the existing ones.

The first thing that we notice is the incredible heterogeneity in the as-sumptions and the definitions. Notably, there is not an agreement on what itmeans for an algorithm to be Byzantine-robust. Furthermore, nobody seemstho agree on what are reasonable assumptions about the graph stucture andabout the number of Byzantine agents.

Part of this heterogeneity is due to the relatively new interest in serverlessgradient descent. Many of the algorithm which we presented, and whichare used as benchmarks in comparative studies [21], have been designed todeal with Byzantine agents in a parameter-server setting. Even though weargued that such methods can be generalized to the decentralized setting, itis evident that the proofs do not hold anymore.

Interesting theoretical questions arise when looking closely at decentral-ized SGD and its robust variants. One can look at the whole network and

2.3. COMMENTS ON THE STATE OF THE ART 23

wonder about the convergence guarantees of the average parameters, asin [8]. Or he could wonder what would be the average across nodes of thelosses computed with the local parameters, or some other linear combinationof those functions [15].

Already in a Byzantine-free setting, the main interest is in the influenceon the convergence rates of a few factors that define the topology of thecommunication graph. In particular, how much speedup can one gain byadding more nodes, and thus more computational power, to the network?How much connections are required to improve the convergence rates? Pre-vious works [8] try to answer those questions, which we study theoreticallyin Chapter 3.

To approach the analysis of Byzantine-resilient optimization, it is crucialto have a well defined problem. Although the papers introducing many of thecited algorithms include convergence and optimality proofs [2, 20, 19, 14, 5,13], they cannot be compared. As we anticipated, the assumptions are toodifferent. Some of those algorithms have been analyzed for the federatedlearning setting [2, 14, 13], but already in the original papers there weredoubts and limitations about the proofs. Even though some of the authorsclaim convergence, they prove it asympthotically and do not report generalconvergence rates [14, 20, 19].

Notably, to the best of our knowledge, no one has ever tried to study thelocal behavior of single nodes in decentralized optimization, with or without

Byzantine agents. We find the convergence rate of the local parameters x(t)i

of major interest, as we discuss in Section 5.2.

Chapter 3

Convergence analysis

This chapter focuses on the theoretical study of finite time convergencerates for Decentralized SGD and its robust generalizations.

3.1 Proof plan

We argue that, in order to prove convergence and consesus rates for finiteiterations, we should prove the following milestones:

1. We can approximate Byzantine-resilient DeSGD by a Byzantine-freedecentralized Gossip SGD algorithm, where the approximation willdepend on the aggregation rule.

2. Compute finite time consensus and convergence rates for Gossip SGD.

The key aspect is the formulation of the problem. To study the evolutionof the system, we express the aggregation function and the subsequent SGDupdates in matrix form, as it has been done in previous works [20, 19, 8]. Itwould be reasonable to assume that the output of the aggregation functionlies somewhere in the convex hull of its inputs. For this reason, with X(t)

the matrix of the parameters broadcasted by all nodes, both regular andByzantines, we would like to approximate the matrix of the aggregatedparameters at the regular nodes R as

X(t)

R := Aggr(X(t)) ≈ X(t)RMt, (3.1.1)

with a mixing matrix Mt ∈ RN×N , N = |R|. We discuss about the proper-ties of, and the assumptions on, Mt in Section 3.3.

25

26 CHAPTER 3. CONVERGENCE ANALYSIS

With (3.1.1), we could express the Robust DeSGD update at iterationt, for the regular nodes R, as

X(t+1)R = X

(t)RMt − ηtG(t),

where the matrix G(t) contains the estimate of the local gradients, at theregular nodes. The proof proceeds by studying how the average of the

parameters, at the good nodes, x(t) = 1N

∑i∈R x

(t)i , behaves. In particular,

we focus on the squared distance to an optimal point x∗:∥∥∥x(t) − x∗∥∥∥2 .

In Section 3.4, we derive the convergence rates for strongly convex and forsmooth loss functions f . In a Byzantine-free setting we prove that the localparameters get closer and closer to their arithmetic mean, which convergesto the optimum. This convergence depends on some properties of the graph,on how the information from the neighbors is aggregated and, of course, onproperties of the loss function f .

The generalization to the Byzantine setting is not trivial. Asymptoticsproofs exists for very specific assumptions [20, 19], but to the best of ourknowledge, no one has proven finite time rates. The Byzantine attacks canbe very elaborate by definition, and characterizing the output of aggregationfunctions is a delicate matter. In Section 3.5, we discuss about the char-acterization of robust aggregation rules and the critical parts of the proofsthat need careful evaluation to include Byzantine agents.

In the next section, we give some definitions of calculus concepts andspectral graph theory which we will apply in the convergence proofs.

3.2 Definitions and preliminary lemmas

3.2.1 Strong convexity and smoothness

We remind some properties of stongly convex functions and some impli-cations.

Definition 3.2.1. Let f : Rd → R be a differentiable function. We say thatf is µ-strongly convex, for some µ > 0 if ∀x,y ∈ Rd the following holds:

f(y) ≥ f(x) + 〈∇f(x),y − x〉+µ

2‖y − x‖2 . (3.2.1)

There are some equivalent properties to 3.2.1.

3.2. DEFINITIONS AND PRELIMINARY LEMMAS 27

Lemma 3.2.2 ([22]). The definition of µ-strong convexity given in 3.2.1is equivalent to

〈∇f(x)−∇f(y),x− y〉 ≥ µ ‖x− y‖2 . (3.2.2)

And if x∗ is a minimum, we have

〈∇f(x),x− x∗〉 ≥ µ ‖x− x∗‖2 , (3.2.3)

〈∇f(x),x− x∗〉 ≥ µ

2‖x− x∗‖+ (f(x)− f(x∗)) . (3.2.4)

Definition 3.2.3. Let f : Rd → R be a differentiable function. We saythat f is L-smooth (L-Lipschitz gradient), for some L > 0 if ∀x,y ∈ Rd thefollowing holds:

‖∇f(x)−∇f(y)‖ ≤ L ‖x− y‖ . (3.2.5)

Lemma 3.2.4 ([22]). If a function f is L-smooth, then ∀x,y ∈ Rd it holdsthat

f(x) ≤ f(y) + 〈∇f(y),x− y〉+L

2‖x− y‖2 . (3.2.6)

Lemma 3.2.5 ([22, 7]). If a function f : Rd → R is L-smooth and µ-convex,then ∀x,y ∈ Rd it holds that

f(x) ≥ f(y) + 〈∇f(y),x− y〉+1

2L‖∇f(x)−∇f(y)‖2 , (3.2.7)

and if x∗ is a minimum

‖∇f(x)‖2 ≤ 2L (f(x)− f(x∗)) . (3.2.8)

Furthermore, ∀x,y, z ∈ Rd the following holds:

〈∇f(x), z− y〉 ≥ f(z)− f(y) +µ

4‖y − z‖2 − L ‖z− x‖2 . (3.2.9)

3.2.2 Spectral graph theory and linear algebra

Let G = (V, E) be a directed graph with N vertices V and edges E ⊆V × V. Since the naming of the vertices has no influence on the graphproperties, we identify vertices as numerals, i.e. V := [N ]. Note that, weallow self-loops, i.e. edges from one vertex to itself. We say that a graph isweighted if we have a map w : E → R.


Definition 3.2.6. We define the mixing matrix M of a graph G as theN ×N matrix whose entries are given, for each (i, j) ∈ V × V, by

[M ]i,j :=

{w((j, i)) if (j, i) ∈ E,0 otherwise.

(3.2.10)

Thanks to the mixing matrix, we can express operations on the graphonly relying to linear algebra. In fact, we can assign to each node i ∈ [N ]a vector xi ∈ Rd and, using the mixing matrix M , we can express theweighted average of the neighboring vectors as a matrix multiplication:

xi =∑j∈Ni

w((j, i))xj =

N∑j=1

[M ]i,jxj = [XM ]i,

where X is a d×N matrix whose columns are given by the nodes vectors.

Definition 3.2.7. A real square matrix M is

• right stochastic if the values in each row sum to 1, i.e. M1 = 1.

• left stochastic if 1TM = 1.

• doubly stochastic if both its row and columns sum to 1.

Definition 3.2.8. Let G be a graph on N nodes. Supposing that eachvertex i ∈ [N ] has a vector xi ∈ Rd, we say that we have consensus ifxi = xj∀i, j ∈ [N ].

We can rephrase Definition 3.2.7 in a more intuitive way: a right stochas-tic matrix preserve the averaging operator, while a left stochastic one pre-serves consensus.

Since the algebraic mean can be expressed in matrix notation as 1N , for a

right stochastic matrix M and any X ∈ Rd×N it holds that XM 1N = X 1

N ,which means that the average of the mixed vectors is the same as the meanof the original vectors.

On the other hand, if we suppose that we have consensus in the graph werepeat the same vector x in each column, so we can write the node-vectorsmatrix as X = x1T. Thus, if M is a left stochastic matrix, we have thatXM = x1M = x1 = X, i.e. consensus is preserved.

We now recall some linear algebra inequalities which we will use later inthe convergence proofs.

3.2. DEFINITIONS AND PRELIMINARY LEMMAS 29

Lemma 3.2.9. For any two vectors a, b ∈ Rd and any γ > 0, we have

2〈a, b〉 ≤ γ ‖a‖2 + γ−1 ‖b‖2 . (3.2.11)

Lemma 3.2.10. For any set of vectors {ai}i∈[m] ∈ Rd, the following holds∥∥∥∥∥m∑i=1

ai

∥∥∥∥∥2

≤ mm∑i=1

‖ai‖2 . (3.2.12)

Also, for any two vectors a, b and every α > 0

‖a+ b‖2 ≤ (1 + α) ‖a‖2 + (1 + α−1) ‖b‖2 (3.2.13)

Definition 3.2.11. (Matrix norms) Let A ∈ Rd×n, we define the Matrixnorm induced by the vector norm ‖.‖ as

‖A‖ = supx∈Rn:‖x‖=1

‖Ax‖ . (3.2.14)

We also define the Frobenius norm as

‖X‖F :=

√√√√ n∑i=1

d∑i=1

x2i,j =

√√√√ n∑i=1

‖xi‖2 (3.2.15)

Lemma 3.2.12. For A ∈ Rd×n and B ∈ Rn×n, with ‖.‖F the Frobeniusnorm and ‖.‖ the induced matrix norm, we have

‖AB‖F ≤ ‖A‖F ‖B‖ (3.2.16)

Lemma 3.2.13 (Consensus rates). Let G be a strongly connected graph on[N ] with mixing matrix M . Let X = [xi : i ∈ [N ]] ∈ Rd×N and xi = [XM ]ibe the nodes initial and mixed vectors respectively. Define x = X 1

N to bethe algebraic mean across all nodes. Then, if M is symmetric and doublystochastic, the following holds for any y ∈ Rd:

N∑i=1

‖xi − x‖2 ≤ λ22N∑i=1

‖xi − y‖2 , (3.2.17)

where λ2 < 1 is the second largest eigenvalue of M .

In particular, the right hand side of (3.2.17) is minimized for y = x.


Proof.Let y ∈ Rd and define Y := y1T. By left stochasticity of M , YM = Yand thus Y(M − 11T

N ) = 0.

We observe that

N∑i=1

‖xi − x‖2 =

N∑i=1

∥∥∥∥(XM −X11T

N

)ei

∥∥∥∥2

=N∑i=1

∥∥∥∥X(M − 11T

N

)ei

∥∥∥∥2

=

N∑i=1

∥∥∥∥(X−Y)

(M − 11T

N

)ei

∥∥∥∥2=

∥∥∥∥(X−Y)

(M − 11T

N

)∥∥∥∥2F

(3.2.16)

≤ ‖X−Y‖2F

∥∥∥∥M − 11T

N

∥∥∥∥2=

∥∥∥∥M − 11T

N

∥∥∥∥2 N∑i=1

‖xi − y‖2

Since M is symmetric, it is diagonalizable and ‖Mx‖2 = 〈M2x,x〉.Since the graph is connected, by the Perron-Frobenius theorem, we havethat 1 is the strictly largest eigenvalue of M , with normal eigenvector 1√

N.

Therefore, by the Min-Max Theorem we have that

∥∥∥∥M − 11T

N

∥∥∥∥2 = supx⊥1:‖x‖=1

〈M2x,x〉 = λ22,

where λ2 < 1 is the second largest eigenvalue of M .

This concludes the proof.

3.3 Modelling decentralized SGD

In this section, we present a reformulation of Algorithm 4, i.e. RobustDe-SGD, which allows us to perform a theoretical analysis.

3.3. MODELLING DECENTRALIZED SGD 31

3.3.1 Assumptions on the loss

Before we discuss about the network and its properties, we focus on theempirical loss function f and its gradient with respect to x – recall that weask for f to be differentiable in Assumption 1.

Let g(x) := ∇f(x, ξ) be the stochastic estimate of the gradient withrespect to x of the risk function on a random sample ξ ∼ D. Let g(x) =Eξ∇f(x, ξ) be the mean gradient at x.

Assumption 2 (Bounded gradients). The stochastic component δ(x) =g(x) − g(x) follows a distribution (0,Σx) and has bounded squared norm,for all x ∈ Rd. I.e. for all x ∈ Rd

E ‖δ(x)‖2 ≤ σ2. (3.3.1)

We drop the specification on x when the context is clear. Thus, since we

always evaluate f at the aggregated parameters x(t)i , we write g

(t)i := g(x

(t)i ),

g(t)i for its mean value, and δti for its stochastic component.

Note that the average gradients g(t)i are deterministic conditioned on the

output of previous iterations. The stochasic components δti are independentacross nodes i, we require them to be centered at zero and have some co-variance matrix Σt which is fixed, conditioning on previous iterations. Weare only interested in the expected squared norm E

∥∥δti∥∥2 of those randomvectors.

By basic linear algebra and probability theory, we have that

E∥∥δti∥∥2 = Tr(Σt) := σ2t . (3.3.2)

We note from (3.3.2) that the square expected norm depends on the dimen-sion of the parameter space d.

3.3.2 Modelling the aggregation function

Now, we move to the communication network. As explained in Sec-tion 1.4, we define it as a graph in which each node is a computer. An edge(i, j) is in E if node i can send a message to node j. We define R to be theregular nodes and B to be the Byzantine agents, so that V = R∪B. We letN = |R|, b = |B| and N ′ = N + b.

As we see in Algorithm 4, the only part involving incoming messagesis the aggregation step. In Chapter 2 we anticipated that the goal of theaggregation function Aggr is to estimate the robust average of its input.We can reasonably expect the output to lie in the convex hull of the input


vectors, and express it in matrix form. Furthermore, we can include in theformula the states of every node in the graph by multiplying by 0 thosecoming from a non-neighboring node.

With this in mind, we can express the ouput of Aggr at a node i as alinear combination of the states vectors of the whole system:

Aggr(x(t)i ,X

(t)Ni

):= a(i,i)x

(t)i +

∑j∈Ni

a(j,i)x(t)j (3.3.3)

=∑j∈[N ]

a(j,i)x(t)j

= X(t)ai,

where X(t) =[x(t)1 · · ·x

(t)N

]∈ Rd×N is the matrix built by stacking the

parameter vectors of each node, and ai is a column vector whose entries arethe “scores” that node i assign to the information from each node (that iszero if it is not neighboring). Since the aggregated vector lies in the convexhull of X(t), we know that ∑

j∈[N ]

a(j,i) = 1Tai = 1 (3.3.4)

We can use the linear formulation in the local update of Robust De-SGDto obtain, for a regular node i ∈ R,

x(t+1)i ← x

(t)i − ηtg(x

(t)i ) (3.3.5)

= X(t)ai − ηt∇f(X(t)ai, ξi).

Using (3.3.5), we can write the global (synchronous) update, for theregular nodes R, in matrix form as

X(t+1)R = X(t)At − ηtG(t). (3.3.6)

We recognizeAt as a mixing matix for the communication graph. We specifythe iterate index t as the aggregation rule can produce different weightsbased on the current parameters or some other heuristic.

We shall remark that we abuse the matrix notation, as, by definition,the messages sent by Byzantine agents can be different for every recievingnode. We do not engage in designing a more accurate notation since we tryto exclude byzantine agents from our formulation.

3.3. MODELLING DECENTRALIZED SGD 33

Since by Assumption 2, we suppose that the gradient estimate matrix

G(t) =[g(t)1 · · · g

(t)N

]is centered, we can decompose it as

G(t) =[g(t)1 · · · g

(t)N

]= (3.3.7)

=[g(t)1 · · · g

(t)N

]−[δt1 · · · δtN

]=

= G(t) −∆t.

Ideally, we would like to approximate At by a mixing matrix Mt,R ∈RN×N , so that we could express the recursion as a function of the regularnodes alone:

X(t)At ≈ X(t)RMt,R

⇒ X(t+1)R ≈ X

(t)RMt,R − ηtG(t) + ηt∆t. (3.3.8)

We derive converge bounds for the Byzantine-free recursion (3.3.8) inSection 3.4. Then, in Section 3.5 we discuss about the extension of such aproof to a Byzantine-prone setting.

3.3.3 Unrolling the recursion

As a final preliminary comment, let’s suppose that there are no Byzantineagents. We can unroll the recursion (3.3.6) to obtain the output after T + 1iterations:

X(T+1) = X(T )MT − ηtG(T ) = (3.3.9)

=(. . .((

X(0)M0 − η0G(0))M1+

−η1G(1)). . .)MT − ηT G(T ) =

= X(0)T∏t=0

Mt︸︷︷︸A

−T−1∑t=0

ηtG(t)

T∏k=t+1

Mk − ηTG(T )

︸︷︷︸B

+T−1∑t=0

ηt∆t

T∏k=t+1

Mk + ηT∆T︸︷︷︸C

.

It is interesting to observe that each quantity in the right hand sideof (3.3.9) keeps getting mixed after its first appearence. Furthermore, we


note that A = X(0)∏Tt=0Mt = X(0), since each node starts at the same

point x(0), thus X(0) = x(0)1T; and 1 is a left eigenvector of all mixingmatrices Mt.

Therms B and C involve, respectively, the gradient eastimates and thestochastic noise. Ignoring the mixing matrices, they have the same structureas in local SGD. The product of Mts tells us that the earlier in the iterationsa gradient, or noise, appears, the more it gets mixed.

3.4 Average convergence without Byzantines

In this section, we study the convergence rates of the average parametervector x(t). We propose proofs for two different settings, first for a stronglyconvex loss function, then for a smooth one.

The proof technique is similar to [8], but we have stronger assumptionsand thus the convergence rates are a bit tighter. We prove some lemmason the average of the parameters across all nodes, then we combine themto obtain a bound on the mean error. The idea is to obtain a recursion ona mean error term rT (error on the parameters in Lemma 3.4.1 and on thefunction value in 3.4.4) and on the consensus variation ΞT (in Lemmas 3.4.2and 3.4.9). The recursions on rT and on ΞT shall depend on the sameterms, so that, with some manipulations, we prove bounds on the functionsuboptimality in Theorems 3.4.3 and 3.4.6.

For the whole section, we accept Assumption 2 on bounded norms of thestochastic gradients. More precisely, we suppose that we can decompose the

gradient estimates in g(t)i = g(t)+δti , with g(t) := Eξti [∇f(x(t))] deterministic

conditioned on the previous iterates and δti ∼ (0,Σt) a random vector,independent across nodes i. We also suppose that the mean squared normof the noise is bounded, i.e. E

∥∥δti∥∥2 ≤ σ2, for all i ∈ [N ] and t ≥ 0.

We recall that, by definition, the aggregation rule can be expressed asa left-stochastic matrix Mt, at each iteration t. In this section, we onlyconsider the case in which there are no Byzantine agents. Therefore, wedeal only with regular agents with no need for approximations.

Assumption 3 (Byzantine-free). All of the N agents in the graph areregular workers following Algorithm 4.

Since we are interested in the average of the parameters, the proof getsmore treatable if we suppose that the mixing matrices preserve the averageoperator. In other words, they should be right stochastic.

3.4. AVERAGE CONVERGENCE WITHOUT BYZANTINES 35

Assumption 4 (Symmetric mixing). We suppose that the mixing matricesMt are symmetric, and thus doubly stochastic, for all t ≥ 0.

Assumption 5 (Nonnul spectral gap). The second eigenvalue λ2,T of Mt

is striclty smaller than 1 for all t ≥ 0.Note that since λ2,T < 1, then λ22,T < λ2,T . This implies that the spectral

gap ρt of M2t , defined as the difference between the first two eigenvalues, is

always greater than zero.Also, there exist a positive lower bound ρ = inft {ρt} on the spectrals

gaps.

Note that this assumption is satisfied if the graph described by the mixingmatrix is strongly connected, thanks to the Perron-Frobenius theorem.

Assumption 6 (Smoothness). The empirical risk function f is L-smooth,as defined in (3.2.5), with respect to the parameter vector x, for any randomvector ξ.

3.4.1 Strongly convex case

In this subsection, we suppose that, in addition to L-smoothness, wehave strong convexity for the risk function.

Assumption 7 (Strong convexity). The risk function f is µ-strongly con-vex, as defined in (3.2.1), with respect to the parameter vector x, for allrandom vectors ξ ∈ X .

Throughout this section, we use the following notation:

ΞT =1

N

N∑i=1

E∥∥∥x(T )

i − x(T )∥∥∥2 ,

rT = E∥∥∥x(T ) − x∗

∥∥∥2 ,eT = E

[f(x(T ))− f(x∗)

]= F (x(T ))− F (x∗).

Lemma 3.4.1 (Error recursion – µ-convex). Under assumptions 2, 3, 4, 5,6, 7, the following holds

rT+1 ≤(

1− ηTµ

2

)rT + 2ηT (2LηT − 1) eT (3.4.1)

+2ηTL(ηTL+ 1

)ΞT +

η2TNσ2.


Proof.We consider the average parameters at time T + 1

E∥∥∥x(T+1) − x∗

∥∥∥2 = E∥∥∥∥(X(T+1) −X∗

)· 1

N

∥∥∥∥ (3.4.2)

= E∥∥∥∥(X(T )MT − ηTG(T ) + ηT∆T −X∗

)· 1

N

∥∥∥∥2(by indep and zero mean of δi) ≤ E

∥∥∥∥(X(T )MT − ηTG(T ) −X∗)· 1

N

∥∥∥∥2︸︷︷︸Err

T

+η2TNσ2

Then, we observe that the first term on the right hand side (ErrT

) isdeterministic conditioned on previous steps. We decompose its norm toleverage smoothness and strong convexity:

ErrT

= E∥∥∥∥(X(T )MT − ηTG(T ) −X∗

)· 1

N

∥∥∥∥2= E

∥∥∥x(T ) − x∗∥∥∥2 − 2ηTE

⟨G(T ) 1

N,x(T ) − x∗

⟩+ η2TE

∥∥∥∥G(T ) · 1

N

∥∥∥∥2 .We now focus on the second and third terms on the RHS.

Starting with the scalar product, we use Lemma 3.2.5 to bound

E⟨G(T ) 1

N,x(T ) − x∗

⟩=

1

N

N∑i=1

E⟨g(T )i ,x(T ) − x∗

⟩(3.4.3)

≥ 1

N

N∑i=1

(F (x(T ))− F (x∗) +

µ

4E∥∥∥x(T ) − x∗

∥∥∥2−LE

∥∥∥x(T )i − x(T )

∥∥∥2)=µ

4E∥∥∥x(T ) − x∗

∥∥∥2 + F (x(T ))− F (x∗)

− LN

N∑i=1

E∥∥∥x(T )

i − x(T )∥∥∥2

We only miss the third term of ErrT

, which we bound using Lemma 3.2.4


for L-smooth functions. Since f(x, ξ) is L-smooth in x for all ξ, we obtain

E∥∥∥∥G(T ) · 1

N

∥∥∥∥2 =

∥∥∥∥∥ 1

N

N∑i=1

g(T )i

∥∥∥∥∥2

(3.4.4)

= E

∥∥∥∥∥ 1

N

N∑i=1

(g(T )i ± E∇fi(x(T ))

)∥∥∥∥∥2

≤ 2

N

N∑i=1

E∥∥∥g(T )i − E∇fi(x(T ))

∥∥∥2 +2

N

N∑i=1

∥∥∥E∇fi(x(T ))∥∥∥2

≤ 2L2

N

N∑i=1

E∥∥∥x(T )

i − x(T )∥∥∥2 +

4L

N

N∑i=1

(Efi(x(T ))− Efi(x∗)

)=

2L2

N

N∑i=1

E∥∥∥x(T )

i − x(T )∥∥∥2 + 4L

(F (x(T ))− F (x∗)

).

Finally, we can combine (3.4.3) and (3.4.4) to bound ErrT

in (3.4.2),yielding

E∥∥∥x(T+1) − x∗

∥∥∥2 ≤ ErrT

+η2TNσ2

≤(

1− ηTµ

2

)E∥∥∥x(T ) − x∗

∥∥∥2 +η2TNσ2

+2ηT (2LηT − 1)(F (x(T ))− F (x∗)

)+2ηT

L

N

(ηTL+ 1

) N∑i=1

E∥∥∥x(T )

i − x(T )∥∥∥2 .

Lemma 3.4.2 (Consensus recursion – µ-convex). Under assumptions 2, 3,4, 5, 6, 7, with ρT = 1− λ2T,2 the spectral gap of the squared mixing matrix

M2T , we have

ΞT ≤(

1− ρT2

+6L2

ρTη2T−1

)ΞT−1 (3.4.5)

+12L

ρTη2T−1eT−1 + (1− ρT ) η2T−1σ

2.


Furthermore, if we use a fixed learning rate η ≤ ρ

2√6L

and we define a

series of weights {wt}t≥0 ⊂ R+ such that wt+1 ≤ wt(1 + ρ

8

), we can bound

T∑t=0

wtΞt ≤96L

ρ2η2

T∑t=0

wtet + 4WT

(1

ρ− 1

)η2σ2.

Proof.Using Lemma 3.2.13 with y = x(T−1), we have that

NΞT =

N∑i=1

E∥∥∥x(T )

i − x(T )∥∥∥2 ≤ λT,2 N∑

i=1

E∥∥∥x(T )

i − x(T−1)∥∥∥2

= λT,2E∥∥∥X(T ) −X

(T−1)∥∥∥2F.

We observe that for any t ≥ 0

E∥∥∥X(t+1) −X

(t)∥∥∥2F

= E∥∥∥X(t)Mt −X

(t) − ηtG(t) + ηt∆t

∥∥∥2F

= E∥∥∥X(t)Mt −X

(t) − ηtG(t)∥∥∥2F

+ η2tNσ2

(3.2.13)

≤ (1 + α)E∥∥∥X(t)Mt −X

(t)∥∥∥2F

+(1 + α−1)η2tE∥∥∥G(t)

∥∥∥2F

+ η2tNσ2.

Then, let ρT = 1 − λ22,T and fix α = ρT /2. Note that ρT ≤ 1 and thus

1 + 2ρT

< 3ρT

. Using the same inequlities as in (3.4.4) to bound ‖G(t)‖2F , weget

E∥∥∥X(T ) −X

(T−1)∥∥∥2F≤(

1 +ρT2

+6L2

ρTη2T−1

)E∥∥∥X(T−1)MT−1 −X

(T−1)∥∥∥2F

+12LN

ρTη2T−1

(F (x(T−1))− F (x∗)

)+Nη2T−1σ

2.

We obtain (3.4.5) multiplying by (1 − ρT )/N and noting that (1 −ρT )/ρT ≤ 1/ρT , since ρT > 0, and that (1− ρT )(1 + ρT /2) ≤ 1− ρT /2.

For the second statement, note that Ξ0 = E ‖x0 − x0‖2 = 0 since allnodes start at the same point. Using a fixed learning rate η ≤ ρ

2√6L

and

supposing we have a lower bound ρ for all spectral gaps, we can unroll the


consensus recursion (3.4.5) into

ΞT ≤(

1− ρ

2+

6L2

ρη2)

ΞT−1 +12L

ρη2eT−1 + (1− ρ)η2σ2

≤(

1− ρ

2+

6L2

ρη2)T

Ξ0 +12L

ρη2

T∑t=1

(1− ρ

2+

6L2

ρη2)T−t

et−1

+(1− ρ)η2σ2T∑t=1

(1− ρ

2+

6L2

ρη2)T−t

η< ρ

2√6L

≤ 12L

ρη2

T∑t=1

(1− ρ

4

)T−tet + (1− ρ)η2σ2

T∑t=1

(1− ρ

4

)T−t=

12L

ρη2

T∑t=1

(1− ρ

4

)T−tet + 4η2σ2

(1

ρ− 1

)(1−

(1− ρ

4

)T)

≤ 12L

ρη2

T∑t=1

(1− ρ

4

)T−tet + 4η2σ2

(1

ρ− 1

).

Now, letting {wt}t≥0 as in the statement, we have wt ≤ wj(1 + ρ

8

)t−j.

Defining WT =∑T

t=0wt, we obtain

T∑t=0

wtΞt ≤T∑t=0

wt

(12L

ρη2

t∑k=1

(1− ρ

4

)t−kek + 4

(1

ρ− 1

)η2σ2

)

=12L

ρη2

T∑t=0

t∑k=1

(1− ρ

4

)t−kwtek︸︷︷︸

T

+4WT

(1

ρ− 1

)η2σ2.

Focusing on the first term T , we get

T ≤ 12L

ρη2

T∑t=0

t∑k=1

(1− ρ

8

)t−kwkek =

12L

ρη2

T∑k=0

wkek

T∑t=k+1

(1− ρ

8

)t−k≤ 12L

ρη2

T∑k=0

wkek

∞∑t=0

(1− ρ

8

)t=

96L

ρ2η2

T∑k=0

wkek,

which proves the desired result.

Theorem 3.4.3 (Average convergenge rate – µ-convex). Let assumptions 2,3, 4, 5, 6, 7 hold. Denote by x(t) the average of the parameters across all


nodes at iteration t and let x(0) be the common starting point. Then, takinga fixed learning rate ηt = η ≤ ρ

8√6L

, we have the following convergence rate:

1

WT

T∑t=0

wtet +µ

2rT+1 ≤

r0η

exp{−(T + 1)

ηµ

2

}(3.4.6)

+η

(1− ρ

2+

1

N

)σ2,

where wt =(1− ηµ

2

)−(t+1)is a sequence of weights.

Corollary. Under Theorem 3.4.3 conditions, with a fixed learning rate

η ≤ min{

2 ln(T 2)µT , ρ

8√6L

}, we have

1

WT

T∑t=0

wtet +µ

2rT+1 ≤ O

(r0T

+

(1− ρ

2+

1

N

)σ2

T

), (3.4.7)

where the O notation hides polylogarithmic terms.

Proof of 3.4.3.We start by rearranging (3.4.1), with fixed η, to

rT+1 ≤(

1− ηµ2

)rT + 2η (2Lη − 1) eT + 2ηL

(ηL+ 1

)ΞT +

η2

Nσ2

⇒ eT ≤(1− η µ2

)2η

rT −1

2ηrT+1 + 2LηeT +

(ηL2 + L

)ΞT +

η

2Nσ2

⇒ wT eT ≤(1− η µ2

)2η

wT rT −wT2ηrT+1 + 2LηwT eT

+(ηL2 + L

)wTΞT +

η

2NwTσ

2.

Note that With wt =(1− ηµ

2

)−(t+1)and WT =

∑Tt=0wt, we obtain a

telescoping sum by summing the above inequality over t. If we also divide


by WT , we obtain

1

WT

T∑t=0

wtet ≤1

2ηWT(r0 − wT rT+1) +

2Lη

WT

T∑t=0

wtet

+ηL2 + L

WT

T∑t=0

wtΞt +η

2Nσ2

(3.4.5)

≤ 1


2Lη

WT

T∑t=0

wtet +η

2Nσ2

+ηL2 + L

WT

(96L

ρ2η2

T∑t=0

wtet + 4η2σ2(

1

ρ− 1

)WT

)

≤ 1


2Lη

WT

T∑t=0

wtet +η

2Nσ2

+96η3L3 + η2L2

WTρ2

T∑t=0

wtet + 4(η3L2 + η2L

)(1

ρ− 1

)σ2.

Imposing η ≤ ρ

8√6L

, we see that the coefficients in front of∑T

t=0wtet

sum to less than 12WT

and we can further bound by

1

WT

T∑t=0

wtet ≤1


1

2WT

T∑t=0

wtet

+η

2

((ρ+ 16)(1− ρ)

48+

1

N

)σ2.

Which implies

1

WT

T∑t=0

wtet +wTηWT

rT+1 ≤1

ηWTr0 + η

((ρ+ 16)(1− ρ)

48+

1

N

)σ2.

≤ 1

ηWTr0 + η

(1− ρ

2+

1

N

)σ2.

Noting that WT =∑T

t=0wt ≤2wTηµ and WT ≥ wT = (1− ηµ/2)−(T+1),


we get

1

WT

T∑t=0

wtet +µ

2rT+1 ≤

(1− ηµ

2

)T+1

ηr0 + η

(1− ρ

2+

1

N

)σ2.

≤ r0η

exp{−(T + 1)

ηµ

2

}+ η

(1− ρ

2+

1

N

)σ2.

3.4.2 Non-convex case

Throughout this section, we use the following notation:

ΞT =1

N

N∑i=1

E∥∥∥x(T )

i − x(T )∥∥∥2 ,

rT = E[f(x(T ))− f(x∗)

]= F (x(T ))− F (x∗),

eT =∥∥∥∇F (x(T ), ξ)

∥∥∥2 ,where we recall that f(x) = f(x, ξ) is a shorthand for a stochastic functionon the random vector ξ and F (x) = Eξf(x, ξ) is its mean. In particularfi(x) = f(x, ξi) with ξi ⊥ ξj with i 6= j.

Lemma 3.4.4 (Error recursion - Non-convex). Let assumptions 2, 3, 4, 5,6 hold. The average of the parameters at iteration T produced by De-SGD,with constant leraning rate η satisfies

rT+1 ≤ rT +(Lη2 − η

2

)eT +

L2η + 2L3η2

2ΞT +

Lη2

2Nσ2. (3.4.8)


Proof.We expand the update from (3.3.6), with fixed η and observe

F (x(T+1)) = Eξ[f

((X(T )MT − ηG(T ) + η∆T

) 1

N, ξ

)](by double stoch.) = Eξ

[f

(x(T ) − ηG(T ) 1

N+ ηδT , ξ

)](3.2.6)

≤ Eξ

[f(x(T ), ξ) + η

⟨∇f(x(T ), ξ),−G(T ) 1

N+ δT

⟩

+Lη2

2

∥∥∥∥−G(T ) 1

N+ δT

∥∥∥∥2]

= F (x(T )) + η

⟨Eξ∇f(x(T ), ξ),−G(T ) 1

N

⟩︸︷︷︸

T1

+Lη2

2E∥∥∥∥−G(T ) 1

N+ δT

∥∥∥∥2︸︷︷︸T2

,

where in the last equality we used that E[δTi]

= 0. From now on, wewill write E∇f(x) instead of Eξ∇f(x, ξ), to simplify the notation and toemphasize that we are not interested in the stochasticity of the function f ,but in that of its gradient.

We add and subtract E∇f(x(T )) to the second term T1 and we obtain

T1 = E⟨E∇f(x(T )),−G(T ) 1

N+ δT + E∇f(x(T ))− E∇f(x(T ))

⟩= −

∥∥∥E∇f(x(T ))∥∥∥2 +

⟨E∇f(x(T )),E∇f(x(T ))− 1

N

N∑i=1

E∇f(x(T )i )

⟩(3.2.11)

≤ −1

2

∥∥∥E∇f(x(T ))∥∥∥2 +

1

2N2

∥∥∥∥∥N∑i=1

(E∇f(x(T ))− E∇f(x

(T )i )

)∥∥∥∥∥2

≤ −1

2

∥∥∥E∇f(x(T ))∥∥∥2 +

L2

2N

N∑i=1

∥∥∥x(T ) − x(T )i

∥∥∥2 ,where in the last inequality we use (3.2.12) and L-smoothness of F .


For the term T2, we observe

T2 = E∥∥δT∥∥2 + E

∥∥∥∥G(T ) 1

N

∥∥∥∥2≤ σ2

N+

∥∥∥∥∥ 1

N

N∑i=1

E∇f(x(T )i )± E∇f(x(T ))

∥∥∥∥∥2

(3.2.12), (3.2.13)

≤ σ2

N+ 2

∥∥∥E∇f(x(T ))∥∥∥2 +

2

N2

N∑i=1

∥∥∥E∇f(x(T )i )− E∇f(x(T ))

∥∥∥2(3.2.5)

≤ σ2

N+ 2

∥∥∥E∇f(x(T ))∥∥∥2 +

2L2

N

N∑i=1

E∥∥∥x(T ) − x

(T )i

∥∥∥2 .We combine the two bounds and conclude the proof by subtracting F (x∗)

from both sides of the inequality

F (x(T+1)) ≤ F (x(T )) +(Lη2 − η

2

)∥∥∥E∇f(x(T ))∥∥∥2

+L2η + 2L3η2

2N

N∑i=1

∥∥∥x(T ) − x(T )i

∥∥∥2 +Lη2

2Nσ2.

Lemma 3.4.5 (Consensus convergence - Non-convex). Let assumptions 2,3, 4, 5, 6 hold. With ρT = 1 − λ2T,2 the spectral gap of the squared mixing

matrix M2T , we have

ΞT ≤(

1− ρT2

+6L2η2

ρT

)ΞT−1 +

6η2

ρTeT−1 + (1− ρT ) η2σ2. (3.4.9)

Furthermore, if we use a fixed learning rate η ≤ ρ

2√6L

, with ρ a lower

bound on the spectral gaps; and we define a series of weights {wt}t≥0 ⊂ R+

such that wt+1 ≤ wt(1 + ρ

8

), we can bound

T∑t=0

wtΞt ≤48L

ρ2η2

T∑t=0

wtet + 4η2σ2(

1

ρ− 1

)WT . (3.4.10)

Proof.As in Lemma 3.4.2 proof, we rewrite

NΞT = (1− ρT )E∥∥∥X(T ) −X

(T−1)∥∥∥2F

(3.2.13),α=ρT2

≤(

1− ρT2

)NΞT−1 +

3

ρTη2E

∥∥∥G(T−1)∥∥∥2F

+ (1− ρT )η2Nσ2.


Then, we add and subtract E∇f(x(T )) to the second term and compute

E∥∥∥G(T−1)

∥∥∥2F

=N∑i=1

E∥∥∥E∇f(x

(T−1)i )

∥∥∥2≤ 2

N∑i=1

∥∥∥E∇f(x(T−1))∥∥∥2

+2N∑i=1

E∥∥∥∇f(x

(T−1)i )−∇f(x(T−1))

∥∥∥2(3.2.5)

≤ 2N∥∥∥E∇f(x(T−1))

∥∥∥2 + 2L2N∑i=1

E∥∥∥x(T−1)

i − x(T−1)∥∥∥2 .

Combining those two inequalities and dividing by N we prove (3.4.9).The proof of (3.4.10) is identical to the second part of Lemma 3.4.2

proof, except from a 2 factor in front of et.

Theorem 3.4.6 (Average parameters recursion – Non-convex). Let assump-tions 2, 3, 4, 5, 6 hold. Denote by x(t) the average of the parameters acrossall nodes at iteration t and let x(0) be the common starting point. Then, tak-

ing a fixed learning rate η ≤ min{

ρ

12√2L, ρ

12√2L1.5

}, we have the following

convergence rate:

1

T + 1

T∑t=0

et ≤4r0

(T + 1)η+ 2Lη

(1

N+

1− ρ3

)σ2. (3.4.11)

Corollary. Under Theorem 3.4.6 conditions, with fixed learning rateη = 1√

T+1, we have

1

T + 1

T∑t=0

et ≤ O(

r0√T + 1

+

(1

N+

1− ρ3

)σ2√T + 1

), (3.4.12)

where the O notation hides polylogarithmic terms.

Proof.Let’s rearrange (3.4.8) as follows

rT+1 ≤ rT +(Lη2 − η

2

)eT +

L2η + 2L3η2

2ΞT +

Lη2

2Nσ2

⇒ eT ≤2

η(rT − rT+1) + 2LηeT +

(L2 + 2L3η

)ΞT +

Lη

Nσ2.


We sum over t and divide by T + 1 to observe

1

2(T + 1)

T∑t=0

et ≤ r0 − rT+1

(T + 1)η+

Lη

T + 1

T∑t=0

et

+L2 + 2L3η

2(T + 1)

T∑t=0

Ξt +Lη

2Nσ2

(3.4.10)

≤ r0 − rT+1

(T + 1)η+

Lη

T + 1

T∑t=0

et +Lη

2Nσ2

+L2 + 2L3η

2(T + 1)

(48L

ρ2η2

T∑t=0

et + 4η2σ2(

1

ρ− 1

)(T + 1)

)

≤ r0 − rT+1

(T + 1)η+

Lη

T + 1

T∑t=0

et +Lη

2Nσ2

+24L3η2 + 2L4η3

(T + 1)ρ2

T∑t=0

et + 2(L2η2 + 2L3η3

)(1

ρ− 1

)σ2

≤ r0(T + 1)η

+Lηρ2 + 24L3η2 + 48L4η3

ρ2(T + 1)

T∑t=0

et

+Lη

(1

2N+ 2

(Lη + 2L2η2

)(1

ρ− 1

))σ2.

With η ≤ min{

ρ

12√2L, ρ

12√2L1.5

}the coefficient in front of

∑Tt=0et on the

right hand side is smaller than (T + 1)−1/4 and thus

1

4(T + 1)

T∑t=0

et ≤r0

(T + 1)η+ Lη

(1

2N+ρ+ 12

72(1− ρ)

)σ2

≤ r0(T + 1)η

+ Lη

(1

2N+

1− ρ5

)σ2.

We multiply on both sides by 4 and, noticing that 2/5 < 1/3, we conclude.

3.5 Discussion on Byzantine setting

In previous section, we derived linear and sublinear convergence boundsfor the average parameters in a Byzantine-free setting, with respectively a

3.5. DISCUSSION ON BYZANTINE SETTING 47

strongly convex and a smooth objective. Ideally, we would like to integrateByzantine agents and robustness directly into the above proof. Although, aswe shall see, this is highly non-trivial. In fact, no existing characterizationof robustness fits in this setting.

We discuss the consensus variation ΞT in the strongly convex setting,characterized in Lemma 3.4.2. It generalizes as

ΞT =1

N

∑i∈R

∥∥∥x(T )i − x(T )

∥∥∥2=

1

N

∥∥∥Aggr(X(T ))−X(T )∥∥∥2F,

where we would like to approximate Aggr with a stochastic matrix MT .If we were able to multiplicatively bound this Frobenius norm with a

term only depending on a linear combination of the good nodes, the proofof Lemma 3.4.2 would still hold, with the only difference of adding a constantfactor to the right hand side of (3.4.5).

With the previous remark, we would give the following definition ofrobustness, which is a decentralized version of the one presented in [6].

Definition 3.5.1 (Robust aggregation - Frobenius). Let Aggr : Rd×N ′ → Rd×Nbe an aggregation function to be used in Robust DeSGD. We say that Aggris (C, δ)-robust if there exist a constant C ≥ 0 such that

E∥∥∥∥Aggr(X(t))−X

(t)R

11T

N

∥∥∥∥2F

≤ Cδb,NE∥∥∥∥X(t)RMt −X

(t)R

11T

N

∥∥∥∥2F

, (3.5.1)

for a stochastic matrix Mt, depending on X(t), and a constant δb,N , depend-ing on the number of Byzantine agents b and regular nodes N .

To state it differently, Aggr is robust if the sum of the squared distances(Frobenius norm) between the aggregated parameters and the average ofthe regular points is upper bounded by a constant Cδb,N , times the squareddistances between the same average and a linear combination of the regularones.

Definition 3.5.1 is an intuitive characterization of a robust aggregationfunction, but is not yet compatible with our proof strategy. In the proofof Lemma 3.4.1 (Error recursion – µ-convex), we have another problemwith Aggr.

In (3.4.2), which characterizes E∥∥x(T+1) − x∗

∥∥2, we use double stochas-ticty of MT to preserve the algebraic mean and have only the average pa-rameter vector from previous iteration x(T ) to appear. We should therefore


add and subtract this average term, to have it togheter with Aggr(X(T )) inthe Byzantine setting.

This would make appear a leading (1+α)ΞT addend in the RHS of (3.4.1),if we were to use (3.2.13) to split the norm. This term would not be scalableby the learning rate, and thus the recursion would diverge in T , since α > 0.

A different definition, could require that, when taking the distance with afixed point, our aggregation function is “well-behaved”. This seems intuitive,but a multiplicative bound would be too restrictive. If we were to ask thatthere exist a constant S such that, for any X ∈ Rd×N and v ∈ Rd, we hada matrix MX satisfying∥∥∥∥Aggr(X(t))

1

N− v

∥∥∥∥2 ≤ S ∥∥∥∥X(t)RMX

1

N− v

∥∥∥∥2 , (3.5.2)

it would be equivalent to ask that the average of Aggr is exactly the average

of the linear combination. To prove it, we just need to choose v = X(t)RMX

1N

and see that the RHS is zero.Still, even if we fixed this two lemmas, the general recursion that makes

the two thoerem would not work. In fact, unrolling the recursion, we wouldhave the newly introduced constants exponentially increased in T .

The design of a new proof thecnique is out of the scope of this work, sowe limit ourselves to point out to the limitations of the current approach.

Chapter 4

Experimental analysis

In this chapter, we present some experiments to illustrate the conver-gence results that we proved in Chapter 3, and we try to discuss the influenceof Byzantine agents. In particular, we show the effect on the convergencerates of different properties of the communication network. Then, we ana-lyze how different algorithms perform against various adversarial agents.

We start with some experiments without any Byzantine agent. We testa fully connected graph on increasing number of nodes, to show the conver-gence speedup. Then, we fix the number of workers and test graphs withlower connectivity, to discuss about local parameter variability and conver-gence rates.

Afterwards, we introduce Byzantine agents and test them against someof the previously introduced algorithms. The focus is on the behavior of thelearning curve, which we compair to single agents trained on their own.

We choose as our test case the MNIST handwritten-digit classificationproblem [11]. This use case has been extensively used as a benchmark totest learning algorithms and, as we reported in the review from Chapter 2,it appears in almost every paper on Byzantine-resilient optimization.

The task consists in identifying the digit, from 0 to 1, represented in a28 × 28 grayscale picture. The dateset consists in 60’000 training samplesand 10’000 test semples. Figure 4.1 illustrate a sample from the trainingdataset.

Our model consists in a Convolutional Neural Network [4] with threeconvolutional layers of kernel size 3×3, increasing the channels from 1 to 32to 64 to 128, and two fully connected layers, with respective sizes 2048×128and 128 × 10. We use the Rectified Linear Unit (ReLU) between everylayer as activation functions and, after the final layer, we use softmax to

49

50 CHAPTER 4. EXPERIMENTAL ANALYSIS

Figure 4.1: Sample of images from the MNIST dataset, along with their correspondinglabels.

express the outputs as class probabilities. The design is loosely inspiredfrom “LeNet-5”, which appears in the original MNIST paper [11].

The parameters of the network are trained over 300 iterations to mini-mize the Negative Log-Likelihood, also known as Cross Entropy, using de-centralized SGD and robust variants. We use a fixed learning rate η = 0.2for the first 100 steps, then we halve it to η = 0.1.

At each iteration, every node computes the loss gradient over a minibatchof 32 random samples from the training set. The whole training dataset israndomly split among the nodes, so that no agents share common images.

For reference, during training we compute the loss on a batch of 1600samples from the test set and, after the 300 iteration of Robust DeSGD, weevaluate the models from every node on the full test set and compute theiraverage accuracy. We will report, when relevant, the test loss evaluated withthe average parameter vector x(t), or the average of the test losses computed

locally 1N

∑Ni=1f(x

(t)i , ξ

testi ).

In every experiment, the optimization process is intialized with the sameparameters x(0), so that the learning curves are comparable across differentattacks and defences.

4.1 Fully connected graph

In this section, we present the results that we obtain in a Byzantine-freenetwork, with a fully connected communication graph on N nodes. At eachstep, the mixing is made through the full gossip matrix M = 11T

N . In thiscase, the spectral gap of the matrix is ρ = 1, since it is a rank 1 matrix.

Note that this setting of Decentralized SGD is equivalent to Algorithm 2

4.1. FULLY CONNECTED GRAPH 51

0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e

LocalAverage Test

(a) No communication: 20 independent nodes.

0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e

LocalAverage TestTest at x(t)

(b) Fully connected on 5 nodes.

0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e


(c) Fully connected on 10 nodes.

0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5Lo

ss v

alue


(d) Fully connected on 20 nodes.

Figure 4.2: Loss evolution for fully connected networks of different sizes. Blue linesrepresent the local loss evolution at every node; the orange line is the test loss evaluatedat the consensus vector x(t); and the green line is the average of the local test losses1N

∑Ni=1f(x

(t)i , ξi). The models score the following average test accuracies: (a) 79; (b) 85.6;

(c) 87.7%; (d) 86.6%.

(Federated SGD), for the full mixing enforces that every node computes the

gradients on the same aggregated parameter vector x(t)i = x(t).

Plugging our parameters in Theorem 3.4.6, we expect the convergence

rate to scale as O(

r0√T+1

+ σ2

N√T+1

). Figure 4.2 shows the loss evolution over

the iterations for different graph sizes. We start by observing the behaviorof a single agent learning without communication, illustrated in panel (a).We note that convergence is slow, and seems almost linear in t. Then, weobserve the shift in the learning curve when passing to a fully connectedgraph on 5, 10 and 20 nodes, respectively in panels (b), (c) and (d).

There is an evident improvement passing from the local training to thecollaboration with four other agents. A slight further improvement can be


0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e


(a) Degree = 16.

0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e


(b) Degree = 8.

0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e


(c) Degree = 4.

0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e


(d) Cycle graph, degree = 2.

Figure 4.3: Loss evolution on regular graphs on 20 nodes with different degrees. Bluelines represent the local loss evolution at every node; the orange line is the test lossevaluated at the consensus vector x(t); and the green line is the average of the localtest losses 1

N

∑Ni=1f(x

(t)i , ξi). Average test accuracies: (a) 87%; (b) 86.6%; (c) 85.9%;

(d) 85.7%.

seen when incresing from five to ten and, then, to twenty agents.

Figure 4.2 also reports the evaluation of the test loss on the consensusvector x(t) and the average of the local test losses. Those lines tell us thatwe should be careful in handling the convergence Thoerems from Chapter 3.We observe that the average local performance is always worse than theperformance of the model evaluated with averaged parameters.

4.2 Reducing connectivity

In this section, we focus on a graph with twenty nodes and we present ascenario in which we gradually reduce the connectivity of the graph.

4.3. BYZANTINE INFLUENCE 53

(a) Communication graph.0 50 100 150 200 250 300

Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e

LocalAverage Test

(b) Loss evolution.

Figure 4.4: Byzantine-free (15, .4)-ER graph trained with Gossip SGD. We have aregular learning curve. Average test accuracies: 85.8%.

We study regular graphs, which are graphs whose nodes all have thesame degree, i.e. the same number of neighbors. We sample random regulargraphs with degrees of 16, 8, 4, and 2. Note that the last case, i.e. with degreetwo, is a cycle. The fully connected graph on twenty nodes, considered inprevious section, is a particular case of regular graph as well.

Figure 4.3 shows the loss evolutions of this gradually less connectedgraphs. We see that convergence slightly slows down, the smaller the graphdegree is. This is theoretically in line with Theorem 3.4.6. Even thoughthe spectral gap of a d-regular random graph is not fully characterized, ithas been proven that 1 − ρ behaves as o(1/

√d) with high probability [16].

Therefore, the second addend on the convergence bound (3.4.11) is bounded

by O((

1N + 1√

d

)1√T+1

).

As in previous section, Figure 4.3 shows that the average local perfor-mance is always worse than the performance of the model evaluated withaveraged parameters.

4.3 Byzantine influence

In this section we investigate the influence of Byzantine agents on thelearning process, with different aggregation rules.

Definition 4.3.1. An Erdos-Renyi graph G(n, p) is a graph on n nodes,constucted by connecting nodes randomly. Each edge is included in thegraph with probability p, independently from every other edge.


For this experiment, we sample indipendent G(15, 0.4) graphs. By defi-nition, each node has an average of 6 neighbors. Figure 4.4 shows a samplecommunication graph from such a network in panel (a). Panel (b) showsthe result of running Gossip SGD on such a network without any Byzantineagent. Every regular node follows Algorithm 4 (Robust De-SGD), with arelevant aggregation function Aggr.

Byzantine agents are added on top of this graph and are allowed tocommunicate to every node. Therefore, when discussing the influence ofone Byzantine attacker, we deal with a graph on 16 nodes, which containsa G(15, 0.4) graph, whose nodes are all connected to the adversary. Inthis way, the number of expected good neighbors is unchanged, and we canonly focus on the ratio of Byzantines over regular neighbors which is b/6 inexpectation.

We propose two different attacks. Firstly, we use a “lazy” attack, wherethe Byzantine agents sends a random sample from a multivariate standardGaussian distribution. This Gaussian attack is not really adversarial, asit does not leverage any information on the nodes state, nor on the graphconnectivity. Nonetheless, it is easily implemented and is a valid modelfor random failures in the system. Furthermore, if no countermeasure istaken, it is highly disruptive, to the point that it numerically breaks theoptimization.

The second attack has been adapted from the paper “A Little Is Enough:Circumventing Defenses For Distributed Learning” [1] to fit in the decen-tralized setting. The idea is to estimate the mean and variance of the vec-tors shared by good workers, to send an arbitrary erroneous message whichwould go undetected by distance based aggregators. In oppostion to theprevious attack, this one, which we name LittleIsEnough, has almost noeffect against regular Gossip SGD, although it can damage robust optimiza-tion by taking the place of good workers in aggregation rules such as Krumor BRIDGE.

4.3.1 Coordinate-wise median

The first aggregation function we discuss is also the simplest and consistsin returning the coordinate-wise median of the candidate vectors.

[Aggr(X)]l = median ([x1]l, . . . , [xN ′ ]l) , ∀l ∈ [d].

Figure 4.5 shows the effect on this strategy of 3 and 5 Byzantine agentsimplementing the Gaussian attack. The median manages to resist three ad-versaries, but is shattered when it starts missing an honest majority. With


0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e

LocalAverage Test

(a) 3 Byzantines vs 15 regular nodes.

0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e

LocalAverage Test

(b) 5 Byzantines vs 15 regular nodes.

Figure 4.5: Robust De-SGD with coordinate-wise median aggregation, against Gaussianattack on a G(15, 0.4) graph. Average test accuracies: (a) 84.9%; (b) 73%.

0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e

LocalAverage Test


0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e

LocalAverage Test


Figure 4.6: Robust De-SGD with coordinate-wise median aggregation, againstLittleIsEnough attack on a G(15, 0.4) graph. Average test accuracies: (a) 88%;(b) 84.3%.

5 Byzantine agents, some nodes have, by random sampling, more adversar-ial neighbors than good ones and, with its own estimate, the majority iscomposed by Byzantine agents. In such conditions the median is not robustanymore, since it does not give priority to local information. Therefore, aswe see in Figure 4.5(b), some nodes diverge and the overall optimization iscompromised.

The LittleIsEnough attack manages, as expected, to deal more damageagainst the median. Figure 4.6 shows that two Byzantine nodes slightly slowdown convergence, and a greater impairment is achieved with four. Eventhough the regular nodes do not diverge, the learning is slowed to the point


0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e

LocalAverage Test


0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e

LocalAverage Test


Figure 4.7: Robust De-SGD with DKrum aggregation, against Gaussian attack on aG(15, 0.4) graph. Average test accuracies: (a) 83.7%; (b) 84.2%.

0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e

LocalAverage Test


0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5Lo

ss v

alue

LocalAverage Test


Figure 4.8: Robust De-SGD with DKrum aggregation, against LittleIsEnough attackon a G(15, 0.4) graph. Average test accuracies: (a) 87.8%; (b) 74.4%

that it looks like local SGD from Figure 4.2(a).

4.3.2 Krum

Let’s now consider Krum, the pioneering Byzantine-robust aggregationmethods introduced in [2]. This function suppose to know the upper boundb on the number of Byzantine agents and, in its original design, requiresthat among the neighbors there are at least b + 2 good nodes. We slighltymodify the implementation, so that it can deal with any number of Byzantineagents, and we impose that if the original condition is not satisfied, then agood node only works with is own estimate, thus performing local SGD. We


call this variant DKrum.This aggregation rule is very robust to the Gaussian attack, as we see

in Figure 4.7. On the othere hand, the LittleIsEnough attack has a strongeffect on DKrum. Figure 4.8 shows that three aversaries manage to signifi-cantly slow down convergence, even more than 7 Gaussian attackers. Withfour Byzantine agents the algorithm becomes equivalent to, if not slightlyworse than, local SGD and five nodes manage to overflow the gradient com-putation. With more, the good nodes stop trusting external informationand revert to local SGD.

4.3.3 BRIDGE

As we mentioned in Chapter 2, BRIDGE is one of the first Bizantine-robust algorithms designed explicitly for a decentralized setting. As withDKrum, it supposes to know an upper bound on the number of adversariesin the graph, and uses it to exclude extreme vectors.

It requires to have at least 2b+ 1 neighbors to use external information,otherwise all incoming vectors are discarded and the node just perform localSGD. For this reason, and since the expected number of good neighbors fora regular node is 5.6, the algorithm probably reverts to local SGD with morethen three adversaries in the networks.

Figure 4.9 shows that the algorithm is over-restrictive in our setting.Even though the average convergence against three Gauss adversaries is bet-ter than Krum, against four Byzantine agents, the strategy partially revertsto local SGD, as panel 4.9(b) shows that many nodes isolate themselves.BRIDGE performance is thus worse than DKrum in this setting.

The algorithm is even more impaired by the LittleIsEnough attack.Figure 4.10 shows that with one fewer adversarial agent we bring as muchdisruption as the Gaussian attack.


0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e

LocalAverage Test


0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e

LocalAverage Test

(b) 4 Byzantines vs 15 regular nodes. Regularnodes isolate themselves and revert to local SGD.

Figure 4.9: Robust De-SGD with BRIDGE aggregation, against Gaussian attack on aG(15, 0.4) graph. Average test accuracies: (a) 85.5%; (b) 78.3%.

0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e

LocalAverage Test


0 50 100 150 200 250 300Iteration

0.0

0.5

1.0

1.5

2.0

2.5

Loss

valu

e

LocalAverage Test


Figure 4.10: Robust De-SGD with BRIDGE aggregation, against LittleIsEnough attackon a G(15, 0.4) graph. Average test accuracies: (a) 83.2%; (b) 79.4%.

Chapter 5

Limitations

In this chapter, we discuss about the limitations of our analysis. Inparticular, we argue about some of the assumptions that we forced into thetheoretical study, without which we could have not proven convergence.

Then, we open the debate on the importance of understanding the lo-cal behavior of a decentralized optimization algorithm. We point out thegenerally overlooked tradeoff between security and computational speedup,which we encounter when we choose to have computers collaborating in aByzantine-prone setting.

5.1 Theoretical assumptions

Symmetric mixing

Recall Assumption 4 on the mixing matrix, which requires that Mt issymmetric for every t ≥ 0. This assumption is needed by Lemma 3.2.13, tobound the distance of the mixed points to the average with respect to theoriginal vectors and any other point. We also apply double stochasticity inthe error and recursion proofs, to preserve the average operator.

Even though this assumption is requested in previous analysis [8], it ishardly satisfied in practice. For instance, the adversary-free Gossip SGDhas a nonsymmetric mixing matrix for all communication graphs which arenot regular. In fact, considering two nodes i and j, with |Ni| = m, |Nj | = nand m 6= n, the mixing weight for the edge (i, j) is 1/n, while (j, i) has aweight of 1/m. The matrix Mt is not symmetric, as [Mt]ij 6= [Mt]ji.

The assumption would be satisfied by fixing the weights for every (i, j), i 6=j to be w = (1/maxi{|Ni|}), i.e. the inverse of maximum degree, and choos-ing the self-weights so the that rows and columns sum to one. Nonetheless,

59

60 CHAPTER 5. LIMITATIONS

this would be very limiting for graphs with much heterogeneity in their de-grees. If a node i had few neighbors, say three, while the most well connectedhad many, for instance 300, the weight given by i to the incoming informa-tion would sum to 1/100, thus making the learning almost autonomous onnode i. This scenery corresponds, for instance, to a cycle graph on 300nodes, which are all additionally connected to the same node outisde thecycle. It has a spectral gap of ≈ 1/300. In such a case, the nodes on thecycle perform an almost local SGD, while the single external node almostcompletely discard its own information.

In the Byzantine setting, we should restrict ourselves to the subgraphgenerated by regular nodes, as it makes no sense to discuss about incomingweights to the Byzantine nodes. In such a case, we should not forget aboutthe weight that is given to Byzantine agents, wich implies that the mixingmatrix is not stochastich when restricted on a subgraph. We should thereforeresort to approximations by symmetric matrices, as discussed in Section 3.3,but the accuracy of such approximations is out of the scope of this work.

Finally, it would be contradictory to require symmetric mixing matri-ces with directed graphs, which are therefore excluded, a priori, from ouranalysis.

Spectral gap

Assumption 5 on the positive spectral gap requires a separate discussion.

Focusing on undirected graphs, this assumption is always granted if thegraph is connected, thanks to the Perron-Frobenius theorem. If the regulargraph is disconnected, we can treat each connected component separately,and the Theorems from Chapter 3 would still hold, even though the numberof nodes N would now be given by the nodes in the compent.

If the function has many local minima, the average behavior has the sameformulation as before, although it could happen that separate componentsconverge to different points. This would not necessarily be a problem, if thefunction suboptimality at the local minima is comparable, as often happensin Deep Learning models [4]. Nonetheless, it is a concept that should bekept on mind, and the importance given to it should be reflected by theoptimization algorithm.

For instance, in [13], the authors do not care about consensus over theoptimal point, but only focus on performance. They claim positive resultson connected networks whose nodes converged to very different models.

5.2. ON LOCAL CONVERGENCE 61

Figure 5.1: Example of mixing graph in which the distance from the consensus pointincreases for a local node. The points are vectors in the space and edges represent notnullmixing weights in M between the corresponding nodes.The green point has notnull mixing weights for its two neighbors, so the mixing bringit closer to them and, thus, further from the global average (red cross). It takes manyiterations before the green point starts moving towards the mean, and even more for it toget closer to the average than its initial state

5.2 On local convergence

It would be of great interest to obtain some finite time bounds on thedistances between the local parameters xi from the regular nodes and theoptimal ones. As discussed in Chapter 2, most of the literature focuses onasymptotic consensus, and ignore the importance of local solutions.

Even worse, no one seems to compare the quality of its strategy withthe one that only uses local information. Nonetheless, since the latter isequivalent to running different copies of Stochastic Gradient Descent, it hasthe same convergence guarantees, in spite of how many Byzantine agentsare in the graph and independently from any attack they implement.

We can consider a motivating example, in which we train a face recog-nition model in a decentralized fashion, on a network of smartphones. Thesetting is ideal, as people have plenty of portraits on their phones, thoughthey are not always happy about sharing them. In such a setting, we couldargue that we can grant convergence for the average of the parameters,though, it could be very harmful if the model was to fail very badly in justa few phones. Here is where the comparison with local SGD gets interest-ing: how can we evaluate the tradeoff between security and computationalefficiency?

We would like to discuss about the local behavior in each regular node.

How can we bound the suboptimality ‖x(T )i − x∗‖2, or ‖∇f(x

(T )i )‖2? How

can we bound the local consensus error ‖x(T )i − x(T )‖2?

To the best of our knowledge, there is no one-catch-all answer for theabove questions. Unfortunately, we cannot apply the proof stategy that we

62 CHAPTER 5. LIMITATIONS

used in Section 3.4. To be precise, we cannot separate the error recursionfrom the consensus recursion, without incurring contradictory proofs.

This is evident if we look at the simple counterexample illustrated inFigure 5.1. Supposing we have a node whose parameters are already closeto the mean, but whose neighbors are further away, it cannot lower theconsensus error by mixing. This would imply that the local estimate canworsen during the mixing process, thus invalidating the proof strategy inChapter 3, and in particular the lammas on the consesus recursion.

Nonetheless, this problem could be solved through careful analysis of thesetting. We should consider in the proof that the nodes all start at the samepoint and the stochastic gradients estimate the same quantity.

The design of a new proof strategy is left as a future line of research.

Chapter 6

Conclusion

6.1 Final comments

We motivate and define a decentralized version of Stochastic Gradi-ent Descent, where a network of computers aims to jointly minimize aparametrized stochastic function. We introduce Byzantine adversaries, whichhave complete knowledge of the network state and try to fool the regularnodes by sending arbitrary messages during the optimization process.

We review many variants of the Decentralized SGD algorithm, whichhave been designed to withstand such Byzantine attacks, and we experi-mentally test some of them.

We carry on a theoretical analysis on the behavior of Robust Decentral-ized SGD, and provide convergence rates for a Byzantine-free setting. Weprove sublinear convergence in the number of iterations T , and a dependenceon the number of good nodes N and on the connectivity of the graph.

An experimental analysis on the MNIST handwritten-digit classificationtask shows that the the theoretical results capture the learning behavioreven when some of the assumptions are not satisfied.

The experiments on variants of Robust DeSGD against two differentkind of Byzantine attacks show that existing strategies can deal with simpleattacks, but fails on more elaborated ones. However, there is no agree-ment among the machine learning community on the meaning of failure in aByzantine setting. Furthermore, there is not a unique definition of robust-ness.

Different authors focus on different settings and, often, their methodsare not comparable, because the assumptions are contradictory. We focuson a convergence-based approach and discuss the performances of various

63

64 CHAPTER 6. CONCLUSION

algorithms by comparing them to the learning rate of local SGD, which isalways a solution for the iid setting that we consider. For this reason, wepoint out the importance of understanding the local behavior of decentral-ized algorithms, which has not been treated in the literature.

6.2 Future work

We analyze the convergence guarantees of variants of Byzantine-robustdecentralized SGD and, for every answer we provide, new questions arise.Still many points need to be clarified, and future research should try todissipate the mist.

There is not a univocal definition of Byzantine robustness, not in thefederated setting, nor in the decentralized one. The lack of such character-ization prevents the comparison of different algorithms. We should find anintuitive defition, maybe inspired by Definition 3.5.1, to open the door forprecise theoretical analysis. In this context, a future line of research shouldwork on bounding the error introduced by the linear approximation to therobust aggregation from Section 3.3.

In parallel, we should find a different proof approach, to open up tonon-symmetric mixing matrices and, maybe, to more general aggregationfunctions.

We should also properly characterize the finite time local behavior ofthe nodes in the graph. This would be useful to asses the robustness of analgorithm, and to quantify the tradeoff between security and computationalefficiency.

Many other questions are left unsolved, and an intuitive reader couldglimpse others directions to explore, which we may have overlooked.

Bibliography

[1] Moran Baruch, Gilad Baruch, and Yoav Goldberg. A Little Is Enough:Circumventing Defenses For Distributed Learning. 2019. arXiv: 1902.06156 [cs.LG].

[2] Peva Blanchard, Rachid Guerraoui, Julien Stainer, et al. “Machinelearning with adversaries: Byzantine tolerant gradient descent”. In:Advances in Neural Information Processing Systems. 2017, pp. 119–129.

[3] Corinna Cortes and Vladimir Vapnik. “Support-vector networks”. In:Machine learning 20.3 (1995), pp. 273–297.

[4] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.http://www.deeplearningbook.org. MIT Press, 2016.

[5] Shangwei Guo et al. Towards Byzantine-resilient Learning in Decen-tralized Systems. 2020. arXiv: 2002.08569 [cs.LG].

[6] Sai Praneeth Karimireddy, Lie He, and Martin Jaggi. Learning fromHistory for Byzantine Robust Optimization. 2020. arXiv: 2012.10333[cs.LG].

[7] Sai Praneeth Karimireddy et al. “SCAFFOLD: Stochastic ControlledAveraging for On-Device Federated Learning”. In: CoRR abs/1910.06378(2019). arXiv: 1910.06378. url: http://arxiv.org/abs/1910.

06378.

[8] Anastasia Koloskova et al. “A Unified Theory of Decentralized SGDwith Changing Topology and Local Updates”. In: (2020). arXiv: 2003.10422 [cs.LG].

[9] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simplerapproach to obtaining an O(1/t) convergence rate for the projectedstochastic subgradient method. 2012. arXiv: 1212.2002 [cs.LG].

65

https://arxiv.org/abs/1902.06156


http://www.deeplearningbook.org





http://arxiv.org/abs/1910.06378

http://arxiv.org/abs/1910.06378




66 BIBLIOGRAPHY

[10] Leslie Lamport, Robert Shostak, and Marshall Pease. “The Byzan-tine Generals Problem”. In: ACM Transactions on Programming Lan-guages and Systems 4.3 (1982), pp. 382–401.

[11] Yann LeCun et al. “Gradient-based learning applied to documentrecognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324.

[12] El Mahdi El Mhamdi, Rachid Guerraoui, and Sebastien Rouault. TheHidden Vulnerability of Distributed Learning in Byzantium. 2018. arXiv:1802.07927 [stat.ML].

[13] J. Peng and Q. Ling. “Byzantine-Robust Decentralized Stochastic Op-timization”. In: ICASSP 2020 - 2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP). 2020, pp. 5935–5939.

[14] Jayanth Regatti and Abhishek Gupta. Befriending The ByzantinesThrough Reputation Scores. 2020. arXiv: 2006.13421 [cs.LG].

[15] L. Su and N. H. Vaidya. “Byzantine-Resilient Multi-Agent Optimiza-tion”. In: IEEE Transactions on Automatic Control (2020), pp. 1–1.

[16] Konstantin Tikhomirov and Pierre Youssef. The spectral gap of denserandom regular graphs. 2016. arXiv: 1610.01765 [math.PR].

[17] Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. “Zeno: Byzantine-suspicious stochastic gradient descent”. In: arXiv preprint (2018). arXiv:1805.10032.

[18] Cong Xie, Sanmi Koyejo, and Indranil Gupta. “Zeno++: Robust FullyAsynchronous SGD”. In: (2019). arXiv: 1903.07020 [cs.LG].

[19] Zhixiong Yang and Waheed U Bajwa. “BRIDGE: Byzantine-resilientdecentralized gradient descent”. In: arXiv preprint (2019). arXiv: 1908.08098.

[20] Zhixiong Yang and Waheed U Bajwa. “ByRDiE: Byzantine-resilientdistributed coordinate descent for decentralized learning”. In: IEEETransactions on Signal and Information Processing over Networks 5.4(2019), pp. 611–627.

[21] Zhixiong Yang, Arpita Gang, and Waheed U Bajwa. “Adversary-resilientdistributed and decentralized statistical inference and machine learn-ing: An overview of recent advances under the Byzantine threat model”.In: IEEE Signal Processing Magazine 37.3 (2020), pp. 146–159.

[22] Xingyu Zhou. On the Fenchel Duality between Strong Convexity andLipschitz Continuous Gradient. 2018. arXiv: 1803.06573 [math.OC].









byzantine-robust decentralized optimization for machine

Documents