distributed solution of large-scale linear systems …nazizanr/papers/apc.pdfieee transactions on...

12
IEEE TRANSACTIONS ON SIGNALPROCESSING, VOL. 67, NO. 14, JULY 15, 2019 Distributed Solution of Large-Scale Linear Systems via Accelerated Projection-Based Consensus Navid Azizan-Ruhi , Student Member, IEEE, Farshad Lahouti, Senior Member, IEEE, Amir Salman Avestimehr, Senior Member, IEEE, and Babak Hassibi, Member, IEEE Abstract—Solving a large-scale system of linear equations is a key step at the heart of many algorithms in scientific comput- ing, machine learning, and beyond. When the problem dimen- sion is large, computational and/or memory constraints make it desirable, or even necessary, to perform the task in a distributed fashion. In this paper, we consider a common scenario in which a taskmaster intends to solve a large-scale system of linear equa- tions by distributing subsets of the equations among a number of computing machines/cores. We propose a new algorithm called Ac- celerated Projection-based Consensus, in which at each iteration every machine updates its solution by adding a scaled version of the projection of an error signal onto the nullspace of its system of equations, and the taskmaster conducts an averaging over the solutions with momentum. The convergence behavior of the pro- posed algorithm is analyzed in detail and analytically shown to compare favorably with the convergence rate of alternative dis- tributed methods, namely distributed gradient descent, distributed versions of Nesterov’s accelerated gradient descent and heavy-ball method, the block Cimmino method, and Alternating Direction Method of Multipliers. On randomly chosen linear systems, as well as on real-world data sets, the proposed method offers significant speed-up relative to all the aforementioned methods. Finally, our analysis suggests a novel variation of the distributed heavy-ball method, which employs a particular distributed preconditioning and achieves the same theoretical convergence rate as that in the proposed consensus-based method. Index Terms—System of linear equations, distributed comput- ing, big data, consensus, optimization. Manuscript received June 25, 2018; revised December 14, 2018 and April 2, 2019; accepted April 26, 2019. Date of publication June 4, 2019; date of current version June 21, 2019. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Laura Cottatellucci. This work was supported in part by the National Science Foundation under Grants CCF-1423663, CCF-1409204, and ECCS-1509977, in part by a grant from Qualcomm Inc., in part by NASA’s Jet Propulsion Laboratory through the President and Director’s Fund, and in part by fellowships from Amazon Web Services, Inc., and PIMCO, LLC. This paper was presented in part at the IEEE International Conference on Acoustics, Speech, and Signal Processing, Calgary, AB, Canada, April 2018 [1]. (Corresponding author: Navid Azizan-Ruhi.) N. Azizan-Ruhi is with the Department of Computing and Mathematical Sci- ences, California Institute of Technology, Pasadena, CA 91125 USA (e-mail: [email protected]). F. Lahouti and B. Hassibi are with the Department of Electrical Engineer- ing, California Institute of Technology, Pasadena, CA 91125 USA (e-mail: [email protected]; [email protected]). A. S. Avestimehr is with the Department of Electrical Engineering, Univer- sity of Southern California, Los Angeles, CA 90007 USA (e-mail: avestimehr @ee.usc.edu). Digital Object Identifier 10.1109/TSP.2019.2917855 I. INTRODUCTION W ITH the advent of big data, many analytical tasks of interest rely on distributed computations over multiple processing cores or machines. This is either due to the inher- ent complexity of the problem, in terms of computation and/or memory, or due to the nature of the data sets themselves that may already be dispersed across machines. Most algorithms in the literature have been designed to run in a sequential fashion, as a result of which in many cases their distributed counterparts have yet to be devised. In order to devise efficient distributed al- gorithms, one has to address a number of key questions such as (a) What computation should each worker carry out, (b) What is the communication architecture and what messages should be communicated between the processors, (c) How does the distributed implementation fare in terms of computational com- plexity, and (d) What is the rate of convergence in the case of iterative algorithms. In this paper, we focus on solving a large-scale system of linear equations, which is one of the most fundamental prob- lems in numerical computation, and lies at the heart of many algorithms in engineering and the sciences. In particular, we consider the setting in which a taskmaster intends to solve a large-scale system of equations in a distributed way with the help of a set of computing machines/cores (Figure 1). This is a common setting in many computing applications, and the task is mainly distributed because of high computational and/or memory requirements (rather than physical location as in sensor networks). This problem can in general be cast as an optimization prob- lem, with a cost function that is separable in the data 1 (but not in the variables). Hence, there are general approaches to con- struct distributed algorithms for this problem, such as distributed versions of gradient descent [2]–[4] and its variants (e.g. Nes- terov’s accelerated gradient [5] and heavy-ball method [6]), as well as the so-called Alternating Direction Method of Multipli- ers (ADMM) [7] and its variants. ADMM has been widely used [8]–[10] for solving various convex optimization problems in a distributed way, and in particular for consensus optimization [11]–[13], which is the relevant one for the type of separation that we have here. In addition to the optimization-based meth- ods, there are a few distributed algorithms designed specifically 1 Solving a system of linear equations, Ax = b, can be set up as the optimiza- tion problem min x Ax b 2 = min x i (Ax) i b i 2 .

Upload: others

Post on 18-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Solution of Large-Scale Linear Systems …nazizanr/papers/APC.pdfIEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019 Distributed Solution of Large-Scale

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019

Distributed Solution of Large-Scale Linear Systemsvia Accelerated Projection-Based Consensus

Navid Azizan-Ruhi , Student Member, IEEE, Farshad Lahouti, Senior Member, IEEE, Amir Salman Avestimehr, Senior Member, IEEE, and Babak Hassibi, Member, IEEE

Abstract—Solving a large-scale system of linear equations is akey step at the heart of many algorithms in scientific comput-ing, machine learning, and beyond. When the problem dimen-sion is large, computational and/or memory constraints make itdesirable, or even necessary, to perform the task in a distributedfashion. In this paper, we consider a common scenario in whicha taskmaster intends to solve a large-scale system of linear equa-tions by distributing subsets of the equations among a number ofcomputing machines/cores. We propose a new algorithm called Ac-celerated Projection-based Consensus, in which at each iterationevery machine updates its solution by adding a scaled version ofthe projection of an error signal onto the nullspace of its systemof equations, and the taskmaster conducts an averaging over thesolutions with momentum. The convergence behavior of the pro-posed algorithm is analyzed in detail and analytically shown tocompare favorably with the convergence rate of alternative dis-tributed methods, namely distributed gradient descent, distributedversions of Nesterov’s accelerated gradient descent and heavy-ballmethod, the block Cimmino method, and Alternating DirectionMethod of Multipliers. On randomly chosen linear systems, as wellas on real-world data sets, the proposed method offers significantspeed-up relative to all the aforementioned methods. Finally, ouranalysis suggests a novel variation of the distributed heavy-ballmethod, which employs a particular distributed preconditioningand achieves the same theoretical convergence rate as that in theproposed consensus-based method.

Index Terms—System of linear equations, distributed comput-ing, big data, consensus, optimization.

Manuscript received June 25, 2018; revised December 14, 2018 and April2, 2019; accepted April 26, 2019. Date of publication June 4, 2019; date ofcurrent version June 21, 2019. The associate editor coordinating the review ofthis manuscript and approving it for publication was Prof. Laura Cottatellucci.This work was supported in part by the National Science Foundation underGrants CCF-1423663, CCF-1409204, and ECCS-1509977, in part by a grantfrom Qualcomm Inc., in part by NASA’s Jet Propulsion Laboratory through thePresident and Director’s Fund, and in part by fellowships from Amazon WebServices, Inc., and PIMCO, LLC. This paper was presented in part at the IEEEInternational Conference on Acoustics, Speech, and Signal Processing, Calgary,AB, Canada, April 2018 [1]. (Corresponding author: Navid Azizan-Ruhi.)

N. Azizan-Ruhi is with the Department of Computing and Mathematical Sci-ences, California Institute of Technology, Pasadena, CA 91125 USA (e-mail:[email protected]).

F. Lahouti and B. Hassibi are with the Department of Electrical Engineer-ing, California Institute of Technology, Pasadena, CA 91125 USA (e-mail:[email protected]; [email protected]).

A. S. Avestimehr is with the Department of Electrical Engineering, Univer-sity of Southern California, Los Angeles, CA 90007 USA (e-mail: [email protected]).

Digital Object Identifier 10.1109/TSP.2019.2917855

I. INTRODUCTION

W ITH the advent of big data, many analytical tasks ofinterest rely on distributed computations over multiple

processing cores or machines. This is either due to the inher-ent complexity of the problem, in terms of computation and/ormemory, or due to the nature of the data sets themselves thatmay already be dispersed across machines. Most algorithms inthe literature have been designed to run in a sequential fashion,as a result of which in many cases their distributed counterpartshave yet to be devised. In order to devise efficient distributed al-gorithms, one has to address a number of key questions such as(a) What computation should each worker carry out, (b) Whatis the communication architecture and what messages shouldbe communicated between the processors, (c) How does thedistributed implementation fare in terms of computational com-plexity, and (d) What is the rate of convergence in the case ofiterative algorithms.

In this paper, we focus on solving a large-scale system oflinear equations, which is one of the most fundamental prob-lems in numerical computation, and lies at the heart of manyalgorithms in engineering and the sciences. In particular, weconsider the setting in which a taskmaster intends to solve alarge-scale system of equations in a distributed way with thehelp of a set of computing machines/cores (Figure 1). Thisis a common setting in many computing applications, and thetask is mainly distributed because of high computational and/ormemory requirements (rather than physical location as in sensornetworks).

This problem can in general be cast as an optimization prob-lem, with a cost function that is separable in the data1 (but notin the variables). Hence, there are general approaches to con-struct distributed algorithms for this problem, such as distributedversions of gradient descent [2]–[4] and its variants (e.g. Nes-terov’s accelerated gradient [5] and heavy-ball method [6]), aswell as the so-called Alternating Direction Method of Multipli-ers (ADMM) [7] and its variants. ADMM has been widely used[8]–[10] for solving various convex optimization problems ina distributed way, and in particular for consensus optimization[11]–[13], which is the relevant one for the type of separationthat we have here. In addition to the optimization-based meth-ods, there are a few distributed algorithms designed specifically

1Solving a system of linear equations, Ax = b, can be set up as the optimiza-tion problem minx ‖Ax− b‖2 = minx

∑i‖(Ax)i − bi‖2.

Page 2: Distributed Solution of Large-Scale Linear Systems …nazizanr/papers/APC.pdfIEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019 Distributed Solution of Large-Scale

AZIZAN-RUHI et al.: DISTRIBUTED SOLUTION OF LARGE-SCALE LINEAR SYSTEMS VIA ACCELERATED PROJECTION-BASED CONSENSUS

Fig. 1. Schematic representation of the taskmaster and the m machines/cores.Each machine i has only a subset of the equations, i.e. [Ai, bi].

for solving systems of linear equations. The most famous one ofthese is what is known as the block Cimmino method [14]–[16],which is a block row-projection method [17], and is in a way adistributed implementation of the Kaczmarz method [18]. An-other algorithm has been recently proposed in [19], [20], wherea consensus-based scheme is used to solve a system of linearequations over a network of autonomous agents. Our algorithmbears some resemblance to all of these methods, but as it will beexplained in detail, it has much faster convergence than any ofthem.

Our main contribution is the design and analysis of a newalgorithm for distributed solution of large-scale systems of lin-ear equations, which is significantly faster than all the existingmethods. In our methodology, the taskmaster assigns a subsetof equations to each of the machines and invokes a distributedconsensus-based algorithm to obtain the solution to the originalproblem in an iterative manner. At each iteration, each machineupdates its solution by adding a scaled version of the projectionof an error signal onto the nullspace of its system of equations,and the taskmaster conducts an averaging over the solutionswith momentum. The incorporation of a momentum term inboth projection and averaging steps results in accelerated con-vergence of our method, compared to the other projection-basedmethods. For this reason, we refer to this method as Acceler-ated Projection-based Consensus (APC). We provide a completeanalysis of the convergence rate of APC (Section III), as well as adetailed comparison with all the other distributed methods men-tioned above (Section IV). Also by empirical evaluations overboth randomly chosen linear systems and real-world data sets,we demonstrate the significant speed-ups from the proposed al-gorithm, relative to the other distributed methods (Section VI).Finally, as a further implication of our results, we propose a noveldistributed preconditioning method (Section VII), which can beused to improve the convergence rate of distributed gradient-based methods.

II. THE SETUP

We consider the problem of solving a large-scale system oflinear equations

Ax = b, (1)

where A ∈ RN×n, x ∈ Rn and b ∈ RN . While we will gener-ally take N ≥ n, we will assume that the system has a unique

solution. For this reason, we will most often consider the squarecase (N = n). The case where N < n and there are multiple(infinitely many) solutions is discussed in Section V.

As mentioned before, for large-scale problems (whenN,n�1), it is highly desirable, or even necessary, to solve the problemin a distributed fashion. Assuming we have m machines (as inFigure 1), the equations can be partitioned so that each machinegets a disjoint subset of them. In other words, we can write (1)as

⎢⎢⎢⎢⎣

A1

A2

...

Am

⎥⎥⎥⎥⎦x =

⎢⎢⎢⎢⎣

b1b2...

bm

⎥⎥⎥⎥⎦,

where each machine i receives [Ai, bi]. In some applications, thedata may already be stored on different machines in such a fash-ion. For the sake of simplicity, we assume that m divides N , andthat the equations are distributed evenly among the machines, sothat each machine gets p = N

m equations. Therefore Ai ∈ Rp×n

and bi ∈ Rp for every i = 1, . . .m. It is helpful to think of p asbeing relatively small compared to n. In fact, each machine hasa system of equations which is highly under-determined.

III. ACCELERATED PROJECTION-BASED CONSENSUS

A. The Algorithm

Each machine i can certainly find a solution (among infinitelymany) to its own highly under-determined system of equationsAix = bi, with simply O(p3) computations. We denote this ini-tial solution by xi(0). Clearly adding any vector in the rightnullspace of Ai to xi(0) will yield another viable solution. Thechallenge is to find vectors in the nullspaces of each of theAi’s insuch a way that all the solutions for different machines coincide.

At each iteration t, the master provides the machines with anestimate of the solution, denoted by x(t). Each machine thenupdates its value xi(t) by projecting its difference from the es-timate onto the nullspace, and taking a weighted step in thatdirection (which behaves as a “momentum”). Mathematically

xi(t+ 1) = xi(t) + γPi(x(t)− xi(t)),

where

Pi = I −ATi (AiA

Ti )−1Ai (2)

is the projection matrix onto the nullspace of Ai (It is easy tocheck that AiPi = 0 and P 2

i = Pi).Although this might bear some resemblance to the block Cim-

mino method because of the projection matrices, APC has amuch faster convergence rate than the block Cimmino method(i.e. convergence time smaller by a square root), as will beshown in Section IV. Moreover, it turns out that the blockCimmino method is in fact a special case of APC for γ = 1(Section IV-E).

The update rule of xi(t+ 1) described above can be alsothought of as the solution to an optimization problem with twoterms: the distance from the global estimate x(t), and the dis-tance from the previous solution xi(t). In other words, one can

Page 3: Distributed Solution of Large-Scale Linear Systems …nazizanr/papers/APC.pdfIEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019 Distributed Solution of Large-Scale

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019

Algorithm 1: APC: Accelerated Projection-Based Consen-sus (For Solving Ax = b Distributedly).

Input: data [Ai, bi] on each machine i = 1, . . .m,parameters η, γInitialization: on each machine i, find a solution xi(0)(among infinitely many) to Aix = bi.at the master, compute x(0)← 1

m

∑mi=1 xi(0)

for t = 1 to T dofor each machine i parallel doxi(t)← xi(t− 1) + γPi(x(t− 1)− xi(t− 1))

end forat the master: x(t)← η

m

∑mi=1 xi(t) + (1− η)x(t− 1)

end for

show that

xi(t+ 1) = argminxi

‖xi − x(t)‖2 + 1− γ

γ‖xi − xi(t)‖2

s.t. Aixi = bi

The second term in the objective is what distinguishes thismethod from the block Cimmino method. If one sets γ equalto 1 (which is the reduction to the block Cimmino method), thesecond term disappears altogether, and the update no longer de-pends on xi(t). As we will show, this can have a dramatic impacton the convergence rate.

After each iteration, the master collects the updated valuesxi(t+ 1) to form a new estimate x(t+ 1). A plausible choicefor this is to simply take the average of the values as the new es-timate, i.e., x(t+ 1) = 1

m

∑mi=1 xi(t+ 1). This update works,

and is what appears both in ADMM and in the consensus methodof [19], [20]. But it turns out that it is extremely slow. Instead,we take an affine combination of the average and the previousestimate as

x(t+ 1) =η

m

m∑

i=1

xi(t+ 1) + (1− η)x(t),

which introduces a one-step memory, and again behaves as amomentum.

The resulting update rule is therefore

xi(t+ 1) = xi(t) + γPi(x(t)− xi(t)), i ∈ [m], (3a)

x(t+ 1) =η

m

m∑

i=1

xi(t+ 1) + (1− η)x(t), (3b)

which leads to Algorithm 1.

B. Convergence Analysis

We analyze the convergence of the proposed algorithm andprove that it has linear convergence (i.e. the error decays ex-ponentially), with no additional assumption imposed. We alsoderive the rate of convergence explicitly.

Let us define the matrix X ∈ Rn×n as

X � 1

m

m∑

i=1

ATi (AiA

Ti )−1Ai. (4)

As it will become clear soon, the condition number of this matrixpredicts the behavior of the algorithm. Note that since the eigen-values of the projection matrix Pi are all 0 and 1, for every i, theeigenvalues ofX are all between 0 and 1. Denoting the eigenval-ues of X by μi, 0 ≤ μmin � μn ≤ · · · ≤ μ1 � μmax ≤ 1. Letus define complex quadratic polynomials pi(λ) characterized byγ and η as

pi(λ; γ, η) � λ2

+ (−ηγ(1− μi) + γ − 1 + η − 1)λ+ (γ − 1)(η − 1)(5)

for i = 1, . . . , n. Further, define set S as the collection of pairsγ ∈ [0, 2] and η ∈ R for which the largest magnitude solutionof pi(λ) = 0 among every i is less than 1, i.e.

S = {(γ, η) ∈ [0, 2]×R |roots of pi have magnitude less than 1 for all i}. (6)

The following result summarizes the convergence behavior ofthe proposed algorithm.

Theorem 1: Algorithm 1 converges to the true solution as fastas ρt converges to 0, as t→∞, for some ρ ∈ (0, 1), if and onlyif (γ, η) ∈ S. Furthermore, the optimal rate of convergence is

ρ =

√κ(X)− 1

√κ(X) + 1

≈ 1− 2√

κ(X), (7)

where κ(X) = μmax

μminis the condition number of X , and the

optimal parameters (γ∗, η∗) are the solution to the followingequations

{μmaxηγ = (1 +

√(γ − 1)(η − 1))2,

μminηγ = (1−√(γ − 1)(η − 1))2.

Proof: Letx∗ be the solution ofAx = b. To make the analysiseasier, we define error vectors with respect to x∗ as ei(t) =xi(t)− x∗ for all i = 1 . . .m, and e(t) = x(t)− x∗, and workwith these vectors. Using this notation, Eq. (3a) can be rewrittenas

ei(t+ 1) = ei(t) + γPi(e(t)− ei(t)), i = 1, . . . ,m.

Note that both x∗ and xi(t) are solutions toAix = bi. Therefore,their difference, which is ei(t), is in the nullspace of Ai, and itremains unchanged under projection onto the nullspace. As aresult, Piei(t) = ei(t), and we have

ei(t+ 1) = (1− γ)ei(t) + γPie(t), i = 1, . . . ,m. (8)

Similarly, the recursion (3b) can be expressed as

e(t+ 1) =η

m

m∑

i=1

ei(t+ 1) + (1− η)e(t),

Page 4: Distributed Solution of Large-Scale Linear Systems …nazizanr/papers/APC.pdfIEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019 Distributed Solution of Large-Scale

AZIZAN-RUHI et al.: DISTRIBUTED SOLUTION OF LARGE-SCALE LINEAR SYSTEMS VIA ACCELERATED PROJECTION-BASED CONSENSUS

which using (8) becomes

e(t+ 1) =η

m

m∑

i=1

((1− γ)ei(t) + γPie(t)) + (1− η)e(t)

=η(1− γ)

m

m∑

i=1

ei(t) +

(ηγ

m

m∑

i=1

Pi + (1− η)In

)

e(t).

(9)

It is relatively easy to check that in the steady state, the recur-sions (8), (9) become

{Pie(∞) = ei(∞), i = 1, . . . ,m

e(∞) = 1m

∑mi=1 Pie(∞)

which because of 1m

∑mi=1 Pi = I − 1

m

∑mi=1 A

Ti (AiA

Ti )−1

Ai = I −X , implies e(∞) = e1(∞) = · · · = em(∞) = 0, ifμmin �= 0.

Now let us stack up all the m vectors ei along withthe average e together, as a vector e(t)T = [e1(t)

T , e2(t)T , . . . , em(t)T , e(t)T ] ∈ R(m+1)n. The update rule can beexpressed as:⎡

⎢⎢⎢⎢⎢⎣

e1(t+ 1)

...

em(t+ 1)

e(t+ 1)

⎥⎥⎥⎥⎥⎦

=

⎢⎢⎢⎢⎣

(1− γ)Imn γ

⎢⎢⎣

P1

...

Pm

⎥⎥⎦

η(1−γ)m

[In . . . In

]M

⎥⎥⎥⎥⎦

⎢⎢⎢⎢⎢⎣

e1(t)

...

em(t)

e(t)

⎥⎥⎥⎥⎥⎦

,

(10)where M = ηγ

m

∑mi=1 Pi + (1− η)In.

The convergence rate of the algorithm is determined by thespectral radius (largest magnitude eigenvalue) of the (m+1)n× (m+ 1)n block matrix in (10). The eigenvalues λi ofthis matrix are indeed the solutions to the following character-istic equation.

det

⎢⎢⎢⎢⎣

(1− γ − λ)Imn γ

⎢⎢⎣

P1

...

Pm

⎥⎥⎦

η(1−γ)m

[In . . . In

]ηγm

∑mi=1 Pi + (1− η − λ)In

⎥⎥⎥⎥⎦= 0.

Using the Schur complement and properties of determinant, thecharacteristic equation can be simplified as follows.

0 = (1− γ − λ)mn

× det

(ηγ

m

m∑

i=1

Pi + (1− η − λ)In − η(1− γ)γ

(1− γ − λ)m

m∑

i=1

Pi

)

= (1− γ − λ)mn

× det

(ηγ

m

(

1− 1− γ

1− γ − λ

) m∑

i=1

Pi + (1− η − λ)In

)

= (1− γ − λ)mn

× det

(−ηγλ

(1− γ − λ)m

m∑

i=1

Pi + (1− η − λ)In

)

= (1− γ − λ)(m−1)n

× det

(

−ηγλ∑m

i=1 Pi

m+ (1− γ − λ)(1− η − λ)In

)

.

Therefore, there are (m− 1)n eigenvalues equal to 1− γ, andthe remaining 2n eigenvalues are the solutions to

0 = det (−ηγλ(I −X) + (1− γ − λ)(1− η − λ)I)

= det (ηγλX + ((1− γ − λ)(1− η − λ)− ηγλ) I) .

Whenever we have dropped the subscript of the identity matrix,it is of size n.

Recall that the eigenvalues of X are denoted byμi, i = 1, . . . , n. Therefore, the eigenvalues of ηγλX +((1− γ − λ)(1− η − λ)− ηγλ) I are ηγλμi + (1− γ −λ)(1− η − λ)− ηγλ, i = 1, . . . , n. The above determinantcan then be written as the product of the eigenvalues of thematrix inside it, as

0 =

n∏

i=1

ηγλμi + (1− γ − λ)(1− η − λ)− ηγλ.

Therefore, there are two eigenvalues λi,1, λi,2 as the solutionto the quadratic equation

λ2 + (−ηγ(1− μi) + γ − 1 + η − 1)λ+ (γ − 1)(η − 1) = 0

for every i = 1, . . . , n, which will constitute the 2n eigenvalues.When all these eigenvalues, along with 1− γ, are less than 1, theerror converges to zero as ρt, with ρ being the largest magnitudeeigenvalue (spectral radius). Therefore, Algorithm 1 convergesto the true solution x∗ as fast as ρt converges to 0, as t→∞, ifand only if (γ, η) ∈ S.

The optimal rate of convergence is achieved when the spectralradius is minimum. For that to happen, all the above eigenval-ues should be complex and have magnitude |λi,1| = |λi,2| =√(γ − 1)(η − 1) = ρ. It implies that we should have

(γ + η − ηγ(1− μi)− 2)2 ≤ 4(γ − 1)(η − 1), ∀i,or equivalently

−2√(γ − 1)(η − 1) ≤ γ + η − ηγ(1− μi)

≤ 2√

(γ − 1)(η − 1)

for all i. The expression in the middle is an increasing functionof μi, and therefore for the above bounds to hold, it is enoughfor the lower bound to hold for the μmin and the upper bound tohold for μmax, i.e.

{γ + η − ηγ(1− μmax)− 2 = 2

√(γ − 1)(η − 1)

2 + ηγ(1− μmin)− γ − η = 2√

(γ − 1)(η − 1),

which can be massaged and expressed as

{μmaxηγ = (1 +

√(γ − 1)(η − 1))2 = (1 + ρ)2

μminηγ = (1−√(γ − 1)(η − 1))2 = (1− ρ)2.

Page 5: Distributed Solution of Large-Scale Linear Systems …nazizanr/papers/APC.pdfIEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019 Distributed Solution of Large-Scale

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019

Dividing the above two equations implies κ(X) = (1+ρ)2

(1−ρ)2 ,which results in the optimal rate of convergence being

ρ =

√κ(X)− 1

√κ(X) + 1

,

and that concludes the proof. �We should remark that while in theory the optimal values of

γ and η depend on the values of the smallest and largest eigen-values of X , in practice, one will almost never compute theseeigenvalues. Rather, one will use surrogate heuristics (such asusing the eigenvalues of an appropriate-size random matrix) tochoose the step size. (In fact, the other methods, such as dis-tributed gradient descent and its variants, have the same issue aswell.)

C. Computation and Communication Complexity

In addition to the convergence rate, or equivalently the num-ber of iterations until convergence, one needs to consider thecomputational complexity per iteration. At each iteration, sincePi = In −AT

i (AiATi )−1Ai, and Ai is p× n, each machine

has to do the following two matrix-vector multiplications: (1)Ai(xi(t)− x(t)), which takes pn scalar multiplications, and (2)(AT

i (AiATi )−1) times the vector from the previous step, which

takes another np operations (the pseudoinverse ATi (AiA

Ti )−1 is

computed only once). Thus the overall computational complex-ity of each iteration is 2pn.

We should remark that the computation done at each machineduring each iteration is essentially a projection, which has con-dition number one and is as numerically stable as a matrix vectormultiplication can be.

Finally, the communication cost of the algorithm, per iter-ation, is as follows. After computing the update, each of them machines sends an n-dimensional vector to the master, andreceives back another n-dimensional vector, which is the newaverage.

As we will see, the per-iteration computation and communi-cation complexity of the other algorithms are similar to APC;however, APC requires fewer iterations, because of its faster rateof convergence.

IV. COMPARISON WITH RELATED METHODS

A. Distributed Gradient Descent (DGD)

As mentioned earlier, (1) can also be viewed as an optimiza-tion problem of the form

minimizex

‖Ax− b‖2,

and since the objective is separable in the data, i.e. ‖Ax− b‖2 =∑mi=1 ‖Aix− bi‖2, generic distributed optimization methods

such as distributed gradient descent apply well to the problem.The regular or full gradient descent has the update rule x(t+

1) = x(t)− αAT (Ax(t)− b), where α > 0 is the step size orlearning rate. The distributed version of gradient descent is onein which each machine i has only a subset of the equations

[Ai, bi], and computes its own part of the gradient, which isAT

i (Aix(t)− bi). The updates are then collectively done as:

x(t+ 1) = x(t)− αm∑

i=1

ATi (Aix(t)− bi). (11)

One can show that this also has linear convergence, and therate of convergence is

ρGD =κ(ATA)− 1

κ(ATA) + 1≈ 1− 2

κ(ATA). (12)

We should mention that since each machine needs to computeAT

i (Aix(t)− bi) at each iteration t, the computational complex-ity per iteration is 2pn, which is identical to that of APC.

B. Distributed Nesterov’s Accelerated Gradient Descent(D-NAG)

A popular variant of gradient descent is Nesterov’s acceleratedgradient descent [5], which has a memory term, and works asfollows:

y(t+ 1) = x(t)− α

m∑

i=1

ATi (Aix(t)− bi), (13a)

x(t+ 1) = (1 + β)y(t+ 1)− βy(t). (13b)

One can show [21] that the optimal convergence rate of thismethod is

ρNAG = 1− 2√

3κ(ATA) + 1, (14)

which is improved over the regular distributed gradient descent

(one can check that κ(ATA)−1κ(ATA)+1

≥ 1− 2√3κ(ATA)+1

).

C. Distributed Heavy-Ball Method (D-HBM)

The heavy-ball method [6], otherwise known as the gradi-ent descent with momentum, is another accelerated variant ofgradient descent as follows:

z(t+ 1) = βz(t) +

m∑

i=1

ATi (Aix(t)− bi), (15a)

x(t+ 1) = x(t)− αz(t+ 1). (15b)

It can be shown [21] that the optimal rate of convergence ofthis method is

ρHBM =

√κ(ATA)− 1

√κ(ATA) + 1

≈ 1− 2√

κ(ATA), (16)

which is further improved over DGD and D-NAG ( κ(ATA)−1

κ(ATA)+1≥

1− 2√3κ(ATA)+1

≥√

κ(ATA)−1√κ(ATA)+1

). This is similar to, but not the

same as, the rate of convergence of APC. The difference is thatthe condition number of ATA =

∑mi=1 A

Ti Ai is replaced with

the condition number of X =∑m

i=1 ATi

(AiA

Ti

)−1Ai in APC.

Given its structure as the sum of projection matrices, one mayspeculate thatX has a much better condition number than ATA.

Page 6: Distributed Solution of Large-Scale Linear Systems …nazizanr/papers/APC.pdfIEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019 Distributed Solution of Large-Scale

AZIZAN-RUHI et al.: DISTRIBUTED SOLUTION OF LARGE-SCALE LINEAR SYSTEMS VIA ACCELERATED PROJECTION-BASED CONSENSUS

TABLE IA SUMMARY OF THE CONVERGENCE RATES OF DIFFERENT METHODS. DGD: DISTRIBUTED GRADIENT DESCENT, D-NAG: DISTRIBUTED NESTEROV’S

ACCELERATED GRADIENT DESCENT, D-HBM: DISTRIBUTED HEAVY-BALL METHOD, MOU ET AL: CONSENSUS ALGORITHM OF [20], B-CIMMINO: BLOCK

CIMMINO METHOD, APC: ACCELERATED PROJECTION-BASED CONSENSUS. THE SMALLER THE CONVERGENCE RATE IS, THE FASTER IS THE METHOD.NOTE THAT ρGD ≥ ρNAG ≥ ρHBM AND ρMou ≥ ρCim ≥ ρAPC

TABLE IIA COMPARISON BETWEEN THE CONDITION NUMBERS OF ATA AND X FOR

SOME EXAMPLES. m IS THE NUMBER OF MACHINES/PARTITIONS. THE

CONDITION NUMBER OF X IS TYPICALLY MUCH SMALLER (BETTER).REMARKABLY, THE DIFFERENCE IS EVEN MORE PRONOUNCED

WHEN A HAS NON-ZERO MEAN

Indeed, our experiments with random, as well as real, data setssuggest that this is the case and that the condition number of Xis often significantly better (see Table II).

D. Alternating Direction Method of Multipliers (ADMM)

Alternating Direction Method of Multipliers (more specifi-cally, consensus ADMM [7], [12]), is another generic methodfor solving optimization problems with separable cost functionf(x) =

∑mi=1 fi(x) distributedly, by defining additional local

variables. Each machine i holds local variables xi(t) ∈ Rn andyi(t) ∈ Rn, and the master’s value is x(t) ∈ Rn, for any time t.For fi(x) = 1

2‖Aix− bi‖2, the update rule of ADMM simpli-fies to

xi(t+ 1) = (ATi Ai + ξIn)

−1(ATi bi − yi(t) + ξx(t)),

i ∈ [m] (17a)

x(t+ 1) =1

m

m∑

i=1

xi(t+ 1) (17b)

yi(t+ 1) = yi(t) + ξ(xi(t+ 1)− x(t+ 1)), i ∈ [m] (17c)

It turns out that this method is very slow (and often unstable)in its native form for the application in hand. One can check thatwhen system (1) has a solution, all the yi variables converge tozero in steady state. Therefore, setting yi’s to zero can speed upthe convergence significantly. We use this modified version inSection VI, to compare with.

We should also note that the computational complexity ofADMM is O(pn) per iteration (the inverse is computed usingmatrix inversion lemma), which is again the same as that ofgradient-type methods and APC.

E. Block Cimmino Method

The Block Cimmino method [14]–[16], which is a parallelmethod specifically for solving linear systems of equations, isperhaps the closest algorithm in spirit to APC. It is, in a way, adistributed implementation of the so-called Kaczmarz method[18]. The convergence of the Cimmino method is slower byan order in comparison with APC (its convergence time is thesquare of that of APC), and it turns out that APC includes thismethod as a special case when γ = 1.

The block Cimmino method is the following:

ri(t) = A+i (bi −Aix(t)), i ∈ [m] (18a)

x(t+ 1) = x(t) + νm∑

i=1

ri(t), (18b)

where A+i = AT

i (AiATi )−1 is the pseudoinverse of Ai.

Page 7: Distributed Solution of Large-Scale Linear Systems …nazizanr/papers/APC.pdfIEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019 Distributed Solution of Large-Scale

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019

Proposition 2: The APC method (Algorithm 1) includes theblock Cimmino method as a special case for γ = 1.

Proof: When γ = 1, Eq. (3a) becomes

xi(t+ 1) = xi(t)− Pi (xi(t)− x(t))

= xi(t)−(I −AT

i (AiATi )−1Ai

)(xi(t)− x(t))

= x(t) +ATi (AiA

Ti )−1Ai(xi − x(t))

= x(t) +ATi (AiA

Ti )−1(bi −Aix(t))

In the last equation, we used the fact that xi is always a solutionto Aix = bi. Notice that the above equation is no longer an “up-date” in the usual sense, i.e., xi(t+ 1) does not depend on xi(t)directly. This can be further simplified using the pseudoinverseof Ai, A

+i = AT

i (AiATi )−1 as

xi(t+ 1) = x(t) +A+i (bi −Aix(t)).

It is then easy to see from the Cimmino’s equation (18a) that

ri(t) = xi(t+ 1)− x(t).

Therefore, the update (18b) can be expressed as

x(t+ 1) = x(t) + ν

m∑

i=1

ri(t)

= x(t) + ν

m∑

i=1

(xi(t+ 1)− x(t))

= (1−mν)x(t) + νm∑

i=1

xi(t+ 1),

which is nothing but the same update rule as in (3b) with η =mν. �

It is not hard to show that optimal rate of convergence of theCimmino method is

ρCim =κ(X)− 1

κ(X) + 1≈ 1− 2

κ(X), (19)

which is slower (by an order) than that of APC (√

κ(X)−1√κ(X)+1

≈1− 2√

κ(X)).

F. Consensus Algorithm of Mou et el.

As mentioned earlier, a projection-based consensus algorithmfor solving linear systems over a network was recently proposedby Mou et al. [19], [20]. For the master-worker setting studiedin this paper, the corresponding network would be a clique, andthe algorithm reduces to

xi(t+ 1) = xi(t) + Pi

⎝ 1

m

⎝m∑

j=1

xj(t)

⎠− xi(t)

⎠ , i ∈ [m],

(20)which is transparently equivalent to APC with γ = η = 1:

xi(t+ 1) = xi(t) + Pi(x(t)− xi(t)), i ∈ [m],

x(t+ 1) =1

m

m∑

i=1

xi(t+ 1),

It is straightforward to show that the rate of convergence inthis case is

ρMou = 1− μmin(X) (21)

which is much slower than the block Cimmino method and APC.One can easily check that

1− μmin(X) ≥ κ(X)− 1

κ(X) + 1≥√κ(X)− 1

√κ(X) + 1

.

Even though this algorithm is slow, it is useful for applica-tions where a fully-distributed (networked) solution is desired.Another networked algorithm for solving a least-squares prob-lem has been recently proposed in [22].

A summary of the convergence rates of all the related methodsdiscussed in this section is provided in Table I.

V. UNDERDETERMINED SYSTEM

In this section, we consider the case when N < n andrank(A) = N , i.e., the system is underdetermined and thereare infinitely many solutions. We prove that in this case, eachmachine still converges to “a” (global) solution, and further, allthe machines converge to the same solution. The convergenceis again linear (i.e. the error decays exponentially fast), and therate of convergence is similar to the previous case.

Recall that the matrix X ∈ Rn×n defined earlier can be writ-ten as

X =1

m

m∑

i=1

ATi (AiA

Ti )−1Ai

=1

mAT

⎢⎢⎣

(A1AT1 )−1

. . .

(AmATm)−1

⎥⎥⎦A

which is singular in this case.We define a new matrix Y ∈ RN×N

Y � 1

mAAT

⎢⎢⎣

(A1AT1 )−1

. . .

(AmATm)−1

⎥⎥⎦ (22)

which has the same nonzero eigenvalues as X .Theorem 3: SupposeN < n and rank(A) = N . Each one of

x1(t), . . . , xm(t), x(t) in Algorithm 1 converges to a solution asfast as ρt converges to 0, as t→∞, for some ρ ∈ (0, 1), if andonly if (γ, η) ∈ S. Furthermore, the solutions converged to arethe same. The optimal rate of convergence is

ρ =

√κ(Y )− 1

√κ(Y ) + 1

≈ 1− 2√

κ(Y ), (23)

where κ(Y ) = μmax

μminis the condition number of Y , and the

optimal parameters (γ∗, η∗) are the solution to the followingequations

{μmaxηγ = (1 +

√(γ − 1)(η − 1))2,

μminηγ = (1−√(γ − 1)(η − 1))2.

Page 8: Distributed Solution of Large-Scale Linear Systems …nazizanr/papers/APC.pdfIEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019 Distributed Solution of Large-Scale

AZIZAN-RUHI et al.: DISTRIBUTED SOLUTION OF LARGE-SCALE LINEAR SYSTEMS VIA ACCELERATED PROJECTION-BASED CONSENSUS

Proof: Let x∗ be a solution to Ax = b. We define error vec-tors ei(t) = xi(t)− x∗ for all i = 1 . . .m, and e(t) = x(t)−x∗, as before, but this time show thatAei(t)→ 0 andAe(t)→ 0.Recursion (3a) can be rewritten as

ei(t+ 1) = ei(t) + γPi(e(t)− ei(t)), i = 1, . . . ,m,

as before. Since both x∗ and xi(t) are solutions to Aix = bi,their difference ei(t) is in the nullspace of Ai, and it remainsunchanged under projection onto the nullspace. As a result,Piei(t) = ei(t), and we have

ei(t+ 1) = (1− γ)ei(t) + γPie(t), i = 1, . . . ,m. (24)

Similarly, the recursion (3b) can be expressed as

e(t+ 1) =η

m

m∑

i=1

ei(t+ 1) + (1− η)e(t)

m

m∑

i=1

((1− γ)ei(t) + γPie(t)) + (1− η)e(t)

=η(1− γ)

m

m∑

i=1

ei(t) +

(ηγ

m

m∑

i=1

Pi + (1− η)In

)

e(t),

as before.Multiplying the recursions by A, we have

Aei(t+ 1) = (1− γ)Aei(t) + γAPie(t), i = 1, . . . ,m,

and

Ae(t+ 1) =η(1− γ)

m

m∑

i=1

Aei(t)

+

(ηγ

m

m∑

i=1

APi + (1− η)A

)

e(t)

Note thatPi = In −ATi (AiA

Ti )−1Ai, and we can expressAi

as Ai =[0p×p . . . Ip . . . 0p×p

]A = EiA, where Ei is a p×

N matrix with an identity at its i-th block and zero everywhereelse. Therefore, we have APi = A−AAT

i (AiATi )−1EiA =

(IN −AATi (AiA

Ti )−1Ei)A, and the recursions become

Aei(t+ 1) = (1− γ)Aei(t)

+ γ(IN −AATi (AiA

Ti )−1Ei)Ae(t),

for i = 1, . . . ,m, and

Ae(t+ 1) =η(1− γ)

m

m∑

i=1

Aei(t)

+

(ηγ

m

m∑

i=1

(IN −AATi (AiA

Ti )−1Ei) + (1− η)IN

)

Ae(t)

Stacking up all the m vectors Aei along with Ae, as an (m+1)N -dimensional vector, results in

⎢⎢⎢⎢⎢⎣

Ae1(t+ 1)

...

Aem(t+ 1)

Ae(t+ 1)

⎥⎥⎥⎥⎥⎦

=

⎢⎢⎢⎢⎣

(1− γ)ImN γ

⎢⎢⎣

P ′1...

P ′m

⎥⎥⎦

η(1−γ)m

[IN . . . IN

]M ′

⎥⎥⎥⎥⎦

⎢⎢⎢⎢⎢⎣

Ae1(t)

...

Aem(t)

Ae(t)

⎥⎥⎥⎥⎥⎦

,

where M ′ = ηγm

∑mi=1 P

′i + (1− η)IN and P ′i = IN −AAT

i

(AiATi )−1Ei.

The convergence rate of the algorithm is determined by thespectral radius (largest magnitude eigenvalue) of this (m+1)N × (m+ 1)N matrix. The eigenvalues λi of this matrix arethe solutions to the following characteristic equation.

det

⎢⎢⎢⎢⎣

(1− γ − λ)ImN γ

⎢⎢⎣

P ′1...

P ′m

⎥⎥⎦

η(1−γ)m

[IN . . . IN

]ηγm

∑mi=1 P

′i + (1− η − λ)IN

⎥⎥⎥⎥⎦= 0.

Similar as in the proof of Theorem 1, using the Schur comple-ment and properties of determinant, the characteristic equationcan be simplified as follows.

0 = (1− γ − λ)mN

× det

(ηγ

m

m∑

i=1

P ′i + (1− η − λ)IN − η(1− γ)γ

(1− γ − λ)m

m∑

i=1

P ′i

)

= (1− γ − λ)mN

× det

(ηγ

m

(

1− 1− γ

1− γ − λ

) m∑

i=1

P ′i + (1− η − λ)IN

)

= (1− γ − λ)mN

× det

(−ηγλ

(1− γ − λ)m

m∑

i=1

P ′i + (1− η − λ)IN

)

= (1− γ − λ)(m−1)N

× det

(

−ηγλ∑m

i=1 P′i

m+ (1− γ − λ)(1− η − λ)IN

)

.

Note that 1m

∑mi=1 P

′i = IN − 1

m

∑mi=1 AA

Ti (AiA

Ti )−1Ei =

IN − [AAT1 (A1A

T1 )−1, . . . , AAT

m(AmATm)−1] = IN − Y .

There are (m− 1)N eigenvalues equal to 1− γ, and theremaining 2N eigenvalues are the solutions to

0 = det (−ηγλ(I − Y ) + (1− γ − λ)(1− η − λ)I)

= det (ηγλY + ((1− γ − λ)(1− η − λ)− ηγλ) I) .

Notice that this is exactly the same as the one in the proof ofTheorem 1, with X replaced with Y .

It follows that the Ae1(t), . . . , Aem(t), Ae(t) converge tozero as fast as ρt if and only if (γ, η) ∈ S, and the optimal rate

Page 9: Distributed Solution of Large-Scale Linear Systems …nazizanr/papers/APC.pdfIEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019 Distributed Solution of Large-Scale

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019

of convergence is

ρ =

√κ(Y )− 1

√κ(Y ) + 1

.

Convergence of Ae1(t), . . . , Aem(t), Ae(t) to zero means thateach machine and the master converge to a solution, but the solu-tions reached may not be the same. What remains to show is thatthe only steady state is the “consensus steady state.” From (3),it is easy to see that the steady state x1(∞), . . . , xm(∞), x(∞)satisfies the following equation.

{Pi(x(∞)− xi(∞)) = 0, i ∈ [m]

x(∞) = 1m

∑mi=1 xi(∞)

(25)

which can be written in a matrix form as⎡

⎢⎢⎢⎢⎢⎣

P1 −P1

. . ....

Pm −Pm

− Inm . . . − In

m In

⎥⎥⎥⎥⎥⎦

⎢⎢⎢⎢⎢⎣

x1(∞)

...

xm(∞)

x(∞)

⎥⎥⎥⎥⎥⎦

= 0. (26)

Notice that for any v ∈ Rn, the vector[vT . . . vT vT

]Tis

a solution to this equation, which corresponds to a consensussteady state x1(∞) = · · · = xm(∞) = x(∞) = v. Therefore,the nullspace of the above matrix is at least n dimensional, or inother words, it has n zero eigenvalues. We will argue that thismatrix has only n zero eigenvalues, and therefore any steady-state solution must be a consensus. To find the eigenvalues λi

we have to solve the following characteristic equation.

det

⎢⎢⎢⎢⎣

P1 − λI −P1

. . ....

Pm − λI −Pm

− Im . . . − I

m (1− λ)I

⎥⎥⎥⎥⎦= 0.

Once again, using the Schur complement we have

0 =

(m∏

i=1

det(Pi − λI)

)

× det

(

(1− λ)I − 1

m

m∑

i=1

(Pi − λI)−1Pi

)

Note that (Pi − λI)−1Pi =1

1−λPi. Therefore, we can write

0 =

(m∏

i=1

det(Pi − λI)

)

det

(

(1− λ)I − 1

m

m∑

i=1

1

1− λPi

)

=

(m∏

i=1

((−λ)p(1− λ)n−p

))

· det(

(1− λ)I − 1

m

m∑

i=1

1

1− λPi

)

= (−λ)N (1− λ)mn−N det

(

(1− λ)I − 1

m

m∑

i=1

1

1− λPi

)

because Pi has p zero eigenvalues and n− p one eigenvalues.Using the properties of determinant, we further have

0 = (−λ)N (1− λ)(m−1)n−N det

(

(1− λ)2I − 1

m

m∑

i=1

Pi

)

= (−λ)N (1− λ)(m−1)n−N det((1− λ)2I − (I −X)

)

= (−λ)N (1− λ)(m−1)n−N det((λ2 − 2λ)I +X

)

= (−λ)N (1− λ)(m−1)n−Nm∏

i=1

(λ2 − 2λ+ μi

)

Therefore, the (m+ 1)n eigenvalues are as follows: N zeroeigenvalues, (m− 1)n−N eigenvalues at 1, and the remaining2n are 1±√1− μi. Note that in the underdetermined case, Xhas n−N zero eigenvalues. Therefore, n−N of 1±√1− μi

are zero. As a result, the overall number of zero eigenvalues isN + (n−N) = n. This implies that the nullity of the matrixin (26) is n and any steady-state solution must be a consensussolution, which completes the proof. �

VI. EXPERIMENTAL RESULTS

In this section, we evaluate the proposed method (APC)by comparing it with the other distributed methods discussedthroughout the paper, namely DGD, D-NAG, D-HBM, modi-fied ADMM, and block Cimmino methods. We use randomly-generated problems as well as real-world ones form the NationalInstitute of Standards and Technology (NIST) repository, MatrixMarket [23].

We first compare the rates of convergence of the algorithmsρ, which is the spectral radius of the iteration matrix. To dis-tinguish the differences, it is easier to compare the convergencetimes, which is defined as T = 1

− log ρ (≈ 11−ρ ). We tune the pa-

rameters in all of the methods to their optimal values, to makethe comparison between the methods fair. Also as mentionedbefore, all the algorithms have the same per-iteration computa-tion and communication complexity. Table III shows the valuesof the convergence times for a number of synthetic and real-world problems with different sizes. It can be seen that APC hasa much faster convergence, often by orders of magnitude. Asexpected from the analysis, the APC’s closest competitor is thedistributed heavy-ball method. Notably, in randomly-generatedproblems, when the mean is not zero, the gap is much larger.

To further verify the performance of the proposed algorithm,we also run all the algorithms on multiple problems, and observethe actual decay of the error. Fig. 2 shows the relative error (thedistance from the true solution, divided by the true solution, in �2norm) for all the methods, on two examples from the repository.Again, to make the comparison fair, all the methods have beentuned to their optimal parameters. As one can see, APC out-performs the other methods by a wide margin, which is consis-tent with the order-of-magnitude differences in the convergencetimes of Table III. We should also remark that initialization doesnot seem to affect the convergence behavior of our algorithm.Lastly, we should mention that our experiments on cases wherethere are missing updates (“straggler” machines) indicate that

Page 10: Distributed Solution of Large-Scale Linear Systems …nazizanr/papers/APC.pdfIEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019 Distributed Solution of Large-Scale

AZIZAN-RUHI et al.: DISTRIBUTED SOLUTION OF LARGE-SCALE LINEAR SYSTEMS VIA ACCELERATED PROJECTION-BASED CONSENSUS

TABLE IIIA COMPARISON BETWEEN THE OPTIMAL CONVERGENCE TIME T (= 1

− logρ ) OF DIFFERENT METHODS ON REAL AND SYNTHETIC EXAMPLES. BOLDFACE VALUES

SHOW THE SMALLEST CONVERGENCE TIME. QC324: MODEL OF H+2 IN AN ELECTROMAGNETIC FIELD. ORSIRR 1: OIL RESERVOIR SIMULATION.

ASH608: ORIGINAL HARWELL SPARSE MATRIX TEST COLLECTION

Fig. 2. The decay of the error for different distributed algorithms, on two real problems from Matrix Market [23] (QC324: Model of H+2 in an Electromagnetic

Field, and ORSIRR 1: Oil reservoir simulation). n = # of variables, N = # of equations, m = # of workers, p = # of equations per worker.

APC is at least as robust as the other algorithms to these effects,and the convergence curves looks qualitatively the same as inFig. 2.

VII. A DISTRIBUTED PRECONDITIONING TO IMPROVE

GRADIENT-BASED METHODS

The noticeable similarity between the optimal convergence

rate of APC (√

κ(X)−1√κ(X)+1

) and that of D-HBM (√

κ(ATA)−1√κ(ATA)+1

) sug-

gests that there might be a connection between the two. It turnsout that there is, and we propose a distributed preconditioning

for D-HBM, which makes it achieve the same convergence rateas APC. The algorithm works as follows.

Prior to starting the iterative process, each machine i canpremultiply its own set of equations Aix = bi by (AiA

Ti )−1/2,

which can be done in parallel (locally) with O(p2n) operations.This transforms the global system of equations Ax = b to a newone Cx = d, where

C =

⎢⎢⎣

(A1AT1 )−1/2A1

...

(AmATm)−1/2Am

⎥⎥⎦,

Page 11: Distributed Solution of Large-Scale Linear Systems …nazizanr/papers/APC.pdfIEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019 Distributed Solution of Large-Scale

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019

and

d =

⎢⎢⎣

(A1AT1 )−1/2b1

...

(AmATm)−1/2bm

⎥⎥⎦.

The new system can then be solved using distributed heavy-ball method, which will achieve the same rate of convergenceas APC, i.e.

√κ−1√κ+1

where κ = κ(CTC) = κ(X).

VIII. CONCLUSION

We considered the problem of solving a large-scale systemof linear equations by a taskmaster with the help of a num-ber of computing machines/cores, in a distributed way. We pro-posed an accelerated projection-based consensus algorithm forthis problem, and fully analyzed its convergence rate. Analyt-ical and experimental comparisons with the other known dis-tributed methods confirm significantly faster convergence of theproposed scheme. Finally, our analysis suggested a novel dis-tributed preconditioning for improving the convergence of thedistributed heavy-ball method to achieve the same theoreticalperformance as the proposed consensus-based method.

We should finally remark that while the setting studied inthis paper was a master-workers one, the same algorithm canbe implemented in a networked setting where there is no centralcollector/master, using a “distributed averaging” approach ([24],[25]).

REFERENCES

[1] N. Azizan-Ruhi, F. Lahouti, S. Avestimehr, and B. Hassibi, “Distributedsolution of large-scale linear systems via accelerated projection-based con-sensus,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Apr.2018, pp. 6358–6362.

[2] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, “Parallelized stochas-tic gradient descent,” in Proc. Advances Neural Inf. Process. Syst., 2010,pp. 2595–2603.

[3] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild!: A lock-free approachto parallelizing stochastic gradient descent,” in Proc. Advances Neural Inf.Process. Syst., 2011, pp. 693–701.

[4] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gra-dient descent,” SIAM J. Optim., vol. 26, no. 3, pp. 1835–1854, 2016.

[5] Y. Nesterov, “A method of solving a convex programming problem withconvergence rate O(1/k2),” Sov. Math. Doklady, vol. 27, no. 2, 1983,pp. 372–376.

[6] B. T. Polyak, “Some methods of speeding up the convergence of iterationmethods,” USSR Comput. Math. Math. Phys., vol. 4, no. 5, pp. 1–17, 1964.

[7] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed op-timization and statistical learning via the alternating direction method ofmultipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122, 2011.

[8] B. He and X. Yuan, “On the O(1/n) convergence rate of the Douglas-Rachford alternating direction method,” SIAM J. Numer. Anal., vol. 50,no. 2, pp. 700–709, 2012.

[9] W. Deng and W. Yin, “On the global and linear convergence of the gen-eralized alternating direction method of multipliers,” J. Scientif. Comput.,vol. 66, no. 3, pp. 889–916, 2016.

[10] R. Zhang and J. T. Kwok, “Asynchronous distributed ADMM for consen-sus optimization,” in Proc. Int. Conf. Mach. Learn., 2014, pp. 1701–1709.

[11] J. F. Mota, J. M. Xavier, P. M. Aguiar, and M. Puschel, “D-ADMM:A communication-efficient distributed algorithm for separable optimiza-tion,” IEEE Trans. Signal Process., vol. 61, no. 10, pp. 2718–2723,May 2013.

[12] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence ofthe ADMM in decentralized consensus optimization,” IEEE Trans. SignalProcess., vol. 62, no. 7, pp. 1750–1761, Apr. 2014.

[13] L. Majzoubi and F. Lahouti, “Analysis of distributed ADMM algorithm forconsensus optimization in presence of error,” in Proc. IEEE Conf. Audio,Speech, Signal Process., Mar. 2016, pp. 4831–4835.

[14] I. S. Duff, R. Guivarch, D. Ruiz, and M. Zenadi, “The augmented blockCimmino distributed method,” SIAM J. Scientif. Comput., vol. 37, no. 3,pp. A1248–A1269, 2015.

[15] F. Sloboda, “A projection method of the Cimmino type for linear algebraicsystems,” Parallel Comput., vol. 17, no. 4/5, pp. 435–442, 1991.

[16] M. Arioli, I. Duff, J. Noailles, and D. Ruiz, “A block projection methodfor sparse matrices,” SIAM J. Scientif. Statist. Comput., vol. 13, no. 1,pp. 47–70, 1992.

[17] R. Bramley and A. Sameh, “Row projection methods for large nonsym-metric linear systems,” SIAM J. Scientif. Statist. Comput., vol. 13, no. 1,pp. 168–193, 1992.

[18] S. Kaczmarz, “Angenäherte auflösung von systemen linearer gleichun-gen,” Bulletin Int. de lAcademie Polonaise des Sci. et des Lettres, vol. 35,pp. 355–357, 1937.

[19] J. Liu, S. Mou, and A. S. Morse, “An asynchronous distributed algorithmfor solving a linear algebraic equation,” in Proc. 52nd IEEE Annu. Conf.Decis, Control, 2013, pp. 5409–5414.

[20] S. Mou, J. Liu, and A. S. Morse, “A distributed algorithm for solving alinear algebraic equation,” IEEE Trans. Autom. Control, vol. 60, no. 11,pp. 2863–2878, Nov. 2015.

[21] L. Lessard, B. Recht, and A. Packard, “Analysis and design of optimizationalgorithms via integral quadratic constraints,” SIAM J. Optim., vol. 26,no. 1, pp. 57–95, 2016.

[22] X. Wang, J. Zhou, S. Mou, and M. J. Corless, “A distributed algorithmfor least squares solutions,” IEEE Trans. Autom. Control, 2019, doi:10.1109/TAC.2019.2894588.

[23] “Matrix market,” 2007. [Online]. Available: http://math.nist.gov/MatrixMarket/. Accessed: May 2017.

[24] J. N. Tsitsiklis, “Problems in decentralized decision making and com-putation,” M.S. thesis, Laboratory Inf. Decis. Syst., Massachusetts Inst.Technol., Cambridge, MA, USA, 1984.

[25] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,”Syst. Control Lett., vol. 53, no. 1, pp. 65–78, 2004.

Navid Azizan Ruhi (S’15) received the B.S. degreeform Sharif University of Technology, Tehran, Iran,and the M.S. degree from the University of South-ern California, Los Angeles, CA, USA, in 2013 and2015, respectively, both in electrical engineering. Heis currently working toward the Ph.D. degree in com-puting and mathematical sciences with the CaliforniaInstitute of Technology, Los Angeles, CA, USA. Hisresearch interests span optimization, machine learn-ing, complex networks, and distributed systems.

He was the first-place winner and a gold medalistat the 21st National Physics Olympiad in Iran. His work has been recognizedwith several awards, including the 2016 ACM GREENMETRICS Best StudentPaper Award, the Amazon Fellowship in Artificial Intelligence, the PIMCOGraduate Fellowship, the Computing and Mathematical Sciences Fellowship,the Annenberg Graduate Fellowship, and the Sharif University of TechnologyPresident’s Award.

Farshad Lahouti (SM’13) received the B.Sc. de-gree in electrical engineering from the University ofTehran, Tehran, Iran, in 1997, and the Ph.D. degreein electrical engineering from the University of Wa-terloo, Waterloo, ON, Canada, in 2002. In 2005, hejoined the School of Electrical and Computer Engi-neering, University of Tehran, as a Faculty Member,where he founded the Center for Wireless Multime-dia Communications. He was the Head of the Com-munications Engineering Department from 2008 to2012. He is currently a Visiting Professor of Electri-

cal Engineering with the California Institute of Technology, Pasadena, CA, USA,where he initiated the Digital Ventures Design Program. His current researchinterests are coding and information theory, signal processing, and communica-tion theory with applications to wireless networks, cyber-physical systems andman–machine symbiosis, and biological and neuronal networks. He received theDistinguished Scientist Award from the Iran National Academy of Sciences in2014.

Page 12: Distributed Solution of Large-Scale Linear Systems …nazizanr/papers/APC.pdfIEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 67, NO. 14, JULY 15, 2019 Distributed Solution of Large-Scale

AZIZAN-RUHI et al.: DISTRIBUTED SOLUTION OF LARGE-SCALE LINEAR SYSTEMS VIA ACCELERATED PROJECTION-BASED CONSENSUS

Amir Salman Avestimehr (SM’17) received the B.S.degree in electrical engineering from the Sharif Uni-versity of Technology, Tehran, Iran, in 2003, and theM.S. and Ph.D. degrees in electrical engineering andcomputer science from the University of Californiaat Berkeley, Berkeley, CA, USA, in 2005 and 2008,respectively. He was a Postdoctoral Scholar with theCenter for the Mathematics of Information, Califor-nia Institute of Technology, in 2008. He is currentlyan Associate Professor with the Department of Elec-trical Engineering, University of Southern California,

Los Angeles, CA, USA. His research interests include information theory, com-munications, distributed computing, and data analytics. He has received a num-ber of awards, including the Communications Society and Information TheorySociety Joint Paper Award, the Presidential Early Career Award for Scientistsand Engineers for pushing the frontiers of information theory through its exten-sion to complex wireless information networks, the Young Investigator ProgramAward from the U.S. Air Force Office of Scientific Research, the National Sci-ence Foundation CAREER Award, and the David J. Sakrison Memorial Prizefor outstanding research. He was a recipient (as Advisor) of the Qualcomm In-novation Award and the Best Student Paper Award at the IEEE InternationalSymposium on Information Theory. He has been a Guest Associate Editor ofthe IEEE TRANSACTIONS ON INFORMATION THEORY Special Issue on Interfer-ence Networks and General Co-Chairs of the 2012 North America InformationTheory Summer School, the 2012 Workshop on Interference Networks, and the2020 IEEE International Symposium on Information Theory. He is currently anAssociate Editor of the IEEE TRANSACTIONS ON INFORMATION THEORY.

Babak Hassibi (M’08) was born in Tehran, Iran, in1967. He received the B.S. degree from the Univer-sity of Tehran, Tehran, Iran, in 1989, and the M.S.and Ph.D. degrees from Stanford University, Stan-ford, CA, USA, in 1993 and 1996, respectively, all inelectrical engineering.

Since January 2001, he has been with the Cali-fornia Institute of Technology, Pasadena, CA, USA,where he is currently the Mose and Lilian S. BohnProfessor of Electrical Engineering. From 2013 to2016, he was the Gordon M. Binder/Amgen Professor

of Electrical Engineering and from 2008 to 2015, he was an Executive Officer ofElectrical Engineering, as well as the Associate Director of Information Scienceand Technology. From October 1996 to October 1998, he was a Research Asso-ciate with the Information Systems Laboratory, Stanford University, and fromNovember 1998 to December 2000, he was a Member of the Technical Staff withthe Mathematical Sciences Research Center, Bell Laboratories, Murray Hill, NJ,USA. He has also held short-term appointments at Ricoh California ResearchCenter, the Indian Institute of Science, and Linkoping University, Sweden. Heis the coauthor of the books (both with A.H. Sayed and T. Kailath) IndefiniteQuadratic Estimation and Control: A Unified Approach to H2 and H∞ Theo-ries (SIAM, 1999) and Linear Estimation (Prentice Hall, 2000). His researchinterests include communications and information theory, control and networkscience, and signal processing and machine learning. He is a recipient of anAlborz Foundation Fellowship, the 1999 O. Hugo Schuck best paper award ofthe American Automatic Control Council (with H. Hindi and S.P. Boyd), the2002 National Science Foundation Career Award, the 2002 Okawa FoundationResearch Grant for Information and Telecommunications, the 2003 David andLucille Packard Fellowship for Science and Engineering, the 2003 Presiden-tial Early Career Award for Scientists and Engineers (PECASE), and the 2009Al-Marai Award for Innovative Research in Communications, and was a partic-ipant in the 2004 National Academy of Engineering “Frontiers in Engineering”program.

He has been a Guest Editor for the IEEE TRANSACTIONS ON INFORMATION

THEORY Special Issue on “Space-time transmission, reception, coding and signalprocessing,” was an Associate Editor for Communications of the IEEE TRANS-ACTIONS ON INFORMATION THEORY during 2004–2006, and is currently an Edi-tor for the journal Foundations and Trends in Information and Communication”and for the IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING. Heis an IEEE Information Theory Society Distinguished Lecturer for 2016–2017.