a survey of distributed optimization methods for multi

17
1 A Survey of Distributed Optimization Methods for Multi-Robot Systems Trevor Halsted 1 , Ola Shorinwa 1 , Javier Yu 2 , Mac Schwager 2 Abstract—Distributed optimization consists of multiple compu- tation nodes working together to minimize a common objective function through local computation iterations and network- constrained communication steps. In the context of robotics, dis- tributed optimization algorithms can enable multi-robot systems to accomplish tasks in the absence of centralized coordination. We present a general framework for applying distributed optimiza- tion as a module in a robotics pipeline. We survey several classes of distributed optimization algorithms and assess their practical suitability for multi-robot applications. We further compare the performance of different classes of algorithms in simulations for three prototypical multi-robot problem scenarios. The Consensus Alternating Direction Method of Multipliers (C-ADMM) emerges as a particularly attractive and versatile distributed optimization method for multi-robot systems. Index Terms—distributed optimization, multi-robot systems I. I NTRODUCTION Distributed optimization is the problem of minimizing a joint objective function consisting of a sum of several local objective functions, each corresponding to a computational node. While distributed optimization has been a longstanding topic of research in the optimization community (e.g., [1], [2]), its usage in robotics is limited to only a handful of examples. However, distributed optimization techniques have important implications for multi-robot systems, as we can cast many of the key tasks in this area, including cooperative estimation [3], multi- agent learning [4], and collaborative motion planning [5], as distributed optimization problems. The distributed optimization formulation offers a flexible and powerful paradigm for deriving efficient and distributed algorithms for many multi- robot problems. In this survey, we provide an overview of the distributed optimization literature, contextualize it for applications in robotics, discuss best practices for applying distributed optimization to robotic systems, and evaluate several algorithms in simulations for fundamental robotics problems. We specifically consider optimization problems over real- valued decision variables (we do not consider discrete opti- mization, i.e., integer programs or mixed integer programs), and we assume that the robots communicate over a mesh network without central coordination. Often times, in the context of robotics, optimization is performed on-line within *This project was funded in part by DARPA YFA award D18AP00064, and NSF NRI awards 1830402 and 1925030. The first author was supported on an NDSEG Fellowship, and the third author was supported on an NSF Graduate Research Fellowship. ** The first three authors contributed equally. 1 Department of Mechanical Engineering, Stanford University, Stanford, CA 94305, USA, {halsted, shorinwa}@stanford.edu 2 Department of Aeronautics and Astronautics, Stanford University, Stanford, CA 94305, USA {javieryu, schwager}@stanford.edu (a) Optimization by one robot yields the solution given only that robot’s observations. (b) Using distributed optimization, each robot obtains the optimal solution resulting from all robots’ observations. Fig. 1. A motivation for distributed optimization: consider an estimation scenario in which a robot seeks to localize a target given sensor measurements. The robot can compute an optimal solution given only its observations, as represented in (a). By using distributed optimization techniques, each robot in a networked system of robots can compute the optimal solution given all robots’ observations without actually sharing individual sensor models or measurements with one another, as represented in (b). another iterative algorithm, e.g. for planning and control—as in Model Predictive Control (MPC), or perception—as in on-line Simultaneous Localization and Mapping (SLAM). Hence one can understand the algorithms in this paper as being run within a single time step of another iterative algorithm. In evaluating the usefulness of distributed optimization algorithms for multi-robot applications, we focus on methods that permit robots to use local robot-to-robot communication to compute solutions that are of the same quality as those that would have been obtained were all the robots’ observations available on one central computer. For convex problems, the robots all obtain a common globally optimal solution using distributed optimization. More generally, for non-convex problems, all robots obtain a common solution which may be locally optimal, but still of the same quality as that which would have been obtained by a single computer with all problem data. In this survey, we provide a taxonomy of the range of different algorithms for performing distributed optimization based on their defining mathematical characteristics, and cate- gorize the distributed optimization algorithms into three classes: distributed gradient descent, distributed sequential convex arXiv:2103.12840v1 [cs.RO] 23 Mar 2021

Upload: others

Post on 05-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Survey of Distributed Optimization Methods for Multi

1

A Survey of Distributed Optimization Methods forMulti-Robot Systems

Trevor Halsted1, Ola Shorinwa1, Javier Yu2, Mac Schwager2

Abstract—Distributed optimization consists of multiple compu-tation nodes working together to minimize a common objectivefunction through local computation iterations and network-constrained communication steps. In the context of robotics, dis-tributed optimization algorithms can enable multi-robot systemsto accomplish tasks in the absence of centralized coordination. Wepresent a general framework for applying distributed optimiza-tion as a module in a robotics pipeline. We survey several classesof distributed optimization algorithms and assess their practicalsuitability for multi-robot applications. We further compare theperformance of different classes of algorithms in simulations forthree prototypical multi-robot problem scenarios. The ConsensusAlternating Direction Method of Multipliers (C-ADMM) emergesas a particularly attractive and versatile distributed optimizationmethod for multi-robot systems.

Index Terms—distributed optimization, multi-robot systems

I. INTRODUCTION

Distributed optimization is the problem of minimizing a jointobjective function consisting of a sum of several local objectivefunctions, each corresponding to a computational node. Whiledistributed optimization has been a longstanding topic ofresearch in the optimization community (e.g., [1], [2]), its usagein robotics is limited to only a handful of examples. However,distributed optimization techniques have important implicationsfor multi-robot systems, as we can cast many of the keytasks in this area, including cooperative estimation [3], multi-agent learning [4], and collaborative motion planning [5], asdistributed optimization problems. The distributed optimizationformulation offers a flexible and powerful paradigm forderiving efficient and distributed algorithms for many multi-robot problems. In this survey, we provide an overview ofthe distributed optimization literature, contextualize it forapplications in robotics, discuss best practices for applyingdistributed optimization to robotic systems, and evaluate severalalgorithms in simulations for fundamental robotics problems.

We specifically consider optimization problems over real-valued decision variables (we do not consider discrete opti-mization, i.e., integer programs or mixed integer programs),and we assume that the robots communicate over a meshnetwork without central coordination. Often times, in thecontext of robotics, optimization is performed on-line within

*This project was funded in part by DARPA YFA award D18AP00064, andNSF NRI awards 1830402 and 1925030. The first author was supported on anNDSEG Fellowship, and the third author was supported on an NSF GraduateResearch Fellowship.

∗∗ The first three authors contributed equally.1Department of Mechanical Engineering, Stanford University, Stanford, CA

94305, USA, {halsted, shorinwa}@stanford.edu2Department of Aeronautics and Astronautics, Stanford University, Stanford,

CA 94305, USA {javieryu, schwager}@stanford.edu

(a) Optimization by one robot yields the solution given onlythat robot’s observations.

(b) Using distributed optimization, each robot obtains theoptimal solution resulting from all robots’ observations.

Fig. 1. A motivation for distributed optimization: consider an estimationscenario in which a robot seeks to localize a target given sensor measurements.The robot can compute an optimal solution given only its observations, asrepresented in (a). By using distributed optimization techniques, each robotin a networked system of robots can compute the optimal solution given allrobots’ observations without actually sharing individual sensor models ormeasurements with one another, as represented in (b).

another iterative algorithm, e.g. for planning and control—as inModel Predictive Control (MPC), or perception—as in on-lineSimultaneous Localization and Mapping (SLAM). Hence onecan understand the algorithms in this paper as being run withina single time step of another iterative algorithm.

In evaluating the usefulness of distributed optimizationalgorithms for multi-robot applications, we focus on methodsthat permit robots to use local robot-to-robot communicationto compute solutions that are of the same quality as those thatwould have been obtained were all the robots’ observationsavailable on one central computer. For convex problems,the robots all obtain a common globally optimal solutionusing distributed optimization. More generally, for non-convexproblems, all robots obtain a common solution which may belocally optimal, but still of the same quality as that which wouldhave been obtained by a single computer with all problem data.

In this survey, we provide a taxonomy of the range ofdifferent algorithms for performing distributed optimizationbased on their defining mathematical characteristics, and cate-gorize the distributed optimization algorithms into three classes:distributed gradient descent, distributed sequential convex

arX

iv:2

103.

1284

0v1

[cs

.RO

] 2

3 M

ar 2

021

Page 2: A Survey of Distributed Optimization Methods for Multi

2

programming, and distributed extensions to the alternatingdirection method of multipliers (ADMM). We do not discusszeroth-order methods for distributed optimization [6], [7], [8].

Distributed Gradient Descent: The most common classof distributed optimization methods is based on the idea ofaveraging local gradients computed by each computational nodeto perform an approximate gradient descent [9], and in this workwe refer to them as distributed gradient descent (DGD) methods.Accelerated DGD algorithms, like [10], offer significantly fasterconvergence rates over basic DGD. Furthermore, DGD methodshave been shown to converge to the optimal solution on non-differentiable convex functions with subgradients [11] and witha push-sum averaging scheme [12], making them well suitedfor a broad range of applications.

Distributed Sequential Convex Programming: SequentialConvex Optimization is a common technique in centralizedoptimization that involves minimizing a sequence of convexapproximations to the original (usually non-convex) problem.Under certain conditions, the sequence of sub-problems con-verges to a local optimum of the original problem. Newton’smethod and the Broyden–Fletcher–Goldfarb–Shanno (BFGS)method are common examples. The same concepts are usedby a number of distributed optimization algorithms, and werefer to these algorithms as distributed sequential convexprogramming methods. Generally, these methods use consensustechniques to construct the convex approximations of thejoint objective function. One example is the Network Newtonmethod [13] which uses consensus to approximate the inverseHessian of the objective to construct a quadratic approximationof the joint problem. The NEXT family of algorithms [14]provides a flexible framework which can utilize a variety ofconvex surrogate functions to approximate the joint problem,and is specifically designed to optimize non-convex objectivefunctions.

Alternating Direction Method of Multipliers (ADMM):The last class of algorithms covered in this paper are based onthe alternating direction method of multipliers (ADMM) [1].ADMM works by minimizing the augmented Lagrangian of theoptimization problem using alternating updates to the primaland dual variables [15]. The method is naturally amenable toconstrained problems. The original method is distributed, butnot in the sense we consider in this survey. Specifically, theoriginal ADMM requires a central computation hub to collectall local primal computations from the nodes to perform acentralized dual update step. ADMM was first modified toremove this requirement for a central node in [16], where it wasused for distributed signal processing. The algorithm from [16]has since become known as Consensus ADMM (C-ADMM),although that paper does not introduce this terminology. C-ADMM was later shown to have linear convergence rateson all strongly convex distributed optimization problems[17]. We find C-ADMM outperforms the other optimizationalgorithms that we tested in our three problem scenariosin terms of convergence speed, as well as computationaland communication efficiency. In addition, we find that itis less sensitive to the choice of hyper-parameters than theother methods we tested. Therefore, C-ADMM emerges as anattractive option for problems in multi-robot systems.

Many problem features affect the convergence rates ofeach of these methods, and convergence guarantees are oftenqualified by the convexity of the underlying joint optimizationproblem. Essential to applying these algorithms is understand-ing their limitations and performance trade-offs. For instance,some methods are better suited for handling problems withconstraints, but have a higher computational complexity orcommunication overhead.

Numerical results in this survey implement select algorithmsfrom each of our taxonomic classes, and use them to solvethree fundamental problems in multi-robot systems: cooperativetarget tracking, planning for coordinated package delivery,and cooperative multi-robot range-only mapping. Typically,distributed optimization algorithms are designed to solveconvex problems, but often the optimization problems thatarise in robotics applications are non-convex. The goal ofthese numerical simulations is to compare the performanceof these algorithms not only in convex optimization wheretheir convergence is often guaranteed, but also in non-convexproblems where it is not.

A. Relevance to Robotics

While the field of distributed optimization is well-developed,its application to robotics problems remains nascent. As aresult, many existing distributed optimization methods donot cater to the specific challenges that arise in roboticsproblems. Generally, robotics problems involve constrainednon-convex optimization problems, an area not explored bymany distributed optimization methods. Further, robots typicallyhave limited access to significant computation and commu-nication resources, placing greater importance on efficientoptimization methods with low overhead. Many distributedoptimization methods do not consider these issues with theassumption that agents have access to sufficient and reliablecomputation and communication infrastructure. For relevanceto robotics problems, we specifically note methods that workon constrained problems and quantify the relative computationand communication costs incurred by distributed optimizationmethods.

In much of the literature, distributed optimization algorithmsare compared on a convergence per iteration basis (per updateto the decision variable). However, this scheme often obfuscatescritical information for applications to robotics like localcomputation time, parameter sensitivity, and communicationoverhead. As part of the numerical results, this survey presentsa more structured approach to analyzing the strengths andweaknesses of these algorithms in terms of metrics that matterfor robotics research.

B. Existing Surveys

A number of other recent surveys on distributed optimizationexist, and provide useful background when working with thealgorithms covered in this survey. The survey [18] coversapplications of distributed optimization for applications indistributed power systems, and [19] focuses on the applicationof predominately first-order methods to solving problemsin multi-agent control. The article [20] broadly addresses

Page 3: A Survey of Distributed Optimization Methods for Multi

3

communication-computation trade-offs in distributed optimiza-tion, again focused mainly on first-order methods. Anothersurvey [21], covers exclusively non-convex optimization in bothbatch and data-streaming contexts, but again only analyzes first-order methods. Finally, [22] covers a wide breadth of distributedoptimization algorithms with a variety of assumptions focusingexclusively on convex optimization tasks. This survey differsfrom all of these in that it is specifically targeting optimizationapplications in robotics, and provides a condensed taxonomicoverview of useful methods for these applications.

Other useful background material can be found for distributedcomputation [23] [24], and on multi-robot systems in [25] [26].

C. Contributions

This survey paper has four primary objectives:1) Provide a unifying formulation for the general distributed

optimization problem.2) Develop a taxonomy for the different classes of dis-

tributed optimization algorithms.3) Compare the performance of distributed optimization

algorithms for applications in robotics with varying levelsof difficulty.

4) Propose open research problems in distributed optimiza-tion for robotics.

D. Organization

In Section II, we present the general formulation for thedistributed optimization problem, and give insight into thebasic assumptions typically made while developing distributedoptimization algorithms. Sections III - V provide greater detailon our algorithm classifications, and include details for repre-sentative algorithms. Finally, Section VII gives performancecomparisons for select algorithms on a range of differentoptimization problems specifically highlighting communicationversus computation trade-offs. In Sec. VIII we discuss openresearch problems in distributed optimization applied to multi-robot systems and robotics in general, and we offer concludingremarks in Sec. IX.

II. PROBLEM FORMULATION

In distributed optimization in multi-robot systems, robotsperform communication and computation steps to minimizesome global cost function. We focus on problems in whichthe robots’ exchange of information must respect the topol-ogy of an underlying distributed communication graph. Thiscommunication graph, denoted as G = (V, E), consists ofvertices V = {1, . . . , n} and edges E ⊆ V × V over whichpairwise communication can occur. In general, we assume thatthe communication graph is connected (there exists some pathof edges from any robot i to any other robot j) and undirected(if i can send information to j, then j can send information toi) but place no other assumptions on its structure.

In the general distributed optimization formulation, eachrobot i ∈ V can compute its local cost function fi(x) butcannot directly compute the local cost functions of the other

Algorithm 1 General Distributed Optimization Framework

1: function DISTOPT(fi,P(0)i ,Q(0)

i ,R(0)i ∀i ∈ V)

2: k ← 03: while stopping criterion is not satisfied do4: for i ∈ V do (in parallel)5: Communicate Q(k)

i to all j ∈ Ni6: Receive Q(k)

j from all j ∈ Ni7: Compute P(k+1)

i ,Q(k+1)i ,R(k+1)

i

8: end for9: k ← k + 1

10: end while11: end function

robots. The robots’ collective objective is to minimize theglobal cost function

f(x) =∑i∈V

fi(x) (1)

despite the limitations on the local information at each robot.The problem in (1) is separable in that it consists of the sumof the local cost functions.

In addition to local knowledge of the local objectivefunctions, we assume that constraints on the optimizationvariable are only known locally as well. Therefore, the generalform of the global cost function is

minx

∑i∈V

fi(x)

subject to gi(x) = 0 ∀i ∈ Vhi(x) ≤ 0 ∀i ∈ V.

(2)

In this paper, we evaluate distributed algorithms on severalclasses of the separable optimization problem, includingunconstrained linear least-squares, constrained convex, andconstrained non-convex optimization problems.

Before describing the specific algorithms that solve dis-tributed optimization problems, we first consider the generalframework that all of these approaches share. Each algorithmprogresses over discrete iterations k = 0, 1, . . . until conver-gence. Besides assuming that each robot has the sole capabilityof evaluating its local cost function fi, we also distinguishbetween the “private” variables P(k)

i that the robot computes ateach iteration k and the “public” variables Q(k)

i that the robotcommunicates to its neighbors. Each algorithm also involvesparameters R(k)

i , which generally require coordination amongall of the robots but can typically be assigned before deploymentof the system. Algorithm 1 describes the general frameworkfor distributed optimization, in which each iteration k involvesa communication step and a computation step.

In order for each robot to find the joint solution to (2) in adistributed manner, the update steps in algorithm 1 must allowthe robots to cooperatively balance the costs accrued in eachlocal cost function. From the perspective of a single robot,the update equations represent a trade-off between optimalityof its individual solution based on its local cost function andagreement with its neighbors. In this paper, we distinguish

Page 4: A Survey of Distributed Optimization Methods for Multi

4

between two distinct perspectives on how this agreement isachieved.

In the following sections, we discuss three broad classes ofdistributed optimization methods, including distributed gradientdescent, distributed sequential convex programming, and thealternating direction method of multipliers. In distributedgradient descent methods, the robots update their local variablesusing their local gradients and obtain the same solution of theglobal problem through consensus on their local variables.Distributed sequential convex programming methods take asimilar approach but involve the problem Hessian, whichprovides information on the function curvature when updatingthe local variables.

The alternating direction method of multipliers takes adifferent approach by explicitly enforcing agreement betweenthe robots’ local variables before relaxing these constraintsusing the problem’s Lagrangian. In this approach, we considerthe reformulation of (1) to distinguish between each robotsestimate of the decision variable xi:

minxi∀i∈V

∑i∈V

fi(xi)

subject to xi = xj ∀(i, j) ∈ Egi(xi) = 0 ∀i ∈ Vhi(xi) ≤ 0 ∀i ∈ V

(3)

where the agreement constraints are only enforced betweenneighboring robots. We discuss distributed gradient descentbefore proceeding with a discussion on distributed sequentialconvex programming and the alternating direction method ofmultipliers.

III. DISTRIBUTED GRADIENT DESCENT

The optimization problem in (2) (in its unconstrained form)can be solved through gradient descent where the optimizationvariable is updated using

x(k+1) = x(k) − α(k)∇f(x(k)) (4)

with ∇f(x(k)) denoting the gradient of the objective function,∇f(x) =

∑i∈V ∇fi(x), given some scheduled step-size α(k).

Inherently, computation of ∇f(x(k)) requires knowledge ofthe local objective functions or gradients by all robots in thenetwork which is infeasible in many problems.

Distributed Gradient Descent (DGD) methods extend the cen-tralized gradient scheme to the distributed setting where robotscommunicate locally without necessarily having knowledge ofthe local objective functions or gradients of all robots. In DGDmethods, each robot updates its local variable using a weightedcombination of the local variables or gradients of its neighborsaccording to the weights specified by a weighting matrix W ,allowing for the dispersion of information on the objectivefunction or its gradient through the network. In general, theweighting matrix W reflects the topology of the communicationnetwork, with non-zero weights existing between pairs ofneighboring robots. The weighting matrix exerts a significantinfluence on the convergence rates of DGD methods, andthus, an appropriate choice of these weights are required forconvergence of DGD methods.

Algorithm 2 Distributed Gradient Descent (DGD)Private variables: Pi = ∅Public variables: Q(k)

i = x(k)i

Parameters: R(k)i = (α(k), wi)

Update equations:

x(k+1)i = wiix

(k)i +

∑j∈Ni

wijx(k)j − α

(k)∇fi(x(k)i )

α(k+1) =α(0)

√k

Many DGD methods use a doubly stochastic matrix W , arow-stochastic matrix A [27], or a column-stochastic matrixB, depending on the model of the communication networkconsidered, while other methods use a push-sum approach. Inaddition, many methods further require symmetry of the doublystochastic weighting matrix with W>1 = 1 and W = W>.

The use of doubly stochastic matrices is supported by a richbody of work on the convergence of consensus algorithms ofthis form. (As observed in several works including [28], thestandard average consensus problem is equivalent to using (5)to solve the distributed optimization problem (2) when thelocal cost functions fi are constant).

DGD methods generally required decreasing step sizes forconvergence to an optimal solution of the problem whichnegatively impacts their convergence rates; however, thesemethods have been modified to develop gradient descentmethods with fixed step-sizes. Next, we discuss these distributedgradient descent variants.

A. Decreasing Step-Size DGD

Tsitsiklis introduced a model for DGD in the 1980s in [2]and [9] (see also [23]). The works of Nedic and Ozdaglar in[11] revisit the problem, marking the beginning of interestin consensus-based frameworks for distributed optimizationover the recent decade. This basic model of DGD consists ofan update term that involves consensus on the optimizationvariable as well as a step in the direction of the local gradientfor each node:

xi(k+1) = wiixi(k)+∑j∈Ni

wijxj(k)−αi(k)∇fi(xi(k)) (5)

where robot i updates its variable using a weighted combinationof its neighbors’ variables determined by the weights wij withαi(k) denoting its local step-size at iteration k.

For convergence to the optimal joint solution, these methodsrequire the step-size to asymptotically decay to zero. Asproven in [28], scheduling the step-size according to therules

∑∞k=1 αi(k) = ∞ and

∑∞k=1 αi(k)2 < ∞ with all

robots’ step-sizes equal guarantees the asymptotic convergenceof the robots’ optimization variables to the optimal jointsolution, given the standard assumptions of a connectednetwork, properly chosen weights, and bounded (sub)gradients.Alternatively, the choice of a constant step-size for all timestepsonly guarantees convergence of each robot’s iterates to aneighborhood of the optimal joint solution.

Page 5: A Survey of Distributed Optimization Methods for Multi

5

Algorithm 3 EXTRA

Private variables: P(k)i = ∅

Public variables: Q(k)i =

(x

(k)i , x

(k−1)i

)Parameters: R(k)

i = (α,wi)Update equations:

x(k+1)i =

∑j∈Ni∪{i}

wij

(x

(k)j −

1

2x

(k−1)j

)· · ·

− α[∇fi

(x

(k)i

)−∇fi

(x

(k−1)i

)]

Algorithm 2 summarizes the update step for the decreasingstep-size gradient descent method in Algorithm 1 with thestep-size α(k+1) = α(0)

√k

.An alternative approach to the consensus term in (5) uses

push-sum consensus introduced in the context of gossip-baseddistributed algorithms [29]. This approach involves two parallelconsensus steps to circumvent the need for a doubly stochasticmatrix which is replaced by a row-stochastic matrix A. Thepush-sum technique is also explored in [30], [31], [12], [32] indistributed gradient methods. In general, using a row-stochasticweighting matrix A in place of a doubly stochastic W wouldresult in the consensus step becoming weighted accordingto the relative degrees of each robot in the communicationgraph. The push-sum approach introduces a second variableby which the robots keep track of their relative degrees and“unweight” the consensus term. The extra communication costof this second variable comes with the benefit of extendingconsensus behavior to networks with asynchronous or directedcommunication.

Noting the sub-linear convergence rates of decreasing step-size DGD, some approaches have applied accelerated gradientschemes for improved convergence speed [33].

B. Fixed Step-Size DGD

Although decreasing step-size DGD methods converge to anoptimal joint solution, the requirement of a decaying step-sizereduces the convergence speed of these methods. Fixed step-size methods address this limitation by eliminating the needfor decreasing step-sizes while retaining convergence to theoptimal joint solution. The EXTRA algorithm introduced byShi et al. in [10] uses a fixed step-size while still achievingexact convergence. EXTRA replaces the gradient term withthe difference in the gradients of the previous two iterates.Because the contribution of this gradient difference termdecays as the iterates converge to the optimal joint solution,EXTRA does not require the step-size to decay in order tosettle at the exact global solution. Algorithm 3 describes theupdate steps for the local variables of each robot in EXTRA.EXTRA achieves linear convergence [34], and a variety of DGDalgorithms have since offered improvements on its linear rate[35]. Many other works with fixed step-sizes involve variationson the variables updated using consensus and the order ofthe update steps, including DIGing [36], [37], NIDS [38],Exact Diffusion [39], [40], and [41], [42]. These approaches

Algorithm 4 Distributed Dual Averaging (DDA)

Private variables: Pi = z(k)i

Public variables: Q(k)i = x

(k)i

Parameters: R(k)i =

(α(k), wi, φ(·)

)Update equations:

z(k+1)i =

∑j∈Ni

wijz(k)j +∇fi

(x

(k)i

)x

(k+1)i = argmin

x∈X

{x>z

(k+1)i +

1

α(k)φ(x)

}

which generally require the use of doubly stochastic weightingmatrices have been extended to problems with row-stochasticor column-stochastic matrices [43], [44], [45], [46] and push-sum consensus [47] for distributed optimization in directednetworks. Several works offer a synthesis of various fixed step-size DGD methods, noting the similarities between the fixedstep-size DGD methods. Under the canonical form proposed in[48], these algorithms and others differ only in the choice ofseveral constant parameters. We provide this unifying templatefor fixed step-size DGD methods in Algorithm 9, included inthe Appendix X-A. Jakovetic also provides a unified form forvarious fixed-step-size DGD algorithms in [49]. Some otherworks consider accelerated variants using Nesterov gradientdescent [50], [51], [50], [52].

In general, DGD methods address unconstrained distributedconvex optimization problems, but these methods have been ex-tended to non-convex problems [53] and constrained problemsusing projected gradient descent [54], [55], [56].

Some other methods perform dual-ascent on the dual problemof (2) [57], [58], [59], [60] where the robots compute their localprimal variables from the related minimization problem usingtheir dual variables. These methods require doubly stochasticweighting matrices but allow for time-varying communicationnetworks. In [61], the robots perform a subsequent proximalprojection step to obtain solutions which satisfy the problemconstraints.

C. Distributed Dual Averaging

Dual averaging first posed in [62], and extended in [63],takes a similar approach to gradient descent methods insolving the optimization problem in (2), with the added benefitof providing a mechanism for handling problem constraintsthrough a projection step in like manner as projected gradientdescent methods. However, the original formulations of thedual averaging method requires knowledge of all componentsof the objective function or its gradient which is unavailableto all robots. The Distributed Dual Averaging method (DDA)circumvents this limitation by modifying the update equationsusing a doubly stochastic weighting matrix to allow for updatesof each robot’s variable using its local gradients and a weightedcombination of the variables of its neighbors [64].

Similar to decreasing step-size DGD methods, distributeddual averaging requires a sequence of decreasing step-sizesto converge to the optimal solution. Algorithm 4 provides the

Page 6: A Survey of Distributed Optimization Methods for Multi

6

update equations in the DDA method, along with the projectionstep which involves a proximal function ψ(x), often definedas 1

2‖x‖22. After the projection step, the robot’s variable satisfy

the problem constraints described by the constraints set X .Some of the same extensions made to DGD have been

studied for DDA including analysis of the algorithm undercommunication time delays [65], and replacement of the doublystochastic weighting matrix with push-sum consensus [66].

Applications of DGD

Distributed gradient descent methods have found notableapplications in robot localization from relative measurementsproblems [67], [68] including in networks with asynchronouscommunication [69]. DGD has also been applied to optimiza-tion problems on manifolds including SE(3) localization [70],[71], [72], [73], synchronization problems [74], and formationcontrol in SO(3) [75], [76]. Other works [77] employ DGDalong with a distributed simplex method [78] to obtain anoptimal assignment of the robots to a desired target formation.In pose graph optimization, DGD has been employed throughmajorization minimization schemes which minimize an upper-bound of the objective function [79] and using gradient descenton Riemannian manifolds [80], [81], and [82] (block-coordinatedescent).

Online problems are a field of particular interest fordistributed optimization algorithms, and a number of worksadapted DDA for online scenarios, [83], [84], with severalimplemented in scenarios with time varying communicationtopology [85], [86]. The push-sum variant of dual averaginghas also been used for distributed training of deep-learningalgorithms, and has been shown to be useful in avoidingpitfalls of other distributed training frameworks includingcommunication deadlocks and asynchronous update steps [87].

IV. DISTRIBUTED SEQUENTIAL CONVEX PROGRAMMING

A. Approximate Newton Methods

Newton’s method, and its variants, are commonly used forsolving convex optimization problems, and provide significantimprovements in convergence rate when second-order functioninformation is available [88]. While the distributed gradientdescent methods exploit only information on the gradients ofthe objective function, Newton’s method uses the Hessian ofthe objective function, providing additional information on thefunction’s curvature which can improve convergence. To applyNewton’s method to the distributed optimization problem in (2),the Network Newton-K (NN-K) algorithm [13] takes a penalty-based approach which introduces consensus between the robots’variables as components of the objective function. The NN-Kmethod reformulates the constrained form of the distributedproblem in (2) as the following unconstrained optimizationproblem:

minx∀i∈V

α∑i∈V

fi(xi) + x>i (∑

j∈N∪{i}

wijxj) (6)

where W = I −W , and α is a weighting hyper parameter.However, the Newton descent step requires computing the

inverse of the joint problem’s Hessian which cannot be directly

Algorithm 5 Network Newton-K (NN-K)

Private variables: P(k)i =

(g

(k)i , D

(k)i

)Public variables: Qi =

(x

(k)i , d

(k+1)i

)Parameters: Ri = (α, ε,K, wi)Outer update equations:

D(k+1)i = α∇2fi(x

(k)i ) + 2wiiI

g(k+1)i = α∇fi((k)) +

∑j∈Ni∪{i}

wijx(k)j

d(0)i = −

(D

(k+1)i

)−1

gi

B compute d(k+1)i via K inner updates

x(k+1)i = x

(k)i + ε d

(k+1)i

Inner update equations: (Hessian approximation)

d(p+1)i =

(D

(k)i

)−1

wiid(p)i − g

(k+1)i −

∑j∈Ni

wijd(p)j

computed in a distributed manner as its inverse is dense.To allow for distributed computation of the Hessian inverse,NN-K uses the first K terms of the Taylor series expansion(I −X)−1 =

∑∞j=0X

j to compute the approximate Hessianinverse, as introduced in [89]. Approximation of the Hessianinverse comes at an additional communication cost, and requiresan additional K communication rounds per update of the primalvariable. Algorithms 5 summarizes the update procedures inthe NN-K method in which ε denotes the step-size for theNewton’s step. As presented in Algorithm 5, NN-K proceedsthrough two sets of update equations: an outer set of updatesthat initializes the Hessian approximation and computes thedecision variable update and an inner Hessian approximationupdate; a communication round precedes the execution ofeither set of update equations. Increasing K, the number ofintermediary communication rounds, improves the accuracy ofthe approximated Hessian inverse at the cost of increasing thecommunication cost per primal variable update.

A follow-up work optimizes a quadratic approximation of theaugmented Lagrangian of the general distributed optimizationproblem (2) where the primal variable update involves com-puting a P -approximate Hessian inverse to perform a Newtondescent step, and the dual variable update uses gradient ascent[90]. The resulting algorithm Exact Second Order Method(ESOM) provides a faster convergence rate than NN-K at thecost of one additional round of communication for the dualascent step. Notably, replacing the augmented Lagrangian in theESOM formulation with its linear approximation results in theEXTRA update equations, showing the relationship betweenboth approaches.

In some cases, computation of the Hessian is impossiblebecause second order information is not available. Quasi-Newton methods like the Broyden-Flectcher-Goldman-Shanno(BFGS) algorithm approximate the Hessian when it cannot bedirectly computed. The distributed BFGS (D-BFGS) algorithm

Page 7: A Survey of Distributed Optimization Methods for Multi

7

[91] replaces the second order information in the primal updatein ESOM with a BFGS approximation (i.e. replaces D(k)

i ina call to the Hessian approximation equations in Algorithm 5with an approximation), and results in essentially a “doubly”approximate Hessian inverse. In [92] the D-BFGS methodis extended so that the dual update also uses a distributedQuasi-Newton update scheme, rather than gradient ascent.The resulting primal-dual Quasi-Newton method requires twoconsecutive iterative rounds of communication doubling thecommunication overhead per primal variable update comparedto its predecessors (NN-K, ESOM, and D-BFGS). However,the resulting algorithm is shown by the authors to still convergefaster in terms of required communication.

B. Convex Surrogate Methods

While the approximate Newton methods in [90], [91],[92] optimize a quadratic approximation of the augmentedLagrangian of (6), other distributed methods allow for moregeneral and direct convex approximations of the distributedoptimization problem. These convex approximations generallyrequire the gradient of the joint objective function which isinaccessible to any single robot. In the NEXT family ofalgorithms [14] dynamic consensus is used to allow eachrobot to approximate the global gradient, and that gradientis then used to compute a convex approximation of the jointcost function locally. A variety of surrogate functions, U(·),are proposed including linear, quadratic, and block-convexwhich allows for greater flexibility in tailoring the algorithmto individual applications. Using its surrogate of the joint costfunction, each robot updates its local variables iteratively bysolving its surrogate the problem, and then taking a weightedcombination of the resulting solution with the solutions of itsneighbors. To ensure convergence NEXT algorithms require aseries of decreasing step-sizes, resulting in generally slowerconvergence rates as well as additional hyperparameter tuning.

The SONATA [93] algorithm extends the surrogate functionprinciples of NEXT, and proposes a variety of non-doublystochastic weighting schemes that can be used to performgradient averaging similar to the push-sum protocols. Theauthors of SONATA also show that a several configurations ofthe algorithm result in already proposed distributed optimizationalgorithms including Aug-DGM, Push-DIG, and ADD-OPT.

Applications of Distributed Sequential Convex Programming

Distributed sequential convex programming methods havebeen applied to pose graph optimization problems [94] usinga quadratic approximation of the objective function along withGauss-Siedel updates to enable distributed local computationsamong the robots. The NEXT family of algorithms havebeen applied to a number of learning problems where data isdistributed including semi-supervised support vector machines[95], neural network training [96], and clustering [97].

Algorithm 6 NEXT

Private variables: Pi =(x

(k)i , x

(k)i , π

(k)i

)Public variables: Q(k)

i =(z

(k)i , y

(k)i

)Parameters: R(k)

i =(α(k), wi, U(·),K

)Update equations:

x(k+1)i =

∑j∈N (i)

wijz(k)j

y(k+1)i =

∑j∈N (i)

wijy(k)j +

[∇fi(x(k+1)

i )−∇fi(x(k)i )]

π(k+1)i = N · y(k+1)

i −∇fi(x(k+1)i )

x(k+1)i = argmin

x∈KU(x;x

(k+1)i , π

(k+1)i

)z

(k+1)i = x

(k+1)i + α(k)

(x

(k+1)i − x(k+1)

i

)

V. ALTERNATING DIRECTION METHOD OF MULTIPLIERS

Considering the optimization problem in (3) with onlyagreement constraints, we have

minxi∀i∈V

∑i∈V

fi(xi) (7)

subject to xi = xj ∀(i, j) ∈ E . (8)

The method of multipliers solves this problem by alternatingbetween minimizing the augmented Lagrangian of the optimiza-tion problem with respect to the primal variables x1, . . . , xn(the “primal update”) and taking a gradient step to maximizethe augmented Lagrangian with respect to the dual (the “dualupdate”). In the alternating direction method of multipliers(ADMM), given the separability of the global cost function,the primal update is executed as successive minimizations overeach primal variable (i.e. choose the minimizing x1 with allother variables fixed, then choose the minimizing x2, and so on).Most ADMM-based approaches do not satisfy our definitionof distributed in that either the primal updates take placesequentially rather than in parallel or the dual update requirescentralized computation [98], [99], [100], [101]. However,the consensus alternating direction method of multipliers (C-ADMM) provides an ADMM-based optimization method thatis fully distributed: the nodes alternate between updating theirprimal and dual variable and communicating with neighboringnodes [16] .

In order to achieve a distributed update of the primal and dualvariables, C-ADMM alters the agreement constraints betweenagents with an existing communication link by introducingauxiliary primal variables to (3) (instead of the constraintxi = xj , we have two constraints: xi = zij and xj = zij).Considering the optimization steps across the entire network, C-ADMM proceeds by optimizing the auxiliary primal variables,then the original primal variables, and then the dual variablesas in the original formulation of ADMM. We can performminimization with respect to the primal variables and gradient

Page 8: A Survey of Distributed Optimization Methods for Multi

8

Algorithm 7 C-ADMM

Private variables: P(k)i = y

(k)i

Public variables: Q(k)i = x

(k)i

Parameters: R(k)i = ρ

Update equations:

y(k+1)i = y

(k)i + ρ

∑j∈Ni

(x

(k)i − x

(k)j

)x

(k+1)i = argmin

xi

{fi(xi) + x>i y

(k+1)i · · ·

2

∑j∈Ni

∥∥∥∥xi − 1

2

(x

(k)i + x

(k)j

)∥∥∥∥2

2

}

ascent with respect to the dual on an augmented Lagrangianthat is fully distributed among the robots:

La =∑i∈V

fi(xi) + y>i xi +ρ

2

∑j∈Ni

‖xi − zij‖22, (9)

where yi represents the dual variable that enforces agreementbetween robot i and its communication neighbors. The pa-rameter ρ that weights the quadratic terms in La is alsothe step size in the gradient ascent of the dual variable.Furthermore, we can simplify the algorithm by noting that theauxiliary primal variable update can be performed implicitly(z∗ij = 1

2 (xi + xj)).Algorithm 7 summarizes the update procedures for the local

primal and dual variables of each agent. The update procedurefor xk+1

i requires solving an optimization problem which mightbe computationally intensive for certain objective functions.To simplify the update complexity, the optimization can besolved inexactly using a linear approximation of the objectivefunction such as DLM [102] and [103], [104] or a quadraticapproximation using the Hessian such as DQM [105].

C-ADMM as presented in Algorithm 7 requires each robotto optimize over a local copy of the global decision variablex. However, many robotic problems have an fundamentalstructure that makes maintaining global knowledge at everyindividual robot unnecessary: each robot’s data relate onlyto a subset of the global optimization variables, and eachagent only requires a subset of the optimization variablefor its role. For instance, in distributed SLAM, a memory-efficient solution would require a robot to optimize only overits local map and communicate with other robots only messagesof shared interest. The SOVA method [106] leverages theseparability of the optimization variable to achieve orders ofmagnitude improvement in convergence rates, computation,and communication complexity over C-ADMM methods.

In SOVA, each agent only optimizes over variables relevantto its data or role, enabling robotic applications in whichagents have minimal access to computation and communicationresources. SOVA introduces consistency constraints betweeneach agent’s local optimization variable and its neighbors,mapping the elements of the local optimization variables, given

Algorithm 8 SOVA

Private variables: P(k)i = y

(k)i

Public variables: Q(k)i = x

(k)i

Parameters: R(k)i = (ρ,Φ)

Update equations:

y(k+1)i = y

(k)i + ρ

∑j∈Ni

Φ>ij

(Φijx

(k)i − Φjix

(k)j

)x

(k+1)i = argmin

xi

{fi(xi) + x>i y

(k+1)i · · ·

+ ρ∑j∈Ni

∥∥∥∥Φijxi −1

2

(Φijx

(k)i + Φjix

(k)j

)∥∥∥∥2

2

}

by

Φijxi = Φjixj ∀j ∈ Ni, ∀i ∈ V

where Φij and Φji map elements of xi and xj to a commonspace. C-ADMM represents a special case of SOVA where Φijis always the identity matrix. The update procedures for eachagent reduce to the equations given in Algorithm 8. WhileSOVA allows for improved

Applications of C-ADMM

ADMM has been applied to bundle adjustment and posegraph optimization problems which involve the recovery ofthe 3D positions and orientations of a map and camera [107],[108], [109], informative path planning [110]. However, theseworks require a central node for the dual variable updates. Otherworks employ the consensus ADMM variant without a centralnode [111] with other notable applications in target tracking [3],signal estimation [16], task assignment [112], motion planning[5], online learning [113], and parameter estimation in globalnavigation satellite systems [114]. Further applications of C-ADMM arise in trajectory tracking problems involving teamsof robots using non-linear model predictive control [115] and incooperative localization [116]. Applications of SOVA includecollaborative manipulation [117]. C-ADMM is adapted foronline learning problems with streaming data in [113].

VI. PRACTICAL NOTES

In Section VII, we compare the performances of severaldistributed optimization methods on three benchmark prob-lems that approximate the computational demands of typicalrobotics applications. However, there are several significantconsiderations for translating an algorithm to a fast and efficientdistributed solver. Among the most important practical con-siderations (aside from choosing the most suitable algorithm)are parameter tuning, initialization, and communication graphtopology.

A. Communication Topology

As a general rule, a more highly-connected graph facilitatesfaster convergence per communication iteration, regardless of

Page 9: A Survey of Distributed Optimization Methods for Multi

9

the algorithm. Just as the Fiedler value of a graph determinesthe rate of convergence in a pure consensus problem, theconnectivity also has a direct effect on convergence in otherdistributed optimization problems. Fully-connected communi-cation enables fast optimization in a multi-robot network, andeven makes centralized algorithms viable alternatives to dis-tributed optimization, depending on computational constraintsand the amount of local information possessed by each robot.However, we are primarily concerned with the local, range-limited communication models that arise in practical large-scalerobotics problems.

B. Parameter Tuning

The performance of each distributed algorithm that weconsider is sensitive to the choice of parameters. For instance,in DGD (Algorithm 2), choosing α too large leads to divergenceof the individual variables, while too small a value of α causesslow convergence. Similarly, C-ADMM (Algorithm 7) has aconvergence rate that is highly sensitive to the choice of ρ,though convergence is guaranteed for all ρ > 0. We studythe sensitivity of the convergence rate to parameter choice ineach simulation in Section VII. However, the optimal parameterchoice for a particular example is not prescriptive for the tuningof other implementations. Furthermore, while analytical resultsfor optimal parameter selection are available for many of thesealgorithms, a practical parameter-tuning procedure is useful ifan implementation does not exactly adhere to the assumptionsin the literature. Appendix X-B describes the Golden SectionSearch (GSS) algorithm, which provides a general (centralized)procedure for parameter tuning prior to deployment of a roboticsystem.

C. Consensus Weights

Several distributed optimization methods, including DGD,EXTRA, DDA, NN-K, and NEXT depend not only on stepsizeparameters but also on consensus weights (w in Algorithm2) by which each robot incorporates its neighbors’ variablesinto its update equations. In most cases, weights are assumedto be doubly stochastic (a robot’s weights summed overits neighborhood is equal to one, as is the neighborhood’sweights for the robot). The specific choice of weights mayvary depending on the assumptions made on what globalknowledge is available to the robots on the network, asdiscussed thoroughly in [118]. For example, [118] shows thatfinding pairwise symmetric weights that achieve the fastestpossible consensus can be posed as a semidefinite program,which a computer with global knowledge of the network cansolve efficiently. However, we cannot always assume thatglobal knowledge of the network is available, especially inthe case of a time-varying topology. In most cases, Metropolisweights facilitate fast mixing without require global knowledge.Each robot can generate its own weight vector after a singlecommunication round with its neighbors. In fact, Metropolis

weights perform only slightly sub-optimally compared tocentralized optimization-based methods [119]:

wij =

1

max{|Ni|,|Nj |} j ∈ Ni1−

∑j′∈Ni

wij′ i = j

0 else

(10)

Simulation results for algorithms with doubly stochastic mixingmatrices use Metropolis weights.

VII. PERFORMANCE COMPARISONS IN CASE STUDIES

Many robotics problems have a distributed structure, al-though this structure might not be immediately apparent. Inmany cases, applying distributed optimization methods requiresreformulating the original problem into a separable form thatallows for distributed computation of the problem variableslocally by each robot. This reformulation often involves theintroduction of additional problem variables local to eachrobot with an associated set of constraints relating the localvariables between the robots. We provide three examples ofdistributed optimization to notable robotics problems in multi-drone vehicle target tracking, drone-robot coordinated packagedelivery, and multi-robot cooperative mapping.

Our principal motivation in evaluating the distributed opti-mization methods described in Sections III-V is to assess thesuitability of each method in distributed robotics applications.We measure the performances of a representative sample ofthe distributed optimization methods for three problem types:unconstrained least-squares, constrained convex optimization,and non-convex optimization. These classes of optimizationproblems are representative of a broad swath of roboticproblems in estimation, localization, planning, and control.

For a given distributed optimization method, importantconsiderations for its include the generality of the algorithm(i.e. the sensitivity of its performance across different scenariosto its parameters), the computational requirements (i.e. eachrobot’s computation time), and the communication requirements(i.e. how many messages the robots must pass). We use twoimportant metrics in evaluating performance. The first is thenormalized Mean Square Error (MSE) which is computedwith respect to the optimal joint solution. When evaluatingdistributed optimization algorithms comparisons on the basis ofMSE reduction per iteration can often skew results, and does notaccurately reflect the total execution time of these algorithms.For example, methods like DIGing and NEXT require amultiple communication rounds, the execution time of which isnot not accounted for in a per-iteration comparison. Similarly,methods like C-ADMM require local optimization steps at eachiteration, which can significantly increase computation timedepending on the cost function. In robotics applications, bothcommunication time and local computation time contributeto the overall execution time of these algorithms. Severalworks [120], [121] propose application-specific cost metricsthat assess both communication time and computation time.In contrast to these works, we capture the trade-off betweencommunication and computation with a single-parameter metriccalled Resource-Weighted Cost (RWC), which represents thetotal time required for a distributed optimization algorithm

Page 10: A Survey of Distributed Optimization Methods for Multi

10

to converge below an acceptable MSE given some arbitrarycommunication and computation capability:

RWCλ =tcp + λtcm

1 + λ. (11)

Here, tcp is the sum of the time spent by all robots performinglocal computations computed by summing processor timerequired for local update steps across the network. The tcm isthe time spent by all robots communication with their neighborswhich is approximated by counting the number of floating pointvalues passed during the optimization. Finally, λ is a scalingfactor that weights communication to computation resourcecapability. As an example, a network of robots connected viaa 5G cellular network would require significantly less timeto communicate the same amount of data versus the samenetwork of robots connected via a low power radio network.Sweeping across λ corresponds to a pass over the range ofpossible network configurations and computational capabilities.

A. Distributed Multi-drone Vehicle Tracking

Linear least squares optimization, a convex unconstrainedoptimization problem, arises in a variety of robotics problemssuch as parameter estimation, localization, and trajectoryplanning. When extended to a multi-robot scenario, each ofthese problems can be reformulated as a distributed linearleast squares optimization. In this simulation, we consider adistributed multi-drone vehicle target tracking problem in whichrobots connected by a communication graph, G = (V, E), eachrecord range-limited linear measurements of a moving target,and seek to collectively estimate the target’s entire trajectory.We assume that each drone can communicate locally withnearby drones over the communication graph G = (V, E). Thedrones all share a model of the target’s dynamics as

xt+1 = Atxt + wt, (12)

where xt ∈ R4 represents the position and velocity of thetarget in some global frame at time t, At is a linear model ofthe targets dynamics, and wt ∼ N (0, Qt) represents processnoise (including the unknown control inputs to the target). Atevery timestep when the target is sufficiently close to a drone i(which we denote by t ∈ Ti), that robot collects an observationaccording to the measurement model

yi,t = Ci,txt + vi,t , (13)

where yi,t ∈ R2 is a positional measurement, Ci,t is themeasurement model of drone i, and vi,t ∼ N (0, Ri,t) ismeasurement noise. All of the drones have the same model forthe prior distribution of the initial state of the target N (x0, P0).The global cost function is of the form

f(x) =‖x0 − x0‖2P−10

+

T−1∑t=1

‖xt+1 −Atxt‖2Q−1t

+∑i∈V

∑t∈Ti

‖yi,t − Ci,txt‖2R−1 ,

(14)

Fig. 2. A visualization of the distributed multi-drone vehicle target tracking.Each robot (colored quadrotor) records noisy observations when the target(flagged ground vehicle) is within its measurement range. The robotscommunicate over a network (dashed lines) to converge to the globally optimaltrajectory estimate.

while the local cost function for drone i is

fi(x) =1

N‖x0 − x0‖2P−1

0+

T−1∑t=1

1

N‖xt+1 −Atxt‖2Q−1

t

+∑t∈Ti

‖yi,t − Ci,txt‖2R−1 .

(15)

In our results, we consider only a batch solution to theproblem (finding the full trajectory of the target given eachrobot’s full set of measurements). Methods for performing theestimate in real-time through filtering and smoothing steps havebeen well studied, both in the centralized and distributed case[122]. An extended version of this multi-robot tracking problemis solved with distributed optimization in [3]. A rendering ofa representative instance of this multi-robot tracking problemis shown in Figure 2.

In Figures 3 and 4 several distributed optimization algorithmsare compared on an instance of the distributed multi-dronevehicle tracking problem. For this problem instance, 10simulated drones seek to estimate the target’s trajectory over16 time steps resulting in a decision variable dimension ofn = 64. We compare four distributed optimization methodswhich we consider to be representative of the taxonomic classesoutlined in the sections above: C-ADMM, EXTRA, DIGing,and NEXT-Q. Figure 3 shows that C-ADMM and EXTRA havesimilar fast convergence rates per iteration while DIGing andNEXT-Q are 4 and 15 times slower respectively to convergebelow an MSE of 10−6. The step-size hyperparameters foreach method are computed by GSS (for NEXT-Q which usesa two parameter decreasing step-size we fix one according tothe values recommended in [14]).

As mentioned in Section VI, tuning is essential for achievingrobust and efficient convergence with most distributed opti-mization algorithms. Figure 4 shows the sensitivity of thesemethods to variation in step-size, and highlights that three ofthe methods (all except ADMM) become divergent for certainsubsets of the tested hyper parameter space.

While MSE reduction per iteration and hyper parametersensitivity are useful for understanding the performance of thesemethods on this specific problem instance, they do not provide

Page 11: A Survey of Distributed Optimization Methods for Multi

11

Fig. 3. MSE per iteration on a distributed multi-drone vehicle target trackingproblem with N = 10 and n = 64.

Fig. 4. step-size hyper parameter sensitivity sweep on a distributed multi-dronevehicle target tracking problem for N = 10 and n = 64. The dashed linesare thresholds for divergence (top) and convergence (bottom) in terms of MSEafter 104 decision variable updates.

information about the scalability of these methods. Specifically,how do these algorithms scale on real networks where bothcommunication and computation overhead are important? Toanswer this question we look to the RWC (11) which providesa more comprehensive analysis of the problem. In Figure 5we show how the two fastest converging algorithms from thesingle instance analysis, C-ADMM and EXTRA, perform witha sweep across both problem size and RWC parameter λ.

The RWC surfaces in Figure 5 show that C-ADMM has alower RWC compared to EXTRA at all points in the sweep,except where both λ and n are small where the algorithms haveroughly the same RWC. This corresponds to scenarios withrelatively small decision variables, and a network configurationwhere computation is costly and communication is cheap.

Fig. 5. Resource-weighted cost comparison of C-ADMM and EXTRA on thedistributed multi-drone vehicle tracking problem. GSS was used to find the beststep-sizes for each of the algorithms for each problem instance. Computationtime was measured in MATLAB on a single CPU core.

Fig. 6. Coordinated package delivery by aerial and ground robots. Theaerial robots pick up packages from the ground robots for delivery to areasinaccessible to the ground robots which remain within the inner rectangularregion.

B. Multi Drone-Robot Coordinated Package Delivery

Many robotics scenarios, especially in planning and control,involve constrained optimization problems, in which thedecision variable must satisfy constraints such as control limits,state bounds, or load limits. In general, these problems are ofthe form

minimizex

N∑i=1

fi(x)

subject to g(x) ≤ 0

h(x) = 0.

(16)

We consider the case for which the constraints g(x) ≤ 0 andh(x) = 0 are convex, and each local cost function fi(x) isconvex. Note that if g(x) is a convex function, then g(x) ≤ 0is a convex constraint, and h(x) = 0 is a convex constraint ifand only if h(x) is an affine function.

The distributed subgradient methods ([11], [10], [12], [37][33]) do not handle constrained optimization problems, andextending these methods is beyond the scope of this paper.Consequently, we compare C-ADMM to dual averaging with

Page 12: A Survey of Distributed Optimization Methods for Multi

12

push-sum (PS-DA) [65] and the distributed convex approxima-tion method with consensus on the network gradients (NEXT-Q)in which we take a quadratic approximation of the objectivefunction using the problem Hessian.

Consider N aerial robots delivering packages within a remotearea. Each robot meets with a limited number of ground robotsM at a given time to pick up packages for delivery to specifiedlocations. In addition, the ground robots are constrained tomove within certain zones. Depending on the size of theground robots, these constraints are included to keep therobots on the roads (for larger robots) or on sidewalks (forsmaller robots). We want to compute a trajectory for eachrobot that minimizes its energy consumption and satisfies thepackage pick-up, control, and state constraints. The resultingoptimization problem is

minimizexa,ua,xg,ug

N∑i=1

uTa,iQa,iua,i +

M∑j=1

uTg,jQg,jug,j

subject to xa,i = xg,j ∀t ∈ Ti,j ∀i, ∀jfa(xa,i, ua,i) = 0 ∀ifg(xg,j , ug,j) = 0 ∀jh(xg,j , ug,j) ≤ 0 ∀j

(17)

where xa and ua denote the state and control inputs of theaerial robots while xg and ug denote the state and controlinputs of the ground robots over the duration of the packagedelivery problem. Qa and Qg are positive definite weightmatrices on the control inputs of the aerial and ground robots.The constraint in (17) ensures the aerial robots meet with theground robots at the specified times in Ti,j . The robots’ statesfollow the dynamics represented by f(·). Closed-form solutionsfor the constrained convex optimization problem (17) do notexist.

With xa ∈ R6 and xg ∈ R4, we impose convex state con-straints on the position and velocities of the ground robots,representing constraints on the zones which the robots canoccupy. In addition, we constrain the control inputs of theground robots and assume affine dynamics for the ground andaerial robots. The robots’ delivery assignments begin and endat specified stations which we include as constraints in theoptimization problem.

Representing the trajectory of all the robots as Z includingtheir states and control inputs, we define the mean square error(MSE) of each robot’s solution to the joint solution as

MSE(Z1, · · · , ZN ) =1

N

N∑i=1

‖Zi − Z?‖22 (18)

where Z? represents the joint solution of the problem.We examine the performance of C-ADMM, PS-DA, and

NEXT-Q on the coordinated package delivery problem usingthe MSE. We show the joint solution in Figure 6 where theaerial robots pick up packages from the ground robots atlocations indicated by the colored packages for delivery next tothe homes inaccessible to the ground robots. The aerial robotsconclude each delivery assignment at the marked locationsindicated by the cross-hairs next to the homes. Similar togradient descent methods, PS-DA shows notable sensitivity to

Fig. 7. Convergence error of C-ADMM, NEXT-Q, and PS-DA for the packagedelivery problem on range-limited graphs. PS-DA converges slowly comparedto C-ADMM and NEXT-Q.

the step-size and becomes unbounded for some values of thestep-size. We selected the step-size that provided the fastestconvergence. From Figure 7, C-ADMM converges faster tothe joint solution compared to PS-DA and NEXT-Q whichconverges quickly once the agents obtain a good estimate ofthe problem gradients through consensus. PS-DA convergesmore slowly compared to the other methods.

In Figure 8, we show the relative trade-off between thecomputation and communication overhead required by C-ADMM and NEXT-Q as the number of meeting constraintsbetween the aerial and ground robots increases. We omitthe cost of PS-DA from this figure as PS-DA convergessignificantly more slowly compared to the other methods.The constrained problem becomes increasingly difficult foran increasing number of meeting constraints, as reflected inFigure 8. Likewise, solving the constrained problem requiressignificant computation effort, and thus reducing the weighton the contribution of the computation effort to the resource-weight cost results in a lower resource-weighted cost. FromFigure 8, C-ADMM achieves a lower resource-weighted costcompared to NEXT-Q.

C. Cooperative Multi-robot Mapping

In our third evaluation of distributed methods for robotics,we consider distributed nonlinear, nonconvex optimization.Many problems in robotics in robotics require optimizationover nonconvex. For example, several recent papers haveaddressed distributed approaches to SLAM with problem-specific algorithms. In this work, we consider a milder (thoughstill nonconvex) problem: distributed cooperative multi-robotmapping of labeled landmarks using range-only measurements.The distributed cooperative multi-robot mapping, visualized inFigure 9, consists of n robots navigating around an environmentwhile taking measurements of the distance between their knownpositions and the unknown positions of m landmarks. Theseparable cost function with agreement constraints is given as:

Page 13: A Survey of Distributed Optimization Methods for Multi

13

Fig. 8. Resource-weighted cost for the package delivery problem on range-limited graphs. C-ADMM attains a lower total cost considering the computationand communication required by the robots.

Fig. 9. A visualization of the range-only distributed cooperative multi-robotmapping problem. Heterogeneous robots record noisy range measurements tolandmarks and cooperatively estimate the landmarks’ positions.

minx11,...,xnm

n∑i=1

m∑k=1

∑t∈Tik

w2ik

2(‖pi − xik‖ − dik(t))

2

subject to xik = xjk ∀k, ∀j ∈ Ni,(19)

where xik denotes the ith robot’s estimate of the location ofthe kth landmark.

Although the distributed methods that we consider do nothave convergence guarantees for non-convex optimizationproblems, we have evaluated each algorithm based on themost immediate interpretation of the algorithms. In particular,we consider EXTRA and C-ADMM for this problem. Theapplication of DGD methods to differentiable but non-convexproblems such as (19) is immediate, although there is noguarantee of convergence to the global minimum.

An important consequence of solving a nonlinear leastsquares problem to C-ADMM is that the primal update step ofthe algorithm no longer consists of a single linear update. Inthis implementation, each robot solves the minimization step tocompletion (up to a specified error tolerance) before commu-nicating with its neighbors, incurring additional computationalcost per iteration. As a result, the C-ADMM update equationsbetween communication iterations can require significantly

Fig. 10. Mean-square-error of robots’ estimates as a function of iteration forC-ADMM and EXTRA for a fixed graph containing 20 nodes and optimalchoices of step-size parameters.

Fig. 11. Resource weighted cost of distributed optimization algorithms on adistributed cooperative multi-robot mapping problem.

more computation than the EXTRA update equations. Despitethe higher computation cost per iteration, empirical resultssuggest that C-ADMM still maintains a distinct advantage overEXTRA.

VIII. OPEN PROBLEMS IN DISTRIBUTED OPTIMIZATIONFOR ROBOTICS

Distributed optimization methods have primarily focused onsolving unconstrained convex optimization problems, whichconstitute a notably limited subset of robotics problems.Generally, robotics problems involve non-convex objectives andconstraints, which render these problems not directly amenableto many existing distributed optimization methods. For example,problems in multi-robot motion planning, SLAM, learning,distributed manipulation, and target tracking are often non-convex and/or constrained.

Page 14: A Survey of Distributed Optimization Methods for Multi

14

Both DGC methods and C-ADMM methods can be mod-ified for non-convex and constrained problems, howeverfew examples of practical algorithms or rigorous analysesof performance for such modified algorithms exist in theliterature. Specifically, while C-ADMM is naturally amenableto constrained optimization, there are many possible ways toadapt C-ADMM to non-convex objectives, which have yetto be explored. One way to implement C-ADMM for non-convex problems is for each primal step to solve itself a non-convex optimization (e.g. through a quasi-Newton method, orinterior point method). Another option is to perform successivequadratic approximations in an outer loop, and use C-ADMMto solve each resulting quadratic problem in an inner loop. Thetrade-off between these two options has not yet been explored inthe literature, especially in the context of non-convex problemsin robotics.

Likewise, many distributed optimization methods do notconsider communication between agents as an expensiveresource, given that many of these methods were developedfor problems with reliable communication infrastructure (e.g.multi-core computing, or computing in a hard-wired cluster).However, communication takes on greater prominence inrobotics problems as robots often operate in regions withlimited communication infrastructure. The absence of reliablecommunication infrastructure also leads to communicationdelays and dropped message packets. This highlights theneed for research analyzing the robustness of distributedoptimization methods to unreliable communication networks,and the development of new algorithms robust to such real-world communication faults.

Another valuable direction for future research is in develop-ing algorithms specifically for computationally limited roboticplatforms, in which the timeliness of the solution is as importantas the solution quality. In general, many distributed optimizationmethods involve computationally challenging procedures thatrequire significant computational power, especially distributedmethods for constrained problems. These methods ignore thesignificance of computation time, assuming that agents haveaccess to significant computational power. These assumptionsoften do not hold in robotics problems. Typically, roboticsproblems unfold over successive time periods with an associatedoptimization phase at each step of the problem. As such, agentsmust compute their solutions fast enough to proceed withcomputing a reasonable solution for the next problem whichrequires efficient distributed optimization methods. Developingsuch algorithms specifically for multi-robot systems is aninteresting topic for future work.

Finally, there are very few examples of distributed opti-mization algorithms implemented and running on multi-robothardware. This leaves a critical gap in the existing literature,as the ability of these algorithms to run efficiently and robustlyon robots has still not be thoroughly proven. In general, theopportunities for research in distributed optimization for multi-robot systems are plentiful. Distributed optimization providesan appealing unifying framework from which to synthesizesolutions for a large variety of problems in multi-robot systems.

IX. CONCLUSION

The field of distributed optimization provides a variety ofalgorithms that can address important problems for multi-robot systems. We have categorized distributed optimizationmethods into three broad classes—distributed gradient descent,distributed sequential convex programming, and the alternatingdirection method of multipliers (ADMM). We have presenteda general framework for applying these methods to multi-robotscenarios, with examples in distributed multi-drone target track-ing, multi-robot coordinated package delivery, and multi-robotcooperative mapping. Our empirical simulation results suggestthat C-ADMM provides an especially attractive algorithm fordistributed optimization in robotics problems. While distributedoptimization techniques can immediately apply to severalfundamental applications in robotics, important challengesremain in developing distributed algorithms for constrained,non-convex robotics problems, and algorithms tailored to thelimited computation and communication resources of robotplatforms.

REFERENCES

[1] R. T. Rockafellar, “Monotone operators and the proximal pointalgorithm,” SIAM journal on control and optimization, vol. 14, no. 5,pp. 877–898, 1976.

[2] J. N. Tsitsiklis, “Problems in decentralized decision making and com-putation.” Massachusetts Inst of Tech Cambridge Lab for Informationand Decision Systems, Tech. Rep., 1984.

[3] O. Shorinwa, J. Yu, T. Halsted, A. Koufos, and M. Schwager,“Distributed multi-target tracking for autonomous vehicle fleets,” in 2020IEEE International Conference on Robotics and Automation (ICRA).IEEE, 2020, pp. 3495–3501.

[4] H.-T. Wai, Z. Yang, Z. Wang, and M. Hong, “Multi-agent reinforcementlearning via double averaging primal-dual optimization,” in Advancesin Neural Information Processing Systems, 2018, pp. 9649–9660.

[5] J. Bento, N. Derbinsky, J. Alonso-Mora, and J. S. Yedidia, “A message-passing algorithm for multi-agent trajectory planning,” in Advances inneural information processing systems, 2013, pp. 521–529.

[6] D. Hajinezhad, M. Hong, and A. Garcia, “Zeroth order nonconvex multi-agent optimization over networks,” arXiv preprint arXiv:1710.09997,2017.

[7] ——, “ZONE: Zeroth-order nonconvex multiagent optimization overnetworks,” IEEE Transactions on Automatic Control, vol. 64, no. 10,pp. 3995–4010, 2019.

[8] D. Hajinezhad and M. Hong, “Perturbed proximal primal–dual algorithmfor nonconvex nonsmooth optimization,” Mathematical Programming,vol. 176, no. 1-2, pp. 207–245, 2019.

[9] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronousdeterministic and stochastic gradient optimization algorithms,” IEEETransactions on Automatic Control, vol. 31, no. 9, pp. 803–812, 1986.

[10] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-orderalgorithm for decentralized consensus optimization,” SIAM Journal onOptimization, vol. 25, no. 2, pp. 944–966, 2015.

[11] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54,no. 1, pp. 48–61, 2009.

[12] A. Nedic and A. Olshevsky, “Distributed optimization over time-varyingdirected graphs,” IEEE Transactions on Automatic Control, vol. 60,no. 3, pp. 601–615, 2014.

[13] A. Mokhtari, Q. Ling, and A. Ribeiro, “Network Newton,” ConferenceRecord - Asilomar Conference on Signals, Systems and Computers, vol.2015-April, pp. 1621–1625, 2015.

[14] P. Di Lorenzo and G. Scutari, “NEXT: In-network nonconvex optimiza-tion,” IEEE Transactions on Signal and Information Processing overNetworks, vol. 2, no. 2, pp. 120–136, 2016.

[15] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,” Foundations and Trends® in Machine learning, vol. 3,no. 1, pp. 1–122, 2011.

Page 15: A Survey of Distributed Optimization Methods for Multi

15

[16] G. Mateos, J. A. Bazerque, and G. B. Giannakis, “Distributed sparselinear regression,” IEEE Transactions on Signal Processing, vol. 58,no. 10, pp. 5262–5276, 2010.

[17] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linearconvergence of the ADMM in decentralized consensus optimization,”IEEE Transactions on Signal Processing, vol. 62, no. 7, pp. 1750–1761,2014.

[18] D. K. Molzahn, F. Dorfler, H. Sandberg, S. H. Low, S. Chakrabarti,R. Baldick, and J. Lavaei, “A survey of distributed optimization andcontrol algorithms for electric power systems,” IEEE Transactions onSmart Grid, vol. 8, no. 6, pp. 2941–2962, 2017.

[19] A. Nedic and J. Liu, “Distributed optimization for control,” AnnualReview of Control, Robotics, and Autonomous Systems, vol. 1, pp.77–103, 2018.

[20] A. Nedic, A. Olshevsky, and M. G. Rabbat, “Network topology andcommunication-computation tradeoffs in decentralized optimization,”Proceedings of the IEEE, vol. 106, no. 5, pp. 953–976, 2018.

[21] T.-H. Chang, M. Hong, H.-T. Wai, X. Zhang, and S. Lu, “Distributedlearning in the nonconvex world: From batch data to streaming andbeyond,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 26–38,2020.

[22] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang,Z. Lin, and K. H. Johansson, “A survey of distributed optimization,”Annual Reviews in Control, 2019.

[23] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation:numerical methods. Prentice hall Englewood Cliffs, NJ, 1989, vol. 23.

[24] N. A. Lynch, Distributed algorithms. Elsevier, 1996.[25] F. Bullo, J. Cortes, and S. Martınez, Distributed Control of Robotic

Networks, ser. Applied Mathematics Series. Princeton University Press,2009, electronically available at http://coordinationbook.info.

[26] M. Mesbahi and M. Egerstedt, Graph theoretic methods in multiagentnetworks. Princeton University Press, 2010, vol. 33.

[27] V. S. Mai and E. H. Abed, “Distributed optimization over directedgraphs with row stochasticity and constraint regularity,” Automatica,vol. 102, pp. 94–104, 2019.

[28] I. Lobel and A. Ozdaglar, “Distributed subgradient methods for convexoptimization over random networks,” IEEE Transactions on AutomaticControl, vol. 56, no. 6, pp. 1291–1306, 2010.

[29] D. Kempe, A. Dobra, and J. Gehrke, “Gossip-based computation ofaggregate information,” in 44th Annual IEEE Symposium on Foundationsof Computer Science. IEEE, 2003, pp. 482–491.

[30] A. Olshevsky and J. N. Tsitsiklis, “Convergence speed in distributedconsensus and averaging,” SIAM Journal on Control and Optimization,vol. 48, no. 1, pp. 33–55, 2009.

[31] A. Olshevsky, I. C. Paschalidis, and A. Spiridonoff, “Robust asyn-chronous stochastic gradient-push: asymptotically optimal and network-independent performance for strongly convex functions,” arXiv preprintarXiv:1811.03982, 2018.

[32] F. Benezit, V. Blondel, P. Thiran, J. Tsitsiklis, and M. Vetterli, “Weightedgossip: Distributed averaging using non-doubly stochastic matrices,” in2010 IEEE International Symposium on Information Theory. IEEE,2010, pp. 1753–1757.

[33] D. Jakovetic, J. Xavier, and J. M. Moura, “Fast distributed gradientmethods,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp.1131–1146, 2014.

[34] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralizedgradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp.1835–1854, 2016.

[35] A. Daneshmand, G. Scutari, and V. Kungurtsev, “Second-orderguarantees of distributed gradient algorithms,” arXiv preprintarXiv:1809.08694, 2018.

[36] A. Nedic, A. Olshevsky, and W. Shi, “Achieving geometric convergencefor distributed optimization over time-varying graphs,” SIAM Journalon Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.

[37] ——, “Achieving geometric convergence for distributed optimizationover time-varying graphs,” SIAM Journal on Optimization, vol. 27,no. 4, pp. 2597–2633, 2017.

[38] Z. Li, W. Shi, and M. Yan, “A decentralized proximal-gradient methodwith network independent step-sizes and separated convergence rates,”IEEE Transactions on Signal Processing, vol. 67, no. 17, pp. 4494–4506,2019.

[39] K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion fordistributed optimization and learning—part I: Algorithm development,”IEEE Transactions on Signal Processing, vol. 67, no. 3, pp. 708–723,2018.

[40] ——, “Exact diffusion for distributed optimization and learning—partII: Convergence analysis,” IEEE Transactions on Signal Processing,vol. 67, no. 3, pp. 724–739, 2018.

[41] G. Qu and N. Li, “Harnessing smoothness to accelerate distributedoptimization,” IEEE Transactions on Control of Network Systems, vol. 5,no. 3, pp. 1245–1260, 2017.

[42] R. Xin and U. A. Khan, “Distributed heavy-ball: A generalizationand acceleration of first-order methods with gradient tracking,” IEEETransactions on Automatic Control, 2019.

[43] F. Saadatniaki, R. Xin, and U. A. Khan, “Optimization over time-varyingdirected graphs with row and column-stochastic matrices,” arXiv preprintarXiv:1810.07393, 2018.

[44] C. Xi, R. Xin, and U. A. Khan, “ADD-OPT: Accelerated distributeddirected optimization,” IEEE Transactions on Automatic Control, vol. 63,no. 5, pp. 1329–1339, 2017.

[45] R. Xin and U. A. Khan, “A linear algorithm for optimization overdirected graphs with geometric convergence,” IEEE Control SystemsLetters, vol. 2, no. 3, pp. 315–320, 2018.

[46] C. Xi, V. S. Mai, R. Xin, E. H. Abed, and U. A. Khan, “Linearconvergence in optimization over directed graphs with row-stochasticmatrices,” IEEE Transactions on Automatic Control, vol. 63, no. 10,pp. 3558–3565, 2018.

[47] J. Zeng and W. Yin, “ExtraPush for convex smooth decentralizedoptimization over directed networks,” Journal of Computational Mathe-matics, vol. 35, no. 4, pp. 383–396, 2017.

[48] A. Sundararajan, B. Van Scoy, and L. Lessard, “A canonical form forfirst-order distributed optimization algorithms,” in American ControlConference. IEEE, 2019, pp. 4075–4080.

[49] D. Jakovetic, “A unification and generalization of exact distributedfirst-order methods,” IEEE Transactions on Signal and InformationProcessing over Networks, vol. 5, no. 1, pp. 31–46, 2018.

[50] G. Qu and N. Li, “Accelerated distributed nesterov gradient descent,”IEEE Transactions on Automatic Control, 2019.

[51] R. Xin, D. Jakovetic, and U. A. Khan, “Distributed Nesterov gradientmethods over arbitrary graphs,” IEEE Signal Processing Letters, vol. 26,no. 8, pp. 1247–1251, 2019.

[52] Q. Lu, X. Liao, H. Li, and T. Huang, “A Nesterov-like gradient trackingalgorithm for distributed optimization over directed networks,” IEEETransactions on Systems, Man, and Cybernetics: Systems, 2020.

[53] T. Tatarenko and B. Touri, “Non-convex distributed optimization,” IEEETransactions on Automatic Control, vol. 62, no. 8, pp. 3744–3757, 2017.

[54] S. S. Ram, A. Nedic, and V. V. Veeravalli, “Distributed stochasticsubgradient projection algorithms for convex optimization,” Journalof optimization theory and applications, vol. 147, no. 3, pp. 516–545,2010.

[55] P. Bianchi and J. Jakubowicz, “Convergence of a multi-agent projectedstochastic gradient algorithm for non-convex optimization,” IEEEtransactions on automatic control, vol. 58, no. 2, pp. 391–405, 2012.

[56] B. Johansson, M. Rabi, and M. Johansson, “A randomized incrementalsubgradient method for distributed optimization in networked systems,”SIAM Journal on Optimization, vol. 20, no. 3, pp. 1157–1170, 2010.

[57] M. Maros and J. Jalden, “PANDA: A dual linearly converging methodfor distributed optimization over time-varying undirected graphs,” in2018 IEEE Conference on Decision and Control (CDC). IEEE, 2018,pp. 6520–6525.

[58] ——, “A geometrically converging dual method for distributed opti-mization over time-varying graphs,” IEEE Transactions on AutomaticControl, 2020.

[59] ——, “ECO-PANDA: a computationally economic, geometricallyconverging dual optimization method on time-varying undirected graphs,”in ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5257–5261.

[60] K. Seaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulie, “Optimalalgorithms for smooth and strongly convex distributed optimizationin networks,” in Proceedings of the 34th International Conference onMachine Learning-Volume 70. JMLR. org, 2017, pp. 3027–3036.

[61] G. Lan, S. Lee, and Y. Zhou, “Communication-efficient algorithms fordecentralized and stochastic optimization,” Mathematical Programming,vol. 180, no. 1, pp. 237–284, 2020.

[62] Y. Nesterov, “Primal-dual subgradient methods for convex problems,”Mathematical programming, vol. 120, no. 1, pp. 221–259, 2009.

[63] L. Xiao, “Dual averaging methods for regularized stochastic learning andonline optimization,” Journal of Machine Learning Research, vol. 11,no. Oct, pp. 2543–2596, 2010.

[64] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging fordistributed optimization: Convergence analysis and network scaling,”

Page 16: A Survey of Distributed Optimization Methods for Multi

16

IEEE Transactions on Automatic control, vol. 57, no. 3, pp. 592–606,2011.

[65] K. I. Tsianos and M. G. Rabbat, “Distributed consensus and optimizationunder communication delays,” in 2011 49th Annual Allerton Conferenceon Communication, Control, and Computing (Allerton). IEEE, 2011,pp. 974–982.

[66] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, “Push-sum distributed dualaveraging for convex optimization,” in 2012 ieee 51st ieee conferenceon decision and control (cdc). IEEE, 2012, pp. 5453–5458.

[67] V.-L. Dang, B.-S. Le, T.-T. Bui, H.-T. Huynh, and C.-K. Pham,“A decentralized localization scheme for swarm robotics based oncoordinate geometry and distributed gradient descent,” in MATEC Webof Conferences, vol. 54. EDP Sciences, 2016, p. 02002.

[68] N. A. Alwan and A. S. Mahmood, “Distributed gradient descentlocalization in wireless sensor networks,” Arabian Journal for Scienceand Engineering, vol. 40, no. 3, pp. 893–899, 2015.

[69] M. Todescato, A. Carron, R. Carli, and L. Schenato, “Distributedlocalization from relative noisy measurements: A robust gradient basedapproach,” in 2015 European Control Conference (ECC). IEEE, 2015,pp. 1914–1919.

[70] R. Tron and R. Vidal, “Distributed image-based 3-d localization ofcamera sensor networks,” in Proceedings of the 48h IEEE Conferenceon Decision and Control (CDC) held jointly with 2009 28th ChineseControl Conference. IEEE, 2009, pp. 901–908.

[71] ——, “Distributed computer vision algorithms,” IEEE Signal ProcessingMagazine, vol. 28, no. 3, pp. 32–45, 2011.

[72] R. Tron, Distributed optimization on manifolds for consensus algorithmsand camera network localization. The Johns Hopkins University, 2012.

[73] R. Tron and R. Vidal, “Distributed 3-d localization of camera sensornetworks from 2-d image measurements,” IEEE Transactions onAutomatic Control, vol. 59, no. 12, pp. 3325–3340, 2014.

[74] A. Sarlette and R. Sepulchre, “Consensus optimization on manifolds,”SIAM Journal on Control and Optimization, vol. 48, no. 1, pp. 56–76,2009.

[75] K.-K. Oh and H.-S. Ahn, “Formation control and network localizationvia orientation alignment,” IEEE Transactions on Automatic Control,vol. 59, no. 2, pp. 540–545, 2013.

[76] ——, “Distributed formation control based on orientation alignmentand position estimation,” International Journal of Control, Automationand Systems, vol. 16, no. 3, pp. 1112–1119, 2018.

[77] E. Montijano and A. R. Mosteo, “Efficient multi-robot formations usingdistributed optimization,” in 53rd IEEE Conference on Decision andControl. IEEE, 2014, pp. 6167–6172.

[78] M. Burger, G. Notarstefano, F. Bullo, and F. Allgower, “A distributedsimplex algorithm for degenerate linear programs and multi-agentassignments,” Automatica, vol. 48, no. 9, pp. 2298–2304, 2012.

[79] T. Fan and T. Murphey, “Majorization minimization methods fordistributed pose graph optimization with convergence guarantees,” in2020 IEEE/RSJ International Conference on Intelligent Robots andSystems (IROS), 2020, pp. 5058–5065.

[80] Y. Tian, A. Koppel, A. S. Bedi, and J. P. How, “Asynchronousand parallel distributed pose graph optimization,” IEEE Robotics andAutomation Letters, vol. 5, no. 4, pp. 5819–5826, 2020.

[81] J. Knuth and P. Barooah, “Collaborative localization with heterogeneousinter-robot measurements by Riemannian optimization,” in 2013 IEEEInternational Conference on Robotics and Automation. IEEE, 2013,pp. 1534–1539.

[82] Y. Tian, K. Khosoussi, and J. P. How, “Block-coordinate descent onthe riemannian staircase for certifiably correct distributed rotation andpose synchronization,” arXiv preprint arXiv:1911.03721, 2019.

[83] S. Hosseini, A. Chapman, and M. Mesbahi, “Online distributedoptimization via dual averaging,” in 52nd IEEE Conference on Decisionand Control. IEEE, 2013, pp. 1484–1489.

[84] S. Shahrampour and A. Jadbabaie, “Exponentially fast parameterestimation in networks using distributed dual averaging,” in 52nd IEEEConference on Decision and Control. IEEE, 2013, pp. 6196–6201.

[85] S. Hosseini, A. Chapman, and M. Mesbahi, “Online distributed convexoptimization on dynamic networks,” IEEE Transactions on AutomaticControl, vol. 61, no. 11, pp. 3545–3550, 2016.

[86] S. Lee, A. Nedic, and M. Raginsky, “Stochastic dual averaging fordecentralized online optimization on time-varying communicationgraphs,” IEEE Transactions on Automatic Control, vol. 62, no. 12,pp. 6407–6414, 2017.

[87] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, “Consensus-baseddistributed optimization: Practical issues and applications in large-scale machine learning,” in 2012 50th annual allerton conference on

communication, control, and computing (allerton). IEEE, 2012, pp.1543–1550.

[88] S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex optimization.Cambridge university press, 2004.

[89] M. Zargham, A. Ribeiro, A. Ozdaglar, and A. Jadbabaie, “Accelerateddual descent for network flow optimization,” IEEE Transactions onAutomatic Control, vol. 59, no. 4, pp. 905–920, 2013.

[90] A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro, “A decentralizedsecond-order method with exact linear convergence rate for consensusoptimization,” IEEE Transactions on Signal and Information Processingover Networks, vol. 2, no. 4, pp. 507–522, 2016.

[91] M. Eisen, A. Mokhtari, A. Ribeiro, and A. We, “Decentralized Quasi-Newton Methods,” IEEE Transactions on Signal Processing, vol. 65,no. 10, pp. 2613–2628, 2017.

[92] M. Eisen, A. Mokhtari, and A. Ribeiro, “A Primal-Dual Quasi-NewtonMethod for Exact Consensus Optimization,” IEEE Transactions onSignal Processing, vol. 67, no. 23, pp. 5983–5997, 2019.

[93] Y. Sun and G. Scutari, “Distributed nonconvex optimization for sparserepresentation,” in 2017 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 4044–4048.

[94] S. Choudhary, L. Carlone, C. Nieto, J. Rogers, H. I. Christensen, andF. Dellaert, “Distributed mapping with privacy and communicationconstraints: Lightweight algorithms and object-based models,” TheInternational Journal of Robotics Research, vol. 36, no. 12, pp. 1286–1311, 2017.

[95] S. Scardapane, R. Fierimonte, P. Di Lorenzo, M. Panella, and A. Uncini,“Distributed semi-supervised support vector machines,” Neural Networks,vol. 80, pp. 43–52, 2016.

[96] S. Scardapane and P. Di Lorenzo, “A framework for parallel anddistributed training of neural networks,” Neural Networks, vol. 91,pp. 42–54, 2017.

[97] R. Altilio, P. Di Lorenzo, and M. Panella, “Distributed data clusteringover networks,” Pattern Recognition, vol. 93, pp. 603–620, 2019.

[98] H. Terelius, U. Topcu, and R. M. Murray, “Decentralized multi-agentoptimization via dual decomposition,” IFAC proceedings volumes,vol. 44, no. 1, pp. 11 245–11 251, 2011.

[99] B. Houska, J. Frasch, and M. Diehl, “An augmented Lagrangian basedalgorithm for distributed nonconvex optimization,” SIAM Journal onOptimization, vol. 26, no. 2, pp. 1101–1127, 2016.

[100] N. Chatzipanagiotis, D. Dentcheva, and M. M. Zavlanos, “An aug-mented Lagrangian method for distributed optimization,” MathematicalProgramming, vol. 152, no. 1-2, pp. 405–434, 2015.

[101] F. Iutzeler, P. Bianchi, P. Ciblat, and W. Hachem, “Asynchronousdistributed optimization using a randomized alternating direction methodof multipliers,” in 52nd IEEE conference on decision and control. IEEE,2013, pp. 3671–3676.

[102] Q. Ling, W. Shi, G. Wu, and A. Ribeiro, “DLM: Decentralized linearizedalternating direction method of multipliers,” IEEE Transactions onSignal Processing, vol. 63, no. 15, pp. 4051–4064, 2015.

[103] T.-H. Chang, M. Hong, and X. Wang, “Multi-agent distributed opti-mization via inexact consensus ADMM,” IEEE Transactions on SignalProcessing, vol. 63, no. 2, pp. 482–497, 2014.

[104] F. Farina, A. Garulli, A. Giannitrapani, and G. Notarstefano, “A dis-tributed asynchronous method of multipliers for constrained nonconvexoptimization,” Automatica, vol. 103, pp. 243–253, 2019.

[105] A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro, “DQM: Decentralizedquadratically approximated alternating direction method of multipliers,”IEEE Transactions on Signal Processing, vol. 64, no. 19, pp. 5158–5173,2016.

[106] O. Shorinwa, T. Halsted, and M. Schwager, “Scalable distributedoptimization with separable variables in multi-agent networks,” in 2020American Control Conference (ACC). IEEE, 2020, pp. 3619–3626.

[107] R. Zhang, S. Zhu, T. Fang, and L. Quan, “Distributed very large scalebundle adjustment by global camera consensus,” in Proceedings of theIEEE International Conference on Computer Vision, 2017, pp. 29–38.

[108] A. Eriksson, J. Bastian, T.-J. Chin, and M. Isaksson, “A consensus-basedframework for distributed bundle adjustment,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, 2016,pp. 1754–1762.

[109] S. Choudhary, L. Carlone, H. I. Christensen, and F. Dellaert, “Exactlysparse memory efficient SLAM using the multi-block alternatingdirection method of multipliers,” in 2015 IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS). IEEE, 2015,pp. 1349–1356.

[110] S.-S. Park, Y. Min, J.-S. Ha, D.-H. Cho, and H.-L. Choi, “A distributedADMM approach to non-myopic path planning for multi-target tracking,”IEEE Access, vol. 7, pp. 163 589–163 603, 2019.

Page 17: A Survey of Distributed Optimization Methods for Multi

17

[111] K. Natesan Ramamurthy, C.-C. Lin, A. Aravkin, S. Pankanti, andR. Viguier, “Distributed bundle adjustment,” in Proceedings of theIEEE International Conference on Computer Vision Workshops, 2017,pp. 2146–2154.

[112] R. Haksar, O. Shorinwa, P. Washington, and M. Schwager, “Consensus-based ADMM for task assignment in multi-robot teams,” in InternationalSymposium on Robotics Research, 2019.

[113] H. F. Xu, Q. Ling, and A. Ribeiro, “Online Learning over a DecentralizedNetwork Through ADMM,” Journal of the Operations Research Societyof China, vol. 3, no. 4, pp. 537–562, 2015.

[114] A. Khodabandeh and P. Teunissen, “Distributed least-squares estimationapplied to GNSS networks,” Measurement Science and Technology,vol. 30, no. 4, p. 044005, 2019.

[115] L. Ferranti, R. R. Negenborn, T. Keviczky, and J. Alonso-Mora,“Coordination of multiple vessels via distributed nonlinear modelpredictive control,” in 2018 European Control Conference (ECC). IEEE,2018, pp. 2523–2528.

[116] S. Kumar, R. Jain, and K. Rajawat, “Asynchronous optimization overheterogeneous networks via consensus ADMM,” IEEE Transactionson Signal and Information Processing over Networks, vol. 3, no. 1, pp.114–129, 2016.

[117] O. Shorinwa and M. Schwager, “Scalable collaborative manipulationwith distributed trajectory planning,” in Proceedings of IEEE/RSJInternational Conference on Intelligent Robots and Systems. IROS’20,vol. 1. IEEE, 2020, pp. 9108–9115.

[118] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,”Systems & Control Letters, vol. 53, no. 1, pp. 65–78, 2004.

[119] S. Jafarizadeh and A. Jamalipour, “Weight optimization for distributedaverage consensus algorithm in symmetric, CCS & KCS star networks,”arXiv preprint arXiv:1001.4278, 2010.

[120] A. S. Berahas, R. Bollapragada, N. S. Keskar, and E. Wei, “Balancingcommunication and computation in distributed optimization,” IEEETransactions on Automatic Control, vol. 64, no. 8, pp. 3141–3155,2018.

[121] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani,“Fedpaq: A communication-efficient federated learning method withperiodic averaging and quantization,” in International Conference onArtificial Intelligence and Statistics. PMLR, 2020, pp. 2021–2031.

[122] R. Olfati-Saber, “Distributed Kalman filtering for sensor networks,” in2007 46th IEEE Conference on Decision and Control. IEEE, 2007,pp. 5492–5498.

[123] W. H. Press, B. P. Flannery, S. A. Teukolsky, W. T. Vetterling et al.,“Numerical recipes,” 1989.

X. APPENDIX

A. Fixed Step-size DGD General Form

As demonstrated in [48], a range of fixed-step-size DGDmethods, including EXTRA [10], NIDS [38], Exact Diffusion[39], and DIGing [36] can be alternatively represented in acanonical form. In this canonical form, weights define theinclusion of several terms common to fixed-step-size DGDincluding the local gradient and the local decision variablehistory. Algorithm 9 provides a unifying template for fixedstep-size distributed gradient descent (DGD) methods with theselection of the parameters α, ζ0, ζ1, ζ2, and ζ3 defining aspecific DGD method. For an illustrative example, we obtainEXTRA (refer to Algorithm 3) from the parameters ζ0 = 1

2 ,ζ1 = 1, ζ2 = 0, and ζ3 = 0.

B. Golden Section Search

We assume that rather than tuning parameters online using adistributed algorithm, the roboticist selects suitable parametersfor an implementation before deploying a system, eitherusing analytical results or simulation. In this case, the mostgeneral (centralized) procedure for parameter tuning involvescomparing the convergence performance of the system on aknown problem for different parameter values. However, while

Algorithm 9 Fixed Step-Size Distributed Gradient DescentPrivate variables: Pi = ∅Public variables: Q(k)

i =(x

(k)i , z

(k)i

)Parameters: R(k)

i = (α, ζ0, ζ1, ζ2, ζ3)Update equations:

x(k+1)i = x

(k)i + ζ0z

(k)i − ζ1

x(k)i −

∑j∈Ni∪{i}

wijx(k)j

· · ·+ ζ2

z(k)i −

∑j∈Ni∪{i}

wijz(k)j

· · ·− α∇fi

(1− ζ3)x(k)i + ζ3

∑j∈Ni∪{i}

wijx(k)j

z

(k+1)i = z

(k)i − x(k)

i +∑

j∈Ni∪{i}

wijx(k)j

Algorithm 10 Golden Section Search1: function GSS(a0, a3, f(·))2: h← a3 − a0

3: (a1, a2)← (a0 + ϕ2h, a0 + ϕh)4: while stopping criterion is not satisfied do5: if f(a1) < f(a2) then6: (a3, a2)← (a2, a1)7: h← ϕh8: a1 ← a0 + ϕ2h9: else

10: (a0, a1)← (a1, a2)11: h← ϕh12: a2 ← a0 + ϕh13: end if14: end while15: end function

a uniform sweep of the parameter space may be effectivefor small problems or parameter-insensitive methods, it isnot computationally efficient. Given the convergence rateof a distributed method at particular choices of parameter,bracketing methods provide parameter selections to moreefficiently find the convergence-rate-minimizing parameter. Forinstance, Golden Section Search (GSS) provides a versatileapproach for tuning a scalar parameter [123]. Assuming thatthe dependence of convergence rate on the given parameteris strictly unimodal (as demonstrated in Section VII), GSSmaintains four parameter candidates and refines its list ofcandidates based on performance of the interior candidates.Algorithm 10 outlines the basic procedure for parameter search.We present the performance results in Section VII according tothe best parameter choice using GSS on log10(a) for eachparameter a (typically requiring a total of 10-15 problemevaluations).