yujie tang , and na li arxiv:1908.11444v3 [math.oc] 16 jun ... · arxiv:1908.11444v3 [math.oc] 16...

42
arXiv:1908.11444v3 [math.OC] 16 Jun 2020 Distributed Zero-Order Algorithms for Nonconvex Multi-Agent Optimization Yujie Tang 1 , Junshan Zhang 2 , and Na Li 1 1 School of Engineering and Applied Sciences, Harvard University 2 School of Electrical, Computer and Energy Engineering, Arizona State University Abstract Distributed multi-agent optimization finds many applications in distributed learning, control, es- timation, etc. Most existing algorithms assume knowledge of first-order information of the objective and have been analyzed for convex problems. However, there are situations where the objective is nonconvex, and one can only evaluate the function values at finitely many points. In this paper we consider derivative-free distributed algorithms for nonconvex multi-agent optimization, based on re- cent progress in zero-order optimization. We develop two algorithms for different settings, provide detailed analysis of their convergence behavior, and compare them with existing centralized zero-order algorithms and gradient-based distributed algorithms. Keywords: Distributed optimization, nonconvex optimization, zero-order information 1 Introduction Consider a set of n agents connected over a network, each of which is associated with a smooth local objective function f i that can be nonconvex. The goal is to solve the optimization problem min xR d f (x) := 1 n n i=1 f i (x) with the restriction that f i is only known to agent i and each agent can exchange information only with its neighbors in the network during the optimization procedure. We focus on the situation where only zero-order information of f i is available to agent i. Distributed multi-agent optimization lies at the core of a wide range of applications, and a large body of literature has been contributed to distributed multi-agent optimization algorithms. One line of research combines (sub)gradient-based methods with a consensus/averaging scheme, where each iteration of a local agent consists of one or multiple consensus steps and a local gradient evaluation step. It has been shown that, for convex functions, the convergence rates of distributed gradient-based algorithms can match or nearly match those of centralized gradient-based algorithms. Specifically, [9, 2] proposed and analyzed consensus-based decentralized gradient descent (DGD) algorithms with O(log t/ t) convergence for nonsmooth convex functions; [10, 7, 11] employed the gradient tracking scheme and showed that the DGD with gradient tracking achieves O(1/t) convergence for smooth convex functions and linear convergence for strongly convex functions; [12] employed Nesterov’s gradient descent method and showed O(1/t 1.4ǫ ) convergence for smooth convex functions and improved linear convergence for strongly convex 1

Upload: others

Post on 25-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • arX

    iv:1

    908.

    1144

    4v3

    [m

    ath.

    OC

    ] 1

    6 Ju

    n 20

    20

    Distributed Zero-Order Algorithms for Nonconvex

    Multi-Agent Optimization

    Yujie Tang1, Junshan Zhang2, and Na Li1

    1School of Engineering and Applied Sciences, Harvard University2School of Electrical, Computer and Energy Engineering, Arizona State University

    Abstract

    Distributed multi-agent optimization finds many applications in distributed learning, control, es-

    timation, etc. Most existing algorithms assume knowledge of first-order information of the objective

    and have been analyzed for convex problems. However, there are situations where the objective is

    nonconvex, and one can only evaluate the function values at finitely many points. In this paper we

    consider derivative-free distributed algorithms for nonconvex multi-agent optimization, based on re-

    cent progress in zero-order optimization. We develop two algorithms for different settings, provide

    detailed analysis of their convergence behavior, and compare them with existing centralized zero-order

    algorithms and gradient-based distributed algorithms.

    Keywords: Distributed optimization, nonconvex optimization, zero-order information

    1 Introduction

    Consider a set of n agents connected over a network, each of which is associated with a smooth local

    objective function fi that can be nonconvex. The goal is to solve the optimization problem

    minx∈Rd

    f(x) :=1

    n

    n∑

    i=1

    fi(x)

    with the restriction that fi is only known to agent i and each agent can exchange information only with

    its neighbors in the network during the optimization procedure. We focus on the situation where only

    zero-order information of fi is available to agent i.

    Distributed multi-agent optimization lies at the core of a wide range of applications, and a large

    body of literature has been contributed to distributed multi-agent optimization algorithms. One line of

    research combines (sub)gradient-based methods with a consensus/averaging scheme, where each iteration

    of a local agent consists of one or multiple consensus steps and a local gradient evaluation step. It has

    been shown that, for convex functions, the convergence rates of distributed gradient-based algorithms can

    match or nearly match those of centralized gradient-based algorithms. Specifically, [9, 2] proposed and

    analyzed consensus-based decentralized gradient descent (DGD) algorithms with O(log t/√t) convergence

    for nonsmooth convex functions; [10, 7, 11] employed the gradient tracking scheme and showed that

    the DGD with gradient tracking achieves O(1/t) convergence for smooth convex functions and linear

    convergence for strongly convex functions; [12] employed Nesterov’s gradient descent method and showed

    O(1/t1.4−ǫ) convergence for smooth convex functions and improved linear convergence for strongly convex

    1

    http://arxiv.org/abs/1908.11444v3

  • Table 1: Comparison of different algorithms for distributed optimization and zero-order optimization.

    smooth gradient dominated

    distributed

    zero-order

    (nonconvex)

    Alg. 1, this paper

    (2-point + DGD)O

    (

    d

    mlogm

    )

    O

    (

    d

    m

    )

    Alg. 2, this paper

    (2d-point +

    gradient tracking)

    O

    (

    d

    m

    )

    O

    (

    [

    1− c(

    1− ρ2)2(

    µ

    L

    ) 43

    ]m/d)

    ZONE [1] O

    (

    γ(d)√

    M

    )

    distributed

    first-order

    DGD

    O

    (

    log t√

    t

    )

    [2, 3] (convex)

    O

    (

    1

    t

    )

    [4] (strongly convex)O

    (

    1√

    T

    )

    [5] (nonconvex)

    gradient

    trackingO

    (

    1

    t

    )

    [6] (nonconvex) O

    (

    [

    1−c(1−ρ)2(µ

    L

    )32

    ]t)

    [7] (strongly convex)

    centralized

    zero-order

    [8]

    (2-point estimator)O

    (

    d

    m

    )

    (nonconvex) O([

    1−c

    d

    µ

    L

    ]m)

    (strongly convex)

    Note: The table summarizes best known convergence rates for deterministic nonconvex unconstrained optimization with 1) smooth,

    2) gradient dominated objectives. The convex counterparts are listed if results for nonconvex cases have not been established.

    m denotes the number of function value queries, t denotes the number of iterations, d denotes the dimension of the decision

    variable, c’s represent numerical constants that can be different for different algorithms.

    M denotes the total number of function value queries and T denotes the total number of iterations provided before the

    optimization procedure. The rates in [1] and [5] assume constant step sizes chosen based on M or T .

    The listed convergence rates are the ergodic rates of ‖∇f‖2 for the smooth case, and the objective error rates for thegradient dominated case, respectively.

    The rates provided in [1] do not include explicit dependence on d; we use γ(d) to denote this dependence.

    The cited results in this table may apply to more general settings (e.g., stochastic gradients [5, 4]).

    We do not include algorithms with Nesterov-type acceleration in this comparison.

    functions where ǫ is an arbitrarily small positive number. Besides convergence rates, some works have

    additional focuses such as time-varying/directed graphs [13], uncoordinated step sizes [14], stochastic

    (sub)gradient [15], etc.

    While distributed convex optimization has broad applicability, nonconvex problems also appear in

    important applications such as distributed learning [16], robotic networks [17], operation of wind farms [18],

    etc. Several works have considered nonconvex multi-agent optimization and developed various distributed

    gradient-based methods to converge to stationary points with convergence rate analysis, e.g., [19, 5, 3, 6].

    We notice that for smooth functions, either convex or nonconvex, in general DGD with gradient-tracking

    converges faster than the method without gradient tracking, and its convergence rate has the same big-O

    dependence on the number of iterations as the centralized vanilla gradient descent method (See Table 1).

    Further, there has been increasing interest in zero-order optimization, where one does not have access

    to the gradient of the objective. Such situations can occur, for example, when only black-box procedures

    are available for computing the values of the functional characteristics of the problem, or when resource

    limitations restrict the use of fast or automatic differentiation techniques. Many existing works [20, 21,

    22, 8, 23] on zero-order optimization are based on constructing gradient estimators using finitely many

    function evaluations, e.g., gradient estimator based on Kiefer-Wolfowitz scheme[20] by using 2d-point

    2

  • function evaluations where d is the dimension of the problem. However, this estimator does not scale up

    well with high-dimensional problems. [21] proposed and analyzed a single-point gradient estimator, and

    [22] further studied the convergence rate for highly smooth objectives. [8] proposed two-point gradient

    estimators and showed that the convergence rates of the resulting algorithms are comparable to their

    first-order counterpart (See Table I). For instance, gradient descent with two-point gradient estimators

    converges with a rate of O(d/m) where m denotes the number of function value queries. [23] and [24]

    showed that two-point gradient estimators achieve the optimal rate O(√

    d/m) of stochastic zero-order

    convex optimization.

    Some recent works have started to combine zero-order and distributed optimization methods [1, 25, 26].

    For example, [1] proposed the ZONE algorithm for stochastic nonconvex problems based on the method

    of multipliers. [25] proposed a distributed zero-order algorithm over random networks and established its

    convergence for strongly convex objectives. [26] considered distributed zero-order methods for constrained

    convex optimization. However, there are still many questions remaining to be studied in distributed zero-

    order optimization. In particular, how do zero-order and distributed methods affect the performance of each

    other, and could their fundamental structural properties be kept when combining the two? For instance, it

    would be ideal if we could combine both 2-point zero-order methods with DGD with gradient tracking and

    maintain the nice properties for both methods, leading to an “optimal” distributed zero-order algorithm if

    possible. This is unclear a prior, and indeed, as we shall show later, 2-point gradient estimator and DGD

    with gradient tracking do not reconcile with each other well.

    Contributions. Motivated by the above observations, we propose two distributed zero-order algorithms:

    Algorithm 1 is based on the 2-point estimator and DGD; Algorithm 2 is based on the 2d-point gradient

    estimator and DGD with gradient tracking. We analyze the performance of the two algorithms for deter-

    ministic nonconvex optimization, and compare their convergence rates with their distributed first-order

    and centralized zero-order counterparts. The convergence rates of the two algorithms are summarized in

    Table 1. Specifically, it can be seen that the rates of Algorithm 1 are comparable with the first-order

    decentralized gradient descent but are inferior to the centralized zero-order method; the rates of Algo-

    rithm 2 are comparable with the centralized zero-order method and the first-order DGD with gradient

    tracking. On the other hand, Algorithm 1 uses the 2-point gradient estimator that requires only 2 func-

    tion value queries, while Algorithm 2 employs the 2d-gradient estimator whose computation involves 2d

    function value queries, indicating that Algorithm 1 could be favored for high-dimensional problems even

    though its convergence is slower asymptotically, while Algorithm 2 could handle problems of relatively

    low dimensions better with faster convergence. These results shed light on how zero-order evaluations

    affect distributed optimization and how the presence of network structure affects zero-order algorithms.

    Different problems and different computation requirements would favor different integration of zero-order

    methods and distributed methods.

    Compared to existing literature on distributed zero-order optimization, our Algorithm 1 is similar to the

    algorithms proposed in [25, 26], but our analysis assumes nonconvex objectives and also considers gradient

    dominated functions. While [1] analyzed the performance of the ZONE algorithm for unconstrained

    nonconvex problems, we shall see that our Algorithm 1 achieves comparable convergence behavior with

    ZONE-M, and Algorithm 2 converges faster than ZONE-M in the deterministic setting due to the use of

    the gradient tracking technique. A more detailed comparison will be given in Section 3.4.

    Notation. We denote the ℓ2-norm of vectors and matrices by ‖ · ‖. The standard basis of Rd will bedenoted by {ek}dk=1. We let 1n ∈ Rn denote the vector of all ones. We let Bd denote the closed unit ballin Rd, and let Sd−1 := {x ∈ Rd : ‖x‖ = 1} denote the unit sphere. The uniform distributions over Bd

    3

  • and Sd−1 will be denoted by U(Bd) and U(Sd−1). Id denotes the d× d identity matrix. For two matricesA = [aij ] ∈ Rp×q and B = [bij ] ∈ Rr×s, their tensor product A⊗B is

    A⊗B =

    a11B · · · a1qB...

    . . ....

    ap1B · · · apqB

    ∈ Rpr×qs.

    2 Formulation and Algorithms

    2.1 Problem Formulation

    Let N = {1, 2, . . . , n} be the set of agents. Suppose the agents are connected by a communicationnetwork, whose topology is represented by an undirected, connected graph G = (N , E) where the edges inE represent communication links.

    Each agent i is associated with a local objective function fi : Rd → R. The goal of the agents is to

    collaboratively solve the optimization problem

    minx∈Rd

    f(x) :=1

    n

    n∑

    i=1

    fi(x). (1)

    We assume that at each time step, agent i can only query the function values of fi at finitely many points,

    and can only communicate with its neighbors. Similar to [8] and other works on zero-order optimization,

    we assume a deterministic setting where the queries of the function values are noise-free and error-free.

    The analysis of the deterministic setting will provide a baseline for extension to stochastic optimization

    which we leave as future work.

    The following definitions will be useful later in the paper.

    Definition 1. 1. A function f : Rd → R is said to be L-smooth if f is continuously differentiable andsatisfies

    ‖∇f(x)−∇f(y)‖ ≤ L‖x− y‖, ∀x, y ∈ Rd.

    2. A function f : Rd → R is said to be G-Lipschitz if

    |f(x)− f(y)| ≤ G‖x− y‖ ∀x, y ∈ Rd.

    3. A function f : Rd → R is said to be µ-gradient dominated if f is differentiable, has a global minimizerx∗, and

    2µ(f(x)− f(x∗)) ≤ ‖∇f(x)‖2 ∀x ∈ Rd.

    The notion of gradient domination is also known as Polyak-Łojasiewicz (PL) inequality, first introduced

    by [27] and [28]. It can be viewed as a nonconvex analogy of strong convexity, as the centralized vanilla

    gradient descent achieves linear convergence for gradient dominated objective functions. The gradient

    domination condition has been frequently discussed in nonconvex optimization [27, 29]. Also, nonconvex

    but gradient dominated objective functions appear in many applications, e.g., linear quadratic control

    problems [30] and deep linear neural networks [31].

    4

  • 2.2 Preliminaries on Zero-Order and Distributed Optimization

    We present some preliminaries to motivate our algorithm development.

    Zero-order optimization based on gradient estimation. In zero-order optimization, one tries to

    minimize a function with the limitation that only function values at finitely many points may be obtained.

    One basic approach of designing zero-order optimization algorithms is to construct gradient estimators

    from zero-order information and substitute them for the true gradients. Here we introduce two types of

    zero-order gradient estimators for the noiseless setting:

    i) The 2d-point gradient estimator is given by

    G(2d)f (x;u) =

    d∑

    k=1

    f(x+ uek)− f(x− uek)2u

    ek, (2)

    where u is some given positive number. Basically, it approximates the gradient ∇f(x) by taking finitedifferences along d orthogonal directions, and can be viewed as a noise-free version of the classical Kiefer-

    Wolfowitz type method [20]. Given an L-smooth function f : Rd → R, it can be shown that

    ‖G(2d)f (x;u)−∇f(x)‖ ≤1

    2uL

    √d

    for any x ∈ Rd. The right-hand side decreases to zero as u → 0. In other words, G(2d)f (x;u) can bearbitrarily close to ∇f(x) (as long as the finite differences can be evaluated accurately). One drawbackof this estimator is that it requires 2d zero-order queries, which may not be computationally efficient for

    high-dimensional problems.

    ii) The 2-point gradient estimator is given by

    G(2)f (x;u, z) := d ·

    f(x+ uz)− f(x− uz)2u

    z, (3)

    where z ∈ Rd is a random vector that is sampled from the distribution U(Sd−1), and u > 0 is a givenpositive number. The following proposition indicates that when z is uniformly sampled from the sphere

    Sd−1, the expectation of G(2)f (x;u, z) is the gradient of a “locally averaged” version of f .

    Proposition 1 ([21]). Suppose f : Rd → R is L-smooth. Then for any u > 0 and x ∈ Rd,

    Ez∼U(Sd−1)[

    G(2)f (x;u, z)

    ]

    = ∇fu(x),

    where fu(x) := Ey∼U(Bd) [f(x+ uy)].

    It has been shown in [8] that if we substitute G(2)f (x;u, z) for the gradient in the gradient descent

    algorithm, we have

    1

    t

    t−1∑

    τ=0

    ‖∇f(xτ )‖2 = O(

    d

    m

    )

    for nonconvex smooth objectives, and

    f(xτ )− f∗ = O([

    1− c µ/Ld

    ]m)

    for smooth and strongly convex objectives, where xτ denotes the τ ’th iterate and m denotes the number of

    zero-order queries in t iterations (see Table 1). These rates are comparable to the rates of the (centralized)

    vanilla gradient descent method, i.e., O(1/t) for nonconvex smooth objectives and linear convergence for

    smooth and strongly convex objectives.

    5

  • Distributed optimization. In this paper, we mainly focus on consensus-based algorithms for distributed

    optimization, where each agent maintains a local copy of the global variables, and weighs its neighbors’

    information to updates its own local variable. Specifically, for a time-invariant and bidirectional com-

    munication network, we introduce a consensus matrix W = [Wij ] ∈ Rn×n that satisfies the followingassumption:

    Assumption 1. 1. W is a doubly stochastic matrix.

    2. Wii > 0 for all i ∈ N , and for two distinct agents i and j, Wij > 0 if and only if (i, j) ∈ E.

    When Assumption 1 is satisfied, we have [11]

    ρ :=∥

    ∥W − n−11n1⊤n∥

    ∥ < 1. (4)

    We present two consensus-based algorithms that will serve as the basis for designing distributed zero-order

    algorithms.

    i) The decentralized gradient descent (DGD) algorithm [9, 2] is given by the following iterations:

    xi(t) =

    n∑

    j=1

    Wijxj(t− 1)− ηt∇fi(xi(t− 1)), (5)

    where xi(t) ∈ Rd denotes the local copy of the decision variable for the i’th agent, and ηt is the stepsize. It has been shown that DGD in general converges more slowly than the centralized gradient descent

    algorithm [2, 11] for smooth functions. This is because the local gradient ∇fi does not vanish at thestationary point, and a diminishing step size ηt is necessary, which slows down the convergence.

    ii) The DGD gradient tracking method incorporates additional local variables si(t) to track the global

    gradient ∇f = 1n∑

    i ∇fi:

    si(t) =

    n∑

    j=1

    Wijsj(t−1) +∇fi(xi(t−1))−∇fi(xi(t−2)),

    xi(t) =

    n∑

    j=1

    Wijxj(t−1)− ηtsi(t),

    where we set si(0) = ∇fi(xi(0)) for each i. Since gradient tracking has been proposed, it has attractedmuch attention and inspired many recent studies [14, 19, 7, 11, 6], as it can accelerate the convergence for

    smooth objectives compared to DGD. Here we provide a high level explanation of how gradient tracking

    works: For smooth functions, when xi(t) approaches consensus, ∇fi(xi(t)) will not change much becauseof the smoothness, and therefore the local variables si(t) will eventually reach a consensus; on the other

    hand, by induction it can be shown that

    1

    n

    n∑

    i=1

    si(t) =1

    n

    n∑

    i=1

    ∇fi(xi(t)).

    Therefore, the sequence (si(t))t∈N will eventually converge to the global gradient, and a constant stepsize

    ηt = η is allowed, leading to comparable convergence rates as the centralized gradient methods. See [11,

    Section III and Section IV.B] for more discussion.

    6

  • 2.3 Our Algorithms

    Following the previous discussions, it would be ideal if we can combine the 2-point gradient estimator and

    the DGD with gradient tracking and maintain a convergence rate comparable to the centralized vanilla

    gradient descent method. However, it turns out that such combination does not lead to the desired

    convergence rate. This is mainly because gradient tracking requires increasingly accurate local gradient

    information as one approaches the stationary point to achieve faster convergence compared to DGD,

    whereas the 2-point gradient estimator can produce a variance that does not decrease to zero even if the

    radius u decreases to zero; a more detailed explanation will be provided in Section 3.3.

    We propose the following two distributed zero-order algorithms for the problem (1):1

    Algorithm 1: 2-point gradient estimator without global gradient tracking

    for t = 1, 2, 3, . . . do

    foreach i ∈ N do1. Generate zi(t) ∼ U(Sd−1) independently from (zi(τ))t−1τ=1 and (zj(τ))tτ=1 for j 6= i.

    2. Update xi(t) by

    gi(t) = G(2)fi

    (xi(t− 1);ut, zi(t)), (6)

    xi(t) =

    n∑

    j=1

    Wij(xj(t− 1)− ηtgj(t)). (7)

    end

    end

    Algorithm 2: 2d-point gradient estimator with global gradient tracking

    Set si(0) = gi(0) = 0 for each i ∈ N .for t = 1, 2, 3, . . . do

    foreach i ∈ N do1. Update si(t) by

    gi(t) = G(2d)fi

    (xi(t− 1);ut), (8)

    si(t) =

    n∑

    j=1

    Wij(

    sj(t−1)+gj(t)−gj(t−1))

    . (9)

    2. Update xi(t) by

    xi(t) =

    n∑

    j=1

    Wij(xj(t− 1)− ηsj(t)). (10)

    end

    end

    1. Algorithm 1 employs the 2-point gradient estimator (3), and adopts the consensus procedure of the

    DGD algorithm that only involves averaging over the local decision variables.

    1 For both algorithms we employ the adapt-then-combine (ATC) strategy [32], a commonly used variant for consensus

    optimization which is slightly different from the combine-then-adapt (CTA) strategy in (5). Both ATC and CTA can be

    used in our algorithms, and the convergence results will be similar.

    7

  • 2. Algorithm 2 employs the 2d-point gradient estimator (2), and adopts the consensus procedure of the

    gradient tracking method where the auxiliary variable si(t) is introduced to track the global gradient

    ∇f = 1n∑

    i ∇fi. We shall see in Theorems 3 and 4 that si(t) converges to the gradient of the globalobjective function as t → ∞ under mild conditions.

    3 Main Results

    In this section we present the convergence results of our algorithms. The proofs are postponed to the

    Appendix.

    3.1 Convergence of Algorithm 1

    Let xi(t) denote the sequence generated by Algorithm 1 with a positive, non-increasing sequence of step

    sizes ηt. Denote

    x̄(t) :=1

    n

    n∑

    i=1

    xi(t), R0 :=1

    n

    n∑

    i=1

    ‖xi(0)− x̄(0)‖2.

    We first analyze the case with general nonconvex smooth objective functions.

    Theorem 1. Assume that each local objective function fi is uniformly G-Lipschitz and L-smooth for some

    positive constants G and L, and that f∗ := infx∈Rd f(x) > −∞.1. Suppose η1L ≤ 1/4,

    ∑∞t=1 ηt = +∞,

    ∑∞t=1 η

    2t < +∞, and

    ∑∞t=1 ηtu

    2t < +∞. Then almost surely,

    ‖xi(t)− x̄(t)‖ converges to zero for all i ∈ N , ∇f(x̄(t)) converges to zero, and limt→∞ f(x̄(t)) exists.2. Suppose that

    ηt =αη

    4L√d· 1√

    t, ut ≤

    αuG

    L√d· 1tγ/2−1/4

    with αη ∈ (0, 1], αu ≥ 0 and γ > 1. Then almost surely, ‖xi(t) − x̄(t)‖ converges to zero for all i, andlim inft→∞ ‖∇f(x̄(t))‖ = 0. Furthermore, we have

    ∑t−1τ=0ητ+1E

    [

    ‖∇f(x̄(τ))‖2]

    ∑t−1τ=0 ητ+1

    ≤√

    d

    t

    [

    αηG2

    3n2ln(2t+1)+

    8L(f(x̄(0))−f∗)αη

    +6R0L

    2

    (1−ρ2)√d

    +9α2ηκ

    2ρ2G2

    4(1−ρ2)2√d+

    9α2uγG2

    4(γ−1)

    ]

    + o

    (

    1√t

    )

    ,

    (11)

    where κ is some positive numerical constant, and

    1

    n

    n∑

    i=1

    E[

    ‖xi(t)− x̄(t)‖2]

    ≤ α2ηκ

    2ρ2

    4(1− ρ2)2G2/L2

    t+ o(t−1). (12)

    Remark 1. Note that in (11), we use the squared norm of the gradient to assess the sub-optimality of

    the iterates, and characterize the convergence by ergodic rates. This type of convergence rate bound is

    common for local methods of unconstrained nonconvex problems where we do not aim for global optimal

    solutions [33, 8].

    Remark 2. Each iteration of Algorithm 1 requires 2 queries of function values. Thus the convergence rate

    (11) can also be interpreted as O(√

    d/m logm) where m denotes the number of function value queries.

    Characterizing convergence rate in terms of the number of function value queries m and the dimension

    8

  • d is conventional for zero-order optimization. In scenarios where zero-order methods are applied, the

    computation of the function values is usually one of the most time-consuming procedures. In addition, it

    is also of interest to characterize how the convergence scales with the dimension d.

    The next theorem shows that for a gradient dominated global objective, a better convergence rate can

    be achieved.

    Theorem 2. Assume that each local objective function fi is uniformly L-smooth for some L > 0. Fur-

    thermore, assume that infx∈Rd fi(x) = f∗i > −∞ for each i, and that the global objective function f is

    µ-gradient dominated and has a minimum value denoted by f∗. Suppose

    ηt =2αη

    µ(t+ t0), ut ≤

    αu√t+ t0

    for some αη > 1 and αu > 0, where

    t0 ≥2αηL

    µ(1−ρ2)

    (

    32Ld

    3µ+ 9ρ

    )

    − 1.

    Then, using Algorithm 1, we have

    E[f(x̄(t))−f∗] ≤(

    32α2ηL2∆

    µ2+

    6αηα2uL

    2

    µ

    )

    d

    t+ o(t−1), (13)

    1

    n

    n∑

    i=1

    E[

    ‖xi(t)−x̄(t)‖2]

    ≤ 32α2ηρ

    2L∆

    µ2(1− ρ2)

    (

    4d

    3+

    6ρ2

    1−ρ2)

    1

    t2+ o(t−2), (14)

    where ∆ := f∗ − 1n∑n

    i=1 f∗i .

    Remark 3. The convergence rate (13) can also be described as E[f(x̄(t))− f∗] = O(d/m), where m is thenumber of function value queries.

    Table 1 shows that, while Algorithm 1 employs a randomized 2-point zero-order estimator of ∇fi,its convergence rates are comparable with the decentralized gradient descent (DGD) algorithm [5, 34].

    However, its convergence rates are inferior to its centralized zero-order counterpart in [8].

    3.2 Convergence of Algorithm 2

    Let (xi(t), si(t)) denote the sequence generated by Algorithm 2 with a constant step size η. Denote

    x̄(t) :=1

    n

    n∑

    i=1

    xi(t), R0 :=1

    n

    n∑

    i=1

    (

    ηρ2

    2L‖∇fi(xi(0))‖2+‖xi(0)− x̄(0)‖2

    )

    +ηρ2u21Ld

    4.

    We first analyze the case where the local objectives are nonconvex and smooth.

    Theorem 3. Assume that each local objective function fi is uniformly L-smooth for some positive constant

    L, and that f∗ := infx∈Rd f(x) > −∞. Suppose

    ηL ≤ min{

    1

    6,

    (1− ρ2)24ρ2(3 + 4ρ2)

    }

    , Ru := d

    ∞∑

    t=1

    u2t < +∞,

    and that ut is non-increasing. Then limt→∞ f(x̄(t)) exists,

    1

    t

    t−1∑

    τ=0

    ‖∇f(x̄(τ))‖2 ≤ 1t

    [

    3.2(f(x̄(0))− f∗)η

    +12.8L2R01− ρ2 + 2.4RuL

    2

    ]

    , (15)

    9

  • and

    1

    t

    t−1∑

    τ=0

    1

    n

    n∑

    i=1

    ‖xi(τ)− x̄(τ)‖2 ≤ 1t

    [

    1.6η(f(x̄(0))− f∗) + 3.2R01− ρ2 + 0.35Ru

    ]

    , (16)

    1

    t

    t∑

    τ=1

    1

    n

    n∑

    i=1

    ‖si(τ)−∇f(x̄(τ−1))‖2 ≤ 1t

    [

    9.6L(f(x̄(0))− f∗) + 19.2LR0η(1− ρ2) +

    2.35

    ηLRu

    ]

    . (17)

    Remark 4. Theorem 3 shows that Algorithm 2 achieves a convergence rate of O(1/t) in terms of the

    averaged squared norm of ∇f(x̄(t)), and has a consensus rate of O(1/t) for the averages of the squaredconsensus error ‖xi(t)−x̄(t)‖2 and the squared gradient tracking error ‖si(t)−∇f(x̄(t−1))‖2. They matchthe rates for distributed nonconvex optimization with gradient tracking [6]. On the other hand, since

    each iteration requires 2d queries of function values, we get a O(d/m) rate in terms of the number of

    function value queries m. This matches the convergence rate of centralized zero-order algorithms without

    Nesterov-type acceleration [8].

    Now we proceed to the situation with a gradient dominated global objective.

    Theorem 4. Assume that each local objective function fi is uniformly L-smooth for some positive constant

    L, and that the global objective function f is µ-gradient dominated and achieves it global minimum at x∗.

    Suppose the step size η satisfies

    ηL = α ·(µ

    L

    )13 (1− ρ2)2

    14(18)

    for some α ∈ (0, 1], and (ut)t≥1 is non-increasing. Let

    λ := 1− α(1− ρ2

    5

    )2(µ

    L

    )43

    .

    Then

    f(x̄(t))− f(x∗) ≤ O(λt) + 5(1− ρ2)Ldt−1∑

    τ=0

    λτu2t−τ , (19)

    1

    n

    n∑

    i=1

    ‖xi(t)−x̄(t)‖2 ≤ O(λt) + 3ηLd1−ρ2

    t−1∑

    τ=0

    λτu2t−τ , (20)

    1

    n

    n∑

    i=1

    ‖si(t)−∇f(x̄(t−1))‖2 ≤ O(λt) + 18L2d

    1−ρ2t−1∑

    τ=0

    λτu2t−τ . (21)

    Remark 5. If we use an exponentially decreasing sequence ut ∝ λ̃t/2 with λ̃ < λ, then both the objectiveerror f(x̄(t)) − f(x∗) and the consensus errors ‖xi(t)− x̄(t)‖2 and ‖si(t) −∇f(x̄(t− 1))‖2 achieve linearconvergence rate O(λt), or O(λm/d) in terms of the number of function value queries. In addition, we

    notice that the decaying factor λ given by Theorem 4 has a better dependence on µ/L than in [7] for

    convex problems. We point out that this is not a result of using zero-order techniques, but rather a more

    refined analysis of the gradient tracking procedure.

    Remark 6. Note that the conditions on the step sizes in Theorems 2, 3 and 4 depend on ρ, a measure of

    the connectivity of the network. In order to choose step sizes to satisfy theses conditions in the distributed

    setting, one possible approach is as follows: Assuming that each agent knows an upper bound n on the

    total number of agents, by [35, Lemma 2], if one chooses W to be the lazy Metropolis matrix, then

    ρ ≤ 1− 1/(71n2), based on which the agents can then derive their step sizes according to the conditionsin the theorems. We also note that some existing works (e.g., [36]) attempt to get rid of the dependence

    of step sizes on the graph topology, and whether those techniques can be applied in our work is beyond

    the scope of this paper but is an interesting future direction.

    10

  • 3.3 Comparison of the Two Algorithms

    We see from the above results that Algorithm 2 converges faster than Algorithm 1 asymptotically as

    m → ∞ in theory. However, each iteration of Algorithm 2 makes progress only after 2d queries of functionvalues, which could be an issue if d is very large. On the contrary, each iteration of Algorithm 1 only

    requires 2 function value queries, meaning that progress can be made relatively immediately without

    exploring all the d dimensions. This observation suggests that, when neglecting communication delays,

    Algorithm 1 is more favorable for high-dimensional problems, whereas Algorithm 2 could handle problems

    of relatively low dimensions better with faster convergence.

    We emphasize that there still exists a trade-off between the convergence rate and the ability to handle

    high-dimensional problems even if one combines the 2-point gradient estimator (2) with the gradient

    tracking method as

    gi(t) = G(2)fi

    (xi(t− 1);ut, zi(t)), zi(t) ∼ U(Sd−1),

    si(t) =

    n∑

    j=1

    Wij(

    sj(t−1) + gj(t)− gj(t−1))

    .

    xi(t) =n∑

    j=1

    Wij(xj(t− 1)− ηsj(t)).

    (22)

    Theoretical analysis suggests that, in order for si(t) to reach a consensus in the sense that E[

    ‖si(t)−sj(t)‖2]

    converges to 0, we need

    limt→∞

    E[

    ‖gi(t)− gi(t− 1)‖2]

    → 0.

    On the other hand, we have the following lemma regarding the variance of the 2-point gradient estimator

    G(2)f (x;u, z).

    Lemma 1. Let f : Rd → R be an arbitrary L-smooth function. Then

    limu→0+

    Ez

    [

    ‖G(2)f (x;u, z)‖2]

    = (d− 1)‖∇f(x)‖2,

    where z ∼ U(Sd−1).

    Proof. Notice that for any z ∈ Sd−1 and u ∈ (0, 1], we have∣

    f(x+uz)− f(x−uz)2u

    ≤ supy∈Bd

    ‖∇f(x+ y)‖.

    Therefore

    limu→0

    Ez

    [

    ‖G(2)f (x;u, z)‖2]

    = d2Ez

    [

    limu→0

    f(x+uz)−f(x−uz)2u

    2]

    = d2Ez

    [

    ∣∇f(x)⊤z∣

    2]

    = d2∇f(x)⊤Ez[

    zz⊤]

    ∇f(x) = d‖∇f(x)‖2,

    where in the second step we exchanged the order of limit and expectation by the bounded convergence

    theorem, and in the last step we used dEz[

    zz⊤]

    = Id for z ∼ U(Sd−1). Then, noticing that ∇fu(x) →∇f(x) as u → 0, we get

    limu→0

    Ez

    [

    ‖G(2)f (x;u, z)−∇fu(x)‖2]

    = limu→0

    (

    Ez

    [

    ‖G(2)f (x;u, z)‖2]

    − ‖∇fu(x)‖2)

    = (d− 1)‖∇f(x)‖2,

    which completes the proof.

    11

  • Lemma 1 suggests that, each gradient estimator G(2)fi

    (xi(t − 1);ut, zi(t)) in (22) will produce a non-vanishing variance approximately equal to (d − 1)E

    [

    ‖∇fi(xi(t− 1))‖2]

    even if we let u → 0 as xi(t)approaches a stationary point. Consequently, E

    [

    ‖gi(t)− gi(t− 1)‖2]

    is not guaranteed to converge to

    zero as t → ∞. The non-vanishing variance will then be reflected in si(t) that tracks the global gradient,and consequently the overall convergence will be slowed down. We refer to [7, 11, 37] for related analysis,

    and to Section 4 for a numerical example.

    3.4 Comparison with Existing Algorithms

    In this subsection, we provide a detailed comparison with existing literature on distributed zero-order

    optimization, specifically [1, 25, 26].

    1. References [25, 26] discuss convex problems, while [1] and our work focus on nonconvex problems.

    2. In terms of the assumptions on the noisy function queries, [26] and our work consider a noise-free

    case. [1] considers stochastic queries but assumes two function values can be obtained for a single random

    sample. [25] assumes independent additive noise on each function value query. We expect that our

    Algorithm 1 can be generalized to the setting adopted in [1] with heavier mathematics. Extensions to

    general stochastic cases remain our ongoing work.

    3. In terms of the approach to reach consensus among agents, our algorithms are similar to [25, 26], where

    some weighted average of the neighbors’ local variables is utilized, while [1] uses the method of multipliers to

    design their algorithms. We also mention that, our Algorithm 2 employs the gradient tracking technique,

    which, to our best knowledge, has not been discussed in existing literature on distributed zero-order

    optimization yet.

    4. Regarding the convergence rates for nonconvex optimization, [1] proved that its proposed ZONE al-

    gorithm achieves O(1/T ) rate if each iteration also employs O(T ) function value queries, where T is the

    number of iterations planned in advance. Therefore in terms of the number of function value queries M ,

    its convergence rate is in fact O(1/√M), which is roughly comparable with Algorithm 1 and slower than

    Algorithm 2 in our paper. Also, [1] did not discuss the dependence on the problem dimension d. Moreover,

    our algorithms only require constant numbers (2 or 2d) of function value queries which is more appealing

    for practical implementation when T is set to be very large for achieving sufficiently accurate solutions.

    4 Numerical Examples

    We consider a multi-agent nonconvex optimization problem formulated as

    minx∈Rd

    1

    n

    n∑

    i=1

    fi(x),

    fi(x) =ai

    1 + exp(−ξ⊤i x−νi)+ bi ln(1 + ‖x‖2),

    (23)

    where ai, bi, νi ∈ R and ξi ∈ Rd for each i = 1, . . . , N .For the numerical example, we set the dimension to be d = 64 and the number of agents to be

    n = 50. The parameters ai, νi and each entry of ξi are randomly generated from the standard Gaussian

    distribution, and (b1, . . . , bn) is generated from the distribution N(

    1n, In − 1n1n1⊤n)

    so that 1n∑

    i bi = 1.

    The graph G = (N , E) is generated by uniformly randomly sampling n points on S2, and then connectingpairs of points with spherical distances less than π/4. The Metropolis-Hastings weights [38] are employed

    for constructing W .

    12

  • Figure 1: Convergence of Algorithm 1 and Algorithm 2. For Algorithm 1, the light blue shaded areas

    represent the results for 50 random instances, and the dark blue curves represent their average.

    We compare the following algorithms on the problem (23):

    1. Algorithm 1 with ηt = 0.02/√t and ut = 4/

    √t;

    2. Algorithm 2 with η = 0.02 and ut = 4/t3/4;

    3. ZONE-M [1], where we test two setups J = 1, ρt = 4√t, ut = 4/

    √t and J = 100, ρt = 0.4

    √t,

    ut = 4/√t;

    4. 2-point gradient estimator combined with gradient tracking [see (22)] with η = 2×10−4 and ut = 4/t3/4.All algorithms start from the same initial points, which are randomly generated from the distribution

    N (0, 25d Id) for each agent.

    4.1 Comparison of Algorithm 1 and Algorithm 2

    Figure 1 shows the convergence behavior of Algorithm 1 and Algorithm 2, where the top figure illus-

    trates the squared norm of the gradient at x̄(t), and the bottom figure illustrates the consensus error1n

    ∑ni=1 ‖xi(t)− x̄(t)‖2. The horizontal axis has been normalized as the number of function value queries

    m. We can see that Algorithm 1 converges faster during the initial stage, but then slows down and con-

    verges at a relatively stable sublinear rate. The convergence of Algorithm 2 is relatively slow initially,

    but then becomes faster as m & 0.5× 104, and when m & 2 × 104, Algorithm 2 achieves smaller squaredgradient norm and consensus error compared to Algorithm 1; the convergence slows down as m exceeds

    2.5× 104 but is still faster than Algorithm 1. Further investigation of the simulation results suggests thatthe speed-up of Algorithm 2 within 0.5 × 104 . m . 2.5 × 104 is due to x̄(t) becoming sufficiently closeto a local optimal, around which the objective function is locally strongly convex; the slow-down after m

    exceeds 2.5 × 104 is caused by the zero-order gradient estimation error that becomes dominant, and can

    13

  • Figure 2: Convergence of Algorithm 1 and ZONE-M with J = 1 and J = 100. For each algorithm,

    the light shaded areas represent the results for 50 random instances, and the dark curves represent their

    average.

    be postponed or avoided if we let ut decrease more aggressively.

    From these results, it can be seen that, if the total number of function value queries is limited by, say

    m . 1.5× 104, then Algorithm 1 might be favorable compared to Algorithm 2 despite slower asymptoticconvergence rate, while if more function value queries are allowed, then Algorithm 2 could be favored. We

    observe that this is related with the discussion in Section 3.3.

    4.2 Comparison with Other Algorithms

    Figure 2 compares the convergence of Algorithm 1 and the two setups of ZONE-M, including the curves

    for the squared norm of the gradient ‖∇f(x̄(t))‖2 and the consensus error 1n∑n

    i=1 ‖xi(t) − x̄(t)‖2. Thehorizontal axis has been normalized as the number of function value queries m. It can be seen that

    Algorithm 1 and ZONE-M with ρt ∝√t, J = 1 have similar convergence behavior. For ZONE-M with

    ρt ∝√t and J = 100, while the convergence of ‖∇f(x̄(t))‖2 is comparable with Algorithm 1 and ZONE-M

    with J = 1, the consensus error decreases much slower, as ZONE-M with J = 100 conducts much fewer

    consensus averaging steps per function value query compared to Algorithm 1 and ZONE-M with J = 1.

    Figure 3 compares the convergence of Algorithm 2 and the 2-point estimator combined with gradient

    tracking (22), including the curves for the squared norm of the gradient ‖∇f(x̄(t))‖2, the consensuserror 1n

    ∑ni=1 ‖xi(t) − x̄(t)‖2 and also the gradient tracking error 1n

    ∑ni=1 ‖si(t) − ∇f(x̄(t−1))‖2. It’s

    straightforward to see that Algorithm 2 has better asymptotic convergence behavior than the 2-point

    estimator combined with gradient tracking. Moreover, for the 2-point estimator combined with gradient

    tracking, the gradient tracking error does not converge to zero but remains at a constant level, indicating

    that the gradient tracking technique is ineffective in this case. These observations are in accordance with

    14

  • Figure 3: Convergence of Algorithm 2 and 2-point estimator combined with gradient tracking. For 2-point

    estimator combined with gradient tracking, the light pink shaded areas represent the results for 50 random

    instances, and the dark purple curves represent their average.

    our theoretical discussion in Section 3.3.

    5 Conclusion

    We proposed two distribtued zero-order algorithms for nonconvex multi-agent optimization, established

    theoretical results on their convergence rates, and showed that they achieve comparable performance

    with their distributed gradient-based or centralized zero-order counterparts. We also provided a brief

    discussion on how the dimension of the problem will affect their performance in practice. There are many

    lines of future work, such as 1) introducing noise or errors when evaluating fi(x), 2) investigating how

    to escape from saddle-point for distributed zero-order methods, 3) extension to nonsmooth problems, 4)

    investigating whether the step sizes can be independent of the network topology, 5) studying time-varying

    graphs, and 6) investigating the fundamental gap between centralized methods and distributed methods,

    especially for high-dimensional problems.

    A Auxiliary Results for Convergence Analysis

    Recall that W ∈ Rn×n is a consensus matrix that satisfies Assumption 1 in the main text, and

    ρ :=∥

    ∥W − n−11n1⊤n∥

    ∥ < 1.

    15

  • The following lemma is a standard result in consensus optimization.

    Lemma 2. For any x1, . . . , xn ∈ Rd, we have

    ‖(W ⊗ Id)(x − 1n ⊗ x̄)‖ ≤ ρ‖x− 1n ⊗ x̄‖,

    where we denote

    x =

    x1...

    xn

    , x̄ =

    1

    n

    n∑

    i=1

    xi.

    The following lemma provides a useful property of smooth functions.

    Lemma 3. Suppose f : Rp → R is L-smooth and infx∈Rp f(x) = f∗ > −∞. Then

    ‖∇f(x)‖2 ≤ 2L(f(x)− f∗).

    Proof. The L-smoothness of f implies

    f∗ ≤ f(x− L−1∇f(x)) ≤ f(x)− 12L

    ‖∇f(x)‖2.

    For a µ-gradient dominated and L-smooth function, we can see from Lemma 3 that µ ≤ L.The following lemma will be used to establish convergence of the proposed algorithms.

    Lemma 4 ([39]). Let (Ω,F ,P) be a probability space and (Ft)t∈N be a filtration. Let U(t), ξ(t) and ζ(t)be nonnegative Ft-measurable random variables for t ∈ N such that

    E[U(t+ 1)|Ft] ≤ U(t) + ξ(t)− ζ(t), ∀t = 0, 1, 2, . . .

    Then almost surely on the event {∑∞t=0 ξ(t) < +∞}, U(t) converges to a random variable and∑∞

    t=0 ζ(t) <

    +∞.As a special case, let Ut, ξt and ζt be (deterministic) nonnegative sequences for t ∈ N such that

    Ut+1 ≤ Ut + ξt − ζt,

    with∑∞

    t=0 ξt < +∞. Then Ut converges and∑∞

    t=0 ζt < +∞.

    We will extensively use the following properties of the distribution U(Sd−1):

    Ez∼U(Sd−1)[d · 〈g, z〉z] = g, Ez∼U(Sd−1)[

    d · 〈g, z〉2]

    = ‖g‖2 (24)

    for any (deterministic) g ∈ Rd.The following results discuss the bias and the second moment of the 2-point gradient estimator.

    Lemma 5. 1. Let u > 0 be arbitrary, and suppose f : Rd → R is differentiable. Then

    Ez∼U(Sd−1)[

    G(2)f (x;u, z)

    ]

    = ∇fu(x). (25)

    where fu(x) := Ey∼U(Bd) [f(x+ uy)]. Moreover, if f is L-smooth, then fu is also L-smooth.

    16

  • 2. Suppose f : Rd → R is L-Lipschitz, and let u be positive. Then for any x ∈ Rd and h ∈ Rd, we have∣

    f(x+ uh)− f(x− uh)2u

    − 〈∇f(x), h〉∣

    ≤ 12uL‖h‖2. (26)

    In addition,

    ‖∇f(x)−∇fu(x)‖ ≤ uL. (27)

    3. [24, Lemma 10] Suppose f : Rd → R is G-Lipschitz. Then for any x ∈ Rd and u > 0,

    Ez∼U(Sd−1)

    [

    ∥G(2)f (x;u, z)

    2]

    ≤ κ2G2d (28)

    where κ > 0 is some numerical constant.

    4. Suppose f : Rd → R is L-smooth. Then for any x ∈ Rd and u > 0,

    Ez∼U(Sd−1)

    [

    ∥G(2)f (x;u, z)

    2]

    ≤ 4d3‖∇f(x)‖2 + u2L2d2. (29)

    Proof. 1. The equality (25) follows from [21, Lemma 1] and the fact that the distribution U(Sd−1) haszero mean. When f is L-smooth, we have

    ‖∇fu(x1)−∇fu(x2)‖ =∥

    1∫

    Bddy

    Bd

    (∇f(x1 + uy)−∇f(x2 + uy)) dy∥

    ≤ 1∫Bd

    dy

    Bd

    ‖∇f(x1 + uy)−∇f(x2 + uy)‖ dy ≤ L‖x1 − x2‖

    for any x1, x2 ∈ Rd.2. We have

    f(x+ uh)− f(x− uh)2u

    − 〈∇f(x), h〉∣

    =

    1

    2u

    ∫ 1

    −1〈∇f(x + ush), uh〉 ds− 〈∇f(x), h〉

    =1

    2

    ∫ 1

    −1〈∇f(x+ ush)−∇f(x), h〉 ds

    ≤ 12

    ∫ 1

    −1Lu|s|‖h‖2 ds = 1

    2uL‖h‖2,

    and

    ‖∇f(x)−∇fu(x)‖ =∥

    1∫

    Bddy

    Bd

    (∇f(x) −∇f(x+ uy)) dy∥

    ≤ uL∫Bd

    dy

    Bd

    ‖y‖ dy ≤ uL.

    4. We have

    Ez∼U(Sd−1)

    [

    ∥G(2)f (x;u, z)

    2]

    = Ez∼U(Sd−1)

    [

    d

    (

    f(x+uz)− f(x−uz)2u

    − 〈∇f(x), z〉)

    z + d〈∇f(x), z〉z∥

    2]

    ≤ (1+3)Ez∼U(Sd−1)[

    d2∣

    f(x+uz)−f(x−uz)2u

    − 〈∇f(x), z〉∣

    2

    ‖z‖2]

    +

    (

    1+1

    3

    )

    Ez∼U(Sd−1)[

    ‖d〈∇f(x), z〉z‖2]

    ≤ 4d2 · 14u2L2 +

    4d

    3‖∇f(x)‖2 = 4d

    3‖∇f(x)‖2 + u2L2d2.

    17

  • We will also use the following inequalities:

    t2∑

    t=t1

    1

    tǫ≥∫ t2+1

    t1

    ds

    sǫ=

    (t2 + 1)1−ǫ − t1−ǫ1

    1− ǫ , (30)

    and

    t2∑

    t=t1

    1

    tǫ≤

    1 +

    ∫ t2+1/2

    3/2

    ds

    sǫ= 1 +

    (t2 + 1/2)1−ǫ − (3/2)1−ǫ1− ǫ , t1 = 1,

    ∫ t2+1/2

    t1−1/2

    ds

    sǫ=

    (t2 + 1/2)1−ǫ − (t1 − 1/2)1−ǫ1− ǫ , t1 > 1.

    (31)

    where ǫ > 0 and ǫ 6= 1, and

    lnt2 + 1

    t1=

    ∫ t2+1

    t1

    ds

    s≤

    t2∑

    t=t1

    1

    t≤∫ t2+1/2

    t1−1/2

    ds

    s= ln

    2t2 + 1

    2t1 − 1. (32)

    Especially, when ǫ > 1, we have

    ∞∑

    t=1

    1

    tǫ≤ 1 +

    ∫ ∞

    3/2

    ds

    sǫ= 1 +

    1

    (ǫ − 1)(3/2)ǫ−1 ≤ǫ

    ǫ− 1 . (33)

    Finally, we note thatt−1∑

    τ=0

    λτ

    (t− τ)ǫ =1

    (1− λ)tǫ + o(t−ǫ) (34)

    for any λ ∈ (0, 1) and ǫ > 0.

    B Proof of Theorem 1

    Let (Ft)t∈N be a filtration such that (zi(t), xi(t)) is Ft-measurable for each t ≥ 1. We denote

    x(t) =

    x1(t)...

    xn(t)

    , g(t) =

    g1(t)...

    gn(t)

    , x̄(t) =

    1

    n

    n∑

    i=1

    xi(t), ḡ(t) =1

    n

    n∑

    i=1

    gi(t),

    and δ(t) = f(x̄(t)) − f∗, ec(t) = E[

    ‖x(t)− 1n ⊗ x̄(t)‖2]

    . We can see that the iterations of Algorithm 1

    can be equivalently written as

    x(t) = (W ⊗ Id)(x(t − 1)− ηtg(t)), x̄(t) = x̄(t− 1)− ηtḡ(t).

    We recall that each fi is assumed to be G-Lipschitz and L-smooth.

    First, we analyze how the objective value at the averaged iterate f(x̄(t)) evolves as the iterations

    proceed.

    Lemma 6. We have

    E[f(x̄(t))|Ft−1] ≤ f(x̄(t− 1))−ηt2‖∇f(x̄(t− 1))‖2 + ηtL

    2

    n‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2

    +η2tL

    2E[

    ‖ḡ(t)‖2|Ft−1]

    + ηtu2tL

    2.

    (35)

    18

  • Proof. Since x̄(t) = x̄(t− 1)− ηtḡ(t), by the L-smoothness of the function f , we get

    f(x̄(t)) ≤ f(x̄(t− 1))− ηt〈∇f(x̄(t− 1)), ḡ(t)〉+ η2tL

    2‖ḡ(t)‖2.

    Note that by (25) of Lemma 5, we have

    E[ḡ(t)|Ft−1] =1

    n

    n∑

    i=1

    ∇futi (xi(t− 1)).

    By taking the expectation conditioned on Ft−1, we get

    E[f(x̄(t))|Ft−1] ≤ f(x̄(t− 1))− ηt‖∇f(x̄(t− 1))‖2 + η2tL

    2E[

    ‖ḡ(t)‖2|Ft−1]

    − ηt〈

    ∇f(x̄(t− 1)), 1n

    n∑

    i=1

    (

    ∇futi (xi(t− 1))−∇futi (x̄(t− 1)))

    − ηt 〈∇f(x̄(t− 1)),∇fut(x̄(t− 1))−∇f(x̄(t− 1))〉 .

    Since each futi is L-smooth (see Part 1 of Lemma 5), we have

    −〈

    ∇f(x̄(t− 1)), 1n

    n∑

    i=1

    (

    ∇futi (xi(t− 1))−∇futi (x̄(t− 1)))

    ≤ 12

    1

    2‖∇f(x̄(t− 1))‖2 + 2

    1

    n

    n∑

    i=1

    (

    ∇futi (xi(t− 1))−∇futi (x̄(t− 1)))

    2

    ≤ 14‖∇f(x̄(t− 1))‖2 +

    (

    1

    n

    n∑

    i=1

    L‖xi(t− 1)− x̄(t− 1)‖)2

    ≤ 14‖∇f(x̄(t− 1))‖2 + L

    2

    n‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2,

    and by (27), we have

    − 〈∇f(x̄(t− 1)),∇fut(x̄(t− 1))−∇f(x̄(t− 1))〉

    ≤ 12

    (

    1

    2‖∇f(x̄(t− 1))‖2 + 2 ‖∇fut(x̄(t− 1))−∇f(x̄(t− 1))‖2

    )

    ≤ 14‖∇f(x̄(t− 1))‖2 + u2tL2.

    Therefore

    E[f(x̄(t))|Ft−1] ≤ f(x̄(t− 1))−ηt2‖∇f(x̄(t− 1))‖2 + η2t

    L

    2E[

    ‖ḡ(t)‖2|Ft−1]

    +ηtL

    2

    n‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2 + ηtu2tL2.

    Lemma 6 suggests that we further need to bound two terms, the second moment of ḡ(t) and the

    expected consensus error ec(t−1).

    Lemma 7. We have

    E[

    ‖ḡ(t)‖2|Ft−1]

    ≤ 4G2d

    3n2+ 2‖∇f(x̄(t− 1))‖2 + 4L

    2

    n‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2 + u2tL2d2.

    19

  • Proof. Since

    ‖ḡ(t)‖2 =∥

    d

    n

    n∑

    i=1

    [

    〈∇fi(xi(t− 1)), zi(t)〉zi(t)

    +

    (

    fi(xi(t−1)+utzi(t))−fi(xi(t−1)−utzi(t))

    2ut− 〈∇fi(xi(t−1)), zi(t)〉

    )

    zi(t)

    ]∥

    2

    ,

    by (26) of Lemma 5, we see that

    E[

    ‖ḡ(t)‖2|Ft−1]

    ≤ E

    (

    1 +1

    3

    )

    (

    d

    n

    n∑

    i=1

    〈∇fi(xi(t− 1)), zi(t)〉zi(t))2

    + (1 + 3)

    (

    d

    n

    n∑

    i=1

    1

    2utL

    )2∣

    Ft−1

    =4

    3

    d

    n2

    n∑

    i=1

    ‖∇fi(xi(t− 1))‖2 +1

    n2

    i6=j〈∇fi(xi(t− 1)),∇fj(xj(t− 1))〉

    + u2tL2d2,

    where we used (24) and the fact that 〈∇fi(xi(t − 1)), zi(t)〉zi(t) and 〈∇fj(xj(t − 1)), zj(t)〉zj(t) areindependent for j 6= i conditioned on Ft−1. Then since

    d

    n2

    n∑

    i=1

    ‖∇fi(xi(t− 1))‖2 +1

    n2

    i6=j〈∇fi(xi(t− 1)),∇fj(xj(t− 1))〉

    =d− 1n2

    n∑

    i=1

    ‖∇fi(xi(t− 1))‖2 +∥

    1

    n

    n∑

    i=1

    ∇fi(xi(t− 1))∥

    2

    ,

    and∥

    1

    n

    n∑

    i=1

    ∇fi(xi(t− 1))∥

    2

    ≤(

    1 +1

    2

    )

    1

    n

    n∑

    i=1

    ∇fi(x̄(t− 1))∥

    2

    + (1 + 2)

    1

    n

    n∑

    i=1

    (

    ∇fi(xi(t− 1))−∇fi(x̄(t− 1)))

    2

    ≤ 32‖∇f(x̄(t− 1))‖2 + 3 · 1

    n

    n∑

    i=1

    ‖∇fi(xi(t− 1))−∇fi(x̄(t− 1))‖2

    ≤ 32‖∇f(x̄(t− 1))‖2 + 3L

    2

    n‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2,

    we get

    E[

    ‖ḡ(t)‖2|Ft−1]

    ≤ 4(d− 1)3n2

    n∑

    i=1

    ‖∇fi(xi(t− 1))‖2 +4

    3

    1

    n

    n∑

    i=1

    ∇fi(xi(t− 1))∥

    2

    + u2tL2d2

    ≤ 4G2d

    3n2+ 2‖∇f(x̄(t− 1))‖2 + 4L

    2

    n‖x(t− 1)− 1⊗ x̄(t− 1)‖2 + u2tL2d2.

    Lemma 8. For each t ≥ 1, we have

    ‖x(t)− 1n ⊗ x̄(t)‖2 ≤(

    1 + ρ2

    2

    )t

    ‖x(0)− 1n ⊗ x̄(0)‖2 +2nρ2

    1− ρ2G2d2

    t−1∑

    τ=0

    (

    1 + ρ2

    2

    η2t−τ (36)

    20

  • almost surely, and

    ec(t) ≤(

    1 + ρ2

    2

    )t

    ec(0) +2nρ2κ2

    1− ρ2 G2d

    t−1∑

    τ=0

    (

    1 + ρ2

    2

    η2t−τ . (37)

    Proof. We have

    x(t) − 1n ⊗ x̄(t) = (W ⊗ Id) (x(t− 1)− 1n ⊗ x̄(t− 1)− ηt(g(t)− 1n ⊗ ḡ(t))) ,

    and therefore

    ‖x(t)− 1n ⊗ x̄(t)‖2 ≤ ρ2(

    ‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2 + η2t ‖g(t)− 1n ⊗ ḡ(t)‖2

    + 2ηt‖x(t− 1)− 1n ⊗ x̄(t− 1)‖‖g(t)− 1n ⊗ ḡ(t)‖)

    ≤ ρ2(

    ‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2 + η2t ‖g(t)− 1n ⊗ ḡ(t)‖2)

    +1− ρ22ρ2

    · ρ2‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2 +2ρ2

    1− ρ2 · η2t ρ

    2‖g(t)− 1n ⊗ ḡ(t)‖2

    =1 + ρ2

    2‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2 + η2t

    ρ2(1 + ρ2)

    1− ρ2 ‖g(t)− 1n ⊗ ḡ(t)‖2,

    (38)

    where we used Lemma 2 in the first inequality. Since each fi is G-Lipschitz, we have∥

    ∥gi(t)∥

    ∥ ≤ Gd, andtherefore

    ‖g(t)− 1n ⊗ ḡ(t)‖2 =n∑

    i=1

    gi(t)− 1n

    n∑

    j=1

    gj(t)

    2

    =n∑

    i=1

    ∥gi(t)∥

    2 − 1n

    n∑

    j=1

    gj(t)

    2

    ≤n∑

    i=1

    ‖gi(t)‖2 ≤ nG2d2,

    and by (28) of Lemma 5, we have

    E[

    ‖g(t)− 1n ⊗ ḡ(t)‖2|Ft−1]

    ≤ E[

    n∑

    i=1

    ∥gi(t)∥

    2

    Ft−1]

    ≤ nκ2G2d.

    By plugging these bounds into (38) and noting that ρ < 1, we get (36) and (37) by mathematical induction.

    Corollary 1. 1. Let ηt be a non-increasing sequence that converges to zero. Then

    limt→∞

    ‖x(t)− 1n ⊗ x̄(t)‖2 = 0.

    Furthermore, if∑∞

    τ=1 η3t < +∞, then

    ∞∑

    t=1

    ηt‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2 < +∞

    almost surely.

    2. Suppose ηt = η1/tβ for β > 1/3. Then

    ∞∑

    t=1

    ηtec(t−1) ≤2η1ec(0)

    1− ρ2 + η31

    12βnκ2ρ2

    (3β − 1)(1− ρ2)2G2d. (39)

    21

  • Proof. 1. By the monotonicity of ηt and ((1 + ρ2)/2)t, we have

    t−1∑

    τ=0

    (

    1 + ρ2

    2

    η2t−τ =t∑

    τ=1

    (

    1 + ρ2

    2

    )t−τη2t ≤

    t∑

    τ=1

    (

    1 + ρ2

    2

    )t−τ· 1t

    t∑

    τ=1

    η2t −→ 0

    as t → ∞.For the summability of ηt‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2, we have

    ∞∑

    t=2

    ηt‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2

    ≤ ‖x(0)− 1n ⊗ x̄(0)‖2∞∑

    t=2

    ηt

    (

    1 + ρ2

    2

    )t−1+

    2nρ2G2d2

    1− ρ2∞∑

    t=2

    t−2∑

    τ=0

    ηt

    (

    1 + ρ2

    2

    η2t−1−τ

    The first term on the right-hand side obviously converges. For the second term, we have

    ∞∑

    t=2

    t−2∑

    τ=0

    ηt

    (

    1 + ρ2

    2

    η2t−1−τ ≤∞∑

    t=2

    t−2∑

    τ=0

    (

    1 + ρ2

    2

    η3t−1−τ =∞∑

    t=2

    t∑

    τ=2

    (

    1 + ρ2

    2

    )t−τη3τ−1

    =

    ∞∑

    τ=2

    η3τ−1

    ∞∑

    t=τ

    (

    1 + ρ2

    2

    )t−τ=

    2

    1− ρ2∞∑

    τ=2

    η3τ−1 < +∞.

    Therefore we can conclude that ηt‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2 is summable almost surely.2. We have

    ∞∑

    t=1

    ηtE[‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2]

    ≤ η1‖x(0)− 1n ⊗ x̄(0)‖2∞∑

    t=1

    (

    1 + ρ2

    2

    )t−1+ η31

    2nρ2κ2

    1− ρ2 G2d

    ∞∑

    t=2

    t−2∑

    τ=0

    1

    tβ(t− 1− τ)2β(

    1 + ρ2

    2

    ≤ 2η1‖x(0)− 1n ⊗ x̄(0)‖2

    1− ρ2 + η31

    2nρ2κ2

    1− ρ2 G2d

    ∞∑

    t=2

    t−2∑

    τ=0

    1

    (t− 1− τ)3β(

    1 + ρ2

    2

    .

    Then since∞∑

    t=2

    t−2∑

    τ=0

    1

    (t− 1− τ)3β(

    1 + ρ2

    2

    =

    ∞∑

    t=2

    t∑

    τ=2

    1

    (τ − 1)3β(

    1 + ρ2

    2

    )t−τ

    =

    ∞∑

    τ=2

    1

    (τ − 1)3β∞∑

    t=τ

    (

    1 + ρ2

    2

    )t−τ=

    2

    1− ρ2∞∑

    τ=2

    1

    (τ − 1)3β

    ≤ 6β(3β − 1)(1 − ρ2) ,

    we get the inequality (39).

    Now we are ready to prove Theorem 1 in the main text.

    Proof of Theorem 1. Recall that δ(t) denotes f(x̄(t))− f∗. By plugging the bound of Lemma 7 into (35)and noticing that ηtL ≤ 1/4, we get

    E[δ(t)|Ft−1] ≤ δ(t− 1)−ηt4‖∇f(x̄(t− 1))‖2 + 3ηtL

    2

    2n‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2

    +2η2tLG

    2d

    3n2+ ηtu

    2tL

    2

    (

    1 +1

    2d2ηtL

    )

    .

    (40)

    22

  • 1. Consider the case where ηt is non-increasing and∑∞

    t=1 ηt = +∞,∑∞

    t=1 η2t < +∞, and

    ∑∞t=1 ηtu

    2t <

    +∞. The convergence of xi(t) to x̄(t) is already shown by Corollary 1. Moreover, the random variable

    3ηtL2

    2n‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2 +

    2η2tLG2d

    3n2+ ηtu

    2tL

    2

    (

    1 +1

    2d2ηtL

    )

    is summable almost surely by Corollary 1 and the assumptions on ηt and ut. Then Lemma 4 guarantees

    that f(x̄(t)) converges and∞∑

    t=1

    ηt‖∇f(t− 1)‖2 < +∞

    almost surely, which implies that lim inft→∞ ‖∇f(x̄(t))‖ = 0.Now let δ > 0 be arbitrary, and consider the event

    Aδ :=

    {

    lim supt→∞

    ‖∇f(x̄(t))‖ ≥ δ}

    .

    On the event Aδ, we can always find a (random) subsequence of ‖∇f(x̄(t))‖, which we denote by(‖∇f(x̄(tk))‖)k∈N, such that ‖∇f(x̄(tk))‖ ≥ 2δ3 for all k. It’s not hard to verify that

    M := supt≥1

    ‖ḡ(t)‖ < +∞.

    Then for any s ≥ 1, we have

    ‖∇f(x̄(tk + s))‖ ≥ ‖∇f(x̄(tk))‖ −s∑

    τ=1

    ‖∇f(x̄(tk + τ)) −∇f(x̄(tk + τ − 1))‖

    ≥ 2δ3

    −s∑

    τ=1

    L · ηtk+τM

    Let ŝ(k) be the smallest positive integer such that

    3−

    ŝ(k)+1∑

    τ=1

    L · ηtk+τM <δ

    3

    (such ŝ(k) exists as∑∞

    t=1 ηt = +∞). We then see that

    ŝ(k)+1∑

    τ=1

    ηtk+τ >δ

    3LMand ‖∇f(x̄(tk + s))‖ ≥

    δ

    3

    for all s = 0, . . . , ŝ(k). Therefore

    ŝ(k)+1∑

    τ=1

    ηtk+τ‖∇f(x̄(tk + τ − 1)))‖2 ≥ŝ(k)+1∑

    τ=1

    ηtk+τδ2

    9≥ δ

    3

    27LM

    Since tk → ∞ as k → ∞, we can find a subsequence of (tkp)p∈N satisfying tkp+1 − tkp > ŝ(kp) by induction,and then ∞

    t=1

    ηt‖∇f(x̄(t− 1))‖2 ≥∞∑

    p=0

    δ3

    27LM= +∞.

    23

  • In other words, on Aδ the series∑∞

    t=1 ηt‖∇f(t−1)‖2 diverges. Since∑∞

    t=1 ηt‖∇f(t−1)‖2 < +∞ convergesalmost surely, we have P(Aδ) = 0, and consequently

    P

    (

    lim supt→∞

    ‖∇f(x̄(t))‖ > 0)

    = P

    ( ∞⋃

    k=1

    A1/k

    )

    = limk→∞

    P(A1/k) = 0,

    and we see that ‖∇f(x̄(t))‖ converges almost surely.2. When ηt = η1/

    √t and ut = u1/t

    γ/2−1/4, by (40) we have

    E

    [

    δ(t)

    (t+ 1)ǫ

    Ft−1]

    ≤ 1tǫδ(t− 1)− η1

    4t1/2+ǫ‖∇f(x̄(t− 1))‖2

    +3ηtL

    2

    2ntǫ‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2 +

    2η21LG2d

    3n2t1+ǫ+

    ηtu2tL

    2

    (

    1 +1

    2d2ηtL

    )

    ,

    where ǫ > 0 is arbitrary. Since

    3ηtL2

    2ntǫ‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2 +

    2η21LG2(d− 1)

    3n2t1+ǫ+

    ηtu2tL

    2

    (

    1 +1

    2d2ηtL

    )

    is summable, we see that∞∑

    t=1

    η1t1/2+ǫ

    ‖∇f(x̄(t− 1))‖2 < +∞,

    which implies that

    lim inft→∞

    ‖∇f(x̄(t))‖ = 0.

    Now by taking the telescoping sum of (40) and noting that δ(t) ≥ 0, we gett∑

    τ=1

    ητE[

    ‖∇f(x̄(t− 1))‖2]

    ≤ 4δ(0) + 6L2

    n

    t∑

    τ=1

    ητec(τ−1) +8LG2d

    3n2

    t∑

    τ=1

    η2τ

    + 4L2t∑

    τ=1

    (

    ητu2τ +

    1

    2d2Lη2τu

    )

    .

    (41)

    Since ηt = η1/√t = αη/(4L

    √d · t) ≤ 1/(4L

    √d) and ut ≤ αuG/(L

    √dtγ/2−1/4) with αη ≤ 1 and γ > 1, we

    havet∑

    τ=1

    ητ = η1

    t∑

    τ=1

    1√t≥ 2η1(

    √t+ 1− 1),

    t∑

    τ=1

    η2τ = η21

    t∑

    τ=1

    1

    t≤ η21 ln(2t+ 1),

    t∑

    τ=1

    (

    ητu2τ +

    1

    2d2Lη2τu

    )

    ≤t∑

    τ=1

    (

    1 +d3/2

    8

    )

    ητu2τ

    ≤ 9d3/2

    8· η1

    α2uG2

    L2d

    t∑

    τ=1

    1

    tγ≤ η1

    9α2uG2√d

    8L2γ

    γ − 1 ,

    where we used (30), (32) and (33). In addition, by the second part of Corollary 1, we get

    t∑

    τ=1

    ητec(τ − 1)

    n≤ 2η1ec(0)

    n(1−ρ2) + η31

    12κ2ρ2

    (1− ρ2)2G2d.

    24

  • By plugging these bounds into (41), we get

    ∑tτ=1 ητE

    [

    ‖∇f(x̄(t− 1))‖2]

    ∑tτ=1 ητ

    ≤ 1√t+ 1− 1

    (

    8√dLδ(0)

    αη+

    6L2ec(0)

    n(1− ρ2) +9α2uγ

    4(γ − 1)G2√d

    )

    +αηG

    2√d

    3n2ln(2t+ 1)√t+ 1− 1 +

    9κ2ρ2

    4(1− ρ2)2α2ηG

    2

    √t+ 1− 1 .

    We now get the convergence rate of E[

    ‖∇f(x̄(t))‖2]

    stated in the theorem. The convergence rate of the

    consensus error follows from Lemma 8 and (34).

    C Proof of Theorem 2

    We still denote

    x(t) =

    x1(t)...

    xn(t)

    , g(t) =

    g1(t)...

    gn(t)

    , x̄(t) =

    1

    n

    n∑

    i=1

    xi(t), ḡ(t) =1

    n

    n∑

    i=1

    gi(t),

    and δ(t) = f(x̄(t)) − f∗, ec(t) = E[

    ‖x(t)− 1n ⊗ x̄(t)‖2]

    , and (Ft)t∈N will be a filtration such that(zi(t), xi(t)) is Ft-measurable for each t ≥ 1. We recall that the iterations of Algorithm 1 can be equiva-lently written as

    x(t) = (W ⊗ Id)(x(t − 1)− ηtg(t)), x̄(t) = x̄(t− 1)− ηtḡ(t),

    and that each fi is assumed to be L-smooth and f∗i = infx∈Rd fi(x) > −∞. In addition, ∆ is defined as

    ∆ := f∗ − 1n∑n

    i=1 f∗i .

    Note that Lemma 6 still applies here. On the other hand,as each fi is not uniformly Lipschitz continuous

    over Rd, we need new lemmas characterizing the consensus procedure and the second moment of ḡ(t).

    Lemma 9. Suppose

    η2tL2 ≤ 1− ρ

    2

    12ρ2(

    4d3 +

    6ρ2

    1−ρ2) . (42)

    Then for each t ≥ 1,

    ec(t)

    n≤ 1 + ρ

    2

    2

    ec(t− 1)n

    + 4η2t ρ2L

    (

    4d

    3+

    6ρ2

    1− ρ2)

    E[δ(t− 1)]

    + 4η2t ρ2L∆

    (

    4d

    3+

    6ρ2

    1− ρ2)

    + η2t ρ2u2tL

    2

    (

    d2 +6ρ2

    1− ρ2)

    .

    (43)

    Consequently,

    ec(t)

    n≤ 16α

    2ηρ

    2L

    µ2

    (

    4d

    3+

    6ρ2

    1−ρ2) t∑

    τ=1

    E[δ(τ−1)](τ + t0)2

    (

    1 + ρ2

    2

    )t−τ

    +32α2ηρ

    2L∆

    µ2(1− ρ2)

    (

    4d

    3+

    6ρ2

    1−ρ2)

    1

    t2+ o(t−2).

    (44)

    25

  • Proof. From x(t) − 1n ⊗ x̄(t) = (W ⊗ Id)(x(t−1)− 1n ⊗ x̄(t−1)− ηt(g(t)− 1n ⊗ ḡ(t))), we get‖x(t)− 1n ⊗ x̄(t)‖2 ≤ ρ2 ‖x(t−1)− 1n ⊗ x̄(t−1)− ηt(g(t)− 1n ⊗ ḡ(t))‖2

    ≤ ρ2 ‖x(t−1)− 1n⊗x̄(t−1)‖2 + ρ2η2t ‖g(t)− 1n ⊗ ḡ(t)‖2

    − 2ρ2ηt 〈x(t−1)− 1n ⊗ x̄(t−1), g(t)− 1n ⊗ ḡ(t)〉 .By ‖g(t)− 1n ⊗ ḡ(t)‖ ≤ ‖g(t)‖ and the bound (29) of Lemma 5, we get

    E

    [

    ‖g(t)− 1n ⊗ ḡ(t)‖2 |Ft−1]

    ≤n∑

    i=1

    E

    [

    ∥gi(t)∥

    2∣

    ∣Ft−1]

    ≤ 4d3

    n∑

    i=1

    ‖∇fi(xi(t−1))‖2 + nu2tL2d2.

    On the other hand, we have

    − 2ηt E[〈x(t−1)− 1n ⊗ x̄(t−1), g(t)− 1n ⊗ ḡ(t)〉 |Ft−1]

    ≤ 1−ρ2

    3ρ2‖x(t−1)− 1n ⊗ x̄(t−1)‖2 +

    3ρ2

    1−ρ2 η2t

    n∑

    i=1

    ∇futi (xi(t−1))−1

    n

    n∑

    j=1

    ∇futj (xj(t−1))

    2

    ≤ 1−ρ2

    3ρ2‖x(t−1)− 1n ⊗ x̄(t−1)‖2 +

    3ρ2

    1−ρ2 η2t

    n∑

    i=1

    ‖∇futi (xi(t−1))‖2

    ≤ 1−ρ2

    3ρ2‖x(t−1)− 1n ⊗ x̄(t−1)‖2 +

    6ρ2

    1−ρ2 η2t

    n∑

    i=1

    ‖∇fi(xi(t−1))‖2 +6ρ2

    1−ρ2η2t nu

    2tL

    2.

    Then, we notice that

    n∑

    i=1

    ‖∇fi(xi(t−1))‖2 ≤ 2n∑

    i=1

    (

    ‖∇fi(xi(t−1))−∇fi(x̄(t−1))‖2 + ‖∇fi(x̄(t−1))‖2)

    ≤ 2L2‖x(t−1)− 1n ⊗ x̄(t−1)‖2 + 4Ln∑

    i=1

    (fi(x̄(t−1))− f∗i )

    ≤ 2L2‖x(t−1)− 1n ⊗ x̄(t−1)‖2 + 4Ln(f(x̄(t−1))− f∗) + 4Ln∆,where we used the L-smoothness of each fi and Lemma 3. Therefore

    E

    [

    ‖x(t) − 1n ⊗ x̄(t)‖2 |Ft−1]

    ≤ 1 + 2ρ2

    3‖x(t−1)− 1n ⊗ x̄(t−1)‖2 + η2t ρ2

    (

    4d

    3

    n∑

    i=1

    ‖∇fi(xi(t−1))‖2 + nu2tL2d2)

    +6ρ4

    1− ρ2 η2t

    n∑

    i=1

    ‖∇fi(xi(t−1))‖2 +6ρ4

    1− ρ2 η2t nu

    2tL

    2

    ≤ 1 + 2ρ2

    3‖x(t−1)− 1n ⊗ x̄(t−1)‖2 + η2t ρ2

    (

    4d

    3+

    6ρ2

    1− ρ2) n∑

    i=1

    ‖∇fi(xi(t−1))‖2

    + η2t ρ2nu2tL

    2

    (

    d2 +6ρ2

    1− ρ2)

    ≤[

    1 + 2ρ2

    3+ 2η2tL

    2ρ2(

    4d

    3+

    6ρ2

    1− ρ2)]

    ‖x(t−1)− 1n ⊗ x̄(t−1)‖2

    + η2t ρ2

    (

    4d

    3+

    6ρ2

    1− ρ2)

    · 4Ln(f(x̄(t−1))− f∗) + η2t ρ2(

    4d

    3+

    6ρ2

    1− ρ2)

    · 4Ln∆

    + η2t ρ2nu2tL

    2

    (

    d2 +6ρ2

    1− ρ2)

    .

    26

  • Since the condition (42) implies

    1 + 2ρ2

    3+ 2η2tL

    2ρ2(

    4d

    3+

    6ρ2

    1− ρ2)

    ≤ 1 + ρ2

    2,

    by taking the total expectation of the bound on E[

    ‖x(t)− 1n ⊗ x̄(t)‖2 |Ft−1]

    , we get (43). Finally, by

    induction, we get

    ec(t)

    n≤(

    1 + ρ2

    2

    )tec(0)

    n+ 4ρ2L

    (

    4d

    3+

    6ρ2

    1−ρ2) t∑

    τ=1

    η2τ E[δ(τ−1)](

    1 + ρ2

    2

    )t−τ

    + 4ρ2L

    (

    4d

    3+

    6ρ2

    1−ρ2)

    t∑

    τ=1

    η2τ

    (

    1+ρ2

    2

    )t−τ+ ρ2L2

    (

    d2 +6ρ2

    1−ρ2) t∑

    τ=1

    η2τu2τ

    (

    1+ρ2

    2

    )t−τ

    =16α2ηρ

    2L

    µ2

    (

    4d

    3+

    6ρ2

    1−ρ2) t∑

    τ=1

    E[δ(τ−1)](τ + t0)2

    (

    1+ρ2

    2

    )t−τ+

    32α2ηρ2L∆

    µ2(1−ρ2)

    (

    4d

    3+

    6ρ2

    1−ρ2)

    1

    t2+ o(t−2),

    which completes the proof.

    Lemma 10. We have

    E[

    ‖ḡ(t)‖2]

    ≤ 8L2d

    nec(t− 1) +

    32Ld

    3E[δ(t− 1)] + 32Ld∆

    3+ u2tL

    2d2. (45)

    Proof. We have

    E[

    ‖ḡ(t)‖2∣

    ∣Ft−1]

    = E

    1

    n

    n∑

    i=1

    G(2)fi

    (xi(t−1);ut, zi(t))∥

    2∣

    Ft−1

    ≤ 1n

    n∑

    i=1

    E

    [

    ∥G(2)fi

    (xi(t−1);ut, zi(t))∥

    2∣

    Ft−1]

    ≤ 4d3n

    n∑

    i=1

    ‖∇fi(xi(t−1))‖2 + u2tL2d2

    ≤ 8d3n

    n∑

    i=1

    ‖∇fi(x̄(t−1))‖2 +8L2d

    3n‖x(t−1)− 1n ⊗ x̄(t−1)‖2 + u2tL2d2

    ≤ 8L2d

    n‖x(t−1)− 1n ⊗ x̄(t−1)‖2 +

    32Ld

    3(f(x̄(t−1))− f∗) + 32Ld∆

    3+ u2tL

    2d2,

    where the third step follows from (29) of Lemma 5, and the fourth step utilizes the L-smoothness of fi.

    Taking the total expectation completes the proof.

    By plugging (45) into (35) of Lemma 6 and using the fact that f is µ-gradient dominated, we can

    prove the following result.

    Lemma 11. Suppose

    ηtL ≤3µ

    32Ld.

    Then for each t ≥ 1,

    E[δ(t)] ≤(

    1− ηtµ2

    )

    E[δ(t−1)] + 3ηtL2

    2nec(t−1) +

    16η2tL2d∆

    3+ 2ηtu

    2tL

    2d (46)

    27

  • Proof. Plugging (45) into (35) and taking the total expectation yield

    E[δ(t)] ≤ E[δ(t− 1)]− ηt2E

    [

    ‖∇f(x̄(t− 1))‖2]

    +ηtL

    2

    n(1+4ηtLd) ec(t− 1)

    +16η2tL

    2d

    3E[δ(t− 1)] + 16η

    2tL

    2d∆

    3+ ηtu

    2tL

    2 +η2tL

    2· u2tL2d2

    ≤(

    1 +16η2tL

    2d

    3

    )

    E[δ(t−1)]− ηt2E

    [

    ‖∇f(x̄(t−1))‖2]

    +3ηtL

    2

    2nec(t−1) +

    16η2tL2d∆

    3+ 2ηtu

    2tL

    2d.

    where we used ηt ≤ 3µ32L2d

    ≤ 18Ld

    . The bound (46) then follows by using 2µδ(t−1) ≤ ‖∇f(x̄(t−1))‖2 andagain ηt ≤ 3µ

    32L2d.

    The following lemma gives a coarse estimate of the convergence rate of E[δ(t)], which will be refined

    later.

    Lemma 12. Suppose ηt satisfies the conditions of Theorem 2. Then E[δ(t)] = O(t−1/2).

    Proof. It can be checked that the conditions of both Lemma 9 and Lemma 11 are satisfied by the choice

    of ηt in Theorem 2. By (43) and (46), we have

    [

    ec(t)/n

    E[δ(t)]

    ]

    1+ρ2

    2ρ2(

    4d

    3+

    6ρ2

    1− ρ2

    )

    η2t

    3L2

    2ηt 1− ηtµ2

    [

    ec(t−1)/nE[δ(t−1)]

    ]

    + υt,

    where

    υt =

    4η2t ρ2L∆

    (

    4d

    3+

    6ρ2

    1−ρ2

    )

    + η2t u2tρ

    2L2(

    d2 +6ρ2

    1−ρ2

    )

    16η2tL2∆d

    3+ 2ηtu

    2tL

    2d

    .

    By using ηt =2αη

    µ(t + t0)and ut = O(1/

    √t), it can be shown by straightforward calculation that

    1+ρ2

    2ρ2(

    4d

    3+

    6ρ2

    1− ρ2

    )

    η2t

    3L2

    2ηt 1− ηtµ

    2

    = 1− αηt

    +O(t−2).

    Therefore there exists T ≥ 1 such that∥

    1+ρ2

    2ρ2(

    4d

    3+

    6ρ2

    1− ρ2

    )

    η2t

    3L2

    2ηt 1− ηtµ

    2

    ≤ 1− αη2t

    for all t ≥ T . Therefore∥

    [

    ec(t)/n

    E[δ(t)]

    ]∥

    ≤(

    1− αη2t

    )

    [

    ec(t−1)/nE[δ(t−1)]

    ]∥

    + ‖υt‖, ∀t ≥ T,

    and by induction, we see that

    [

    ec(t)/n

    E[δ(t)]

    ]∥

    ≤∥

    [

    ec(T−1)/nE[δ(T−1)]

    ]∥

    ·t∏

    τ=T

    (

    1− αη2τ

    )

    +

    t∑

    τ=T

    ‖υτ‖t∏

    s=τ+1

    (

    1− αη2s

    )

    .

    28

  • Since for any T ≤ t1 ≤ t2 + 1, we havet2∏

    s=t1

    (

    1− αη2s

    )

    ≤ exp(

    −t2∑

    s=t1

    αη2s

    )

    ≤ exp(

    −αη2

    (ln(t2 + 1)− ln(t1)))

    =

    (

    t1t2 + 1

    )αη/2

    ,

    we can see that∥

    [

    ec(t)/n

    E[δ(t)]

    ]∥

    ≤∥

    [

    ec(T−1)/nE[δ(T−1)]

    ]∥

    (

    T

    t+ 1

    )αη/2

    +t∑

    τ=T

    ‖υτ‖(

    τ + 1

    t+ 1

    )αη/2

    . (47)

    Finally, by noticing that αη > 1, ‖υt‖ = O(1/t2) and that

    t∑

    τ=T

    1

    τ2

    (

    τ + 1

    t+ 1

    )αη/2

    ≤ 1(t+ 1)αη/2

    · (T + 1)2

    T 2

    t∑

    τ=T

    (τ + 1)αη/2−2 =

    O(1/t), αη > 2,

    O(ln t/t), αη = 2,

    O(t−αη/2), 1 < αη < 2

    = O(t−1/2),

    we can see that E[δ(t)] = O(t−1/2).

    We are now ready to prove Theorem 2.

    Proof of Theorem 2. By Lemma 12and (34), we can see that

    t∑

    τ=1

    E[δ(τ − 1)](τ + t0)2

    (

    1 + ρ2

    2

    )t−τ= O

    (

    1

    t2+1/2

    )

    .

    Therefore from (44) of Lemma 9, we see that

    ec(t)

    n≤

    32α2ηρ2L∆

    µ2(1− ρ2)

    (

    4d

    3+

    6ρ2

    1−ρ2)

    1

    t2+ o(t−2), (48)

    which is just the bound (14). On the other hand, by Lemma 11 and mathematical induction, we get

    E[δ(t)] ≤ δ(0)t∏

    τ=1

    (

    1− ητµ2

    )

    +3L2

    2

    t∑

    τ=1

    ητec(τ−1)

    n

    t∏

    s=τ+1

    (

    1− ηsµ2

    )

    +16L2∆d

    3

    t∑

    τ=1

    η2τ

    t∏

    s=τ+1

    (

    1− ηsµ2

    )

    + 2L2d

    t∑

    τ=1

    ητu2τ

    (

    1− ηsµ2

    )

    .

    Since for any t1 ≤ t2 + 1, we havet2∏

    s=t1

    (

    1− ηsµ2

    )

    ≤ exp(

    −t2∑

    s=t1

    ηsµ

    2

    )

    = exp

    (

    −αηt2∑

    s=t1

    1

    s+ t0

    )

    ≤ exp (−αη (ln(t2 + t0 + 1)− ln(t1 + t0))) =(

    t1 + t0t2 + t0 + 1

    )αη

    ,

    by plugging in the conditions on ηt and ut, we get

    E[δ(t)] ≤ δ(0)(

    t0 + 1

    t+ t0 + 1

    )αη

    +3αηL

    2

    µ

    t∑

    τ=1

    ec(τ−1)n(τ + t0)

    (

    τ + t0 + 1

    t+ t0 + 1

    )αη

    +

    (

    64α2ηL2∆d

    3µ2+

    4αηα2uL

    2d

    µ

    )

    t∑

    τ=1

    1

    (τ + t0)2

    (

    τ + t0 + 1

    t+ t0 + 1

    )αη

    .

    29

  • By (48), we see that

    t∑

    τ=1

    ec(τ−1)n(τ + t0)

    (

    τ + t0 + 1

    t+ t0 + 1

    )αη

    ≤ C1t∑

    τ=1

    1

    τ2(τ + t0)

    (

    τ + t0 + 1

    t+ t0 + 1

    )αη

    =C1

    (t+ t0 + 1)αη·

    O(tαη−2), αη > 2,

    O(ln t), αη = 2,

    C2, 1 < αη < 2

    = o(t−1),

    where C1 and C2 are some positive constant. On the other hand,

    t∑

    τ=1

    1

    (τ + t0)2

    (

    τ + t0 + 1

    t+ t0 + 1

    )αη

    ≤ 1(t+ t0 + 1)αη

    (

    t0 + 2

    t0 + 1

    )2 t∑

    τ=1

    (τ + t0 + 1)αη

    =

    (

    t0 + 2

    t0 + 1

    )2

    · 1t+ o(t−1) ≤ 3

    2· 1t+ o(t−1),

    where we used the fact that

    (

    t0 + 2

    t0 + 1

    )2

    ≤(

    1 +3µ2

    64αηL2d

    )2

    ≤(

    1 +3

    64

    )2

    ≤ 32.

    Therefore we obtain

    E[δ(t)] ≤(

    32α2ηL2∆d

    µ2+

    6αηα2uL

    2d

    µ

    )

    1

    t+ o(t−1). (49)

    D Proof of Theorem 3

    We first bound the error of the 2d-point gradient estimator.

    Lemma 13. Let f : Rd → R be L-smooth. Then for any x ∈ Rd,∥

    d∑

    k=1

    f(x+ uek)− f(x− uek)2u

    ek −∇f(x)∥

    ≤ 12uL

    √d.

    Proof. We have

    d∑

    k=1

    f(x+ uek)− f(x− uek)2u

    ek −∇f(x)∥

    =

    d∑

    k=1

    (

    f(x+ uek)− f(x− uek)2u

    − 〈∇f(x), ek〉)

    ek

    =

    (

    d∑

    k=1

    f(x+ uek)− f(x− uek)2u

    − 〈∇f(x), ek〉∣

    2)1/2

    ≤(

    d∑

    k=1

    (

    1

    2uL

    )2)1/2

    =1

    2uL

    √d,

    where we used (26) of Lemma 5.

    30

  • We shall use the notations

    x(t) =

    x1(t)...

    xn(t)

    , g(t) =

    g1(t)...

    gn(t)

    , s(t) =

    s1(t)...

    sn(t)

    , x̄(t) =

    1

    n

    n∑

    i=1

    xi(t), ḡ(t) =1

    n

    n∑

    i=1

    gi(t),

    and δ(t) = f(x̄(t))− f∗, ec(t) = ‖x(t)− 1n ⊗ x̄(t)‖2, eg(t) = ‖s(t)− 1n ⊗ ḡ(t)‖2. It’s not hard to see thatthe iterations of Algorithm 2 can be equivalently written as

    s(t) = (W ⊗ Id)(s(t − 1) + g(t)− g(t− 1)),x(t) = (W ⊗ Id)(x(t − 1)− ηs(t)).

    We also have1

    n

    n∑

    i=1

    si(t) = ḡ(t), x̄(t) = x̄(t− 1)− ηḡ(t).

    Lemma 14. Suppose ηL ≤ 1/6. Then

    δ(t) ≤ δ(t− 1)− η3‖∇f(x̄(t− 1))‖2 + 4ηL

    2

    3nec(t− 1) +

    ηu2tL2d

    3. (50)

    Proof. By x̄(t) = x̄(t− 1)− ηḡ(t) and the L-smoothness of the function f , we have

    f(x̄(t)) ≤ f(x̄(t− 1))− η〈∇f(x̄(t− 1)), ḡ(t)〉+ η2L

    2‖ḡ(t)‖2

    = f(x̄(t− 1))− η‖∇f(x̄(t− 1))‖2 + η2L

    2‖ḡ(t)‖2

    − η〈

    ∇f(x̄(t− 1)), 1n

    n∑

    i=1

    (gi(t)− fi(x̄(t− 1))〉

    ≤ f(x̄(t− 1))− η2‖∇f(x̄(t− 1))‖2 + η

    2L

    2‖ḡ(t)‖2 + η

    2

    1

    n

    n∑

    i=1

    (gi(t)−∇fi(x̄(t− 1))∥

    2

    .

    Then, by Lemma 13,

    1

    n

    n∑

    i=1

    (

    gi(t)−∇fi(x̄(t− 1)))

    2

    ≤ 2∥

    1

    n

    n∑

    i=1

    (

    ∇fi(xi(t− 1))−∇fi(x̄(t− 1)))

    2

    + 2

    (

    1

    n

    n∑

    i=1

    ‖gi(t)−∇fi(xi(t− 1))‖)2

    ≤ 2(

    1

    n

    n∑

    i=1

    L‖xi(t− 1)− x̄(t− 1)‖)2

    +1

    2u2tL

    2d

    ≤ 2L2

    n‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2 +

    1

    2u2tL

    2d,

    (51)

    we see that

    f(x̄(t)) ≤ f(x̄(t− 1))− η2‖∇f(x̄(t− 1))‖2 + η

    2L

    2‖ḡ(t)‖2

    +ηL2

    n‖x(t− 1)− 1n ⊗ x̄(t− 1)‖2 +

    ηu2tL2d

    4.

    31

  • Next, we bound the term ‖ḡ(t)‖2:

    ‖ḡ(t)‖2 =∥

    1

    n

    n∑

    i=1

    gi(t)

    2

    ≤ 2 ‖∇f(x̄(t− 1))‖2 + 2∥

    1

    n

    n∑

    i=1

    (gi(t)−∇fi(x̄(t− 1)))∥

    2

    ≤ 2 ‖∇f(x̄(t− 1))‖2 + 4L2

    n‖x(t)− 1n ⊗ x̄(t)‖2 + u2tL2d.

    Then we see that

    f(x̄(t)) ≤ f(x̄(t− 1))− η2(1− 2ηL) ‖∇f(x̄(t− 1))‖2

    +ηL2

    n(1 + 2ηL) ‖x(t)− 1n ⊗ x̄(t− 1)‖2 +

    ηu2tL2d

    4(1 + 2ηL) .

    (52)

    Finally, by using ηL ≤ 1/6, we get the desired result.

    Lemma 15. We have

    eg(1) = ‖s(1)− 1n ⊗ ḡ(1)‖2 ≤ ρ2(

    3

    2

    n∑

    i=1

    ‖∇fi(xi(0))‖2 +3

    4nu21L

    2d

    )

    .

    Proof. Since s(0) = g(0) = 0, we have

    ‖s(1)− 1n ⊗ ḡ(1)‖2 = ‖(W ⊗ Id)(g(1)− 1n ⊗ ḡ(1))‖2 ≤ ρ2‖g(1)− 1n ⊗ ḡ(1)‖2.

    Then since

    ‖g(1)− 1n ⊗ ḡ(1)‖2 = ‖g(1)‖2 + n‖ḡ(1)‖2 − 2n∑

    i=1

    gi(1),1

    n

    n∑

    j=1

    gj(1)

    = ‖g(1)‖2 − n‖ḡ(1)‖2 ≤ ‖g(1)‖2,and by Lemma 13,

    ‖g(1)‖2 ≤n∑

    i=1

    [

    3

    2‖∇f i(x(0))‖2 + 3‖gi(1)−∇f i(x(0))‖2

    ]

    ≤ 32

    n∑

    i=1

    ‖∇f i(x(0))‖2 + 3n∑

    i=1

    (

    1

    2u1L

    √d

    )2

    =3

    2

    n∑

    i=1

    ‖∇f i(x(0))‖2 + 34nu21L

    2d,

    we get the desired result.

    The following lemma characterizes the consensus procedure.

    Lemma 16. Suppose ηL ≤ 1/6. Then we have the following component-wise inequality[

    2√57L

    eg(t)

    ec(t−1)

    ]

    ≤ A[

    2√57L

    eg(t−1)ec(t−2)

    ]

    +2nη3Lρ2(1+2ρ2)

    3(1− ρ2)

    2‖∇f(x̄(t−2))‖2+ 54

    u2t−1d

    η2

    0

    , (53)

    where

    A :=

    [

    1+2ρ2

    3 +18ρ4(1+2ρ2)

    1−ρ2 η2L2 2

    √57ρ2(1+2ρ2)5(1−ρ2) ηL

    2√57ρ2(1+2ρ2)5(1−ρ2) ηL

    1+2ρ2

    3

    ]

    (54)

    32

  • Proof. We have

    s(t)− 1n ⊗ ḡ(t)= (W ⊗ Id) (s(t− 1)− 1n ⊗ ḡ(t− 1) + g(t)− g(t− 1)− 1n ⊗ ḡ(t) + 1n ⊗ ḡ(t− 1)) .

    Then since

    ‖g(t)− g(t− 1)− 1n ⊗ ḡ(t) + 1n ⊗ ḡ(t− 1)‖2

    = ‖g(t)− g(t− 1)‖2 + n‖ḡ(t)− ḡ(t− 1)‖2 − 2n∑

    i=1

    〈gi(t)− gi(t− 1), ḡ(t)− ḡ(t− 1)〉

    = ‖g(t)− g(t− 1)‖2 − n‖ḡ(t)− ḡ(t− 1)‖2 ≤ ‖g(t)− g(t− 1)‖2,

    we have

    eg(t) = ‖s(t)− 1n ⊗ ḡ(t)‖2

    ≤ ρ2 (‖s(t− 1)− 1n ⊗ ḡ(t− 1)‖+ ‖g(t)− g(t− 1)‖)2

    ≤ ρ2 ·[(

    1 +1− ρ23ρ2

    )

    ‖s(t− 1)− 1n ⊗ ḡ(t− 1)‖2 +(

    1 +3ρ2

    1− ρ2)

    ‖g(t)− g(t− 1)‖2]

    =1 + 2ρ2

    3‖s(t− 1)− 1n ⊗ ḡ(t− 1)‖2 +

    ρ2(1 + 2ρ2)

    1− ρ2n∑

    i=1

    ‖gi(t)− gi(t− 1)‖2.

    Now since

    ‖gi(t)− gi(t− 1)‖2 ≤ 2∥

    ∥∇fi(xi(t− 1))−∇fi(xi(t− 2))∥

    2

    + 2∥

    ∥gi(t)−∇fi(xi(t− 1))− gi(t− 1) +∇fi(xi(t− 2))∥

    2

    ≤ 2∥

    ∥∇fi(xi(t− 1))−∇fi(xi(t− 2))∥

    2+ 2

    (

    ut + ut−12

    L√d

    )2

    ≤ 2L2‖xi(t− 1)− xi(t− 2)‖2 + 2u2t−1L2d,

    we get

    eg(t) ≤