suhail m. shah arxiv:1711.11196v2 [math.oc] 25 may 2018

24
Distributed Optimization on Riemannian Manifolds SUHAIL MOHMAD SHAH Department of Electrical Engineering, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India. ([email protected]). Abstract We consider the problem of consensual distributed optimization on Riemannian manifolds. Specif- ically, the problem of minimization of a sum of functions subject to non-linear constraints is studied where each function in the sum is defined on a manifold and is only accessible at a node of a network. This gen- eralizes the framework of distributed optimization to a much larger class of problem which subsumes the Euclidean case. Leveraging concepts from Riemannian optimization and consensus algorithms, we propose an algorithm to efficiently solve this problem. A detailed convergence analysis is carried out for (geodesically) convex optimization problems. The empirical performance of the algorithm is demonstrated using standard problems which fit the Riemannian optimization framework. Key words Riemannian Optimization, Distributed Optimization, Manifolds, PCA. 1 arXiv:1711.11196v3 [math.OC] 7 Sep 2020

Upload: others

Post on 04-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

Distributed Optimization on Riemannian Manifolds

SUHAIL MOHMAD SHAHDepartment of Electrical Engineering,Indian Institute of Technology Bombay,

Powai, Mumbai 400076, India.([email protected]).

Abstract We consider the problem of consensual distributed optimization on Riemannian manifolds. Specif-ically, the problem of minimization of a sum of functions subject to non-linear constraints is studied whereeach function in the sum is defined on a manifold and is only accessible at a node of a network. This gen-eralizes the framework of distributed optimization to a much larger class of problem which subsumes theEuclidean case. Leveraging concepts from Riemannian optimization and consensus algorithms, we proposean algorithm to efficiently solve this problem. A detailed convergence analysis is carried out for (geodesically)convex optimization problems. The empirical performance of the algorithm is demonstrated using standardproblems which fit the Riemannian optimization framework.

Key words Riemannian Optimization, Distributed Optimization, Manifolds, PCA.

1

arX

iv:1

711.

1119

6v3

[m

ath.

OC

] 7

Sep

202

0

Page 2: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

1 IntroductionIn the last few decades there has been a major effort to develop algorithms dealing with distributed opti-mization problems and consensus protocols. The majority of these aim at solving problems arising in wirelessand sensor networks, machine learning, multi-vehicle coordination and internet transmission protocols. Insuch scenarios, the networks are typically spatially distributed over a large area and potentially have a largenumber of nodes. The absence of a central node for access to the complete system information lends them thename “decentralized or distributed algorithms". Such a distinction is made because the lack of a central entitymakes the conventional centralized optimization techniques inapplicable. This has initiated the developmentof distributed computational models and algorithms to support efficient operations over such networks. Werefer the interested reader to [1] for a succinct account of the theory of distributed optimization algorithmsand the recent developments therein.

The problem of consensual distributed optimization for the Euclidean case be formally stated as :

minimize1

n

n∑i=1

f i(xi),

subject ton∑j=1

qij‖xi − xj‖2 = 0 ∀i ∈ {1, ..., n},

where xi ∈ Rn, f i : Rn → R and scalars qij with qij > 0 if the nodes i and j can communicate and equal tozero if not. There are usually some additional assumptions on the network which we do not state here. As amotivating example, one can consider machine learning optimization, where the sum of functions form canarise from considering the expected risk:

L(w) = E[`(h(x;w), y)],

where `(·, ·) is the loss function so that given the input-output pair (x, y), the loss incurred is `(h(x;w), y) whenh(x;w) and y are the predicted and true outputs, respectively. Lacking an access to the actual distributionon the data, we can instead consider the minimization of the empirical risk on the available data :

L(w) =1

n

n∑i=1

`(h(xi;w), yi), (1)

Setting f i(w) := `(h(xi;w), yi), we get the sum of functions objective. In solving modern machine learningproblem, one has to frequently deal with distributed data sets. This happens when the data set {xi, yi}ni=1

may be located in disjoint subsets at different nodes of a network with only local communication. A notableexample where this occurs is federated learning [2]. Thus, a learner may lack access to the entire datasetwhich will necessitate the deployment of distributed optimization algorithms to get a global solution.

Another subject of interest that has received a lot of attention during the last decade is Riemannianoptimization. Broadly speaking, Riemannian optimization deals with optimization problems wherein theconstraint set is a Riemannian manifold. Before the advent of techniques related to this field, when dealingwith such constraints, the usual practice was to alternate between optimizing in the ambient Euclidean spaceand projecting onto the manifold. Some classic examples include the power iteration and Oja’s algorithm[3]. Both can be studied within the framework of projected gradient gradient algorithms. The potentialpitfall for these schemes is that for certain manifolds (e.g., positive definite matrices), projections can be tooexpensive to compute. To address these issues, there has been a lot of effort towards developing algorithmsthat are projection free. Riemannian optimization provides a viable alternative by directly operating on themanifold structure and exploiting the geometric structure inherent in the problem. This allows Riemannianoptimization to turn the constrained optimization problem into an unconstrained one defined on the manifold.

Having discussed the relevance of distributed optimization and Riemannian optimizaiton, one is lead tothe consideration of developing distributed optimization techniques for Riemannian manifolds. Given theplethora of practical examples which can be posed within the Rieamnnian framework such as regressionon manifolds, principal component analysis (PCA), matrix completion, computing the leading eigenvector ofdistributed data, such an undertaking would have both practical and theoretical implications. Unfortunately,

2

Page 3: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

there exists a significant gap in the existing literature with regards to understanding distributed computingon Riemannian manifolds. We remark that this is a much harder problem than its Euclidean counterpartsince one has to account for curvature effects. This paper aims at addressing this and begin the program ofanalyzing distributed optimization on Riemannain manifolds.

Formally, the problem of consensual distributed optimization for Riemannian manifolds can be posed as:

minimize1

n

n∑i=1

f i(xi),

subject to xi ∈M,n∑j=1

qijd2(xi, xj) = 0 ∀i ∈ {1, ..., n},

(2)

whereM is a Riemannian Manifold, f i : M→ R and d2(xi, xj) is the squared distance between xi and xj(see section 2 for a definition of distance for Riemannian manifolds).

Overview of the present work : We generalize the distributed optimziation frameowrk to Riemannianmanifolds and propose an algorithm for which we provide convergence guarantees. The present program ofextending distributed optimization to Riemannian manifolds faces two (surmountable) hurdles :

(i) The Riemannian distance function, which is the analog of the Euclidean distance for manifolds, isnot convex. This introduces local minima in the cost function that is being minimized in order to achieveconsensus. These local minima constitute non-consensus configurations, by which we mean that the limitingvalues at all the nodes need not be equal.

(ii) The gradient vectors of the cost function evaluated at different nodes are tangent vectors lying indifferent tangent planes. This renders the usual vector operations on them useless. This issue is addressedusing the concept of parallel transport.

The convergence properties of the proposed algorithm are studied and convergence is established undercertain assumptions.

Relevant Literature : [4] traces the key ideas of gradient descent algorithms with line search on manifolds to[5]. As computationally viable alternatives to calculating geodesics became available, the literature dealingwith optimization on Riemannian manifolds grew significantly. In the last few decades, these algorithmshave been increasingly challenging various mainstream algorithms for the state of art in many popularapplications. As a result, there has been considerable interest in optimization and related algorithms onRiemannian manifolds, particularly on matrix manifolds. [4] and [6] serve as standard references for thetopic alongwith [7] for convex optimization on Riemmanian manifolds. The iteration complexity anaysis forfirst order methods on Alexandrov spaces (manifolds with lower bounded curvature) was carried out in [8].We refer the reader to the "notes and reference" section in [4] for a wonderful account of the history anddevelopment of this field.

The literature on distributed optimization is vast building upon the works of [9] for the unconstrained caseand [10] for the constrained case. [11] was one of the earliest works to address the issue of achieving a consensussolution to an optimization problem in a network of computational agents. In [10], the same problem wasconsidered subject to constraints on the solution. The works [12] and [13] extended this framework andstudied different variants and extensions (asynchrony, noisy links, varying communcation graphs, etc.). Thenon-convex version was considered in [14].

Finally, we briefly mention some of the works which study distributed consensus algorithms on Rieman-nian manifolds. Some of the direct applications include distributed pose estimation, camera sensor networkutilization among others. [15] lists the earliest works on Riemannian consensus algorithms as [16] and [17].The former considers the spherical manifold while the latter deals with the N -torus. [18] considers a moregeneral class of compact homogeneous manifolds. However, these deal with only embedded sub-manifolds ofEuclidean space. In particular they depend on the specific embedding used and hence are extrinsic. Some

3

Page 4: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

other relevant works which deal with coordination on Lie groups include [19] and [20]. The most importantwork for our purposes is [15]. In it, the authors consider the consensus problem as an optimization prob-lem and employ the standard Riemannian gradient descent to solve it. A primary advantage of using thisapproach is that it is intrinsic and hence is independent of the embedding map used.

The paper is organized as follows : In Section 2 we provide basic definitions and recall concepts of Rieman-nian manifolds which we will be using here. Also, a quick review of Riemannian consensus and distributedoptimization on Euclidean spaces is provided. Section 3 presents the algorithm and a brief discussion of thesteps involved. Section 4 details the convergence analysis. In section 6 we test the proposed algorithm onseveral different examples.

Some Notation : Since we are dealing with distributed computation, we use a stacked vector notation.In particular a boldface wk = (w1

k, · · ·, wnk ) ∈ Mn, where wik is the value stored at node i during iterationk, with Mn := M× · · · × M denoting the n-fold Cartesian product of M with itself. Additionally, forany two tangent vectors a,b ∈ TwMn, we use the usual product metric 〈a,b〉 :=

∑ni=1〈ai, bi〉 under the

natural identification TwMn = Tw1M⊕· · ·⊕TwnM, which metrizes the product topology. ‘

⊕’ denotes

the Whitney sum here. This implies, geodesics, exponential maps, and gradients can be easily obtained byusing the respective definitions on each copy ofM inMn.

2 Additional BackgroundIn this section we briefly recall the basic definitions and concepts regarding Riemannian manifolds while alsoestablishing notation. For a detailed treatment we refer the reader to [4], [21] or [22]. We also cover thebasics of Riemannian consensus and Euclidean distributed optimization.

2.1 Basic Definitions(i) Riemannian Manifolds : Throughout this paper we letM denote a connected Riemannian manifold.A smooth n-dimensional manifold is a pair (M,A), where M is a Hausdorff second countable topologicalspace and A is a collection of charts {Uα, ψα} of the set M (for more details see [15]). A Riemannianmanifold is a manifold whose tangent spaces are endowed with a smoothly varying inner product 〈·, ·〉x calledthe Riemannian metric. The tangent space at any point x ∈M is denoted by TxM (for definition see [4] or[21]). We recall that a tangent space admits a structure of a vector space and for a Riemannian manifold,it is a normed vector space. The tangent bundle TM is defined to be the disjoint union ∪x∈M{x} × TxM.The normal space at the point x denoted by NM(x) is the set of all vectors orthogonal (w.r.t to 〈·, ·〉x) tothe tangent space at x. Using the norm, one can also define the arc length of a curve γ : [a, b]→M as

L(γ) =

ˆ b

a

√〈γ(t), γ(t)〉γ(t) dt.

We let d(·, ·) denote the Riemannian distance between any two points x, y ∈M, i.e.,

d : M×M→ R : d(x, y) = infΓL(γ),

where Γ is the set of all curves inM joining x and y. We recall that d(·, ·) defines a metric onM. We alsoassume that the sectional curvature of the manifold is bounded above by κmax and below by κmin.

(ii) Geodesics and Exponential Map : A geodesic on M is a curve that locally minimizes the arclength (equivalently, these are the curves that satisfy γ′′ ∈ NM(γ(t)) for all t). The exponential of a tangentvector u at x, denoted by expx(u), is defined to be Γ(1, x, u), where t→ Γ(t, x, u) is the geodesic that satisfies

Γ(0, x, u) = x andd

dtΓ(0, x, u)|t=0 = u.

Throughout the paper we assume a geodesically complete manifold, i.e. there is always a minimal lengthgeodesic between any two points on the manifold. Let B(x, r) denote the open geodesic ball centered at x ofradius r, i.e.

B(x, r) = {y ∈M : d(x, y) < r}.

4

Page 5: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

Let Ix denote the maximal open set in TxM on which Expx is a diffeomorphism. On the set Ix = Expx(Ix),the exponential map is invertible. This inverse is called the logarithm map and we denote it by Exp−1

x y fory ∈ Ix.

(iii) Injectivity Radius : The radius of the maximal geodesic ball centered at x entirely contained inIx is called the injectivity radius at x and is denoted as injxM. Also,

injM = infx{injxM}.

(v) Convexity Radius : The convexity radius rc > 0 is defined as:

rc =1

2

{injM,

π√κmax

},

where, if κmax ≤ 0, we set 1/√κmax = +∞. This quantity plays an important role in studying the con-

vergence properties since any open ball with radius r ≤ rc is convex. A subset X of M is said to be ageodesically convex set if, given any two points in X , there is an unique geodesic contained within X thatjoins those two points. Additionally, the function x 7→ d2(y, x) for any fixed y is (strongly) convex whenrestricted to B(y, rc) (page 153, Lemma 2.9, [23]).

(vi) Riemannian Gradient and Hessian : Let f : M→ R be a smooth function and v ∈ TxM bea tangent vector. We can define the directional derivative of f along v as d

dtf(γ(t))∣∣∣t=0

using the smoothparametrized curve γ(·) such that γ(0) = x and γ(0) = v. The Riemannian gradient of f is then defined asthe unique element of TxM that satisfies,

Df(x)[v] = 〈grad f(x), v〉.

The Hessian of f at x, denoted by Hessf(x), is defined as the self-adjoint linear operator which satisfies

d

dt2f(γ(t))

∣∣∣t=0

= 〈v,Hessf(x)v〉.

(vii) Parallel/Vector Transport : A parallel transport T yx : TxM → TyM translates the vectorξx ∈ TxM along the geodesic to T yx (ξx) ∈ TyM, while preserving norm and in some sense the direction. Animportant property of parallel transport is that it preserves dot products.

2.2 Euclidean Distributed OptimizationSuppose we have a network of n nodes/agents indexed by 1, ..., n. We associate with each node i, a functionf i : Rd → R. Let f : Rd → R denote

f(·) :=1

n

n∑i=1

f i(·). (3)

Let the communication network be modelled by a static undirected connected graph G ={V, E} whereV = {1, ..., n} is the node set and E ⊂ V×V is the set of links (i, j) indicating that node j can send informationto node i. We associate with the network a non-negative weight matrix Q = [[qij ]]i,j∈V such that

qij > 0⇐⇒ (i, j) ∈ E .

We let Ni denote the neighbourhood of node i, i.e. all the nodes such that qij > 0. In addition, for the restof this work, we assume the following standard assumptions on the matrix Q :

(i) (Double Stochasticity) 1TQ = 1T and Q1 = 1.

(ii) (Irreducibility and aperiodicity)We assume that the underlying graph is irreducible, i.e., there is adirected path from any node to any other node, and aperiodic, i.e., the g.c.d. of lengths of all pathsfrom a node to itself is one. It is known that the choice of node in this definition is immaterial. Thisproperty can be guaranteed, e.g., by making qii > 0 for some i.

5

Page 6: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

Let σ1(Q) ≥ σ2(Q) ≥ · · · ≥ σn(Q) ≥ 0 denote the singular values of matrix Q. Let 1n denote the vector ofall ones and ∆n the unit simplex. Then, we have for any x ∈ ∆n and t ≥ 1, the following inequality (e.g. eq.(24), [24])) : ∥∥Qtx− 1n

n

∥∥1≤√n∥∥Qtx− 1n

n

∥∥2≤√nσt2(Q)

In particular, if qtij denotes the ij’th entry of Qt, we have with x equal to the i ’th co-ordinate vector ei,∣∣qtij − 1

n

∣∣ ≤ ∥∥Qtei − 1nn

∥∥1≤√nσt2(Q) (4)

The objective of distributed optimization is to minimize (3) while simultaneously achieve consensus, i.e.

minimize1

n

n∑i=1

f i(xi)

subject to∑j∈Ni

qij‖xi − xj‖2 = 0 ∀i.

One of the earliest works which studied the above problem and proposed a solution was [11]. More recently,[10] considered the constrained case. Broadly speaking, any distributed optimization algorithm in generalwill involve the following two steps :

(i) (Consensus Step) This step involves local averaging at each node and is aimed at achieving consensus,

vik =∑j∈Ni

qijxjk.

(ii) (Gradient Descent Step) This step is the gradient descent part aimed at minimizing f i at each node :

xik+1 = vik − ak∇f i(vik),

where ak is a positive scalar decaying at a suitable rate to zero.

The convergence properties of the above algorithm have been extremely well studied for convex as well asthe non-convex case. In Section III we modify the above scheme to give an analogous version for RiemannianManifolds.

2.3 Consensus on Riemannian ManifoldsThe behaviour of consensus algorithms on Riemannian manifolds is, as expected, quite different from theEuclidean case owing to curvature effects. To begin with, only local convergence can be claimed, i.e. a setof measurements located at different nodes can conform to a consensus value only if they lie within smallenough convex neighbourhoods. Such a restriction does not exist for Euclidean spaces. Towards proposing aformal procedure to achieve consensus, one can follow suit from the Euclidean case and minimize the followingpotential function (Section III, [15]):

ϕ(w) =1

2

∑{i,j}∈E

d2(wi, wj),

where xi, xj are the measurements in question. A Riemannian gradient descent for the above function involvesthe following iteration:

wik+1 = Expwik(−ε gradwik ϕ(wk)), (5)

where ε is an admissible time step depending upon the upper bound µmax on the Hessian of the functionϕ(·), and gradwi ϕ(wk) = −

∑j∈Ni Exp

−1wiw

j . The failure of achieving consensus with the above iterationoccurs due to ϕ having multiple critical points globally, some of which may be local minima corresponding

6

Page 7: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

to a non-consensual configuration. This can happen even for the simple case of a circle, where convergenceto a consensus configration can take place only if the measurements lie within a semi circle [25].

By definition, for undirected connected graphs, the global minima of ϕ(·) can be seen to belong to theconsensus sub-manifold D:

D := {(wi, ...., wn) ∈Mn : wi = wj ∀ i, j}.

This set is the diagonal space ofMN and it represents all possible consensus configurations of the network.Let us define the following the set:

S ={x ∈Mn : ∃y ∈M s.t. d(y, xi) <

rc2∀ i},

where rc is the convexity radius. We note that this implies:

d2(xi, xj) ≤ 2d2(xi, y) + 2d2(y, xj)

≤ r2c

so that xi ∀ i lie within a convex neighbourhood. For set S, ϕ(·) has only global minima (Theorem 5, [15]).Thus, one can reasonably expect gradient descent to converge to it, provided the iterates xik, generated by(5), stay in S. Assuming the latter fact, one can confirm the convergence by performing a second orderTaylor expansion of the smooth map ϕ ◦ Exp(·), using the expression wik+1 = Expwik(−ε gradwik ϕ(wk)):1

ϕ(wk+1) ≤ ϕ(wk) + 〈gradϕ(wk)),Exp−1wk(wk+1)〉+

µmax‖gradϕ(wk))‖2

2ε2

ϕ(wk+1) ≤ ϕ(wk)− ‖gradϕ(wk))‖2 ε+µmax‖gradϕ(wk))‖2

2ε2

≤ ϕ(wk)− ‖gradϕ(wk))‖2 ε(

1− µmaxε

2

),

where µmax is an upper bound on the Hessian of ϕ (see Prop. 9, [15] for a bound on this quantity). Thus,ϕ(wk+1) < ϕ(wk) as long as ε ∈ (0, 2µ−1

max).

3 A Distributed Optimization algorithm on Riemannian ManifoldsThe pseudo code of the proposed algorithm to solve (2) is detailed in Algorithm 1. We discuss the steps inthe following:

(S1) This constitutes the consensus part of the algorithm. At each node i, the gradient of the functionϕ(·) is evaluated with respect to wik. The estimate at node i moves from the current estimate wik to anew estimate, designated as vik, along the exponential in the direction of the negative gradient with anadmissible step size ε.

(S2) The quantity gik helps to keep track of the average (Riemannian) gradient ∇f(·) via a ‘consensus inthe tangent space’. For each node i, this step essentially involves gathering the parallel transportedestimates {gjk}j∈N (i) of the neighbours and performing a weighted average of them according to thenetwork weights qij . To this weighted average, we add the difference of the current gradient ∇f i(vik)andthe previous gradient ∇f i(vik−1). This technique is sometimes referred to as the gradient trackingtechnique (see e.g. [26], [27], [28]). We note that this step is fully distributed and only involves localcomputations.

(S3) This step constitutes the optimization part of the algorithm. It is the standard Riemannian gradientdescent algorithm performed using the exponential map with a suitably fast decaying time step ak.The decaying time step is necessary here because of the need to achieve consensus and to provideexact convergence guarantees. For a constant time step, one can only establish convergence within aneighbourhood of the optimum.

1The Taylor Expansion is for ϕ(·) :Mn → R. So, grad ϕ(w) belongs to TwMn.

7

Page 8: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

Algorithm 1 Distributed Optimization on Riemannian ManifoldsInput :Manifold M; Cost functions f i(·) for all i, Exponential map Exp(·) or Retraction R(·); Parallel transportT (·); Time steps ε and ak; Network graph G = (V, E) with weight matrix Q.Initial Conditions:Initialize w0 with wi0 = wj0 and gi0 := ∇f i(vi0)∀i.For k = 0, 1, 2..... do:At each node i ∈ {1, .., n} do:(S1) [Riemannian Consensus Step] :

vik = Expwik(−ε gradwikϕ(wk)).

(S2) [Gradient Update Step] :

gik =

n∑j=1

qijTvikvjk−1

gjk−1 +∇f i(vik)− T vik

vik−1

∇f i(vik−1).

(S3) [Gradient Descent Step] :wik+1 = Expvik(−akgik).

k ← k + 1end

Alternatively, one could use a (second order) retraction map instead of Exp(·) in Step 1 and 3 of Algorithm1 with essentially no change in the behaviour of the algorithm. The main advantage of a retraction is a lowercomputational overhead. A retraction onM is a smooth mapping R : TM→M, where TM is the tangentbundle, with the following properties :

i) Rx(0x) = x, where Rx is the restriction of the retraction to TxM and 0x denotes the zero element ofTxM.

ii) With the canonical identification T0xTxM' TxM, Rx satisfies

DRx(0x) = idTxM,

where idTxM denotes the identity mapping on TxM. The fact that a retraction mapping does not affect thebehaviour of the algorithm hinges on two properties. The first of these is:

d(x,Rx(ξx)) ≤ c‖ξx‖.

This obviously holds for the exponential with an equality for c = 1. The second one is

d(Rx(tξx)Expx(tξx)) = O(t2),

i.e., the retraction is a second order one. On a related notion, generalizing the notion of parallel transport,a vector transport associated with a retraction R is defined as a smooth mapping (⊕ denotes the Whitneysum, we refer the reader to page 169, [4] for exact definition)

TM⊕ TM→ TM : (ηx, ξx) 7→ Tηx(ξx) ∈ TM

satisfying :i) If π(·) denotes the foot of a tangent vector, then π(Rx(ηx)) = π(Tηx(ξx)).,ii) Tηx : TxM→ TR(ηx)M is a linear map,iii) T0x(ξx) = ξxVector transports are in general more computationally appealing than parallel transport. The vector

transport is well defined as long as we stay within the injectivity radius defined with respect to the retraction.

8

Page 9: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

4 Convergence AnalysisIn this section, we analyze the convergence properties of Algorithm 1 for convex optimization on Rimeannianmanifolds. This class of problems has found applications in wide variety of fileds (see e.g., Chapter 4, [7],Chapter 7, [29]). To be more precise, this means we consider problems where each f i(·) (and hence f(·), seeTheorem 3.3, [7]) is geodesically convex and x ∈ X ⊂ M, with X being a compact convex set. A functionf : M → R is said to be geodesically convex if for any x, y ∈ M, a geodesic γ such that γ(0) = x andγ(1) = y, and t ∈ [0, 1], the following holds:

f(γ(t)) ≤ (1− t)f(x) + tf(y).

An equivalent definition for geodesic convexity is the existence of a vector ∂f(x) ∈ TxM, such that

f(y) ≥ f(x) + 〈∂f(x),Exp−1x y〉. (6)

We note that for a differentiable f :M→ R, we have ∂f(x) = ∇f(x), where ∇f is the Riemannian gradient.We remark that if we are using a retraction, then the concept of a convexity has to be defined accordingly(see Definition 3.1, [30]).

Throughout this section we will assume that f is geodesically Lf -smooth by which we mean, for anyx, y ∈M

f(y) ≤ f(x) + 〈∇f(x),Exp−1x (y)〉+

Lf2‖Exp−1

x (y)‖2 (7)

This can be shown to imply:‖∇f(x)− T xy ∇f(y)‖ ≤ Lfd(x, y). (8)

Since X is compact, we can assume that the exponential map is invertible (Hopf-Rinow theorem). We remarkthat these conditions are standard assumptions for Riemannian optimization and most practical problems ofinterest satisfy them.

Using the stacked vector notation, one can write step (S1) and (S3) of Algorithm 1 as:

vk = Expwk(−ε gradϕ(wk))

wk+1 = Expvk(−akgk),

where,gradϕ(wk) = (gradw1

kϕ(wik), · · · , gradwnk ϕ(wnk )) ∈ TwkMn

andExpwk

(− εgradϕ(wk))

):=(Expw1

k

(− εgradw1

kϕ(wk)), · · · ,Expwnk

(− εgradwnkϕ(wk))

)).

Since we assume X is compact, we have from the smoothness of f :

‖∇f(vk)‖ ≤ C. (9)

for some positive constant C <∞. We also take ‖gik‖ ≤ C. This again is a reasonable assumption given theboundedness of ∇f(vk), see Lemma 3 for a bound on the difference between gik and ∇f(·). We first provethat Algorithm 1 achieves consensus.

4.1 ConsensusThe following lemma shows that Algorithm 1 asymptotically achieves consensus if the time steps are properlychosen.

Lemma 1. Suppose ε ∈ (0, 2µ−1max),

∑∞k=1 ak =∞ and

∑k≥0 a

2k <∞. Then, we have,

limk‖gradϕ(wk)‖2 = 0, ∀ i. (10)

so that,wk → D as k →∞,

i.e.,d(wik, w

jk)→ 0 as k →∞, ∀ i, j

9

Page 10: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

Proof. We begin by using the smoothness property for ϕ(·) within X :

ϕ(vk) ≤ ϕ(wk)− ‖gradϕ(wk)‖2 ε+µmax‖gradϕ(wk)‖2

2ε2 ≤ ϕ(wk)− ‖gradϕ(wk)‖2

(1− µmaxε

2

)ε, (11)

where we used the fact that Exp−1wk(vk) = −εgradϕ(wk). Similarly, using wk+1 = Expvk(−akgk) gives:

ϕ(wk+1) ≤ ϕ(vk)− ak〈gradϕ(vk),gk〉+µmax‖gk‖2

2a2k.

≤ ϕ(vk) +η‖gradϕ(vk)‖2

2+‖gk‖2

2ηa2k +

µmax‖gk‖2

2a2k. (12)

where we have used the fact that 2〈a, b〉 ≤ η‖a‖2 + ‖b‖2η for any η > 0. We have

d2(wk,vk) = d2(wk,Expwk(−εgradϕ(wk)

)≤ ε2‖gradϕ(wk)‖2 (13)

Let T vwgradϕ(w) := [T v1w1 gradw1ϕ(w), ..., T vnwn gradwnϕ(w)]. The smoothness of ϕ(·) implies∥∥gradϕ(vk)− T vk

wk gradϕ(wk)∥∥ ≤ Lϕd(vk,wk), (14)

where Lϕ is the Lipschitz constant of gradϕ(·). We remark that the above inequality uses the fact that theHessian of ϕ has a bounded operator norm µmax, which implies Lipschitz continuity of gradϕ with Lipschitzconstant Lϕ ≥ µmax (Corollary 10.41, [29]). Combining (13) and (14), we have the bound,∥∥gradϕ(vk)− T v

wkgradϕ(wk)∥∥2 ≤ ε2L2

ϕ ‖gradϕ(wk)‖2. (15)

We use (15) in (12) as follows:

ϕ(wk+1) ≤ ϕ(vk) +η‖gradϕ(vk)− gradϕ(wk) + gradϕ(wk)‖2

2+‖gk‖2

2ηa2k +

µmax‖gk‖2

2a2k.

≤ ϕ(vk) + η‖gradϕ(vk)− gradϕ(wk)‖2 + η‖gradϕ(wk)‖2 +( 1

2η+µmax

2

)C2a2

k

≤ ϕ(vk) + η(1 + ε2L2

ϕ

)‖gradϕ(wk)‖2 +

( 1

2η+µmax

2

)C2a2

k

≤ ϕ(wk)− ‖gradϕ(wk)‖2 ε(

1− µmaxε

2

)+ η(1 + ε2L2

ϕ

)‖gradϕ(wk)‖2 +

( 1

2η+µmax

2

)C2a2

k,

where we have used (11) in the last inequality. Setting η = ε2(1+ε2L2

ϕ) , we get

ϕ(wk+1) ≤ ϕ(wk)− ε

2

(1− µmaxε

2

)‖gradϕ(wk)‖2 +

((1 + ε2L2

ϕ)

ε+µmax

2

)C2a2

k.

Re-arranging gives

ε

2

(1− µmaxε

2

)‖gradϕ(wk)‖2 ≤

(ϕ(wk)− ϕ(wk+1)

)+

((1 + ε2L2

ϕ)

ε+µmax

2

)C2a2

k (16)

Summing the above equation from k = 1 to ` gives:

(1− µmaxε

2

) ε2

∑k=0

‖gradϕ(wk)‖2 ≤(ϕ(w0)− ϕ(w`)

)+

((1 + ε2L2

ϕ)

ε+µmax

2

)C2∑k=0

a2k

((1 + ε2L2

ϕ)

ε+µmax

2

)C2∑k=0

a2k,

10

Page 11: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

where we used the fact that ϕ(w0) = 0 and ϕ(wl) ≥ 0 to obtain last inequality. Taking `→∞ gives:

ε

2

(1− µmaxε

2

) ∞∑k=0

‖gradϕ(wk)‖2 ≤

((1 + ε2L2

ϕ)

ε+µmax

2

)C2

∞∑k=0

a2k <∞

which implies limk ‖gradϕ(wk)‖2 = 0, since ε < 2µmax

. This proves the result since limk ‖gradϕ(wk)‖ = 0

implies any limit point of wk reaches a (global) minimum of ϕ.

4.2 Geodesically Convex Function Convergence AnalysisTo prove the main convergence result, we first establish some preliminary results.

Lemma 2. The sequence {gradϕ(wk)}k≥0 satisfies the following relation:

ε ‖gradϕ(wk)‖2 ≤ 2(1− µmaxε

2

){(ϕ(wk)− ϕ(wk+1))

+2(

1− µmaxε2

)( (1 + ε2L2ϕ)

ε+µmax

2

)C2a2

k

}(17)

Proof. This is just equation (16).

We recall the notation ∇f(x) := 1n

∑ni=1∇f i(x). In the next lemma, we examine the sequence {gik}k≥0.

To prove it, we assume the follwoing mild condition on the time step:

akak+1

≤ a

for some constant a. As an example, for ak = 1k , a can be taken equal to 2.

Lemma 3. For all i, the sequence {gik}k≥0 satisfies the following relation:

k∑m=0

am‖gm −∇f(vim)‖ ≤√nM

1− σ2(Q)

k∑m=0

a2m,

where,

M := 1 + aC +2ε(

1 + 92ε2

)1− µmaxε

2

( (1 + ε2L2ϕ)

ε+µmax

2

)C2 (18)

Proof. The proof is provided in Appendix 5.2.

The next result establishes the local convexity of ϕ(·).

Lemma 4. Let x,y ∈ M be any two points in Mn with d(xi, yj) < r∗ for all components xi, yi, i = 1 to n.Then, we have

ϕ(y) ≥ ϕ(x) + 〈gradxϕ(x),Exp−1x (y)〉,

for any r∗ ∈ (0,∞) if the upper bound on the sectional curvature satisfies κmax ≤ 0 and r∗ ∈(

0, π2√κmax

), if

κmax > 0.

Proof. The proof is provided in Appendix II.

Before stating the main result, we recall the following lemma (see e.g. Lemma 5, [8] or [31]) which statesthe law of cosines for non-Eucliedean geometry:

11

Page 12: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

Lemma 5. If a, b, c are the side lengths of a geodesic triangle in a Riemannian manifold and A is the anglebetween sides b and c defined through inverse exponential map and inner product in tangent space, then

a2 ≤ ξb2 − 2bc cos(A) + c2, (19)

where

ξ :=

√|κmin|D

tanh(√|κmin|D)

(20)

with maxx,y∈M d(x, y) ≤ D and κmin being a lower bound on the sectional curvature.

We are now in a position to prove the main result which we can use to prove the convergence of Algorithm1. We use the notation f(v) = n−1

∑ni=1 f

i(v) for v ∈M and f(v) := n−1∑ni=1 f

i(vi) for v ∈Mn.

Theorem 6. Suppose ε ∈ (0, 2µ−1max) and

∑k≥0 ak = ∞,

∑k≥0 a

2k < ∞. Let w∗ denote a optimal point

of (2) belonging to X , where X ⊂ B(w∗, r∗), with r∗ as in Lemma 4. Then, the objective value sequence{f(vk)}km=1 for all i satisfies the bound:

k∑m=0

2ak(f(vk)− f(w∗)

)≤ d2(w0,w∗) +

{2√nMD

1− σ2(Q)+ ξC2

+2(

1− µmaxε2

)( (1 + ε2L2ϕ)

ε+µmax

2

)εξC2

}k∑

m=0

a2k, (21)

Proof. Consider the geodesic triangle determined by wik+1, vik and w∗. Then, it can be seen that from

wik+1 = Expvik(−akgik), the angle ∠wik+1vikw∗ satisfies:

cos(∠wik+1vikw∗) = 〈−ak gik,Exp

−1vik

(w∗)〉

Then, using Lemma 5 with a = d(wik+1, w∗), b = ‖Exp−1

vik(wik+1)‖, c = d(vik, w

∗), and A = ∠wik+1vikw∗, we

have

d2(wik+1, w∗) ≤ d2(vik, w

∗) + 2ak〈gik,Exp−1vik

(w∗)〉+ ξ‖Exp−1vik

(wik+1)‖2

≤ d2(vik, w∗) + 2ak〈∇f(vik),Expvik(w∗)〉+ gik −∇f(vik),Exp−1

vik(w∗)〉+ ξa2

k‖gik‖2

≤ d2(vik, w∗) + 2ak(f(w∗)− f(vik)) + 2ak〈gik −∇f(vik),Exp−1

vik(w∗)〉+ +ξa2

k‖gik‖2.

which gives

2ak(f(vik) − f(w∗)) ≤ d2(vik, w∗) − d2(wik+1, w

∗) + 2akD‖gik − ∇f(vik)‖ + ξC2a2k

Summing over i gives:

2ak(f(vk) − f(w∗)) ≤ d2(vk,w∗) − d2(wk+1,w∗) + 2akn−1D

n∑i=1

‖gik − ∇f(vik)‖ + ξC2a2k, (22)

where f(vk) := n−1∑ni=1 f

i(vik). Similarly, with vk = Expwk(−gradwkϕ(wk)), we have:

d2(vk,w∗) ≤ d2(wk,w∗) + ε〈gradϕ(wk),Exp−1wk(w∗)〉+ ξ‖Exp−1

wk(vk)‖2

≤ d2(wk,w∗) + ε(ϕ(w∗)− ϕ(wk)

)+ ξ‖Exp−1

wk(vk)‖2

≤ d2(wk, w∗) + ξε2‖gradϕ(wk)‖2,

12

Page 13: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

where we have used Lemma 4 in the second inequality and the fact that ϕ(w∗) = 0 with w∗ := (w∗, · · · , w∗)and ϕ(wk) ≥ 0 for obtaining the third inequality. Using Lemma 2, we have

d2(vk,w∗) ≤ d2(wik,w

∗) +2ξε

1− µmaxε2

(ϕ(wk)− ϕ(wk+1)

)+

2(1− µmaxε

2

)( (1 + ε2L2ϕ)

ε+µmax

2

)2ξεC2a2

k

Plugging this expression for d2(vk,w∗) in (22) gives:

2ak(f(vk)− f(w∗)) ≤ d2(vk,w∗)− d2(wk+1,w∗) + 2akn−1D

n∑i=1

‖gik −∇f(vik)‖+ ξC2a2k

+2ξε

1− µmaxε2

(ϕ(wk)− ϕ(wk+1)

)+

2(1− µmaxε

2

)( (1 + ε2L2ϕ)

ε+

2(1− µmaxε

2

) µmax

2

)2ξεC2a2

k

Summing from m = 0 to k and using Lemma 3 gives:

k∑m=0

2ak(f(vk)−f(w∗)) ≤ d2(w0, w∗)+

{2√nMD

1− σ2(Q)+ξC2+

2(1− µmaxε

2

)( (1 + ε2L2ϕ)

ε+µmax

2

)2εξC2

}k∑

m=0

a2k

Using (21) and Lemma 1, we can get an upper bound on the objective values f(vik). We begin by notingthat |f i(vik)− f(vjk)| ≤ Ld(vik, v

jk) for some constant L, so that

k∑m=0

2ak(f(vim)− f(w∗)

)≤ 2

k∑m=0

akL

n∑j=1

d(vik, vjk)

+ d2(w0, w∗) +

{2√nMD

1− σ2(Q)+ ξC2 +

2(1− µmaxε

2

)( (1 + ε2L2ϕ)

ε+µmax

2

)2εξC2

}k∑

m=0

a2k (23)

which gives

inf0≤m≤k

f(vim) ≤ f(w∗) + L sup0≤m≤k

n∑j=1

d(vik, vjk) +

1

2∑km=0 am

(d2(wi

0, w∗) +

{2√nMD

1− σ2(Q)+ ξC2+

2(1− µmaxε

2

)( (1 + ε2L2ϕ)

ε+µmax

2

)2εξC2

}k∑

m=0

a2k

).

This implies, since∑k≥0 a

2k <∞,

∑k≥0 ak =∞ and d(vik, v

jk)→ 0, that

lim infk→∞

f(vik) = f(w∗). ∀i,

4.3 Stochastic Gradient OptimizationIn this part, we analyze the algorithm for stochastic optimization. This means that we do not have accessto the exact gradient but only an unbiased sample of it. An important motivating example of this scenariois the case when f i at each node is itself a sum of functions,

f i(x) =1

ni

ni∑p=1

fp(x)

13

Page 14: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

and to calculate the gradient, one may randomly uniformly sample a function fω from the set {fp}nip=1. Thisimplies:

E[∇fω(x)|x] = Eω [∇fω(x)] =1

ni

ni∑p=1

∇fp(x),

where E[·|x] denotes the condition expectation given x. The above situation is quite common in machinelearning with large distributed datasets so that stochastic optimization is preferable, or in online optimizationwhere the samples are arriving in real time. Accordingly, in this section we assume that we have unbiasedgradients with finite variance, i.e. for all i, at any instant k we have access to an unbiased sample fk,i suchthat

Ek∇fk,i(vik) = ∇f i(vik) and Ek ‖∇fk,i(vk)−∇f i(vk)‖2 ≤ σ2, (24)

where Ek is the conditional expectation given the sigma algebra σ(v0, · · · ,vk) generated by all the pastinformation of the algorithm. We assume that {f i,k}k≥0 are conditionally independent without any loss ofgenerality. E.g., for the case of sampling from a dataset, this would amount to sampling fω independentlyat each node for all k. We note that, the gradient tracking step (S2) in Algorithm 1 gets modified to:

gik =

n∑j=1

qijTvikvjk−1

gjk−1 +∇fk,i(vik)− T vik

vik−1

∇fk−1,i(vik−1), (25)

Lemma 7. For all i, we have the following relation

Ekgk = gik and E‖gik − gik‖2 ≤( 3n+ 1

1− q2max

)σ2,

where qmax = maxi,j qij < 1.

Proof. Taking conditional expectations in (25), one can see from an easy induction argument that:

Ek gik =

n∑j=1

qijTvikvjk−1

gjk−1 +∇f i(vik)− T vik

vik−1

∇f i(vik−1).

Let ξk,i(vik) := ∇fk,i(vik)−∇f(vik). We note using (24), that ξk,i are zero mean and conditionally independentwith the conditional variance, Var(ξk,i(vik)) ≤ σ2. Then, performing a recursion in (25) and subtracting from(27) (see Appendix I), we have,

gik − gik =

n∑j=1

qkijTvikvj0ξ0,i +

k−1∑p=1

n∑j=1

qk−pij

(T v

ik

vjpξp,j − T v

ip

vjp−1

ξp−1,j)

+ ξk,i − T vik

vik−1

ξk−1,i

=

k−2∑p=0

n∑j=1

(qk−pij T

vikvjpξp,j − qk−p−1

ij T vip

vjpξp,j)

+

n∑j=1

T vik

vjk−1

qijξk−1,j + ξk,i − T v

ik

vik−1

ξk−1,i

Recalling the fact that Var(c1X1 + c2X2) = c21Var(X1) + c22Var(X2) for independent X1, X2, we have

Ek‖gik − gik‖2 ≤ nk−1∑p=1

(q2k−2pij + q2k−2p−2

ij )σ2 + (n+ 1)σ2

≤ 2nσ2

1− q2ij

+ (n+ 1)σ2.

which proves the result.

We note that the previous lemma implies:

E‖gik‖2 ≤ R2 := 2(C2 +

( 3n+ 1

1− q2max

)σ2),

since ‖gik‖2 ≤ C2. We next prove the stochastic analog of Lemma 2:

14

Page 15: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

Lemma 8. We have the following descent relation

εE‖gradϕ(wk)‖2 ≤ 1(1− µmaxε

)(Eϕ(wk)− Eϕ(wk+1))

+2(

1− µmaxε2

)( (1 + ε2L2ϕ)

ε+µmax

2

)R2a2

k

Proof. We note that

ϕ(wk+1) ≤ ϕ(vk) + ak〈gradvkϕ(vk), gk〉+µmax‖gk‖2

2a2k.

Taking conditional expectations and using Lemma 7, we get

Ekϕ(wk+1) ≤ ϕ(vk) + ak〈gradvkϕ(vk),gk〉+µmaxR

2

2a2k.

Recalling the Taylor series expansion of ϕ(vk), we have

ϕ(vk) ≤ ϕ(wk) + ‖gradwkϕ(wk)‖2 ε(

1− µmaxε

2

)Thus, combining the previous two inequalities we get the following:

Ekϕ(wk+1) ≤ ϕ(wk)− ‖gradwkϕ(wk))‖2 ε(

1− µmaxε

2

)+ ak〈gradvkϕ(vk),gk〉+

µmaxR2

2a2k

Taking full expectations in the above gives:

Eϕ(wk+1) ≤ Eϕ(wk)− E‖gradwkϕ(wk))‖2 ε(

1− µmaxε

2

)+ akE〈gradvkϕ(vk),gk〉+

µmaxR2

2a2k

The rest of the proof is exactly the same as that for Lemma 2.

We are in a position to state the main result. The convergence analysis is the along the same lines as inthe previous section with some important modifications.

Theorem 9. Assuming condition (24), we have with probability 1,

lim infk

f(vik) = f(w∗),

if ε ∈ (0, 2µ−1max),

∑k≥0 ak =∞ and

∑∞k=0 a

2k <∞.

Proof. We begin by taking conditional expectations in the following bound:

d2(wik+1, w∗) ≤ d2(vik, w

∗) + 2ak〈gik,Exp−1vik

(w∗)〉+ ξ‖Exp−1vik

(wik+1)‖2

Ekd2(wik+1, w∗) ≤ d2(vik, w

∗) + 2ak〈gik,Exp−1vik

(w∗)〉+ ξ‖Exp−1vik

(wik+1)‖2

Taking full expectations, using the geodesic convexity property of f(·), we have

E d2(wk+1,w∗) ≤ E d2(vik,w∗) + 2ak

(E(f(wk)− f(w∗))

)+ 2akn

−1D

n∑i=1

E‖gik −∇f(vik)‖+ ξR2a2k

The analogous expression for vk gives

Ed2(vk,w∗) ≤ d2(wk, w∗) + ξε2E‖gradwkϕ(wk)‖2,

where we use the convexity of ϕ(·) and the fact that ϕ(w∗) = 0 with w∗ := (w∗, · · · , w∗) and Eϕ(wk) ≥ 0.Combining the previous two expressions, we have, after performing the same manipulations as Theorem 6:

k∑m=0

2ak(Ef(vim)− f(w∗)) ≤ Lfn∑j=1

Ed(vik, vjk) +

1

2∑km=0 am

(d2(wi

0, w∗) +

{2√nMD

1− σ2(Q)+ ξC2+

2(1− µmaxε

2

)( (1 + ε2L2ϕ)

µmax+µmax

2

)2εξR2

}k∑

m=0

a2k

).

15

Page 16: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

We have Ed(vik, vjk)→ 0, so that

lim infk

Ef(vik)− f(w∗) = 0.

Since f(vik) ≥ f(w∗), we can apply Fatou’s Lemma to conclude

0 ≤ E lim infk

f(vk)− f(w∗) ≤ lim infk

Ef(vk)− f(w∗) = 0

which implies lim infk f(vk)− f(w∗) = 0 w.p.1.

5 Numerical ExperimentsIn this section we demontrate the algorithm on two important problems which can be solved within Rie-mannian framewok. In particular, these examples demonstare the performence of proposed algorithm on thespherical and the Grassman manifolds. The algorithm is simulated in a MATLAB environment using theManopt toolbox, [32].

5.1 Computing the leading eigenvectorThe problem entails computing the leading eigenvector of data matrix whose entries are stored at differentnodes of a communication network. The problem can be precisely stated as

minimize{− wtAw ∆

= −wT( n∑i=1

zizTi

)w},

subject to w ∈ Sp,

where zi ∈ Rp+1 is the data entry located at the i ’th node and Sp is the n-dimensional sphere defined as,

Sp = {w ∈ Rp+1 : wTw = 1}.

We note that the distributed version of the problem could be posed in the form (2) by letting f i(w) :=−wT zizTi w. The tangent plane TwSp at a point w is given by,

TwSp = {x ∈ Rn+1 : xTw = 0}.

The Riemannian metric is the usual inner product inherited from Rp+1. The sectional curvature is constantwith κmin = κmax = 1. Also, injSp = π and rc = π/2. The Riemmanian gradient grad f(·) for any functionf(·) can be obtained by the orthogonal projection PTw(·) of the Euclidean gradient ∇f(w) on the tangentspace,

grad w f(w) = PTw(∇f(w)) = (Ip+1 − wwT )∇f(w),

where Ip+1 is the (p+1)×(p+1) identity matrix. The geodesic t→ x(t) expressed as a function of x0 ∈ Sp−1

and x0 ∈ Tx0Sp−1 is given by (Example 5.4.1, [4]) :

x(t) = x0cos(‖x0t‖) + x01

‖x0‖sin(‖x0‖t).

An alternative to this is provided by approximating the geodesic with a retraction which just involves nor-malizing (i.e., divide by the norm) wn at each step. The Riemannian distance between any two points isgiven as

d(x, y) = cos−1(xT y).

The vector transport associated along the geodesic is given by the following expression:

T x(t)x0

(ξx) ={u cos(‖x0‖t)xT (0)− sin(‖x0‖t)xuT + (Ip+1 − uuT )

}ξx,

where u = x0

‖x(0)‖ .

16

Page 17: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

Iteration Count0 200 400 600 800 1000

d2 (wi ,w

N)

0

0.5

1

1.5

2

2.5

3

(a) Network Graph

Iteration Count0 200 400 600 800 1000

f(wi )

-32

-30

-28

-26

-24

-22

-20

-18

-16

(b) Distance between the agents vs Iteration count

Iteration Count0 50 100 150 200 250 300

d2 (wi ,w

N)

0

2

4

6

8

10

12

(c) Function value vs Iteration Count

Iteration Count0 50 100 150 200 250 300

f(W)

×104

-6.6

-6.4

-6.2

-6

-5.8

-5.6

-5.4

-5.2

-5

-4.8

(d) Consensus Failure

Figure 1: (a) This shows the plot of the distance between the various agents vs. the number of iterations. We fixan agent i and plot the value d2(wi, wj) for all j ∈ V. (b) This shows the function value, f(wi) for all i, against theiteration count. (c)-(d) The same results for Example 5.2.

17

Page 18: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

The matrix Q here is generated using Metropolis weights and the number of nodes is set to n = 30.We let p = 3. To generate the initial conditions, a random point w0 ∈ S3 is selected and the initialconditions at each node are set to wi0 = Expw0

(vi), where {vi}Ni=1 are tangent vectors drawn from anisotropic Gaussian distribution with standard deviation σ = 0.2. The algorithm is run for k = 103 iterations(although convergence is achieved pretty quickly due to a small matrix size) with step seizes ε = 0.1 andak = 1/k. The results are plotted in Figure 1.The agents achieve consensus relatively early and stay in thatconfiguration for the rest of the run. We remark that the number of iterations required typically increaseswith the dimension of the problem. Figure 1b shows the function value, f(wi) for all i, against the iterationcount. Note that as the agents achieve consensus, the values decrease in cohesion for all agents.

5.2 Principal Component AnalysisWe next consider a generalization of the previous problem known as Principal Component Analysis (PCA).PCA entails the computation of the r principal eigenvectors of a p× p covariance matrix A given by

A = E[zkzTk ]

with z1, ..., zk, ... being a stream of uniformly bounded p-dimensional data vectors. The cost function forPCA is :

C(W ) = −1

2E[zTWTWz] = −1

2Tr(WTAW ),

where Tr(·) denotes the trace of matrix and W belongs to the Stiefel manifold Sp,r defined as :

Sp.r = {X ∈ Rp×r : XTX = Ir}.

The cost function is invariant to the transformation W 7→ WO where O ∈ O(r) which represents theorthogonal group. Thus the state space can identified with the Grassmann manifold :

G(p, r) ={Sp,r/O(r)

},

so that any [G] ∈ G(p, r) is the equivalence class,

[G] = {WO, O ∈ Or}.

For details on the geometry of this manifold we refer the reader to [33]. The Riemannian gradient of C(W )under the sample z is given by,

gradW f(W ) = (In −WWT )zzTW.

A retraction which could be used here is given by (Example 4.1.3, [4]) :

RW (aH) = qf(W + aH),

where qf(·) gives the orthogonal factor in the QR-decomposition of its argument. The formula for the paralleltransport on Grassmann manifolds is given in (Example 8.1.3 [4]) and for vector transport in (Example 8.1.10[4]).

We consider a synthetic dataset of d = 104 measurements with dimension of each measurement being p =103. The measurements are assumed to be distributed throughout a network of 10 nodes. The connectivitygraph is generated in a similar manner as the previous example and the results are plotted in Figure 1(c)-(d).A point to note here is that consensus, as in the previous case, is achieved pretty early around k = 100 whilethe function value evaluated at the estimates keeps on decreasing till k = 150.

Appendix I: Proof of Lemma 3We recall the iteration for computing gik:

gik =

n∑j=1

qijTvikvjk−1

gjk−1 +∇f i(vik)− T vik

vik−1

∇f i(vik−1), (26)

18

Page 19: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

where gi0 = ∇f i(vi0). Let qmij denote the ij’th entry of the matrix product Qm := Q × · · · × Q. Doing arecursion on the expression for gik gives

gik =

n∑j=1

qkijTvikvj0gi0 +

k−1∑p=1

n∑j=1

qk−pij

(T v

ik

vjp

(∇f j(vjp) − T

vip

vjp−1

∇f j(vjp−1)))

+ ∇f i(vik) − T vik

vik−1

∇f i(vik−1), (27)

where we have used the associative property of parallel transport, namely T xy T yz = T yz . We show that theauxiliary sequence {gik} tracks the average gradient closely enough at each node.

Next, we let gik = 1n

∑nj=1 T

vikvjk

gjk. We can write the expression for gik using the definition of gjk as

gik =1

n

n∑j=1

T vik

vjk

( n∑p=1

qjpTvjkvpk−1

gpk−1 +∇f j(vjk)− T vjk

vjk−1

∇f j(vjk−1))

=1

n

n∑j=1

n∑p=1

qjpTvikvpk−1

gpk−1 +1

n

n∑j=1

T vik

vjk

(∇f j(vjk)− T v

jk

vjk−1

∇f j(vjk−1)),

=1

n

n∑j=1

T vik

vpk−1gjk−1 +

1

n

n∑j=1

T vik

vjk

(∇f j(vjk)− T v

jk

vjk−1

∇f j(vjk−1)),

= T vik

vik−1

gik−1 +1

n

n∑j=1

T vik

vjk

(∇f j(vjk)− T v

jk

vjk−1

∇f j(vjk−1)),

where we use the fact that∑nj=1 qij =

∑ni=1 qij = 1 in the third equality. Since gi0 = 1

n

∑np=1∇fp(v

p0), using

the previous equation, one can see by induction that gip := 1n

∑np=1∇fp(vpp) for any p ≥ 1. Doing a recursion

for gik−1 in the previous equation we have

gik = T vik

vi0gi0 +

1

n

k∑p=1

n∑j=1

T vik

vjp

(∇fp(vjp)− T

vjkvjp−1

∇fp(vjp−1)). (28)

Subtracting (28) from (27), we get

gik − gik =

n∑j=1

qkijTvikvj0gi0 − T

vikvi0

gi0

k−1∑p=1

n∑j=1

(qk−pij − 1

n

)T v

ik

vjp

(∇f j(vjp)− T

vjkvjp−1

∇f j(vjp−1))

+ ∇f i(vik)− T vik

vik−1

∇f i(vik−1)− 1

n

n∑j=1

T vik

vjk(∇f j(vjk)− T v

jk

vjk−1

∇f j(vjk−1)). (29)

From the mixing bound (4), we have

‖gik − gik‖ = σk2 (Q)C√n+

k−1∑p=1

σk−p‖∇f j(vjp)− Tvjkvjp−1

∇f j(vjp−1)‖+ ‖∇f i(vik)− T vik

vik−1

∇f i(vik−1)

− 1

n

n∑j=1

T vik

vjk

(∇f j(vjk)− T v

jk

vjk−1

∇f j(vjk−1))‖. (30)

We next consider bounding ‖∇f j(vjp) − Tvjp

vjk−1

∇f j(vjp−1)‖ for 1 ≤ p ≤ k. We begin by noting that for any

i = 1, · · · , n, we have‖∇f i(vik)− T v

ik

vik−1

∇f i(vik−1)‖ ≤ d(vik, vik−1) (31)

By triangle inequality we have,

d(vik, vik−1) ≤ d(vik, w

ik) + d(wik, v

ik−1)

≤ ε‖gradwikϕ(wk)‖+ ak−1‖gik−1‖,

19

Page 20: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

Thus, we have

ak‖∇f i(vik)− T vik

vik−1

∇f i(vik−1)‖ ≤ akd(vik, vik−1)

≤ ak(ε‖gradwikϕ(wk)‖+ ak−1‖gik−1‖

)≤(a2

k

2+ ε2‖gradwikϕ(wk)‖2

2+ aCa2

k

),

where we have used ak−1 ≤ a ak. Using the above in (29),

ak‖gik − gik‖ ≤√n

k−1∑m=0

σk−m2 (Q)(a2

m

2+ ε2‖gradwimϕ(wm)‖2

2+ aCa2

m

)+ (32)

(1 +

1

n

)(a2k

2+ ε2‖gradwikϕ(wk)‖2

2+ aCa2

k

)≤√n

k∑m=1

σk−m2 (Q)(a2

m

2+ ε2‖gradwimϕ(wm)‖2

2+ aCa2

k

), (33)

where we assume, without loss of generality, that 1 ≤ a a20 to absorb the first term in (30) into the summation

and√n ≥ 1 + n−1 to absorb the last term in the above inequality. We also have

ak‖gik −∇f(vik)‖ ≤ ak‖gik − gik‖+ ak‖gik −∇f(vik)‖ (34)

The second term in the previous equation can be bounded as

ak‖gik −∇f(vik)‖ ≤ a2k

2+‖gik −∇f(vik)‖2

2

=a2k

2+‖ 1n

∑np=1∇fp(v

jk)−∇f(vik)‖2

2

≤ a2k

2+

1

2n

n∑j=1

d2(vik, vjk)

≤ a2k

2+

3

2n

n∑j=1

(d2(vik, wik) + d2(wik, w

jk) + d2(wjk, v

jk))

≤ a2k

2+

9

2‖gradϕ(wk)‖2,

where we have used the fact that d2(wik, wjk) ≤ ‖gradϕ(wk)‖2 by definition. Using the above and (32) in

(34),

ak‖gik − gik‖ =√n

k∑m=0

σk−m2 (Q)(a2m + ε2

(1 +

9

2ε2

)‖gradϕ(wm)‖2 + aCa2

m

)and using Lemma 2 to substitute for ‖gradϕ(wm)‖2 gives

ak‖gik − gik‖ =√n

{1 + aC +

2ε(

1 + 92ε2

)1− µmaxε

2

( (1 + ε2L2ϕ)

ε+µmax

2

)C2

k∑m=0

(σk−m2 (Q)a2

m

)+

2√nε(

1 + 92ε2

)(1− µmaxε

2 )

k∑m=0

(ϕ(wm)− ϕ(wm−1))

Set M equal to the constant inside the curly brackets in the above equation. This gives

k∑p=0

ap‖gip − gip‖ =√nM

k∑p=0

p∑m=0

σp−m2 (Q)a2m

20

Page 21: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

We note that the summation on the right can be written as:

k∑p=0

p∑m=0

σp−ma2m ≤

k∑m=0

a2m

k−m∑p=0

σp ≤ 1

1− σ

k∑m=0

a2m.

Plugging the above bound in the previous inequality proves the result.

Appendix II: Proof of Lemma 4The proof uses techniques similar to Theorem 27, [15]. We prove the result for a constant sectional curvature∆. The result for a varying (but bounded) curvature can be established along similar lines by using the Rauchcomparision theorems (see e.g. Theorem IX.2.2, IX.2.3, [34]). We consider the function ϕi,j :M×M→ Rdefined by ϕi,j(xi, xj) := 1

2d2(xi, xj). Since ϕ(x) =

∑(i,j)∈E ϕ(xi, xj), it is enough we prove the result for

ϕi,j . We drop the indices i, j in the proof to avoid notational clutter.Let (x, y) ∈ M2. Define the geodesic γ(−ε, ε) → M2 with γ(0) = x and γ(1) = y. The second order

Taylor expansion (see e.g., eq. (5.26), Section 5.3, [29]) of ϕ gives:

ϕ(y) = ϕ(x) + 〈gradϕ(x),Exp−1x (y)〉+

1

2〈Hessϕ(γ(t))[γt], γ(t)〉+

1

2〈grad(ϕ(γ(t))), γ(t)〉 (35)

The last term can be ignored since the acceleration of a geodesic is zero. We show that the third term ispositive for any point γ(t) = (x, y) and γ(t) = (v1, v2) with t ∈ (0, 1).

Accordingly, consider any point (x, y) ∈M2 with d(x, y) ≤ r < r∗. Define the geodesics γi : (−ε, ε)→M,with γ1(0) = x, γ1(0) = v1 and γ2(0) = y, γ2(0) = v2. Let γ12,t(s) denote the geodesic joining the point γ1(t)to γ2(t). We note that since we restrict ourselves to radii less than the convexity radius, such a constructionis possible without encountering any conjugate point pairs on γ12. Also, let φ(t) := L(γ12,t), where L denotesthe length of the geodesic segment, so that φ(t) := d(γ1(t), γ2(t)) and φ(0) = r. Thus, we need to calculatethe second order derivative of φ2(t) at t0 = 0.

To proceed, we define the geodesic variation as α : [0, 1]× [a, b]→M, so that the map s→ α(t0, s) tracesthe (normalized) geodesic γ1,2,t0(s) with ‖γ1,2,t‖ = 1 . Define the vector field X(t) := ∂α

∂t := Dα ∂∂t |t=t0 . We

note that X(t) is a Jacobi field (page 36, [23]) along γ12,t. The double derivative of φ2(·) can be calculatedby a straightforward application of the second variation of arc length formula (Theorem II.4.3, [34]) and isgiven by (Theorem 27, [15]):

d

dt2φ2(t)

∣∣∣t=t0

= r〈X⊥(s), X⊥(s)〉∣∣ba

+(⟨X(s),

γ12,t0(s)

‖γ12,t0(s)‖

⟩∣∣∣ba

)2

, (36)

where X⊥(t) := X(t)− 〈X(t), γ12,t〉γ12,t is the part of X orthogonal to γ12,t. We consider the first term andshow that it is non-negative to establish the result. Since the initial condition X(0) is not zero, to use thestandard Jacobi field theory we split X(t) as X(t) = X1(t) +X2(t), where

X1(0) = 0 ;X2(0) = X⊥(0); X1(r) = X⊥(r); X2(r) = 0. (37)

Such a (unique) decomposition is possible (see e.g. Lemma 2.4, [23]). Then, the solution to the Jacobiequation with the above initial conditions, for a manifold with constant curvature, is given as (see Section2.3, Chapter 5, [21] or Section 3, [34]):

Xi(t) = S∆(t)Ei(t) and ∇Xi(t) = C∆(t)Ei(t), i = 1, 2 (38)

where Ei(t) is some field parallel to the γ12,t with ‖Ei(t)‖ = 1 and

C∆(t) := cos(√

∆t), S∆(t) :=1√∆

sin(√

∆t), if ∆ > 0; C∆ := t, S∆ = 1, if ∆ = 0;

C∆(t) := cosh(√|∆|t), S∆(t) :=

1√|∆|

sinh(√|∆|t), if ∆ < 0.

21

Page 22: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

We also have the easily verifiable property (see e.g. Theorem 22, [15]),

〈∇X(t), X(t)〉〈X(t), X(t)〉

=C∆(t)

S∆(t)and ‖∇X(0)‖ ≤ ‖X(t)‖

S∆(t)(39)

We have from (36):

1

r

d

dt2φ2(t)

2≥ 〈∇X1(r), X1(r)〉+ 〈∇X2(r), X1(r)〉 − 〈∇X1(0), X2(0)〉 − 〈∇X2(0), X2(0)〉

We remark that since the initial conditions for X2(t) are reversed (i.e. X2(r) = 0 and X2(0) = X⊥(0)),when considering ∇Xr(0), we have to parametrize γ12,t0(s) as s′ = r− s. This gives ∇X(r) = −∇X(0). Wefirst consider the case ∆ ≤ 0. We have from using (39) in the the previous equation (alongwith −∇X2(0) =∇X2(r)):

1

r

d

dt2φ2(t)

2≥ C∆(r)

S∆(r)‖X1(r)‖2 − 1

S∆‖X2(0)‖‖X1(r)‖ − 1

S∆(r)‖X1(r)‖‖X2(0)‖+

C∆(r)

S∆(r)‖X2(0)‖2

≥ C∆(r)

S∆(r)‖X1(r)‖2 +

C∆(r)

S∆(r)‖X2(0)‖2 − 1

S∆

(‖X1(r)‖2 + ‖X2(0)‖2

)=

1

S∆(r)

(C∆(r)− 1

)(‖X⊥(0)‖2 + ‖X⊥(r)‖2

),

after using initial conditions (37).The right hand side can be seen to be positive for all r > 0 using thedefinition of C∆.

We next consider ∆ > 0. Set a := ‖X⊥(0)‖ and b = ‖X⊥(r)‖. We have:

1

r

d

dt2φ2(t)

2≥ C∆(r)

S∆(r)‖X1(r)‖2 − C∆(0)‖E2(0)‖‖X1(r)‖ − C∆(0)‖E1(0)‖‖X2(0)‖+

C∆(r)

S∆(r)‖X2(0)‖2

≥ C∆(r)

S∆(r)‖X⊥(r)‖2 − ‖X⊥(r)‖ − ‖X⊥(0)‖+

C∆(r)

S∆(r)‖X⊥(0)‖2

= (a2 + b2)C∆(r)

S∆(r)

(1− tan(

√∆r)(a+ b)√

∆(a2 + b2)

)

where we have used (38), (39) in the first inequality. The second inequality uses the fact that ‖Ei(t)‖ = 1.Consider r∗ = min( π

2√

∆, 1√

∆tan−1(

√∆(a2+b2)a+b )). One can easily verify that r < r∗, the RHS is positive in

the above inequality.

References[1] A. Nedich et al., “Convergence rate of distributed averaging dynamics and optimization in networks,”

Foundations and Trends R© in Systems and Control, vol. 2, no. 1, pp. 1–100, 2015.

[2] J. Konečny, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning:Strategies for improving communication efficiency,” arXiv preprint arXiv:1610.05492, 2016.

[3] E. Oja, “Principal components, minor components, and linear neural networks,” Neural networks, vol. 5,no. 6, pp. 927–935, 1992.

[4] P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization algorithms on matrix manifolds. PrincetonUniversity Press, 2009.

[5] D. G. Luenberger, “The gradient projection method along geodesics,” Management Science, vol. 18,no. 11, pp. 620–631, 1972.

22

Page 23: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

[6] U. Helmke and J. B. Moore, Optimization and dynamical systems. Springer Science & Business Media,2012.

[7] C. Udriste, Convex functions and optimization methods on Riemannian manifolds. Springer Science &Business Media, 1994, vol. 297.

[8] H. Zhang and S. Sra, “First-order methods for geodesically convex optimization,” in Conference onLearning Theory, 2016, pp. 1617–1638.

[9] J. N. Tsitsiklis, “Problems in decentralized decision making and computation.” Massachusetts inst. oftech. Cambridge Lab for information and decision systems, Tech. Rep., 1984.

[10] A. Nedic, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus and optimization in multi-agentnetworks,” IEEE Transactions on Automatic Control, vol. 55, no. 4, pp. 922–938, 2010.

[11] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous deterministic and stochastic gra-dient optimization algorithms,” IEEE transactions on automatic control, vol. 31, no. 9, pp. 803–812,1986.

[12] S. S. Ram, A. Nedić, and V. V. Veeravalli, “Distributed stochastic subgradient projection algorithmsfor convex optimization,” Journal of optimization theory and applications, vol. 147, no. 3, pp. 516–545,2010.

[13] K. Srivastava and A. Nedic, “Distributed asynchronous constrained stochastic optimization,” IEEE Jour-nal of Selected Topics in Signal Processing, vol. 5, no. 4, pp. 772–790, 2011.

[14] P. Bianchi and J. Jakubowicz, “Convergence of a multi-agent projected stochastic gradient algorithm fornon-convex optimization,” arXiv preprint arXiv:1107.2526, 2011.

[15] R. Tron, B. Afsari, and R. Vidal, “Riemannian consensus for manifolds with bounded curvature,” IEEETransactions on Automatic Control, vol. 58, no. 4, pp. 921–934, 2013.

[16] R. Olfati-Saber, “Swarms on sphere: A programmable swarm with synchronous behaviors like oscillatornetworks,” in Decision and Control, 2006 45th IEEE Conference on. IEEE, 2006, pp. 5060–5066.

[17] L. Scardovi, A. Sarlette, and R. Sepulchre, “Synchronization and balancing on the n-torus,” Systems &Control Letters, vol. 56, no. 5, pp. 335–341, 2007.

[18] A. Sarlette and R. Sepulchre, “Consensus optimization on manifolds,” SIAM Journal on Control andOptimization, vol. 48, no. 1, pp. 56–76, 2009.

[19] Y. Igarashi, T. Hatanaka, M. Fujita, and M. W. Spong, “Passivity-based attitude synchronization inse(3),” IEEE Transactions on Control Systems Technology, vol. 17, no. 5, pp. 1119–1134, 2009.

[20] A. Sarlette, S. Bonnabel, and R. Sepulchre, “Coordinated motion design on lie groups,” IEEE Transac-tions on Automatic Control, vol. 55, no. 5, pp. 1047–1058, 2010.

[21] M. P. Do Carmo, Riemannian geometry. Birkhauser, 1992.

[22] S. Gallot, D. Hulin, and J. Lafontaine, Riemannian geometry. Springer, 1990, vol. 3.

[23] T. Sakai, Riemannian geometry. American Mathematical Soc., 1996, vol. 149.

[24] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for distributed optimization: Con-vergence analysis and network scaling,” IEEE Transactions on Automatic control, vol. 57, no. 3, pp.592–606, 2011.

[25] A. Sarlette, S. E. Tuna, V. D. Blondel, and R. Sepulchre, “Global synchronization on the circle,” IFACProceedings Volumes, vol. 41, no. 2, pp. 9045–9050, 2008.

[26] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Transactionson Control of Network Systems, vol. 5, no. 3, pp. 1245–1260, 2017.

23

Page 24: Suhail M. Shah arXiv:1711.11196v2 [math.OC] 25 May 2018

[27] P. Di Lorenzo and G. Scutari, “Next: In-network nonconvex optimization,” IEEE Transactions on Signaland Information Processing over Networks, vol. 2, no. 2, pp. 120–136, 2016.

[28] A. NedicÌĄ, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimizationover time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.

[29] N. Boumal, “An introduction to optimization on smooth manifolds,” Available online, Aug 2020.[Online]. Available: http://www.nicolasboumal.net/book

[30] W. Huang, K. A. Gallivan, and P.-A. Absil, “A broyden class of quasi-newton methods for riemannianoptimization,” SIAM Journal on Optimization, vol. 25, no. 3, pp. 1660–1685, 2015.

[31] S. Bonnabel, “Stochastic gradient descent on riemannian manifolds,” IEEE Transactions on AutomaticControl, vol. 58, no. 9, pp. 2217–2229, 2013.

[32] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre, “Manopt, a matlab toolbox for optimization onmanifolds,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1455–1459, 2014.

[33] A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthogonality constraints,”SIAM journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303–353, 1998.

[34] I. Chavel, Riemannian geometry: a modern introduction. Cambridge university press, 2006, vol. 98.

24