riemannian optimization via frank-wolfe methods

32
Riemannian Optimization via Frank-Wolfe Methods Melanie Weber [email protected] Princeton University Suvrit Sra [email protected] Laboratory for Information and Decision Systems, MIT Abstract We study projection-free methods for constrained Riemannian optimization. In particular, we pro- pose the Riemannian Frank-Wolfe (Rfw) method. We analyze non-asymptotic convergence rates of Rfw to an optimum for (geodesically) convex problems, and to a critical point for nonconvex objec- tives. We also present a practical setting under which Rfw can attain a linear convergence rate. As a concrete example, we specialize Rfw to the manifold of positive definite matrices and apply it to two tasks: (i) computing the matrix geometric mean (Riemannian centroid); and (ii) computing the Bures-Wasserstein barycenter. Both tasks involve geodesically convex interval constraints, for which we show that the Riemannian “linear oracle” required by Rfw admits a closed form solution; this result may be of independent interest. We further specialize Rfw to the special orthogonal group, and show that here too, the Riemannian “linear oracle” can be solved in closed form. Here, we de- scribe an application to the synchronization of data matrices (Procrustes problem). We complement our theoretical results with an empirical comparison of Rfw against state-of-the-art Riemannian optimiza- tion methods, and observe that Rfw performs competitively on the task of computing Riemannian centroids. 1 Introduction We study the following constrained optimization problem min x∈X ⊆M φ( x), (1.1) where φ : M→ R is a differentiable function and X is a compact g-convex subset of a Riemannian manifold M. The objective φ may be geodesically convex (henceforth, g-convex) or nonconvex. When the constraint set X is “simple” one may solve (1.1) via Riemannian projected-gradient. But in many cases, projection onto X can be expensive to compute, motivating us to seek projection-free methods. Euclidean (M≡ R n ) projection-free methods based on the Frank-Wolfe (Fw) scheme [Frank and Wolfe, 1956] have recently witnessed a surge of interest in machine learning and related fields [Jaggi, 2013, Lacoste-Julien and Jaggi, 2015]. Instead of projection, such Fw methods rely on access to a “linear oracle,” that is, a subroutine that solves the problem min z∈X hz, φ( x)i, (1.2) that can sometimes be much simpler than projection onto X . The attractive property of Fw methods has been exploited in a variety of settings, including convex [Jaggi, 2013, Bach, 2015], nonconvex [Lacoste- Julien, 2016], submodular [Fujishige and Isotani, 2011, Calinescu et al., 2011] and stochastic [Hazan and Luo, 2016, Reddi et al., 2016]; among numerous others. But as far as we are aware, Fw methods have not been studied for Riemannian manifolds. Our work fills this gap in the literature by developing, analyzing, and experimenting with Riemannian Frank-Wolfe (Rfw) methods. In addition to adapting Fw to the Riemannian setting, there is one more challenge that we must overcome: Rfw requires access to a Riemannian analog of the linear oracle (1.2), which can be hard even for g-convex problems. Therefore, to complement our theoretical analysis of Rfw, we discuss in detail practical settings that admit efficient Riemannian linear oracles. Specifically, we discuss problems where M = P d , the 1 arXiv:1710.10770v3 [math.OC] 29 Mar 2020

Upload: others

Post on 19-Oct-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Riemannian Optimization via Frank-Wolfe Methods

Riemannian Optimization via Frank-Wolfe Methods

Melanie Weber [email protected]

Princeton UniversitySuvrit Sra [email protected]

Laboratory for Information and Decision Systems, MIT

Abstract

We study projection-free methods for constrained Riemannian optimization. In particular, we pro-pose the Riemannian Frank-Wolfe (Rfw) method. We analyze non-asymptotic convergence rates ofRfw to an optimum for (geodesically) convex problems, and to a critical point for nonconvex objec-tives. We also present a practical setting under which Rfw can attain a linear convergence rate. Asa concrete example, we specialize Rfw to the manifold of positive definite matrices and apply it totwo tasks: (i) computing the matrix geometric mean (Riemannian centroid); and (ii) computing theBures-Wasserstein barycenter. Both tasks involve geodesically convex interval constraints, for whichwe show that the Riemannian “linear oracle” required by Rfw admits a closed form solution; thisresult may be of independent interest. We further specialize Rfw to the special orthogonal group,and show that here too, the Riemannian “linear oracle” can be solved in closed form. Here, we de-scribe an application to the synchronization of data matrices (Procrustes problem). We complement ourtheoretical results with an empirical comparison of Rfw against state-of-the-art Riemannian optimiza-tion methods, and observe that Rfw performs competitively on the task of computing Riemanniancentroids.

1 Introduction

We study the following constrained optimization problem

minx∈X⊆M

φ(x), (1.1)

where φ : M → R is a differentiable function and X is a compact g-convex subset of a Riemannianmanifold M. The objective φ may be geodesically convex (henceforth, g-convex) or nonconvex. Whenthe constraint set X is “simple” one may solve (1.1) via Riemannian projected-gradient. But in manycases, projection onto X can be expensive to compute, motivating us to seek projection-free methods.

Euclidean (M ≡ Rn) projection-free methods based on the Frank-Wolfe (Fw) scheme [Frank andWolfe, 1956] have recently witnessed a surge of interest in machine learning and related fields [Jaggi,2013, Lacoste-Julien and Jaggi, 2015]. Instead of projection, such Fw methods rely on access to a “linearoracle,” that is, a subroutine that solves the problem

minz∈X

〈z, ∇φ(x)〉, (1.2)

that can sometimes be much simpler than projection onto X . The attractive property of Fw methods hasbeen exploited in a variety of settings, including convex [Jaggi, 2013, Bach, 2015], nonconvex [Lacoste-Julien, 2016], submodular [Fujishige and Isotani, 2011, Calinescu et al., 2011] and stochastic [Hazanand Luo, 2016, Reddi et al., 2016]; among numerous others.

But as far as we are aware, Fw methods have not been studied for Riemannian manifolds. Ourwork fills this gap in the literature by developing, analyzing, and experimenting with RiemannianFrank-Wolfe (Rfw) methods. In addition to adapting Fw to the Riemannian setting, there is one morechallenge that we must overcome: Rfw requires access to a Riemannian analog of the linear oracle (1.2),which can be hard even for g-convex problems.

Therefore, to complement our theoretical analysis of Rfw, we discuss in detail practical settingsthat admit efficient Riemannian linear oracles. Specifically, we discuss problems where M = Pd, the

1

arX

iv:1

710.

1077

0v3

[m

ath.

OC

] 2

9 M

ar 2

020

Page 2: Riemannian Optimization via Frank-Wolfe Methods

manifold of (Hermitian) positive definite matrices, and X is a g-convex semidefinite interval; thenproblem (1.1) assumes the form

minX∈X⊆Pd

φ(X), where X := X ∈ Pd | L X U, (1.3)

where L and U are positive definite matrices. An important instance of (1.3) is the following g-convexoptimization problem (see §4 for details and notation):

minX∈Pd

n

∑i=1

wiδ2R(X, Ai), where w ∈ ∆n, A1, . . . , An ∈ Pd, (1.4)

which computes the Riemannian centroid of a set of positive definite matrices (also known as the“matrix geometric mean” and the “Karcher mean”) [Bhatia, 2007, Karcher, 1977, Lawson and Lim,2014]. We will show that Rfw offers a simple approach to solve (1.4) that empirically outperformsrecently published state-of-the-art approaches. As a second application, we show that Rfw allows foran efficient computation of Bures-Wasserstein barycenters on the Gaussian density manifold.

Furthermore, to offer an example on a matrix manifold other than positive definite matrices, weshow that the Riemannian linear oracle can be solved in closed form for the special orthogonal grouptoo. Specifically, we use this oracle to obtain an efficient Rfw based approach for the task of synchro-nizing data matrices, a classic problem better known as the Procrustes problem.

Summary of results. The key contributions of this paper are:

• It introduces Riemannian Frank-Wolfe (Rfw) for addressing constrained g-convex optimizationon Riemannian manifolds. We show that Rfw attains a non-asymptotic O(1/k) rate of convergence(Theorem 3.4) to the optimum, k being the number of iterations. Furthermore, under additionalassumptions on the objective function and the constraints, we show that Rfw can even attainlinear convergence rates (Theorem 3.5). In the nonconvex case, Rfw attains a non-asymptoticO(1/

√k) rate of convergence to first-order critical points (Theorem 3.6).

• It specializes Rfw for g-convex problems of the form (1.3) on the manifold of Hermitian positivedefinite (HPD) matrices. Most importantly, for this problem it develops a closed-form solutionto the Riemannian “linear oracle,” which involves solving a nonconvex semi-definite program(SDP), see Theorem 4.1.

• It applies Rfw to computing the Riemannian mean of HPD matrices. In comparison with state-of-the-art methods, we observe that Rfw performs competitively. Furthermore, we implementRfw for the computation of Bures-Wasserstein barycenters on the Gaussian density manifold.

• It specializes Rfw for optimization on the manifold of special orthogonal matrices, where theRiemannian linear oracle can be solved in closed form too. Notably, this problem is an instanceof Rfw being applicable to a setting with nonconvex constraints. We present an application tothe classical Procrustes problem.

We believe the closed-form solutions for the Riemannian linear oracle, which involve a nonconvex SDP,respectively, should be of wider interest too. A similar approach can be used solve the Euclidean linearoracle, a convex SDP, in closed form (Appendix A). These oracles lie at the heart of the competitiveperformance of Rfw and Fw in comparison with state-of-the-art methods for computing the Rieman-nian centroid of HPD matrices (Section 6). More broadly, we hope that our results encourage others tostudy Rfw as well as other examples of problems with efficient Riemannian linear oracles.

Related work. Riemannian optimization has a venerable history. The books [Udriste, 1994, Absil et al.,2009] provide a historical perspective as well as basic theory. The focus of these books and of numer-ous older works on Riemannian optimization, e.g., [Edelman et al., 1998, Helmke et al., 2007, Ring and

2

Page 3: Riemannian Optimization via Frank-Wolfe Methods

Wirth, 2012], is almost exclusively on asymptotic analysis. More recently, non-asymptotic convergenceanalysis quantifying iteration complexity of Riemannian optimization algorithms has begun to be pur-sued [Zhang and Sra, 2016, Boumal et al., 2016, Bento et al., 2017]. However, to our knowledge, allthese works either focus on unconstrained Riemannian optimization, or handle constraints via projec-tions. In contrast, we explore constrained g-convex optimization within an abstract Rfw framework,by assuming access to a Riemannian “linear oracle.” Several applications of Riemannian optimiza-tion are known, including to matrix factorization on fixed-rank manifolds [Vandereycken, 2013, Tanet al., 2014], dictionary learning [Cherian and Sra, 2015, Sun et al., 2015], classical optimization un-der orthogonality constraints [Edelman et al., 1998], averages of rotation matrices [Moakher, 2002],elliptical distributions in statistics [Zhang et al., 2013, Sra and Hosseini, 2013], and Gaussian mixturemodels Hosseini and Sra [2015]. Explicit theory of g-convexity on HPD matrices is studied in [Sra andHosseini, 2015]. Additional related work corresponding to the Riemannian mean of HPD matrices isdiscussed in Section 4.

2 Background

We begin by noting some useful background and notation from Riemannian geometry. For a deepertreatment we refer the reader to [Jost, 2011, Chavel, 2006].

A smooth manifold M is a locally Euclidean space equipped with a differential structure. At anypoint x ∈ M, the set of tangent vectors forms the tangent space TxM. Our focus is on Riemannianmanifolds, i.e., smooth manifolds with a smoothly varying inner product 〈ξ, η〉x defined on the TxM ateach point x ∈ M. We write ‖ξ‖x :=

√〈ξ, ξ〉x for ξ ∈ TxM; for brevity, we will drop the subscript on

the norm whenever the associated tangent space is clear from context. Furthermore, we assumeM tobe complete.

The exponential map provides a mapping from TxM to M; it is defined by Expx : TxM → Msuch that y = Expx(gx) ∈ M along a geodesic γ : [0, 1] → M with γ(0) = x, γ(1) = y and γ(0) =

gx ∈ TxM. We can define an inverse exponential map Exp−1x : M → TxM as a diffeomorphism from

the neighborhood of x ∈ M onto the neighborhood of 0 ∈ TxM with Exp−1x (x) = 0. Note, that the

completeness ofM ensures that both maps are well-defined.Since tangent spaces are local notions, one cannot directly compare vectors lying in different tangent

spaces. To tackle this issue, we use the concept of parallel transport: the idea is to transport (map) atangent vector along a geodesic to the respective other tangent space. More precisely, let x, y ∈ M withx 6= y. We transport gx ∈ TxM along a geodesic γ (where γ(0) = x and γ(1) = y) to the tangent spaceTyM; we denote this by Γy

xgx. Importantly, the inner product on the tangent spaces is preserved underparallel transport, so that 〈ξx, ηx〉x = 〈Γy

xξx, Γyxηx〉y, where ξx, ηx ∈ TxM, while 〈·, ·〉x and 〈·, ·〉y are

the respective inner products.

2.1 Gradients, convexity, smoothness

Recall that the Riemannian gradient grad φ(x) is the unique vector in TxM such that the directionalderivative

Dφ(x)[v] = 〈grad φ(x), v〉x, ∀v ∈ TxM.

When optimizing functions using gradients, it is useful to impose some added structure. The two mainproperties we require are sufficiently smooth gradients and geodesic convexity. We say φ : M→ R isL-smooth, or that it has L-Lipschitz gradients, if

‖ grad φ(y)− Γyx grad φ(x)‖ ≤ Ld(x, y), ∀ x, y ∈ M; gx ∈ TxM , (2.1)

where d(x, y) is the geodesic distance between x and y; equivalently,

φ(y) ≤ φ(x) + 〈gx, Exp−1x (y)〉x + L

2 d2(x, y) ∀x, y ∈ M; gx ∈ TxM . (2.2)

We say φ :M→ R is geodesically convex (g-convex) if

φ(y) ≥ φ(x) + 〈gx, Exp−1x (y)〉x ∀x, y ∈ M; gx ∈ TxM , (2.3)

3

Page 4: Riemannian Optimization via Frank-Wolfe Methods

and call it µ-strongly g-convex (µ ≥ 0) if

φ(y) ≥ φ(x) + 〈gx, Exp−1x (y)〉x + µ

2 d2(x, y) ∀x, y ∈ M; gx ∈ TxM . (2.4)

The following observation underscores the reason why g-convexity is a valuable geometric propertyfor optimization.

Proposition 2.1 (Optimality). Let x∗ ∈ X ⊂ M be a local optimum for (1.1). Then, x∗ is globally optimal,and 〈grad φ(x∗), Exp−1

x∗ (y)〉x∗ ≥ 0 for all y ∈ X .

2.2 Projection-free vs. Projection-based methods

The growing body of literature on Riemannian optimization considers mostly projection-based meth-ods, such as Riemannian Gradient Decent (RGD) or Riemannian Steepest Decent (RSD) [Absil et al., 2009].Such methods and their convergence guarantees typically require Lipschitz assumptions. However, theobjectives of many classic optimization and machine learning tasks are not Lipschitz on the whole man-ifold. In such cases, an additional compactness argument is required. However, in projection-basedmethods, the retraction back onto the manifold is typically not guaranteed to land in this compactset. Thus, additional and potentially expensive work (e.g., an additional projection step) is needed toensure that the update remains in the compact region where the gradient is Lipschitz. On the otherhand, projection-free methods, such as Fw, bypass this issue, because their update is guaranteed tostay within the compact feasible region. Importantly, in some problems, the Riemannian linear oracleat the heart of Fw can be much less expensive than computing a projection back onto the compact set.This is especially significant for the applications highlighted in this paper, where the linear oracle evenadmits a closed form solution (see sections 4.1 and 5.1).

3 Riemannian Frank-Wolfe

The condition 〈grad φ(x∗), Exp−1x∗ (y)〉x∗ ≥ 0 for all y ∈ X in Proposition 2.1 lies at the heart of Frank-

Wolfe (also known as “conditional gradient”) methods. In particular, if this condition is not satisfied,then there must be a feasible descent direction — Fw schemes seek such a direction and update theiriterates [Frank and Wolfe, 1956, Jaggi, 2013]. This high-level idea is equally valid in the Riemanniansetting. Algorithm 1 recalls the basic (Euclidean) Fw method which solves minx∈X φ(x), and Algo-rithm 2 introduces its Riemannian version Rfw, obtained by simply replacing Euclidean objects withtheir Riemannian counterparts. In the following, 〈·, ·〉 will denote 〈·, ·〉xk unless otherwise specified.

Algorithm 1 Euclidean Frank-Wolfe without line-search

1: Initialize with a feasible point x0 ∈ X ⊂ Rn

2: for k = 0, 1, . . . do3: Compute zk ← argminz∈X 〈∇φ(xk), z− xk〉4: Let sk ← 2

k+25: Update xk+1 ← (1− sk)xk + skzk6: end for

Notice that to implement Algorithm 1, X must be compact. Convexity ensures that after the updatein Step 5, xk+1 remains feasible, while compactness ensures that the linear oracle in Step 3 has a solution.To obtain Rfw, we first replace the linear oracle (Step 3 in Algorithm 1) with the Riemannian “linearoracle”:

maxz∈X

〈grad φ(xk), Exp−1xk

(z)〉, (3.1)

where now X is assumed to be a compact g-convex set. Similarly, observe that Step 5 of Algorithm 1

updates the current iterate xk along a straight line joining xk with zk. Thus, by analogy, we replace

4

Page 5: Riemannian Optimization via Frank-Wolfe Methods

this step by moving xk along a geodesic joining xk with zk. The resulting Rfw algorithm is presentedas Alg 2. While we obtained Algorithm 2 purely by analogy, we must still show that this analogy

Algorithm 2 Riemannian Frank-Wolfe (Rfw) for g-convex optimization

1: Initialize x0 ∈ X ⊆M; assume access to the geodesic map γ : [0, 1]→M2: for k = 0, 1, . . . do3: zk ← argmaxz∈X 〈grad φ(xk), Exp−1

xk(z)〉

4: Let sk ← 2k+2

5: xk+1 ← γ(sk), where γ(0) = xk and γ(1) = zk6: end for

results in a valid algorithm. In particular, we need to show that Algorithm 2 converges to a solutionof (1.1). We will in fact prove a stronger result that Rfw converges globally at the rate O(1/k), i.e.,φ(xk)− φ(x∗) = O(1/k), which matches the rate of the usual Euclidean Fw method.

3.1 Convergence analysis

We make the following smoothness assumption:

Assumption 1 (Smoothness). The objective φ has a locally Lipschitz continuous gradient on X , that is,there exists a constant L such that for all x, y ∈ X we have

‖grad φ(x)− Γxy grad φ(y)‖ ≤ L d(x, y). (3.2)

Next, we introduce a quantity that will play a central role in the convergence rate of Rfw, namelythe curvature constant

Mφ := supx,y,z∈X

2η2

[φ(y)− φ(x)− 〈grad φ(x), Exp−1

x (y)〉]

. (3.3)

Here and in the following, y = γ(η) for some η ∈ [0, 1] and a geodesic map γ : [0, 1] → M withγ(0) = x and γ(1) = z. Lemma 3.1 relates the curvature constant (3.3) to the Lipschitz constant L.

Lemma 3.1. Let φ : M → R be L-smooth on X , and let diam(X ) := supx,y∈X

d(x, y). Then, the curvature

constant Mφ satisfies the boundMφ ≤ L diam(X )2.

Proof. Let x, z ∈ X and η ∈ (0, 1); let y = γxz(η) be a point on the geodesic joining x with z. Thisimplies 1

η2 d(x, y)2 = d(x, z)2. From (2.2) we know that ‖φ(z)− φ(x)− 〈grad φ(x), Exp−1x (z)〉‖2 ≤

L2 d(x, z)2, whereupon using the definition of the curvature constant we obtain

Mφ ≤ sup2η2

L2

d(x, y)2 = sup L d(x, z)2 ≤ L · diam(X )2 . (3.4)

We note below an analog of the Lipschitz inequality (2.2) using the constant Mφ.

Lemma 3.2 (Lipschitz). Let x, y, z ∈ X and η ∈ [0, 1] with y = γxz(η). Then,

φ(y) ≤ φ(x) + η〈grad φ(x), Exp−1x (z)〉+ 1

2 η2Mφ .

5

Page 6: Riemannian Optimization via Frank-Wolfe Methods

Proof. From definition (3.3) of the constant Mφ we see that

Mφ ≥ 2η2

(φ(y)− φ(x)− 〈grad φ(x), Exp−1

x (y)〉)

,

which we can rewrite as

φ(y) ≤ φ(x) + 〈grad φ(x), Exp−1x (y)〉+ 1

2 η2Mφ. (3.5)

Furthermore, since y = γxz(η), we have Exp−1x (y) = η Exp−1

x (z), and therefore

〈grad φ(x), Exp−1x (y)〉 = 〈grad φ(x), η Exp−1

x (z)〉 = η〈grad φ(x), Exp−1x (z)〉.

Plugging this equation into (3.5) the claim follows.

We need one more technical lemma (easily verified by a quick induction).

Lemma 3.3 (Stepsize for Rfw). Let (ak)k∈I be a nonnegative sequence that fulfills

ak+1 ≤ (1− sk)ak +12 s2

k Mφ. (3.6)

If sk =2

(k+2) , then, ak ≤2Mφ

(k+2) .

We are now ready to state our first main convergence result, Theorem 3.4 that establishes a globaliteration complexity for Rfw.

Theorem 3.4 (Rate). Let sk =2

k+2 , and let X∗ be a minimum of φ. Then, the sequence of iterates Xk generatedby Algorithm 2 satisfies φ(Xk)− φ(X∗) = O(1/k).

Proof. The proof of this claim is relatively straightforward; indeed

φ(Xk+1)− φ(X∗)

≤ φ(Xk)− φ(X∗) + sk〈grad φ(Xk), Exp−1Xk

(Zk)〉+ 12 s2

k Mφ

≤ φ(Xk)− φ(X∗) + sk〈grad φ(Xk), Exp−1Xk

(X∗)〉+ 12 s2

k Mφ

≤ φ(Xk)− φ(X∗)− sk(φ(Xk)− φ(X∗)) + 12 s2

k Mφ

= (1− sk)(φ(Xk)− φ(X∗)) + 12 s2

k Mφ,

where the first inequality follows from Lemma 3.2, while the second one from Zk being an argminobtained in Step 3 of the algorithm. The third inequality follows from g-convexity of φ. Settingak = φ(Xk)− φ(X∗) in Lemma 3.3 we immediately obtain

φ(Xk)− φ(X∗) ≤2Mφ

k + 2, k ≥ 0,

which is the desired O(1/k) convergence rate.

Theorem 3.4 provides a global sublinear convergence rate for Rfw. Typically, Fw methods trade offtheir simplicity for this slower convergence rate, and even for smooth strongly convex objectives theydo not attain linear convergence rates [Jaggi, 2013]. We study in Section 3.2 a setting that permits Rfw

to attain a linear rate of convergence.

6

Page 7: Riemannian Optimization via Frank-Wolfe Methods

3.2 Linear convergence of Rfw

Assuming that the solution lies in the strict interior of the constraint set, Euclidean Fw is known toconverge at a linear rate [GueLat and Marcotte, 1986, Garber and Hazan, 2015]. Remarkably, under asimilar assumption, Rfw also displays global linear convergence. For the special case of the geometricmatrix mean that we analyze in the next section, this strict interiority assumption will be always valid(as long as not all the matrices are the same).

We will make use of a Riemannian extension to the well-known Polyak-Łojasiewicz (PL) inequal-ity [Polyak, 1963, Lojasiewicz, 1963], which we define below. Consider the minimization

minx∈M

f (x),

and let f ∗ be the optimal function value. We say that f satisfies the PL inequality if for some µ > 0,

12‖grad f (x)‖2 ≥ µ ( f (x)− f ∗) ∀x, y ∈ M. (3.7)

Inequality (3.7) is weaker than strong convexity (and is in fact implied by it). It has been widelyused for establishing linear convergence rates of gradient-based methods; see [Karimi et al., 2016] forseveral (Euclidean) examples, and [Zhang et al., 2016] for a Riemannian example. We will make useof inequality (3.7) for obtaining linear convergence of Rfw, by combining it with a strict interioritycondition on the minimum.

Theorem 3.5 (Linear convergence Rfw). Suppose that φ is strongly g-convex with constant µ and that itsminimum lies in a ball of radius r that strictly inside the constraint set X . Define ∆k := φ(Xk)− φ(X∗) and

let the step-size sk =r√

µ∆k√2Mφ

. Then, Rfw converges linearly since it satisfies the bound

∆k+1 ≤(

1− r2µ

4Mφ

)∆k.

Proof. Let Br(X∗) ⊂ X be a ball of radius r containing the optimum. Let

Wk := argmaxξ∈TXk

,‖ξ‖≤1〈ξ, grad φ(Xk)〉

be the direction of steepest descent in the tangent space TXk . The point Pk = ExpXk(rWk) lies in X .

Consider now the following inequality

〈−Exp−1Xk

(Pk), grad φ(Xk)〉 = −〈grad φ(Xk), rWk〉 ≤ −r‖grad φ(Xk)‖, (3.8)

which follows upon using the definition of Wk. Thus, we have the bound

∆k+1 ≤ ∆k + sk〈grad φ(Xk), Exp−1Xk

(Xk+1)〉+ 12 s2

k Mφ

≤ ∆k − skr‖grad φ(Xk)‖+ 12 s2

k Mφ

≤ ∆k − skr√

2µ√

∆k +12 s2

k Mφ,

where the first inequality follows from the Lipschitz-bound (Lemma 3.2), the second one from (3.8),and the third one from the PL inequality (which, in turn holds due to the µ-strong g-convexity of φ).

Now setting the step size sk =r√

µ∆k√2Mφ

, we obtain

∆k+1 ≤(

1− r2µ

4Mφ

)∆k,

which delivers the claimed linear convergence rate.

7

Page 8: Riemannian Optimization via Frank-Wolfe Methods

Theorem 3.5 provides a setting where Rfw can converge fast, however, it uses step sizes sk thatrequire knowing φ(X∗)1; in case this value is not available, we can use a worse value, which will stillyield the desired inequality.

3.3 Rfw for nonconvex problems

Finally, we want to consider the case where φ in Eq. 1.1 may be convex. We cannot hope to findthe global minimum with first-order methods, such as Rfw. However, we can compute first-ordercritical point via Rfw. For the setup and analysis, we follow closely S. Lacoste-Julien’s analysis of theEuclidean case [Lacoste-Julien, 2016].

We first introduce the Frank-Wolfe gap as a criterion for evaluating convergence rates. For X ∈ X ,we write

G(X) := maxZ∈X〈Exp−1

Z (X), − grad φ(X)〉 . (3.9)

With this, we can show the following sublinear convergence guarantee:

Theorem 3.6 (Rate (nonconvex case)). Let Gk := min0≤k≤K G(Xk) (where G(Xk) denotes the Frank-Wolfe

gap at Xk). After K iterations of Algorithm 2, we have Gk ≤max2h0,Mφ√

K+1.

Proof. We follow the proof of Thm. 3.4 to get a bound on the update and then substitute the FW gapin the directional term:

φ(Xk+1) ≤ φ(Xk) + sk 〈grad φ(Xk), Exp−1Xk

(Zk)〉︸ ︷︷ ︸=−G(Xk)

+s2

k Mφ

2(3.10)

≤ φ(Xk)− skG(Xk) +s2

k Mφ

2. (3.11)

The step size that minimizes the sk-terms on the right hand side is given by

s∗k = min

G(Xk)

Mφ, 1

, (3.12)

which gives

φ(Xk+1) ≤ φ(Xk)−min

G(Xk)2

2Mφ, G(Xk)−

2· 1G(Xk)>Mφ

. (3.13)

Iterating over this scheme gives after K steps:

φ(Xk+1) ≤ φ(X0)−K

∑k=0

min

G(Xk)2

2Mφ, G(Xk)−

2· 1G(Xk)>Mφ

. (3.14)

Let GK = min0≤k≤K G(Xk) be the minimum FW gap up to iteration step K. Then,

φ(XK+1) ≤ φ(X0)− (K + 1) ·min

G2K

2Mφ, GK −

2· 1GK>Mφ

. (3.15)

The minimum can attend two different values, i.e. we have to consider two different cases. For thefollowing analysis, we introduce the initial global suboptimality

h0 := φ(X0)−minZ∈X

φ(Xk) . (3.16)

Note that h0 ≥ φ(X0)− φ(XK+1).1This step size choice is reminiscent of the so-called “Polyak stepsizes” used in the convergence analysis of subgradient

methods [Polyak, 1987].

8

Page 9: Riemannian Optimization via Frank-Wolfe Methods

1. GK ≤ Mφ:

φ(XK+1) ≤ φ(X0)− (K + 1)G2

K2Mφ

(3.17)

φ(X0)− φ(XK+1)︸ ︷︷ ︸≤h0

≥ (K + 1)G2

K2Mφ

(3.18)

√2h0Mφ

K + 1≥ GK . (3.19)

2. GK > Mφ:

φ(XK+1) ≤ φ(X0)− (K + 1)(

GK −Mφ

2

)(3.20)

h0 ≥ (K + 1)(

GK −Mφ

2

)(3.21)

h0

K + 1+

2≥ GK . (3.22)

The bound on the second case can be refined with the observation that GK > Mφ iff h0 >Mφ

2 : AssumingGK > Mφ, we have

h0

K + 1+

2≥ GK > Mφ (3.23)

⇒ h0

K + 1>

2(3.24)

⇒ 2h0

Mφ> K + 1 . (3.25)

However, then h0 ≤Mφ

2 would imply 1 > K + 1 contradicting K > 0; i.e. GK > Mφ iff h0 >Mφ

2 . Withthis we can rewrite the bound for the second case:

h0

K + 1+

2=

h0√K + 1

(1√

K + 1+

2h0

√K + 1

)(3.26)

≤† h0√K + 1

1√K + 1

+

√Mφ

2h0

(3.27)

≤‡ h0√K + 1

(1√

K + 1+ 1)

︸ ︷︷ ︸≤2

(3.28)

≤ 2h0√K + 1

. (3.29)

Here, (†) holds due to Eq. (3.25) and (‡) follows from h0 >Mφ

2 .

In summary, we now have

GK ≤

2h0√K+1

, if h0 >Mφ

2√2h0 Mφ

K+1 , else .(3.30)

9

Page 10: Riemannian Optimization via Frank-Wolfe Methods

The upper bounds on the FW gap given by Eq. (3.30) imply convergence rates of O(1/√

K), completingthe proof.

4 Specializing RFW for HPD matrices

In this section we study a concrete setting for Rfw, namely, a class of g-convex optimization problemswith Hermitian positive definite (HPD) matrices. The most notable aspect of the concrete class ofproblems that we study is that the Riemannian linear oracle (3.1) will be shown to admit an efficientsolution, thereby allowing an efficient implementation of Algorithm 2. The concrete class of problemsthat we consider is the following:

minX∈X⊆Pd

φ(X), where X := X ∈ Pd | L X U, (4.1)

where φ is a g-convex function and X is a “positive definite interval” (which is easily seen to be ag-convex set). Note that set X actually does not admit an easy projection for matrices (contrary to thescalar case). Problem (4.1) captures several g-convex optimization problems, of which perhaps the bestknown is the task of computing the matrix geometric mean (also known as the the Riemannian centroidor Karcher mean)—see Section 4.2.1.

We briefly recall some facts about the Riemannian geometry of HPD matrices below. We denote byHd the set of d× d Hermitian matrices. The most common (nonlinear) Riemannian geometry on Pd isinduced by

〈A, B〉X := tr(X−1 AX−1B), where A, B ∈Hd. (4.2)

This metric induces the geodesic γ : [0, 1]→ Pd between X, Y ∈ Pd given by

γ(t) := X1/2(X−1/2YX−1/2)tX1/2, t ∈ [0, 1]. (4.3)

The corresponding Riemannian distance is

d(X, Y) := ‖log(X−1/2YX−1/2)‖F, X, Y ∈ Pd. (4.4)

The Riemannian gradient grad φ is obtained from the Euclidean one ∇φ as follows

grad φ(X) = 12 X[∇φ(X) + (∇φ(X))∗]X =: X∇Hφ(X)X; (4.5)

here we use ∇Hφ to denote (Hermitian) symmetrization. The exponential map and its inverse at apoint P ∈ Pd are respectively given by

ExpX(A) = X1/2 exp(X−1/2 AX−1/2)X1/2, A ∈ TXPd ≡Hd,

Exp−1X (Y) = X1/2 log(X−1/2YX−1/2)X1/2, X, Y ∈ Pd,

where exp(·) and log(·) denote the matrix exponential and logarithm, respectively. Observe that using(4.5) we obtain the identity

〈grad φ(X), Exp−1X (Y)〉X = 〈X1/2∇Hφ(X)X1/2, log(X−1/2YX−1/2)〉. (4.6)

With these details, Algorithm 2 can almost be applied to (4.1) — the most crucial remaining componentis the Riemannian linear oracle, which we now describe.

4.1 Solving the Riemannian linear oracle

For solving (4.1), the Riemannian linear oracle (see (3.1)) requires solving

maxLZU

〈X1/2k ∇

Hφ(Xk)X1/2k , log(X−1/2

k ZX−1/2k )〉. (4.7)

10

Page 11: Riemannian Optimization via Frank-Wolfe Methods

Problem (4.7) is a non-convex optimization problem over HPD matrices. However, remarkably, it turnsout to have a closed form solution. Theorem 4.1 presents this solution and is our main technical resultfor Section 4.

Theorem 4.1. Let L, U ∈ Pd such that L ≺ U. Let S ∈Hd and X ∈ Pd be arbitrary. Then, the solution to theoptimization problem

maxLZU

tr(S log(XZX)), (4.8)

is given by Z = X−1Q(

P∗[− sgn(D)]+P + L)

Q∗X−1, where S = QDQ∗ is a diagonalization of S, U − L =

P∗P with L = Q∗XLXQ and U = Q∗XUXQ.

For the proof of Theorem 4.1, we need two fundamental lemmas about eigenvalues of Hermitianmatrices (Lemmas 4.2, 4.3).

Lemma 4.2 (Problem III.6.14, [Bhatia, 1997]). Let X, Y be HPD matrices. Then

λ↓(X) · λ↑(Y) ≺w λ(XY) ≺w λ↓(X) · λ↓(Y),

where λ↓(X) (λ↑) denotes eigenvalues of X arranged in decreasing (increasing) order, ≺w denotes the weak-majorization order, and · denotes elementwise product.

Lemma 4.3 (Corollary III.2.2 [Bhatia, 1997]). Let A, B ∈Hd. Then ∀i = 1, . . . , d:

λ↓i (A) + λ↓d(B) ≤ λ↓i (A + B) ≤ λ↓i (A) + λ↓1(B).

Theorem 4.1. First, introduce the variable Y = XZX; then (4.8) becomes

maxL′YU′

tr(S log Y), (4.9)

where the constraints have also been modified to L′ Y U′, where L′ = XLX and U′ = XUX.Diagonalizing S as S = QDQ∗, we see that tr(S log Y) = tr(D log W), where W = Q∗YQ. Thus, insteadof (4.9) it suffices to solve

maxL′′WU′′

tr(D log W), (4.10)

where L′′ = Q∗L′Q and U′′ = Q∗U′Q. We have strict inequality U′′ L′′, thus, our constraints are0 ≺W − L′′ U′′ − L′′, which we may rewrite as 0 ≺ R I, where R = (U′′ − L′′)−1/2(W − L′′)(U′′ −L′′)−1/2; notice that

W = (U′′ − L′′)1/2R(U′′ − L′′)1/2 + L′′.

Thus, problem (4.10) now turns into

max0≺RI

tr(D log(P∗RP + L′′)) , (4.11)

where U′′ − L′′ = P∗P. To maximize the trace, we have to maximize the sum of the eigenvalues of thematrix term. Using Lemma 4.2 we see that

λ(D log(P∗RP + L′′)) w λ↑(D) · λ↓(log(P∗RP + L′′)).

Noticing that D is diagonal, we have to now consider two cases:

(i) If dii > 0, the corresponding element of λ↓(P∗RP + L′′) should be minimized.

(ii) If dii ≤ 0, the corresponding element of λ↓(P∗RP + L′′) should be maximized.

11

Page 12: Riemannian Optimization via Frank-Wolfe Methods

Notice now the elementary fact that P∗RP + L′′ is operator monotone in R. Let

rii =

0 dii ≥ 01 dii < 0.

(4.12)

We show that (4.12) provides the best possible choice for R, i.e., achieves the respective minimum andmaximum in cases (i) and (ii). For (ii) with dii ≤ 0 we have

λ↓i (P∗RP + L′′)︸ ︷︷ ︸linear in R

≤ λ↓i (P∗RP) + λ↓1(L′′)︸ ︷︷ ︸constant

max. contribution

. (4.13)

For (i) with dii > 0 we have

λ↓i (P∗RP + L′′)︸ ︷︷ ︸linear in R

≥ λ↓i (P∗RP) + λ↓d(L′′)︸ ︷︷ ︸constant

min. contribution

. (4.14)

λ↓i (P∗RP) is maximized (minimized) if λ↓i (R) is maximized (minimized). With 0 ≺ R ≺ I this givesrii = 1 for dii ≤ 0 (case (ii)) and rii = 0 for dii > 0 (case 1). Therefore, (4.12) is the optimal choice.

Thus, we see that Y = Q(P∗RP + L′′)Q∗ = Q(P∗ [− sgn(D)]+ P + L′′)Q∗, and we immediatelyobtain the optimal Z = X−1YX−1.

Remark 4.4. Computing the optimal direction Zk takes one Cholesky factorization, two matrix squareroots (Schur method), eight matrix multiplications and one eigenvalue decomposition. This gives acomplexity of O(N3). On our machines, we report ≈ 1

3 N3 + 2× 28.3N3 + 8× 2N3 + 20N3 ≈ 93N3.

4.2 Application to the Riemannian mean

4.2.1 Computing the Riemannian mean

Statistical inference frequently involves computing averages of input quantities. Typically encountereddata lies in Euclidean spaces where arithmetic means are the “natural” notions of averaging. However,the Euclidean setting is not always the most natural or efficient way to represent data. Many appli-cations involve non-Euclidean data such as graphs, strings, or matrices [Le Bihan et al., 2001, Nielsenand Bhatia, 2013]. In such applications, it is often beneficial to represent the data in its natural spaceand adapt classic tools to the specific setting. In other cases, a problem might be very hard to solve inEuclidean space, but may become more accessible when viewed through a different geometric lens.

This section considers one of the later cases, namely the problem of determining the geometric matrixmean (Karcher mean problem). While there exists an intuitive notion for the geometric mean of sets ofpositive real numbers, this notion does not immediately generalize to sets of positive definite matricesdue to the lack of commutativity on matrix spaces. Over a collection of Hermitian, positive definitematrices the geometric mean can be viewed as the geometric optimization problem

G := argminX0

[φ(X) = ∑m

i=1 wiδ2R(X, Ai)

]. (4.15)

In an Euclidean setting, the problem is non-convex. However, one can view Hermitian, positive matricesas points on a Riemannian manifold and compute the geometric mean as the Riemannian centroid.The corresponding optimization problem Eq. 4.15 is geodesically convex. In this section we look at theproblem through both geometric lenses and provide efficient algorithmic solutions while illustratingthe benefits of switching geometric lenses in geometric optimization problems.

There exists a large body of work on the problem of computing the geometric matrix means [Jeuriset al., 2012]. Classic algorithms like Newton’s method or Gradient Decent (GD) have been successfully

12

Page 13: Riemannian Optimization via Frank-Wolfe Methods

applied to the problem. Standard toolboxes implement efficient variations of GD like Steppest Decentor Conjugate Gradient (Manopt [Boumal et al., 2014]) or Richardson-like linear gradient decent (MatrixMeans Toolbox [Bini and Iannazzo, 2013]). Recent work by Yuan et al. [Yuan et al., 2017] analyzescondition numbers of Hessians in Riemannian and Euclidean steepest-decent approaches that providetheoretical arguments for the good performance of Riemannian approaches.

Recently, T. Zhang developed a majorization-minimization approach with asymptotic linear conver-gence [Zhang, 2017]. In this section we apply the above introduced Riemannian version of the classicconditional gradient method by M. Frank and P. Wolfe [Frank and Wolfe, 1956] (Rfw) for computing thegeometric matrix mean in Riemannian settings with linear convergence. Here, we exploit the geodesicconvexity of the problem. To complement this analysis, we show that recent advances in non-convexanalysis [Lacoste-Julien, 2016] can be used to develop a Frank-Wolfe scheme (Efw) for the non-convexEuclidean case (see Appendix A).

In the simple case of two PSD matrices X and Y one can view the geometric mean as their metricmid point computed by Kubo and Ando [1979]

G(X, Y) = X#tY = X12

(X−

12 YX−

12

)tX

12 . (4.16)

More generally, for a collection of M matrices, the geometric mean can be seen as a minimizationproblem of the sum of squares of distances Bhatia and Holbrook [2006],

G(A1, ..., AM) = argminX0

M

∑i=1

δ2R(X, Ai) , (4.17)

with the Riemannian distance function

d(X, Y) = ‖log(X−12 YX−

12 )‖ . (4.18)

Here we consider the more general weighted geometric mean:

G(A1, ..., AM) = argminX0

M

∑i=1

wiδ2R(X, Ai) . (4.19)

E. Cartan showed in a Riemannian setting that a global minimum exists, which led to the term Cartanmean frequently used in the literature. In this setting, one can view the collection of matrices as pointson a Riemannian manifold. H. Karcher associated the minimization problem with that of findingcenters of masses on these manifolds [Karcher, 1977], hence motivating a second term to describe thegeometric matrix mean (Karcher mean).

The geometric matrix mean enjoys several key properties. We list below the ones of crucial im-portance to our paper and refer the reader to Lawson and Lim [2014], Lim and Palfia [2012] for amore extensive list. To state these results, we recall the general form of the two other basic means ofoperators: the (weighted) harmonic and arithmetic means, denoted by H and A respectively.

H :=(∑M

i=1 wi A−1i

)−1, A := ∑M

i=1 wi Ai.

Then, one can show the following well-known operator inequality that relates H, G, and A:

Lemma 4.5 (Means Inequality, Bhatia [2007]). Let A1, . . . , AM > 0, and let H, G, and A denote their(weighted) harmonic, geometric, and arithmetic means, respectively. Then,

H G A . (4.20)

The key computational burden of all our algorithms lies in computing the gradient of the objectivefunction (4.15). A short calculation (see e.g., [Bhatia, 2007, Ch.6]) shows that if f (X) = δ2

R(X, A), then∇ f (X) = X−1 log(XA−1). Thus, we immediately obtain

∇φ(X) = ∑i wiX−1 log(XA−1i ).

13

Page 14: Riemannian Optimization via Frank-Wolfe Methods

4.2.2 Implementation

We compute the geometric matrix mean with Algorithm 2. For the PSD manifold, line 3 can be writtenas

Zk ← argmaxHZA

〈X1/2k ∇φ(Xk)X1/2

k , log(X−1/2k ZX−1/2

k )〉 . (4.21)

Note that the operator inequality H G A given by Lem. 4.5 plays a crucial role: It shows that theoptimal solution lies in a compact set so we may as well impose this compact set as a constraint to theoptimization problem (i.e., we set X = H Z A). We implement the linear oracles as discussedabove: In the Euclidean case, a closed-form solution is given by

Z = H + P∗Q[− sgn(Λ)]+Q∗P, (4.22)

where A − H = P∗P and P∇φ(Xk)P∗ = QΛQ∗. Analogously, for the Riemannian case, the linearoracle

maxHZA

〈∇φ(Xk), log(Z)〉 , (4.23)

is well defined and solved by

Z = Q(

P∗[− sgn(Λ)]+P + H)

Q∗ , (4.24)

with A − H = P∗P, A = Q∗AQ, H = Q∗HQ and ∇φ(Xk) = QΛQ∗ (see Prop. 4.21). The resultingFrank-Wolfe method for the geometric matrix mean is given by Algorithm 3.

Algorithm 3 FW for fast Geometric mean (GM)/ Wasserstein mean (WM)

1: (A1, . . . , AN), w ∈ RN+

2: X ≈ argminX>0 ∑i wiδ2R(X, Ai)

3: β = min1≤i≤N λmin(Ai)4: for k = 0, 1, . . . do5: Compute gradient.6: GM: ∇φ(Xk) = X−1

k(∑i wi log(Xk A−1

i ))

7: WM: ∇φ(Xk) = ∑i wi

(I − (AiXk)

−1/2 Ai

)8: Compute Zk:9: GM: Zk ← argmaxHZA〈X

1/2k ∇φ(Xk)X1/2

k , log(X−1/2k ZX−1/2

k )〉10: WM: Zk ← argmaxβIZA〈X

1/2k ∇φ(Xk)X1/2

k , log(X−1/2k ZX−1/2

k )〉11: Let αk ← 2

k+2 .12: Update X:13: Xk+1 ← Xk#αk Zk14: end for15: return X = Xk

4.3 Application to Bures-Wasserstein barycenters

4.3.1 Computing Bures-Wasserstein barycenters on the Gaussian density manifold

As a second application of Rfw, we consider the computation of Bures-Wasserstein barycenters ofmultivariate (centered) Gaussians. This application is motivated by optimal transport theory; inparticular, the Bures-Wasserstein barycenter is the solution to the multi-marginal transport prob-lem [Malago et al., 2018, Bhatia et al., 2018b]: Let Aim

i=1 ∈ Pd and w = (w1, ..., wm) a vector ofweights (∑i wi = 1; wi > 0, ∀i). The minimization task

minX∈PdX0

m

∑i=1

wid2W(X, Ai) (4.25)

14

Page 15: Riemannian Optimization via Frank-Wolfe Methods

is called multi-marginal transport problem. Its unique minimizer

Ω(w; Ai) = argminX0

m

∑i=1

wid2W(X, Ai) ∈ Pd

is the Bures-Wasserstein barycenter. Here, dW denotes the Wasserstein distance

dW(X, Y) =[tr(X + Y)− 2tr

(X1/2YX1/2)1/2]1/2 . (4.26)

To see the connection to optimal transport, note that Eq. 4.25 corresponds to a least squares optimiza-tion over a set of multivariate center Gaussians with respect to the Wasserstein distance. The Gaussiandensity manifold is isomorphic to the manifold of symmetric positive definite matrices, which allowsfor a direct application of Rfw and the setup in the previous section to Eq. 4.25, albeit with a differ-ent set of constraints. In the following, we discuss a suitable set of constraint and adapt Rfw for thecomputation of the Bures-Waserstein barycenter (short: Wasserstein mean).

First, note that as for the Karcher mean, one can show that the Wasserstein mean of two matrices isgiven in closed form; namely as

X♦tY = (1− t)2X + t2Y + t(1− t)((XY)1/2 + (YX)1/2

). (4.27)

However, the computation of the barycenter of m matrices (m > 2) requires solving a quadratic opti-mization problem. Note, that Eq. 4.27 defines a geodesic map from X to Y. Unfortunately, an analogueof the means inequality does not hold in the Wasserstein case. However, one can show, that the arithmeticmatrix mean always gives an upper bound, providing one-sided constraints [Bhatia et al., 2018a]:

Ai♦t Aj Ai + Aj

2,

and similarly for the arithmetic mean A(w; Ai) of m matrices. Moreover, one can show that αI X Bhatia et al. [2018b], where α is the minimal eigenvalue over Aim

i=1, i.e.

α = min1≤j≤m

λmin(Aj) .

In summary, this gives the following constraints on the Wasserstein mean:

αI Ω(w; Ai) A(w; Ai) . (4.28)

Next, we will derive the gradient of the objective function. According to Eq. 4.25, the objective isgiven as

φ(X) := ∑j wj[tr(Aj) + tr(X)− 2 tr

(A1/2

j XA1/2j)1/2]1/2 . (4.29)

Two expressions for the gradient of (4.29) are derived in the following.

Lemma 4.6. Let φ be given by (4.29). Then, its gradient is

∇φ(X) = ∑j

wj

(I −

(AjX

)−1/2 Aj

).

Proof.

∇φ(X) =d

dX ∑j

wj tr Aj︸ ︷︷ ︸vanishes

+d

dX ∑j

wj

(tr(X)− 2 tr

(A1/2

j XA1/2j

)1/2)

= ∑j

wjd

dXtr(X)−∑

jwj2

ddX

tr(

A1/2j XA1/2

j

)1/2.

15

Page 16: Riemannian Optimization via Frank-Wolfe Methods

Note that ddX tr(X) = I. Consider

2d

dXtr(

A12 XA1/2

)1/2= 2

ddX

[((A1/2XA1/2

)1/2)∗ ((

A1/2XA1/2)1/2

)]1/2

= 2d

dX

(A1/2XA1/2

)1/2

= A1/2(

A1/2XA1/2)−1/2

A1/2

= A1/2(

A1/2XA1/2)−1/2

A−1/2 A1/2 A1/2

= (AX)−1/2 A ,

where the second equality follows from A, X being symmetric and positive definite: If A and X aresymmetric and positive definite, then A has a unique positive definite root A1/2 and therefore(

A1/2XA1/2)∗

=(

A1/2∗X∗A1/2∗)= A1/2XA1/2 .

This gives the desired expression.

Corollary 4.7. The gradient can also be written as

∇φ = ∑j

ωj

(I − Aj#X−1

). (4.30)

Proof.

∇φ = ∑j

ωj

(I −

(AjX

)−1/2 Aj

)= ∑

jωj

I − A1/2j

(A−1/2

j X−1 A−1/2j

)1/2A1/2

j︸ ︷︷ ︸=Aj#X−1

= ∑

jωj

(I − Aj#X−1

).

4.3.2 Implementation

Given a set of constraints and an expression for the gradient, we can solve Eq. 4.25 with Rfw using thefollowing setup: We write the log-linear oracle as

Zk ∈ argmaxαIZA

〈grad φ(X), Exp−1X (Z)〉 , (4.31)

where grad φ(X) is the Riemannian gradient of φ and Exp−1X (Z) the exponential map with respect to

the Riemannian metric. The constraints are given by Eq. 4.28. Since the constraint set is given by aninterval of the form L Z U, the oracle can be solved in closed form using Theorem 4.1. Theresulting algorithm is given in Alg. 3.

5 Specializing RFW to SO(n)

We have seen in the previous section, that Rfw can be implemented efficiently, whenever the Rieman-nian linear oracle can be solved in closed form. In this section we want to show that this is not limitedto the case of positive definite matrices. In particular, we specialize Rfw to the special orthogonal group

SO(n) = X ∈ Rn×n : X∗X = In, det(X) = 1and show that here too, we can implement Rfw efficiently.

16

Page 17: Riemannian Optimization via Frank-Wolfe Methods

5.1 Geometry of SO(n)

We first recall some facts on the geometry of SO(n); for a more comprehensive overview see, e.g., [Edel-man et al., 1998]. The exponential map and its inverse on SO(n) are given as

ExpX(Z) = X−1Y

Exp−1X (Z) = log(X−1Y) .

Note that the tangent space of the manifold is given by the space of skew-symmetric matrices

TXSO(n) = X ∈ Rk×k : X = −X∗ .

The standard Riemannian metric on the special orthogonal group (and more generally on the orthog-onal group O(n)) is given by

gc(∆, ∆) = tr[

∆∗(

I − 12

Y∗Y)

∆]

,

where ∆ = YA + Y⊥B are in TXSO(n). With respect to this metric, a geodesic from Y ∈ SO(n) and indirection H ∈ TYO(n) is given by [Edelman et al., 1998]

γ(Y, H; t) = YM(t) + QN(t) , (5.1)

where QR = (I −Y∗Y)H can be computed via QR-decomposition and(M(t)N(t)

)= exp t

(A −R∗

R 0

)(In

0

).

5.2 Solving the Riemannian linear oracle

The Riemannian linear oracle (line 3 in Alg. 2) for the special orthogonal group is given by

maxZ∈SO(k)

〈grad φ(X), log(X−1Z〉 . (5.2)

We can rewrite this as the trace maximization

maxZ∗Z=Ik

det(Z)=1

tr〈G, log(X−1Z)〉 ,

where G denotes the Riemannian gradient grad φ(X). With a similar argument as in Thm. 4.1, thiscan be solved in closed form:

Theorem 5.1. Then, the solution to the optimization problem

maxZ∗Z=Ik

det(Z)=1

tr〈G, log(X−1Z)〉 ,

is given by Z = XVU∗ where G = UDV∗ is a singular value decomposition of the gradient.

Proof. Consider a singular value decomposition G = UDV∗ with U, V unitary. Furthermore, introducethe shorthand Y = X−1Z. We can then rewrite the oracle as follows:

tr (G log Y) = tr (UDV∗ log(Y))= tr (DV∗ log(Y)U)

=† tr (D log(V∗YU)) .

17

Page 18: Riemannian Optimization via Frank-Wolfe Methods

where (†) follows from V, U being unitary. Let W = V∗YU. Note that

tr (D log(W)) = 〈D, log(W)〉 ≤ ‖D‖2 · ‖log(W)‖2 .

Since ‖X‖2 =√

tr X∗X = (∑i σi(X))1/2, the right hand side is maximized, if the singular values of Ware maximal. Since W ∈ O(k) its singular values are σi = 1. The maximum is therefore achieved bysetting W = I. This gives

I = W = V∗YU = V∗(

X−1Z)

U .

Solving for Z gives a solution for maximizing the original problem:

Z = XVU∗ .

Remark 5.2 (Nonconvex constraints). Note, that in this setting, the constraint set X is not g-convex.However, we can still apply Rfw, since the geodesic update (5.1) remains in the feasible region. Tosee this, note that SO(n) is a compact submanifold of O(n). In particular, the manifold of orthogonalmatrices has two compact connected components (corresponding to matrices with det(X) = 1, i.e.,SO(n), and det(X) = −1, respectively). Hence, when initialized at some X ∈ SO(n), the geodesicupdate (5.1) will always remain within SO(n).

5.3 Application to the Procrustes problem

The Procrustes problem is a classic optimization task that seeks to compute a synchronization betweentwo matrices A, B ∈ Rn×k. In particular, the goal is to find a rotation matrix X ∈ SO(n) that achievesthe best possible map between data matrices A and B. The corresponding objective function is

f (X) =12‖XB− A‖2 =

12‖A‖2 +

12‖B‖2 − tr

(BTXT A

).

The optimal rotation X can be then be found by solving

minX∈SO(n)

[φ(X) := − tr

(XT ABT

)].

The Euclidean gradient of the objective is easily derived as

∇φ(X) = −ABT ,

while the Riemannian gradient is given by

grad φ(X) = X · skew(

XT∇φ(X))

.

Recall that skew(X) = 12(X− XT). The resulting algorithm is given in Alg. 4.

18

Page 19: Riemannian Optimization via Frank-Wolfe Methods

Algorithm 4 RFW for Procrustes problem

1: Input: A, B ∈ Rn×k.2: Assume access to the geodesic map γ(X, Z; α) (Eq. 5.1).3: X ≈ argminX∈SO(n)− tr

(XT ABT)

4: Precompute ∇φ(X) = −ABT .5: for k = 0, 1, . . . do6: Compute gradient: grad φ(Xk) = Xk · skew

(XT

k∇φ(Xk)).

7: Compute Zk via Riemannian linear oracle:8: Zk ← argmaxZ∈TXk

SO(n) 〈grad φ(Xk), log(X−1k Zk)〉

9: Let αk ← 2k+2 .

10: Update Xk as11: Xk+1 ← γ(Xk, Zk; αk).12: end for13: return X = Xk.

6 Computational Experiments

In this section, we will make some remarks on the implementation of Alg. 3 and show numerical resultsfor computing the geometric matrix mean for different parameter choices. To evaluate the efficiency ofour method, we compare its performance against a selection of state-of-the-art methods. Additionally,we use Alg. 3 to compute Wasserstein barycenters of positive definite matrices. All computationalexperiments are performed using Matlab.

6.1 Computational Considerations

When implementing the algorithm we can take advantage of the positive definiteness of the inputmatrices. For example, if using Matlab, rather than computing X−1 log(XA−1

i ), it is more preferableto compute

X−1/2 log(X1/2 A−1i X1/2)X1/2,

because both X−1/2 and log(X1/2 A−1i X1/2) can be computed after suitable eigen decomposition. In

contrast, computing log(XA−1i ) invokes the matrix logarithm (logm in Matlab), which can be much

slower.

6.1.1 Stepsize selection

To save on computation time, we prefer to use a diminishing scalar as the stepsize in Alg. 3. Inprinciple, this simple stepsize selection could be replaced by a more sophisticated Armijo-like line-search or even exact minimization by solving

αk ← argminα∈[0,1]

φ(Xk + αk(Zk − Xk)) . (6.1)

This stepsize tuning may accelerate the convergence speed of the algorithm, but it must be combinedwith a more computational intensive strategy of “away” steps GueLat and Marcotte [1986] to obtain ageometric rate of convergence. However, we prefer Algorithm 3 for its simplicity and efficiency.

6.1.2 Accelerating the convergence

Theorems 3.4 and 3.6 show that Algorithm 2 converges at the global (non-asymptotic) rates O(1/ε)(g-convex Rfw) and O(1/ε2) (nonconvex Rfw). However, by further exploiting the simple structure of

19

Page 20: Riemannian Optimization via Frank-Wolfe Methods

the constraint set and the “curvature” of the objective function, we might obtain a stronger convergenceresult.

6.2 Numerical Results

We present numerical results for computing the Karcher and Wasserstein means for sets of positivedefinite matrices. We test our methods on both well- and ill-conditioned matrices; the generation ofsample matrices is described in the appendix. In addition, we present numerically results for solvingthe Procrustes problem.

6.2.1 Computing the Riemannian Mean

To test the efficiency of our method, we implemented Rfw (Alg. 3) and its Euclidean counterpart Efw

(Alg. 5) in Matlab and compared its performance on computing the geometric matrix mean againstrelated state-of-the-art methods:

1. Riemannian L-BFGS (LBFGS) is a quasi-Newton method that iteratively approximates the Hessianfor evaluating second-order information [Yuan et al., 2016].

2. Riemannian Barzilai–Borwein (BB) is a first-order gradient-based method for constraint and un-constraint optimization. It evaluates second-order information from an approximated Hessian tochoose the stepsize [Iannazzo and Porcelli, 2018]. We use the Manopt version of the RBB.

3. Matrix Means Toolbox (MM) [Bini and Iannazzo, 2013] is an efficient Matlab toolbox for matrixoptimization. Its implementation of the geometric mean problem uses a Richardson-like iterationof the form

Xk+1 ← Xk − αXk

n

∑i=1

log(A−1i Xk) ,

with a suitable α > 0.

4. Zhang [Zhang, 2017] is a recently published majorization-minimization method for computingthe geometric matrix mean.

In addition, we compare against classic gradient-based optimization methods (Steepest Decent and Con-jugate Gradient), see Appendix C.

Note that this selection reflects a broad spectrum of commonly used methods for Riemannianoptimization. It ranges from highly specialized approaches that are targeted to the Karcher mean(MM), to more general and versatile methods (e.g., R-LBFGS). A careful evaluation should take theseimplementation differences into account, by comparing not only CPU time, but also the loss withrespect to the number of iterations. In addition to reporting the loss with respect to CPU time, weinclude results with respect to the number of iterations in Appendix C. One should further comparealgorithmic features such as the number of calls to oracles and loss functions. We perform and furtherdiscuss such an evaluation below.

We generate a set A of M positive definite matrices of size N and compute the geometric mean withEfw and Rfw as specified in Algorithm 3. To evaluate the performance of the algorithm, we computethe cost function

f (X, A) =M

∑i=1‖log

(X−

12 AiX−

12

)‖2

F , (6.2)

after each iteration. We further include an experiment, where we report the gradient norm ‖grad f ‖F.Figure 1 shows performance comparisons of all methods for different parameter choice, initializa-

tions and condition numbers. In a second experiment (Fig. 2 and Fig. 3) we compared three different

20

Page 21: Riemannian Optimization via Frank-Wolfe Methods

initialization: The harmonic mean and a matrix Ai ∈ A, as well as the arithmetic-harmonic mean12 (AM + HM). The latter is known to be a good approximation to the geometric mean and thereforegives an initialization close to the solution. We observe that Efw/ Rfw outperform BB for all parameterchoices and initializations. Furthermore, Efw/ Rfw perform very competitively in comparison withR-LBFGS, Zhang’s Method and MM. In particular, if the initialization is not very close to the finalsolution (e.g., the arithmetic-harmonic mean), Efw/ Rfw improve on the state-of-the-art methods. Ina third experiment we compared the accuracy reached by Efw and Rfw with R-LBFGS as the (in ourexperiments) most accurate state-of-the-art method (Fig. 4). We observe that Efw and Rfw reach amedium accuracy fast; however, ultimately R-LBFGS reaches a higher accuracy.

To account for the above discussed inaccuracies in CPU time as the performance measure, wecomplement our performance analysis by reporting the loss with respect to the number of iterations(see Appendix C). In addition, we include a comparison of the number of internal calls to cost andgradient functions (Figure 10) for all methods. Note that this comparison evaluates algorithmic featuresand is therefore machine-independent. The reported numbers were obtained by averaging counts overten experiments with identical parameter choices. We observe that Rfw and Efw avoid internal calls tothe gradient and cost function, requiring only a single gradient evaluation per iteration. This results inthe acceleration observed in the plots.

(a) Well-conditioned (b) Ill-conditioned

Figure 1: Performance of Efw and Rfw in comparison with state-of-the-art methods for well-conditioned and ill-conditioned inputs of different sizes (N: size of matrices, M: number of matrices,K: maximum number of iterations). All tasks are initialized with the harmonic mean x0 = HM.

21

Page 22: Riemannian Optimization via Frank-Wolfe Methods

(a) x0 = A1 (b) x0 = 12 (AM + HM) (c) x0 = HM

Figure 2: Performance of Efw and Rfw in comparison with state-of-the-art methods for well-conditioned inputs and different initializations.

(a) x0 = A1 (b) x0 = 12 (AM + HM) (c) x0 = HM

Figure 3: Performance of Efw and Rfw in comparison with state-of-the-art methods for ill-conditionedinputs and different initializations.

Figure 4: Gradient norms at each iteration for Efw and Rfw in comparison with R-LBFGS.

22

Page 23: Riemannian Optimization via Frank-Wolfe Methods

6.2.2 Computing Wasserstein Barycenters

To demonstrate the versatility of our approach, we use Rfw to compute Wasserstein barycenters. Forthis, we adapt the above described setup to implement Alg. 3. Fig. 5 shows the performance of Rfw onboth well- and ill-conditioned matrices. We compare three different initializations in each experiment:In (1), we choose an arbitrary element of A as starting point (X0 ∼ U (A)). The other experiments startat the lower and upper bounds of the constraint set, i.e., (2) X0 is set to the arithmetic mean of A and(3) X0 = αI, where α is the smallest eigenvalue over A. Our results suggest that Rfw performs wellwhen initialized from any point in the feasible region, even on sets of ill-conditioned matrices.

(a) Well-conditioned (b) Ill-conditioned

Figure 5: Performance Rfw for computing the Wasserstein mean for M well- and M ill-conditionedmatrices of size N for K iteration and three initializations: (1) X0 is set to one of the matrices in the setA, i.e., Ai ∈ A; (2) X0 is initialized as the arithmetic mean of A, i.e., the upper bound of the constraintset; and (3) X0 is set to the lower bound αI of the constraint, where α is the minimal eigenvalue over A.

6.2.3 Procrustes problem

We implement Rfw for the special orthogonal group, focusing on the Procrustes problem (see Alg. 4).Fig. 6 shows the performance of Rfw for different input sizes and initialization. The loss function ischosen as the optimization gap, where we compare with the analytically known optimal solution

X∗ = UVT ,

where ABT = UΣVT is a singular value decomposition.For initialization, we compare a random rotation matrix (i.e., X0 ∼ U (SO(n))) with the identity ma-

trix (i.e., X0 = In ). In our experiments, initialization with the identity achieved superior performance.

7 Discussion

We presented a Riemannian version of the classic Frank-Wolfe method that enables constraint opti-mization on Riemannian (more precisely, on Hadamard) manifolds. Similar to the Euclidean case,

23

Page 24: Riemannian Optimization via Frank-Wolfe Methods

Figure 6: Performance of Rfw for computing a solution of the procrustes problem. We consider twoinitializations, (i) X0 ∼ U (SO(n)) and (ii) X0 = In for input matrices A, B ∈ Rn×k. MaxIt denotes thenumber of iterations.

we recover sublinear convergence rates for Riemannian Frank-Wolfe for both geodesically convex andnonconvex objectives. Under the stricter assumption of µ-strongly g-convex objectives and a strict in-teriority condition on the constraint set, we show that even linear convergence rates can be attained byRiemannian Frank-Wolfe. To our knowledge, this work represents the first extension of Frank-Wolfemethods to a manifold setting.

In addition to the general results, we present an efficient algorithm for optimization on Hermitianpositive definite matrices. The key highlight of this specialization is a closed-form solution to theRiemannian linear oracle needed by Frank-Wolfe (this oracle involves solving a non-convex problem).While we focus on the specific problem of computing the Karcher mean (also known as the Rieman-nian centroid or geometric matrix mean), the derived closed form solutions apply to more generalobjective functions, and should be of wider interest for related non-convex and g-convex problems. Todemonstrate this versatility, we also included an application of Rfw to the computation of Wassersteinbarycenters on the Gaussian density manifold. Furthermore, we show that the Riemannian linear or-acle can be solved in closed form also for the special orthogonal group. In future work, we hope toexplore other constraint sets and matrix manifolds that admit an efficient solution to the Riemannianlinear oracle.

Our algorithm is shown to be very competitive against a variety of established and recently pro-posed approaches Zhang [2017], Yuan et al. [2016], Iannazzo and Porcelli [2018] providing evidence forits applicability to large-scale statistics and machine learning problems. In follow up work [Weber andSra, 2019], we show that Rfw extends to nonconvex stochastic settings, further increasing the efficiencyand versatility of Riemannian Frank-Wolfe methods. Exploring the use of this class of algorithms inlarge-scale machine learning applications is a promising avenue for future research.

Acknowledgements

SS acknowledges support from NSF-IIS-1409802. This work was done during a visit of MW at MITsupported by a Dean’s grant from the Princeton University Graduate School.

References

P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Prince-ton University Press, 2009.

24

Page 25: Riemannian Optimization via Frank-Wolfe Methods

Francis Bach. Duality between subgradient and conditional gradient methods. SIAM Journal on Opti-mization, 25(1):115–129, 2015.

Glaydston C Bento, Orizon P Ferreira, and Jefferson G Melo. Iteration-complexity of gradient, sub-gradient and proximal point methods on Riemannian manifolds. Journal of Optimization Theory andApplications, 173(2):548–562, 2017.

R. Bhatia. Matrix Analysis. Springer, 1997.

R. Bhatia. Positive Definite Matrices. Princeton University Press, 2007.

R. Bhatia and J. Holbrook. Riemannian geometry and matrix geometric means. Linear Algebra Appl.,413:594–618, 2006.

Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the bures-wasserstein distance between positivedefinite matrices. Expositiones Mathematicae, 2018a. ISSN 0723-0869. doi: https://doi.org/10.1016/j.exmath.2018.01.002.

Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. Strong convexity of sandwiched entropies and relatedoptimization problems. Reviews in Mathematical Physics, 30(09):1850014, 2018b.

D. A. Bini and B. Iannazzo. Computing the Karcher mean of symmetric positive definite matrices.Linear Algebra and its Applications, 438(4):1700–10, Oct. 2013.

N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a Matlab toolbox for optimization onmanifolds. Journal of Machine Learning Research, 15:1455–1459, 2014. URL http://www.manopt.org.

Nicolas Boumal, P-A Absil, and Coralia Cartis. Global rates of convergence for nonconvex optimizationon manifolds. arXiv preprint arXiv:1605.08101, 2016.

G. Calinescu, C. Chekuri, M. Pal, and J. Vondrak. Maximizing a submodular set function subject to amatroid constraint. SIAM J. Computing, 40(6), 2011.

Isaac Chavel. Riemannian Geometry: A modern introduction, volume 98. Cambridge university press,2006.

Anoop Cherian and Suvrit Sra. Riemannian dictionary learning and sparse coding for positive definitematrices. arXiv:1507.02772, 2015.

A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints.SIAM J. Matrix Analysis and Applications (SIMAX), 20(2):303–353, 1998.

M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly,3(95, 1956.

S. Fujishige and S. Isotani. A submodular function minimization algorithm based on the minimum-norm base. Pacific Journal of Optimization, 7:3–17, 2011.

D. Garber and E. Hazan. Faster rates for the Frank-Wolfe method over strongly-convex sets. InInternational Conference on Machine Learning, pages 541–549, 2015.

J. GueLat and P. Marcotte. Some comments on Wolfe’s ‘away step’. Mathematical Programming, 35(1):110–119, May 1986. ISSN 1436-4646. doi: 10.1007/BF01589445. URL https://doi.org/10.1007/

BF01589445.

Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. In Inter-national Conference on Machine Learning, pages 1263–1271, 2016.

25

Page 26: Riemannian Optimization via Frank-Wolfe Methods

Uwe Helmke, Knut Huper, Pei Yean Lee, and John Moore. Essential matrix estimation using Gauss-Newton iterations on a manifold. International Journal of Computer Vision, 74(2):117–136, 2007.

Reshad Hosseini and Suvrit Sra. Matrix manifold optimization for Gaussian mixtures. In NIPS, 2015.

Bruno Iannazzo and Margherita Porcelli. The riemannian barzilai–borwein method with nonmonotoneline search and the matrix geometric mean computation. IMA Journal of Numerical Analysis, 38(1):495–517, 2018. doi: 10.1093/imanum/drx015. URL http://dx.doi.org/10.1093/imanum/drx015.

Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In InternationalConference on Machine Learning (ICML), pages 427–435, 2013.

B. Jeuris, R. Vandebril, and B. Vandereycken. A survey and comparison of contemporary algorithmsfor computing the matrix geometric mean. Electronic Transactions on Numerical Analysis, 39:379–402,2012.

J. Jost. Riemannian Geometry and Geometric Analysis. Springer, 2011.

H. Karcher. Riemannian center of mass and mollifier smoothing. Comm. Pure Appl. Math., 30(5):509–541,1977.

H. Karimi, J. Nutini, and M. W. Schmidt. Linear convergence of gradient and proximal-gradient meth-ods under the Polyak-Lojasiewicz condition. CoRR, abs/1608.04636, 2016.

F. Kubo and T. Ando. Means of positive linear operators. Math. Ann., 246:205–224, 1979.

S. Lacoste-Julien. Convergence rate of Frank-Wolfe for non-convex objectives. arXiv preprintarXiv:1607.00345, 2016.

Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of Frank-Wolfe optimizationvariants. In Proceedings of the 28th International Conference on Neural Information Processing Systems -Volume 1, NIPS’15, pages 496–504, Cambridge, MA, USA, 2015. MIT Press. URL http://dl.acm.

org/citation.cfm?id=2969239.2969295.

J. Lawson and Y. Lim. Karcher means and Karcher equations of positive definite operators. Trans. Amer.Math. Soc. Ser. B, 1:1–22, 2014.

Denis Le Bihan, Jean-Francois Mangin, Cyril Poupon, Chris A Clark, Sabina Pappata, Nicolas Molko,and Hughes Chabriat. Diffusion Tensor Imaging: Concepts and Applications. Journal of magneticresonance imaging, 13(4):534–546, 2001.

Yongdo Lim and Miklos Palfia. Matrix power means and the Karcher mean. Journal of FunctionalAnalysis, 262(4):1498–1514, 2012.

S. Lojasiewicz. Une propriete topologique des sous-ensembles analytiques reels. Les equations auxderivees partielles, 117:87–89, 1963.

Luigi Malago, Luigi Montrucchio, and Giovanni Pistone. Wasserstein riemannian geometry of positive-definite matrices ? 2018.

Maher Moakher. Means and averaging in the group of rotations. SIAM journal on matrix analysis andapplications, 24(1):1–16, 2002.

F. Nielsen and R. Bhatia, editors. Matrix Information Geometry. Springer, 2013.

B. T. Polyak. Gradient methods for minimizing functionals (in Russian). Zh. Vychisl. Mat. Mat. Fiz.,page 643–653, 1963.

B. T. Polyak. Introduction to Optimization. Optimization Software Inc., 1987. Nov 2010 revision.

26

Page 27: Riemannian Optimization via Frank-Wolfe Methods

Sashank J Reddi, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic Frank-Wolfe methods fornonconvex optimization. In Communication, Control, and Computing (Allerton), 2016 54th Annual Aller-ton Conference on, pages 1244–1251. IEEE, 2016.

Wolfgang Ring and Benedikt Wirth. Optimization methods on Riemannian manifolds and their appli-cation to shape space. SIAM Journal on Optimization, 22(2):596–627, 2012.

S. Sra and R. Hosseini. Conic geometric optimization on the manifold of positive definite matrices.SIAM Journal on Optimization, 25(1):713–739, 2015.

Suvrit Sra and Reshad Hosseini. Geometric optimisation on positive definite matrices for ellipticallycontoured distributions. In Advances in Neural Information Processing Systems, pages 2562–2570, 2013.

Ju Sun, Qing Qu, and John Wright. Complete Dictionary Recovery over the Sphere II: Recovery byRiemannian Trust-region Method. arXiv:1511.04777, 2015.

Mingkui Tan, Ivor W Tsang, Li Wang, Bart Vandereycken, and Sinno J Pan. Riemannian pursuit for bigmatrix recovery. In International Conference on Machine Learning (ICML-14), pages 1539–1547, 2014.

Constantin Udriste. Convex functions and optimization methods on Riemannian manifolds, volume 297.Springer Science & Business Media, 1994.

Bart Vandereycken. Low-rank matrix completion by Riemannian optimization. SIAM Journal on Opti-mization, 23(2):1214–1236, 2013.

Melanie Weber and Suvrit Sra. Nonconvex stochastic optimization on manifolds via Riemannian Frank-Wolfe methods. arXiv:1910.04194, 2019.

Xinru Yuan, Wen Huang, P.-A. Absil, and K.A. Gallivan. A riemannian limited-memory bfgs algorithmfor computing the matrix geometric mean. Procedia Computer Science, 80:2147 – 2157, 2016. ISSN 1877-0509. doi: https://doi.org/10.1016/j.procs.2016.05.534. International Conference on ComputationalScience 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA.

Xinru Yuan, Wen Huang, P.-A. Absil, and K. A. Gallivan. A Riemannian quasi-Newton method forcomputing the Karcher mean of symmetric positive definite matrices. Florida State University, (FSU17-02), 2017.

H. Zhang and S. Sra. First-order methods for geodesically convex optimization. In Conference onLearning Theory (COLT), Jun. 2016.

H. Zhang, S. Reddi, and S. Sra. Fast stochastic optimization on Riemannian manifolds. In Advances inNeural Information Processing Systems (NIPS), 2016.

T. Zhang. A majorization-minimization algorithm for computing the Karcher mean of positivedefinite matrices. SIAM Journal on Matrix Analysis and Applications, 38(2):387–400, 2017. doi:10.1137/15M1024482.

Teng Zhang, Ami Wiesel, and Maria S Greco. Multivariate generalized Gaussian distribution: Convex-ity and graphical models. Signal Processing, IEEE Transactions on, 61(16):4141–4148, 2013.

A Non-convex Euclidean Frank-Wolfe

We make a short digression here to mention non-convex Euclidean Frank-Wolfe (Efw) as a potentialalternative approach to solving (4.1). Indeed, the constraint set therein is not only g-convex, it is alsoconvex in the usual sense. Thus, one can also apply an Efw scheme to solve (4.1), albeit with a slower

27

Page 28: Riemannian Optimization via Frank-Wolfe Methods

convergence rate. In general, a g-convex set X need not be Euclidean convex, so this observation doesnot always apply.

Efw was recently analyzed in [Lacoste-Julien, 2016]; the convergence rate reported below adaptsone of its main results. The key difference, however, is that due to g-convexity, we can translate thelocal result of [Lacoste-Julien, 2016] into a global one for problem (4.1).

Theorem A.1 (Convergence Fw-gap ([Lacoste-Julien, 2016])). Define Gk := min0≤k≤K G(Xk), whereG(Xk) = maxZk∈X 〈Zk − Xk, −∇φ(Xk)〉 is the Fw-gap (i.e., measure of convergence) at Xk. Define the curva-ture constant

Mφ := supX,Y,Z∈X

Y=X+s(Z−X)

2s2 [φ(Y)− φ(X)− 〈∇φ(X), Y− X〉] .

Then, after K iterations, Efw satisfies GK ≤max 2h0,Mφ√

K+1.

The proof (a simple adaption of [Lacoste-Julien, 2016]) is similar to that of Thm. 3.6; therefore,we omit it here. Finally, to implement Efw, we need to also efficiently implement its linear oracle.Thm. A.2 below shows how to; the proof is similar to that of Thm. 4.1.It is important to note that thislinear oracle involves solving a simple SDP, but it is unreasonable to require use of an SDP solver ateach iteration. Thm. A.2 thus proves to be crucial, because it yields an easily computed closed-formsolution.

Theorem A.2. Let L, U ∈ Pd such that L ≺ U and S ∈ Hd is arbitrary. Let U − L = P∗P, and PSP∗ =QΛQ∗. Then, the solution to

minLZU

tr(SZ), (A.1)

is given by Z = L + P∗Q[− sgn(Λ)]+Q∗P.

Proof. First, shift the constraint to 0 X − L U − L; then factorize U − L = P∗P, and introduce anew variable Y = X− L. Therewith, problem (A.1) becomes

min0YU−L

tr(S(Y + L)) .

If L = U, then clearly P = 0, and X = L is the solution. Assume thus, L ≺ U, so that P is invertible.Thus, the above problem further simplifies to

min0(P∗)−1YP−1I

tr(SY) .

Introduce another variable Z = P∗YP and use circularity of trace to now write

min0ZI

tr(PSP∗Z) .

To obtain the optimal Z, first write the eigenvalue decomposition

PSP∗ = QΛQ∗. (A.2)

Lemma 4.3 implies that the trace will be maximized when the eigenvectors of PSP∗ and Z align andtheir eigenvalues match up. Since 0 Z I, we therefore see that Z = QDQ∗ is an optimal solution,where D is diagonal with entries

dii =

1 if λi(Y) < 00 if λi(Y) ≥ 0

=⇒ D = [− sgn(Λ)]+. (A.3)

Undoing the variable substitutions we obtain X = L + P∗Q[− sgn(Λ)]+Q∗P as desired.

28

Page 29: Riemannian Optimization via Frank-Wolfe Methods

Remark A.3. Computing the optimal X requires 1 Cholesky factorization, 5 matrix multiplications and1 eigenvector decomposition. The theoretical complexity of the Euclidean Linear Oracle can thereforebe estimated as O(N3). On our machine, eigenvector decomposition is approximately 8–12 timesslower than matrix multiplication. So the total flop count is approximately ≈ 1

3 N3 + 5× 2N3 + 20N3 ≈33N3.

An implementation of Efw for the computation of Riemannian centroids is shown in Alg. 5. Experi-mental results for Efw in comparison with Rfw and state-of-the-art Riemannian optimization methodscan be found in the main text (see sec. 6.2.1).

Algorithm 5 Efw for fast Geometric mean

1: (A1, . . . , AN), w ∈ RN+

2: X ≈ argminX>0 ∑i wiδ2R(X, Ai)

3: β = min1≤i≤N λmin(Ai)4: for k = 0, 1, . . . do5: Compute gradient: ∇φ(Xk) = X−1

k(∑i wi log(Xk A−1

i ))

6: Compute Zk: Zk ← argminHZA〈∇φ(Xk), Z− Xk〉7: Let αk ← 2

k+2 .8: Update X: Xk+1 ← Xk + αk(Zk − Xk).9: end for

10: return X = Xk

B Generating matrices

B.1 Positive Definite Matrices

For testing our methods in the well-conditioned regime, we generate matrices Aini=1 ∈ Pd by sam-

pling real matrices of dimension d uniformly at random Ai ∼ U (Rd×d) and multiplying each with itstranspose

Ai ← Ai ATi .

This gives well-conditioned, positive definite matrices.Furthermore, we sample m matrices Ui ∼ U (Rd×d) with a rank deficit, i.e. rank(U) < d. Then,

setting

Bi ← δI + UiUTi

with δ being small yields ill-conditioned matrices.

B.2 Special Orthogonal group

First, we use a publicly available implementation by Ofek Shilon2 to sample uniformly from the mani-fold of orthogonal matrices, i.e., Ain

i=1 ∈ O(d). The method generates uniformly distributed orthog-onal matrices by applying the Gram-Schmidt process to random matrices. All matrices in our samplewill have determinant det(Ai) = ±1. To map our sample to the special orthogonal group (with deter-minant 1), we multiply the first row of all matrices with negative determinant by −1: Let ai

1∗ be thefirst row of Ai. If det(Ai) = −1, set

ai1∗ ← −ai

1∗ .2https://www.mathworks.com/matlabcentral/fileexchange/11783-randorthmat

29

Page 30: Riemannian Optimization via Frank-Wolfe Methods

C Additional Experiments

C.1 Comparison against classic gradient-based methods

In addition to the comparison with state-of-the-art methods in the main text, we compared against twoclassic gradient-based methods:

1. Steepest Decent (SD) is an iterative, first order minimization algorithm with line search that hasrisen to great popularity in machine learning. In each iteration, the method evaluates the gradientof the function and steps towards a local minimum proportional to the negative of the gradient.We use an implementation provided by the toolbox Manopt [Boumal et al., 2014].

2. Conjugate Gradient (CG) is an iterative optimization algorithm with plane search, i.e. instead ofevaluating the gradient one evaluates a linear combination of the gradient and the descent stepof the previous iteration. We again use an implementation provided by Manopt.

Figure 7: Performance of Efw and Rfw in comparison with state-of-the-art methods for well-conditioned inputs of different sizes (N: size of matrices, M: number of matrices, K: maximum numberof iterations). All tasks are initialized with the harmonic mean x0 = HM.

In Fig. 7, we report the loss both with respect to CPU time and with respect to the number of iterations.This allows for a more objective comparison with implementations from the toolbox Manopt (seediscussion in the main text).

C.2 Additional comparison with state-of-the-art methods

For a more objective comparison, we include here also the loss with respect to the number of iterationsfor (i) different parameter choices (Fig. 8) and (ii) different initializations (Fig. 9).

30

Page 31: Riemannian Optimization via Frank-Wolfe Methods

Figure 8: Performance of Efw and Rfw in comparison with state-of-the-art methods for well-conditioned (left) and ill-conditioned (right) inputs of different sizes (N: size of matrices, M: numberof matrices, K: maximum number of iterations). All tasks are initialized with the harmonic meanx0 = HM.

31

Page 32: Riemannian Optimization via Frank-Wolfe Methods

Figure 9: Performance of Efw and Rfw in comparison with state-of-the-art methods for well-conditioned (left) and ill-conditioned (right) inputs with different initializations: x0 = A1 for A1 ∈ A(left), x0 = 1

2 (AM + HM) (middle) and x0 = HM (right).

Method # calls to grad # calls to costEfw 30 0

Rfw 30 0

RBB 17 35

R-LBFGS 30 49

MM 30 0

Zhang 30 0

Figure 10: Number of calls to the gradient and cost functions until reaching convergence, averagedover ten experiments with N = 40, M = 10 and K = 30. Note, that Efw/ Rfw outperform mostother methods in this example which is partially due to avoiding internal calls to the cost function,significantly increasing the efficiency of both methods.

32