riemannian optimization and the computation of the divergences and the karcher...

Riemannian Optimization and the Computation ofthe Divergences and the Karcher Mean of

Symmetric Positive Definite Matrices

Xinru Yuan1, Wen Huang2, Kyle A. Gallivan1, and P.-A. Absil3

1,Florida State University 2,Rice University 3,Université Catholique de Louvain

May 7, 2018SIAM Conference on Applied Linear Algebra

Means of SPD Matrices 1

Symmetric Positive Definite (SPD) Matrix

Definition

A symmetric matrix A is called positive definite iff all its eigenvalues arepositive.

2× 2 SPD matrix

u√λu

v√λv

3× 3 SPD matrix

u√λu v√

λv

w√λw


Motivation of Averaging SPD Matrices

Possible applications of SPD matrices

- Diffusion tensors in medical imaging[CSV12, FJ07, RTM07]

- Describing images and video[LWM13, SFD02, ASF+05, TPM06,HWSC15]

Motivation of averaging SPD matrices

- Aggregate several noisy measurements of the same object

- Subtask in interpolation methods, segmentation, and clustering


Averaging Schemes: from Scalars to Matrices

Let A1, . . . ,AK be SPD matrices.

Generalized arithmetic mean: 1K

K∑i=1

Ai

→ Not appropriate in many practical applications

A A+B2 B

detA = 50 det(A+B2 ) = 267.56 detB = 50

Generalized geometric mean: (A1 · · ·AK )1/K

→ Not appropriate due to non-commutativity→ How to define a matrix geometric mean?


Desired Properties of a Matrix Geometric Mean

The desired properties are given in the ALM list1, some of which are:

G (Aπ(1), . . . ,Aπ(K)) = G (A1, . . . ,AK ) with π a permutation of (1, . . . ,K)

if A1, . . . ,AK commute, then G(A1, . . . ,AK ) = (A1, . . . ,AK )1/K

G(A1, . . . ,AK )−1 = G(A−11 , . . . ,A

−1K )

det(G(A1, . . . ,AK )) = (det(A1) · · · det(AK ))1/K

1T. Ando, C.-K. Li, and R. Mathias, Geometric means, Linear Algebra and ItsApplications, 385:305-334, 2004


Geometric Mean of SPD Matrices

A well-known mean on the manifold of SPD matrices is the Karchermean [Kar77]:

G (A1, . . . ,AK ) = arg minX∈Sn++

1

2K

K∑i=1

δ2(X ,Ai ), (1)

where δ(X ,Y ) = ‖ log(X−1/2YX−1/2)‖F is the geodesic distanceunder the affine-invariant metric

g(ηX , ξX ) = trace(ηXX−1ξXX

−1)

The Karcher mean defined in (1) satisfies all the geometricproperties in the ALM list [LL11]


Algorithms

G (A1, . . . ,Ak) = arg minX∈Sn++

1

2K

K∑i=1

δ2(X ,Ai ),

Riemannian steepest descent [RA11, Ren13]

Riemannian Barzilai-Borwein method [IP15]

Riemannian Newton method [RA11]

Richardson-like iteration [BI13]

Riemannian steepest descent, conjugate gradient, BFGS, and trustregion Newton methods [JVV12]

Limited-memory Riemannian BFGS method [YHAG16]


Conditioning of the Objective Function

Hemstitching phenomenonfor steepest descent

well-conditioned Hessian ill-conditioned Hessian

Small condition number ⇒ fast convergence

Large condition number ⇒ slow convergence


Conditioning of the Karcher Mean Objective Function

Riemannian metric:

gX (ξ, η) = trace(ξX−1ηX−1)

Euclidean metric:

gX (ξ, η) = trace(ξη)

Condition number κ of Hessian at the minimizer µ:

Hessian of Riemannian metric:

- κ(HR) ≤ 1 + ln(maxκi )2

, where κi = κ(µ−1/2Aiµ

−1/2)

- κ(HR) ≤ 20 if max(κi ) = 1016

Hessian of Euclidean metric:

-κ2(µ)

κ(HR)≤ κ(HE) ≤ κ(HR)κ2(µ)

- κ(HE ) ≥ κ2(µ)/20


BFGS Quasi-Newton Algorithm: from Euclidean to Riemannian

replace by Rxk (ηk)

Update formula:

y

xk+1 = xk + αkηk

Search direction:ηk = −B−1k grad f (xk)

Bk update:

Bk+1 = Bk −Bksks

Tk Bk

sTk Bksk+

ykyTk

yTk sk,

forspacewhere sk = xk+1 − xk , and yk = grad f (xk+1)− grad f (xk)

x xreplaced by R−1xk (xk+1) on different tangent spaces




Update formula:y

xk+1 = xk + αkηk


Bk update:

Bk+1 = Bk −Bksks

Tk Bk

sTk Bksk+

ykyTk

yTk sk,

forspacewhere sk = xk+1 − xk , and yk = grad f (xk+1)− grad f (xk)

x xreplaced by R−1xk (xk+1) on different tangent spaces




Update formula:y

xk+1 = xk + αkηk


Bk update:

Bk+1 = Bk −Bksks

Tk Bk

sTk Bksk+

ykyTk

yTk sk,

forspacewhere sk = xk+1 − xk , and yk = grad f (xk+1)− grad f (xk)x x

replaced by R−1xk (xk+1) on different tangent spaces


Riemannian BFGS (RBFGS) Algorithm

Update formula:

xk+1 = Rxk (αkηk) with ηk = −B−1k grad f (xk)

Bk update [HGA15]:

Bk+1 = B̃k −B̃k sk(B̃k sk)[

(B̃k sk)[sk+

yky[ky[k sk

,

where sk = Tαkηkαkηk , yk = β−1k grad f (xk+1)− Tαkηk grad f (xk),

and B̃k = Tαkηk ◦ Bk ◦ T −1αkηk .

Stores and transports B−1k as a dense matrix

Requires excessive computation time and storage space forlarge-scale problem


Limited-memory RBFGS (LRBFGS)

Riemannian BFGS:

Bk+1 = B̃k − B̃k sk (B̃k sk )[

(B̃k sk )[sk+

yky[ky[k sk

,

where sk = Tαkηkαkηk , yk = β−1k grad f (xk+1)− Tαkηk grad f (xk),

and B̃k = Tαkηk ◦ Bk ◦ T −1αkηk

Limited-memory Riemannian BFGS:

Stores only the m most recent sk and yk

Transports those vectors to the new tangent space rather than theentire matrix B−1kComputational and storage complexity depends upon m


Implementations

Representations of tangent vectors

: TX Sn++ = {S ∈ Rn×n|S = ST}

Extrinsic representation: n2-dimensional vector

Intrinsic representation: d-dimensional vector where d = n(n + 1)/2

Retraction

Exponential mapping: ExpX (ξ) = X1/2 exp(X−1/2ξX−1/2)X 1/2

Second order approximation retraction [JVV12]:

RX (ξ) = X + ξ +1

2ξX−1ξ

Vector transport

Parallel translation: Tpη (ξ) = QξQT , with Q = X12 exp(

X−12 ηX−

12

2)X−

12

Vector transport by parallelization [HAG15]: essentially an identity


Implementations

Representations of tangent vectors: TX Sn++ = {S ∈ Rn×n|S = ST}

Extrinsic representation: n2-dimensional vector

Intrinsic representation: d-dimensional vector where d = n(n + 1)/2

Retraction

Exponential mapping: ExpX (ξ) = X1/2 exp(X−1/2ξX−1/2)X 1/2

Second order approximation retraction [JVV12]:

RX (ξ) = X + ξ +1

2ξX−1ξ

Vector transport

Parallel translation: Tpη (ξ) = QξQT , with Q = X12 exp(

X−12 ηX−

12

2)X−

12

Vector transport by parallelization [HAG15]: essentially an identity


Complexity Comparison for LRBFGS

Extrinsic approach:

Function

Riemannian gradient

Intrinsic approach:

Function

Riemannian gradient

Both approaches have the same complexities: f + ∇f cost



Extrinsic approach:

Function

Riemannian gradient

Retraction

- Evaluate RX (η)

Intrinsic approach:

Function

Riemannian gradient

Retraction

- Compute η from η̃d

- Evaluate RX (η)

Intrinsic cost = Extrinsic cost + 2n3 + o(n3)



Extrinsic approach:

Function

Riemannian gradient

Retraction

Riemannian metric

- 6n3 + o(n3)

Intrinsic approach:

Function

Riemannian gradient

Retraction

Reduces to Euclidean metric

- n2 + o(n2)



Extrinsic approach:

Function

Riemannian gradient

Retraction

Riemannian metric

(2m) times of vector transport

Intrinsic approach:

Function

Riemannian gradient

Retraction


No explicit vector transport



Extrinsic approach:

Function

Riemannian gradient

Retraction

Riemannian metric

(2m) times of vector transport

Intrinsic approach:

Function

Riemannian gradient

Retraction


No explicit vector transport

Complexity comparison:

f +∇f +27n3 + 12mn2+

2m × Vector transport cost

f +∇f +22n3/3 + 4mn2


Numerical Results: Comparison of Different Algorithms

K = 100, size = 3× 3, d = 6

1 ≤ κ(Ai ) ≤ 200

iterations0 5 10 15 20 25 30 35 40 45 50

dist(µ,X

t)

10-15

10-10

10-5

100

RL IterationRSD-QRRBBLRBFGS: 2LRBFGS: 4RBFGS

time (s)0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04

dist(µ,X

t)

10-15

10-10

10-5

100


103 ≤ κ(Ai ) ≤ 107

iterations0 5 10 15 20 25 30 35 40 45 50

dist(µ,X

t)

10-15

10-10

10-5

100


time (s)0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04

dist(µ,X

t)

10-15

10-10

10-5

100


Figure: Evolution of averaged distance between current iterate and the exactKarcher mean with respect to time and iterations


Numerical Results: Comparison of Different Algorithms

K = 30, size = 100× 100, d = 5050

1 ≤ κ(Ai ) ≤ 20

iterations0 5 10 15 20 25 30 35 40 45 50

dis

t(7;X

t)

10-15

10-10

10-5

100RL IterationRSD-QRRBBLRBFGS: 2LRBFGS: 8RBFGS

time (s)0 1 2 3 4 5 6 7 8

dis

t(7;X

t)

10-15

10-10

10-5


104 ≤ κ(Ai ) ≤ 107

iterations0 5 10 15 20 25 30 35 40 45 50

dis

t(7;X

t)

10-15

10-10

10-5


time (s)0 5 10 15

dis

t(7;X

t)

10-15

10-10

10-5




Numerical Results: Riemannian vs. Euclidean Metrics

K = 100, n = 3, and 1 ≤ κ(Ai ) ≤ 106.

iterations0 20 40 60 80 100

dis

t(7;X

t)

10-6

10-4

10-2

100 RSD-RieRSD-EucLRBFGS-Rie: 0LRBFGS-Euc: 0LRBFGS-Rie: 2LRBFGS-Euc: 2RBFGS-RieRBFGS-Euc

time (s)0 0.05 0.1 0.15

dis

t(7;X

t)

10-6

10-4

10-2

100 RSD-RieRSD-EucLRBFGS-Rie: 0LRBFGS-Euc: 0LRBFGS-Rie: 2LRBFGS-Euc: 2RBFGS-RieRBFGS-Euc

K = 30, n = 100, and 1 ≤ κ(Ai ) ≤ 105.

iterations0 20 40 60 80 100

dis

t(7;X

t)

10-6

10-4

10-2

100RSD-RieRSD-EucLRBFGS-Rie: 0LRBFGS-Euc: 0LRBFGS-Rie: 2LRBFGS-Euc: 2RBFGS-RieRBFGS-Euc

time (s)0 5 10 15 20 25

dis

t(7;X

t)

10-6

10-4

10-2

100RSD-RieRSD-EucLRBFGS-Rie: 0LRBFGS-Euc: 0LRBFGS-Rie: 2LRBFGS-Euc: 2RBFGS-RieRBFGS-Euc



Outline

Karcher mean computation on Sn++

Divergence-based means on Sn++

L1-norm median computation on Sn++

Application: Structure tensor image denoising

Summary


Motivations

Karcher mean

K(A1, . . . ,AK ) = arg minX∈Sn++

1

2K

K∑i=1

δ2(X ,Ai ), (1)

where δ(X ,Y ) = ‖ log(X−1/2YX−1/2)‖Fpros: holds desired properties

cons: high computational cost

Use divergences as alternatives to the geodesic distance due to theircomputational and empirical benefits

A divergence is like a distance except it lacks

triangle inequality

symmetry


LogDet α-divergence and Associated Mean

The LogDet α-divergence is defined as

G (A1, . . . ,Ak) = arg minX∈Sn++

1

2K

K∑i=1

δ2LD,α(Ai ,X ) , (2)

where the LogDet α-divergence on Sn++ is given by

δ2LD,α(X ,Y ) =4

1− α2log

det( 1−α2 X +1+α

2 Y )

det(X )1−α

2 det(Y )1+α

2

The LogDet α-divergence is asymetric in general, except for α = 0

(2) defines the right mean. The left mean can be defined in a similarway.


Karcher Mean vs. LogDet α-divergence Mean

Complexity comparison for problem-related operations

function gradient total

LD α-div. mean2Kn3

33Kn3

11Kn3

3

Karcher mean 18Kn3 5Kn3 23Kn3

Invariance properties

scalinginvariance

rotationinvariance

congruenceinvariance

inversioninvariance

LD α-div. mean 3 3 3 7

Karcher mean 3 3 3 3


Numerical Experiment: Comparions of Different Algorithms

K = 100, n = 3, and 10 ≤ κ(Ai ) ≤ 106 Fixed Point iteration2: stepsize =1 − α

2K

iterations0 100 200 300 400

dist(7;X

t)

10-8

10-6

10-4

10-2

100

102, = 0:9

FPRSDRBBLRBFGSRBFGSRNewton

time (s)0 0.02 0.04 0.06 0.08

dist(7;X

t)

10-8

10-6

10-4

10-2

100

102, = 0:9


iterations0 50 100 150 200

dist(7;X

t)

10-8

10-6

10-4

10-2

100

102, = 0:5


time (s)0 0.005 0.01 0.015 0.02 0.025 0.03

dist(7;X

t)

10-8

10-6

10-4

10-2

100

102, = 0:5


iterations0 20 40 60 80 100

dist(7;X

t)

10-8

10-6

10-4

10-2

100

102, = 0


time (s)0 0.005 0.01 0.015 0.02

dist(7;X

t)

10-8

10-6

10-4

10-2

100

102, = 0


2Z. Chebbi and M. Moakher. Means of Hermitian positive-definite matrices based on thelog-determinant -divergence function. Linear Algebra and its Applications, 436(7):1872C1889, 2012





Numerical Experiment: Comparions of Different Algorithms

K = 100, n = 3, and 10 ≤ κ(Ai ) ≤ 106

iterations0 10 20 30 40 50

dist(7;X

t)

10-8

10-6

10-4

10-2

100

102, = 0:9


time (s)0 0.002 0.004 0.006 0.008 0.01

dist(7;X

t)

10-8

10-6

10-4

10-2

100

102, = 0:9


iterations0 10 20 30 40 50

dist(7;X

t)

10-8

10-6

10-4

10-2

100

102, = 0:5


time (s)0 0.002 0.004 0.006 0.008 0.01

dist(7;X

t)

10-8

10-6

10-4

10-2

100

102, = 0:5


iterations0 10 20 30 40 50

dist(7;X

t)

10-8

10-6

10-4

10-2

100

102, = 0


time (s)0 0.002 0.004 0.006 0.008 0.01

dist(7;X

t)

10-8

10-6

10-4

10-2

100

102, = 0



Outline



L1-norm median computation on Sn++


Conclusion


Motivations

The mean of a set of points is sensitive to outliers

The median is robust to outliers

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

points

outliers

mean

median

Figure: The geometric mean and median in R2 space.


Riemannian Median of SPD Matrices

The Riemannian median of a set of SPD matrices is defined as:

M(A1, . . . ,AK ) = arg minX∈Sn++

1

2K

K∑i=1

δ(Ai ,X ),

where δ(X ,Y ) is a distance or the square root of a divergence function.

The cost function is non-smooth at X = Ai

There may exist multiple local minima for some δ


Numerical Experiment

Original data (K = 50) Outliers

outliers 0 1 5 10 25 50

δR

Median

Mean


Outline



Riemannian L1-norm median computation on Sn++


Conclusion


Application: Structure Tensor Image Denoising

A structure tensor image is a spatialstructured matrix field

I : Ω ⊂ Z2 → Sn++

Noisy tensor images are simulated byreplacing the pixel values by an outliertensor with a given probability Pr

Denoising is done by averagingmatrices in the neighborhood of eachpixel

Mean Riemannian Error:

MRE =1

#Ω

∑(i,j)∈Ω

δR(Ii,j , Ĩi,j)

Original tensor image

Noisy tensor image


Structure Tensor Image Denoising: Pr = 0.02


Structure Tensor Image Denoising: MRE and Time

MRE comparison

Means

A-m LE-m J-m K-m R-median α-m α-median

MRE

0

0.2

0.4

0.6

0.8

1

1.21.089

0.263

0.616

0.264

0.026

0.125

0.029

Pr = 0.02

Time comparison

MeansA-m LE-m J-m K-m R-median α-m α-median

Tim

e(s)

0

2

4

6

8

10

12

14

16

18

0.05 0.40 0.30

7.12

16.53

3.62

11.76

Pr = 0.02


Structure Tensor Image Denoising: Pr = 0.1


Structure Tensor Image Denoising: MRE and Time

MRE comparison

Means

A-m LE-m J-m K-m R-median α-m α-median

MRE

0

1

2

3

4

5

3.959

0.947

2.071

0.948

0.078

0.407

0.055

Pr = 0.1

Time comparison

MeansA-m LE-m J-m K-m R-median α-m α-median

Tim

e(s)

0

2

4

6

8

10

12

14

16

0.04 0.40 0.31

6.47

14.31

4.66

11.32

Pr = 0.1


Summary

Karcher mean for SPD matrices

Analyze the conditioning of the Hessian of the Karcher mean costfunctionApply a limited-memory Riemannian BFGS method to computingthe SPD Karcher mean with efficient implementationsRecommend using LRBFGS as the default method for the SPDKarcher mean computation

Other averaging techniques for SPD matrices

Investigate divergence-based means and Riemannian L1-normmedians on Sn++Use recent development in Riemannian optimization to developefficient and robust algorithm on Sn++

Evaluate the performance of different averaging techniques inapplications


Thank you!


References I

Ognjen Arandjelovic, Gregory Shakhnarovich, John Fisher, RobertoCipolla, and Trevor Darrell.Face recognition with image sets using manifold density divergence.In Computer Vision and Pattern Recognition, 2005. CVPR 2005.IEEE Computer Society Conference on, volume 1, pages 581–588.IEEE, 2005.

D. A. Bini and B. Iannazzo.Computing the Karcher mean of symmetric positive definitematrices.Linear Algebra and its Applications, 438(4):1700–1710, 2013.

Guang Cheng, Hesamoddin Salehian, and Baba Vemuri.Efficient recursive algorithms for computing the mean diffusiontensor and applications to DTI segmentation.Computer Vision–ECCV 2012, pages 390–401, 2012.


References II

P. T. Fletcher and S. Joshi.Riemannian geometry for the statistical analysis of diffusion tensordata.Signal Processing, 87(2):250–262, 2007.

Wen Huang, P-A Absil, and Kyle A Gallivan.A Riemannian symmetric rank-one trust-region method.Mathematical Programming, 150(2):179–216, 2015.

Wen Huang, Kyle A Gallivan, and P-A Absil.A Broyden class of quasi-Newton methods for Riemannianoptimization.SIAM Journal on Optimization, 25(3):1660–1685, 2015.

Zhiwu Huang, Ruiping Wang, Shiguang Shan, and Xilin Chen.Face recognition on large-scale video in the wild with hybridEuclidean-and-Riemannian metric learning.Pattern Recognition, 48(10):3113–3124, 2015.


References III

Bruno Iannazzo and Margherita Porcelli.The Riemannian Barzilai-Borwein method with nonmonotoneline-search and the Karcher mean computation.Optimization online, December, 2015.

B. Jeuris, R. Vandebril, and B. Vandereycken.A survey and comparison of contemporary algorithms for computingthe matrix geometric mean.Electronic Transactions on Numerical Analysis, 39:379–402, 2012.

H. Karcher.Riemannian center of mass and mollifier smoothing.Communications on Pure and Applied Mathematics, 1977.

J. Lawson and Y. Lim.Monotonic properties of the least squares mean.Mathematische Annalen, 351(2):267–279, 2011.


References IV

Jiwen Lu, Gang Wang, and Pierre Moulin.Image set classification using holistic multiple order statisticsfeatures and localized multi-kernel metric learning.In Proceedings of the IEEE International Conference on ComputerVision, pages 329–336, 2013.

Q. Rentmeesters and P.-A. Absil.Algorithm comparison for Karcher mean computation of rotationmatrices and diffusion tensors.In 19th European Signal Processing Conference, pages 2229–2233,Aug 2011.

Q. Rentmeesters.Algorithms for data fitting on some common homogeneous spaces.PhD thesis, Universite catholiqué de Louvain, 2013.


References V

Y. Rathi, A. Tannenbaum, and O. Michailovich.Segmenting images on the tensor manifold.In IEEE Conference on Computer Vision and Pattern Recognition,pages 1–8, June 2007.

Gregory Shakhnarovich, John W Fisher, and Trevor Darrell.Face recognition from long-term observations.In European Conference on Computer Vision, pages 851–865.Springer, 2002.

Oncel Tuzel, Fatih Porikli, and Peter Meer.Region covariance: A fast descriptor for detection and classification.In European conference on computer vision, pages 589–600.Springer, 2006.

Xinru Yuan, Wen Huang, P.-A. Absil, and Kyle A. Gallivan.A Riemannian limited-memory BFGS algorithm for computing thematrix geometric mean.Procedia Computer Science, 80:2147–2157, 2016.


riemannian optimization and the computation of the divergences and the karcher...

Documents