the burbea-rao and bhattacharyya centroids

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011 5455

The Burbea-Rao and Bhattacharyya CentroidsFrank Nielsen, Senior Member, IEEE, and Sylvain Boltz

Abstract—We study the centroid with respect to the class ofinformation-theoretic Burbea-Rao divergences that generalizethe celebrated Jensen-Shannon divergence by measuring thenon-negative Jensen difference induced by a strictly convex anddifferentiable function. Although those Burbea-Rao divergencesare symmetric by construction, they are not metric since they failto satisfy the triangle inequality. We first explain how a particularsymmetrization of Bregman divergences called Jensen-Bregmandistances yields exactly those Burbea-Rao divergences. We thenproceed by defining skew Burbea-Rao divergences, and show thatskew Burbea-Rao divergences amount in limit cases to computeBregman divergences. We then prove that Burbea-Rao centroidscan be arbitrarily finely approximated by a generic iterative con-cave-convex optimization algorithm with guaranteed convergenceproperty. In the second part of the paper, we consider the Bhat-tacharyya distance that is commonly used to measure overlappingdegree of probability distributions. We show that Bhattacharyyadistances on members of the same statistical exponential familyamount to calculate a Burbea-Rao divergence in disguise. Thuswe get an efficient algorithm for computing the Bhattacharyyacentroid of a set of parametric distributions belonging to thesame exponential families, improving over former specializedmethods found in the literature that were limited to univariate or“diagonal” multivariate Gaussians. To illustrate the performanceof our Bhattacharyya/Burbea-Rao centroid algorithm, we presentexperimental performance results for �-means and hierarchicalclustering methods of Gaussian mixture models.

Index Terms—Bhattacharrya divergence, Bregman divergences,Burbea-Rao divergence, centroid, exponential families, informa-tion geometry, Jensen-Shannon divergence, Kullback-Leibler di-vergence.

I. INTRODUCTION

A. Means and Centroids

I N Euclidean geometry, the centroid of a point setis defined as the center of mass ,

also characterized as the center point that minimizes the averagesquared Euclidean distances: .This basic notion of Euclidean centroid can be extended to de-note a mean point representing the centrality of a given

Manuscript received April 28, 2010; revised January 11, 2011; accepted Feb-ruary 28, 2011. Date of current version July 29, 2011. This work was supportedby the French agency DIGITEO (GAS 2008-16D), the French National Re-search Agency (ANR GAIA 07-BLAN-0328-01), and Sony Computer ScienceLaboratories, Inc.

F. Nielsen is with the Department of Fundamental Research, Sony ComputerScience Laboratories, Inc., Tokyo, Japan, and the Computer Science Department(LIX), École Polytechnique, Palaiseau, France (e-mail: [email protected]).

S. Boltz is with the Computer Science Department (LIX), École Polytech-nique, Palaiseau, France (e-mail: [email protected]).

Communicated by A. Krzyzak, Associate Editor for Pattern Recognition, Sta-tistical Learning, and Inference.

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIT.2011.2159046

point set . There are basically two complementary approachesto define mean values of numbers: 1) by axiomatization or 2) byoptimization, summarized concisely as follows.

• By axiomatization. This approach was first historicallypioneered by the independent work of Kolmogorov [1]and Nagumo [2] in 1930, and simplified and refined laterby Aczél [3]. Without loss of generality, we consider themean of two non-negative numbers and , and postu-late the following expected behaviors of a mean function

as axioms (common sense):— Reflexivity. ,— Symmetry. ,— Continuity and strict monotonicity. continuous

and for , and— Anonymity.

(also called bisym-metry expressing the fact that the mean can becomputed as a mean on the row means or equivalentlyas a mean on the column means).

Then one can show that the mean function is nec-essarily written as

(1)

for a strictly increasing function . The arithmetic ,geometric and harmonic means are in-

stances of such generalized means obtained for ,and , respectively. Those gen-

eralized means are also called quasi-arithmetic means,since they can be interpreted as the arithmetic mean onthe sequence , the -representation ofnumbers. To get geometric centroids, we simply con-sider means on each coordinate axis independently. TheEuclidean centroid is thus interpreted as the Euclideanarithmetic mean. Barycenters (weighted centroids) aresimilarly obtained using non-negative weights (normal-ized so that ):

(2)

Those generalized means satisfy the inequality property

(3)if and only if function dominates : That is,

. Therefore, the arithmetic meandominates the geometric mean

which in turn dominates the harmonic mean .Note that it is not a strict inequality in (3) as the meanscoincide for all identical elements: if all are equal to

0018-9448/$26.00 © 2011 IEEE

5456 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 8, AUGUST 2011

then. All those quasi-arithmetic means further

satisfy the “interness” property

(4)derived from limit cases of power means1 for

, a nonzero realnumber.

• By optimization. In this second alternative approach, thebarycenter is defined according to a distance function

as the optimal solution of a minimization problem

(5)

where the nonnegative weights denote multiplicity orrelative importance of points (by default, the centroid is de-fined by fixing all ). Ben-Tal et al. [4] consideredan information-theoretic class of distances called -diver-gences [5], [6]:

(6)

for a strictly convex differentiable function satisfyingand . Although those -divergences

were primarily investigated for probability measures,2 wecan extend the -divergence to positive measures. Sinceprogram (OPT) is strictly convex in , it admits a uniqueminimizer , termed theentropic mean by Ben-Tal et al. [4]. Interestingly, thoseentropic means are linear scale-invariant:3

(7)

Nielsen and Nock [7] considered another class of informa-tion-theoretic distortion measures called Bregman di-vergences [8], [9]

(8)

for a strictly convex differentiable function . It followsthat (OPT) is convex, and admits a unique minimizer

, a quasi-arith-metic mean for the strictly increasing and continuousfunction , the derivative of . Observe that infor-mation-theoretic distances may be asymmetric (i.e.,

), and therefore one may also define aright-sided centroid as the minimizer of

(9)

1Besides the min/max operators interpreted as extremal power means, the

geometric mean itself can also be interpreted as a power mean � � � inthe limit case ��.

2In that context, a �-dimensional point is interpreted as a discrete and finiteprobability measure lying in the �� -dimensional unit simplex.

3That is, means of homogeneous degree 1.

It turns out that for -divergences, we have

(10)

for so that (OPT’) is solved as a (OPT)problem for the conjugate function . In the samespirit, we have

(11)

for Bregman divergences, where denotes the Legendreconvex conjugate [8], [9].4 Surprisingly, although (OPT’)may not be convex in for Bregman divergences (e.g.,

), (OPT’) admits nevertheless a uniqueminimizer, independent of the generator function : thecenter of mass . Bregman meansare not homogeneous except for the power generators

which yields entropic means, i.e., meansthat can also be interpreted5 as minimizers of average

-divergences [4]. Amari [11] further studied those powermeans (known as -means in information geometry [12]),and showed that they are linear-scale free means obtainedas minimizers of -divergences, a proper subclass of -di-vergences. Nielsen and Nock [13] reported an alternativesimpler proof of -means by showing that the -diver-gences are Bregman divergences in disguise (namely,representational Bregman divergences for positive mea-sures, but not for normalized distribution measures [10]).To get geometric centroids, we simply consider mul-tivariate extensions of the optimization task (OPT). Inparticular, one may consider separable divergences thatare divergences that can be assembled coordinate-wise

(12)

with denoting the coordinate. A typical nonsepa-rable divergence is the squared Mahalanobis distance [14]

(13)

a Bregman divergence called generalized quadratic dis-tance, defined for the generator , whereis a positive-definite matrix . For separable dis-tances, the optimization problem (OPT) may then be rein-terpreted as the task of finding the projection [15] of a point

(of dimension ) to the upper line

(14)

with , and the -dimensionalpoint obtained by stacking the coordinates of each of the

points.

4Legendre dual convex conjugates � and � have necessarily reciprocal gra-dients: � � �� . See [7].

5In fact, Amari [10] proved that the intersection of the class of � -divergenceswith the class of Bregman divergences are �-divergences.

NIELSEN AND BOLTZ: THE BURBEA-RAO AND BHATTACHARYYA CENTROIDS 5457

In geometry, means (centroids) play a crucial role in center-based clustering (i.e., -means [16] for vector quantization ap-plications). Indeed, the mean of a cluster allows one to aggre-gate data into a single center datum. Thus the notion of meansare encapsulated into the broader theory of mathematical aggre-gators [17].

Results on geometric means can be easily transfered to thefield of Statistics [4] by generalizing the optimization problemtask to a random variable with distribution as

(15)

where denotes the expectation defined with respect to theLebesgue-Stieltjes integral. Although this approach is discussedin [4] and important for defining various notions of centrality instatistics, we shall not cover this extended framework here, forsake of brevity.

B. Burbea-Rao Divergences

In this paper, we focus on the optimization approach (OPT)for defining other (geometric) means using the class of infor-mation-theoretic distances obtained by Jensen difference for astrictly convex and differentiable function :

(16)Since the underlying differential geometry implied by those

Jensen difference distances have been seminally studied inpapers of Burbea and Rao [18], [19], we shall term themBurbea-Rao divergences, and point out to them as . Inthe remainder, we consider separable Burbea-Rao divergences.That is, for -dimensional points and , we define

(17)

and study the Burbea-Rao centroids (and barycenters) asthe minimizers of the average Burbea-Rao divergences.Those Burbea-Rao divergences generalize the celebratedJensen-Shannon divergence [20]

(18)

by choosing , the negative Shannon entropy. Generators of parametric distances

are convex functions representing entropies which are concavefunctions. Burbea-Rao divergences contain all generalizedquadratic distances ( for a positivedefinite matrix , also called squared Mahalanobis dis-tances)

Although the square root of the Jensen-Shannon divergenceyields a metric (a Hilbertian metric), it is not true in generalfor Burbea-Rao divergences. The closest work to our paper isa 1-page symposium6 paper [21] discussing about Ali-Silvey-Csiszár -divergences [5], [6] and Bregman divergences [8],[22] (two entropy-based divergence classes). Those informa-tion-theoretic distortion classes are compared using quadraticdifferential metrics, mean values and projections. The notion ofskew Jensen differences intervene in the discussion.

C. Contributions and Paper Organization

The paper is articulated into two parts: The first part studiesthe Burbea-Rao centroids, and the second part shows some ap-plications in Statistics. We summarize our contributions as fol-lows.

• We define the parametric class of (skew) Burbea-Rao di-vergences, and show that those divergences naturally arisewhen generalizing the principle of the Jensen-Shannon di-vergence [20] to Jensen-Bregman divergences. In the limitcases, we further prove that those skew Burbea-Rao diver-gences yield asymptotically Bregman divergences.

• We describe the centroids with respect to the (skew)Burbea-Rao divergences. Besides centroids for specialcases of Burbea-Rao divergences (including the squaredEuclidean distances), those centroids are not availablein closed-form equations. However, we show that anyBurbea-Rao centroid can be estimated efficiently usingan iterative convex-concave optimization procedure. Asa by-product, we find Bregman sided centroids [7] inclosed-form in the extremal skew cases.

We then consider applications of Burbea-Rao centroids inStatistics, and show the link with Bhattacharyya distances.A wide class of statistical parametric models can be handledin a unified manner as exponential families [23]. The classesof exponential families contain many of the standard para-metric models including the Poisson, Gaussian, multinomial,and Gamma/Beta distributions, just to name a few prominentmembers. However, only a few closed-form formulas for thestatistical Bhattacharyya distances between those densities arereported in the literature.7

For the second part, our contributions are reviewed as follows:• We show that the (skew) Bhattacharyya distances calcu-

lated for distributions belonging to the same exponentialfamily in statistics, are equivalent to (skew) Burbea-Raodivergences. We mention corresponding closed-formformula for computing Chernoff coefficients and -diver-gences of exponential families. In the limit case, we obtainan alternative proof showing that the Kullback-Leiblerdivergence of members of the same exponential familyis equivalent to a Bregman divergence calculated on thenatural parameters [14].

6In the nineties, the IEEE International Symposium on Information Theory(ISIT) published only 1-page papers. We are grateful to Prof. Michèle Bassevillefor sending us the corresponding slides.

7For instance, the Bhattacharyya distance between multivariate normal dis-tributions is given here [24].


• We approximate iteratively the Bhattacharyya centroid ofany set of distributions of the same exponential family (in-cluding multivariate Gaussians) using the Burbea-Rao cen-troid algorithm. For the case of multivariate Gaussians, wedesign yet another tailored iterative scheme based on ma-trix differentials, generalizing the former univariate studyof Rigazio et al. [25]. Thus we get either the generic wayor the tailored way for computing the Bhattacharrya cen-troids of arbitrary Gaussians.

• As a field application, we show how to simplify Gaussianmixture models using hierarchical clustering, and showexperimentally that the results obtained with the Bhat-tacharyya centroids compare favorably well with formerresults obtained for Bregman centroids [26]. Our numer-ical experiments show that the generic method outperformsthe alternative tailored method for multivariate Gaussians.

The paper is organized as follows: In Section II, we intro-duce Burbea-Rao divergences as a natural extension of theJensen-Shannon divergence using the framework of Bregmandivergences. It is followed by Section III which considers thegeneral case of skew divergences, and reveals asymptotic be-haviors of extreme skew Burbea-Rao divergences as Bregmandivergences. Section IV defines the (skew) Burbea-Rao cen-troids, and present a simple iterative algorithm with guaranteedconvergence. We then consider applications in Statistics inSection V: After briefly recalling exponential distributionsin Section V-A, we show that Bhattacharyya distances andChernoff/Amari -divergences are available in closed-formequations as Burbea-Rao divergences for distributions of thesame exponential families. Section V-C presents an alternativeiterative algorithm tailored to compute the Bhattacharyyacentroid of multivariate Gaussians, generalizing the formerspecialized work of Rigazio et al. [25]. In Section V-D, weuse those Bhattacharyya/Burbea-Rao centroids to simplifyhierarchically Gaussian mixture models, and comment bothqualitatively and quantitatively our experiments on a colorimage segmentation application. Finally, Section VI concludesthis paper by describing further perspectives and hinting atsome information geometrical aspects of this work.

II. BURBEA-RAO DIVERGENCES FROM SYMMETRIZATION

OF BREGMAN DIVERGENCES

Let denote the set of non-negative reals. Fora strictly convex (and differentiable) generator , we define theBurbea-Rao divergence as the following non-negative function:

The non-negative property of those divergences fol-lows straightforwardly from Jensen inequality. AlthoughBurbea-Rao distances are symmetric

, they are not metrics since they fail to satisfythe triangle inequality. A geometric interpretation of thosedivergences is given in Fig. 1. Note that is defined up to anaffine term .

Fig. 1. Interpreting the Burbea-Rao divergence �� as the vertical dis-tance between the midpoint of segment �� and the midpointof the graph plot � � .

Fig. 2. Interpreting the Bregman divergence � �� as the vertical distancebetween the tangent plane at � and its translate passing through � (with identicalslope �� ).

We show that Burbea-Rao divergences extend theJensen-Shannon divergence using the broader concept ofBregman divergences instead of the Kullback-Leibler diver-gence. A Bregman divergence [8], [9], [22] is defined asthe positive tail of the first-order Taylor expansion of a strictlyconvex and differentiable convex function :

(19)

where denote the gradient of (the vector of partial deriva-tives ), and the inner product (dot productfor vectors). A Bregman divergence is interpreted geometrically[14] as the vertical distance between the tangent plane at ofthe graph plot and its translates

passing through . Fig. 2 depicts graphicallythe geometric interpretation of the Bregman divergence (to becompared with the Burbea-Rao divergence in Fig. 1).

Bregman divergences are never metrics, and symmetriconly for the generalized quadratic distances [14] obtained bychoosing , for some positive definite matrix

. Bregman divergences allow one to encapsulate bothstatistical distances with geometric distances:

• Kullback-Leibler divergence obtained for :

(20)

• squared Euclidean distance obtained for :

(21)


Basically, there are two ways to symmetrize Bregman diver-gences (see also work on Bregman metrization [27], [28]):

• Jeffreys-Bregman divergences. We consider half of thedouble-sided divergences:

(22)

(23)

Except for the generalized quadratic distances, this sym-metric distance cannot be interpreted as a Bregman diver-gence [14].

• Jensen-Bregman divergences. We consider the Jeffreys-Bregman divergences from the source parameters to theaverage parameter as follows:

(24)

Note that even for the negative Shannon entropy(extended to positive measures), those two sym-

metrizations yield different divergences: While uses the gra-dient , relies only on the generator . Both andhave always finite values.8 The first symmetrization approachwas historically studied by Jeffreys [29].

The second way to symmetrize Bregman divergences gener-alizes the spirit of the Jensen-Shannon divergence [20]

(25)

(26)

with non-negativity that can be derived from Jensen’s in-equality, hence its name. The Jensen-Shannon divergence isalso called the total divergence to the average, a generalizedmeasure of diversity from the population distributions andto the average population . Those Jensen difference-typedivergences are by definition Burbea-Rao divergences. For theShannon entropy, those two different information divergencesymmetrizations (Jensen-Shannon divergence and Jeffreysdivergence) satisfy the following inequality:

(27)

Nielsen and Nock [7] investigated the centroids with respectto Jeffreys-Bregman divergences (the symmetrized Kullback-Leibler divergence).

III. SKEW BURBEA-RAO DIVERGENCES

We further generalize Burbea-Rao divergences by intro-ducing a positive weight when averaging sourceparameters and as follows:

8This may not be the case of Bregman/Kullback-Leibler divergences that canpotentially be unbounded.

We consider the open interval (0,1) since otherwise thedivergence has no discriminatory power (indeed, for

). Although skeweddivergences are asymmetric , wecan swap arguments by replacing by :

(28)

Those skew Burbea-Rao divergences are similarly foundusing a skew Jensen-Bregman counterpart (the gradient terms

perfectly cancel in the sum of skewBregman divergences):

In the limit cases, or , we have. That is, those divergences loose

their discriminatory power at extremities. However, we showthat those skew Burbea-Rao divergences tend asymptoticallyto Bregman divergences:

(29)

(30)

The limit in the right-hand-side of (30) can be expressed al-ternatively as the following one-sided limit:

(31)

where the arrows and denote the limit from the left andthe limit from the right, respectively (see [30] for notations).The right derivative of a function at is defined as

. Since , it follows thatthe right-hand-side limit of (31) is the right derivative (see The-orem 1 of [30] that gives a generalized Taylor expansion ofconvex functions) of the map

(32)

taken at . Thus, we have

(33)

with

(34)

(35)

Lemma 1: Skew Burbea-Rao divergences tend asymptoti-cally to Bregman divergences or reverse Bregman di-vergences .


Thus, we may scale skew Burbea-Rao divergences so thatBregman divergences belong to skew Burbea-Rao divergences

(36)

Moreover, is now not anymore restricted to (0,1) but to thefull real line: , as also noticed in [31]. Setting(that is, ), we get (37), as shown at the bottom ofthe page.

IV. BURBEA-RAO CENTROIDS

Let denote a -dimensional point set. Toeach point, let us further associate a positive weight (ac-counting for arbitrary multiplicity) and a positive scalar

to define an anchored distance . Define theskew Burbea-Rao barycenter (or centroid) as the minimizer ofthe following optimization task:

(38)Without loss of generality, we consider argument on the

left argument position (otherwise, we change all toget the right-sided Burbea-Rao centroid). Removing all termsindependent of , the minimization program (OPT) amounts tominimize equivalently the following energy function:

(39)

Observe that the energy function is decomposable in thesum of a convex function with a concavefunction (since the sum ofconcave functions is concave). We can thus solve iteratively thisoptimization problem using the Convex-ConCave Procedure[32], [33] (CCCP), by starting from an initial position (say,the barycenter ), and iteratively update thebarycenter as follows:

(40)

(41)

Since is convex, the second-order derivative is alwayspositive definite, and is strictly monotone increasing. Thuswe can interpret (41) as a fixed-point equation by consideringthe -representation. Each iteration is interpreted as a quasi-arithmetic mean. This proves that the Burbea-Rao centroid isalways well-defined. There is (at most) a unique fixed point for

with a function strictly monotone increasing.In some cases, like the squared Euclidean distance (or squared

Mahalanobis distances), we find closed-form solutions for theBurbea-Rao barycenters. For example, consider the (negative)quadratic entropy with weights

and all (non-skew symmetric Burbea-Rao diver-gences). We have

(42)

The minimum is obtained when the gradient ,that is when , the barycenter of the point set

. For most Burbea-Rao divergences, (42) can only be solvednumerically.

Observe that for extremal skew cases (for or ),we obtain the Bregman centroids in closed-form solutions [see(30)]. Thus, skew Burbea-Rao centroids allow one to get asmooth transition from the right-sided centroid (the center ofmass) to the left-sided centroid (a quasi-arithmetic meanobtained for , a continuous and strictly increasingfunction).

Theorem 1: Skew Burbea-Rao centroids can be estimated it-eratively using the CCCP iterative algorithm. In extremal skewcases, the Burbea-Rao centroids tend to Bregman left/right sidedcentroids, and have closed-form equations in limit cases.

To describe the orbit of Burbea-Rao centroids linking the leftto right sided Bregman centroids, we compute for theskew Burbea-Rao centroids with the following update scheme:

(43)

We may further consider various convex generators foreach point, and consider the updating scheme

(37)


A. Burbea-Rao Divergences of a Population

Consider now the Burbea-Rao divergence of a popula-tion with respective positive normalized weights

. The Burbea-Rao divergence is defined by

(44)

This family of diversity measures includes the Jensen-Rényidivergences [34], [35] for , where

is the Rényi entropy of order . (Rényi en-tropy is concave for and tend to Shannon entropy for

.)

V. BHATTACHARYYA DISTANCES AS BURBEA-RAO DISTANCES

We first briefly recall the versatile class of exponential familydistributions in Section V-A. Then we show in Section V-B thatthe statistical Bhattacharyya/Chernoff distances between expo-nential family distributions amount to compute a Burbea-Raodivergence.

A. Exponential Family Distribution in Statistics

Many usual statistical parametric distributions (e.g.,Gaussian, Poisson, Bernoulli/multinomial, Gamma/Beta, etc.)share common properties arising from their common canonicaldecomposition of probability distribution [9]:

(45)

Those distributions9 are said to belong to the exponential fam-ilies (see [23] for a tutorial). An exponential family is character-ized by its log-normalizer , and a distribution in that familyby its natural parameter belonging to the natural space . Thelog-normalizer is strictly convex and , and can also be ex-pressed using the source coordinate system using the 1-to-1map that converts parameters from the source coor-dinate system to the natural coordinate system

(46)

where denotes the log-normalizer function ex-pressed using the -coordinates instead of the natural -coor-dinates.

The vector denote the sufficient statistics, that is the set oflinear independent functions that allows to concentrate withoutany loss all information about the parameter carried in the iid.observations . The inner product is defined ac-cording to the primitive type of . Namely, it is a multiplication

for scalars, a dot product for vectors, amatrix trace for matrices, etc.For composite types such as being defined by both a vector

9The distributions can either be discrete or continuous. We do not introducethe unifying framework of probability measures in order to not burden the paper.

part and a matrix part, the composite inner product is definedas the sum of inner products on the primitive types. Finally,

represents the carrier measure according to the countingor Lebesgue measures. Decompositions for most common ex-ponential family distributions are given in [23]. An exponentialfamily is the set of probability distri-butions obtained for the same log-normalizer function . Infor-mation geometry considers as a manifold entity, and studyits differential geometric properties [12].

For example, consider the family of Poisson distributionswith mass function:

(47)

for a positive integer. Poisson distributionsare univariate exponential families of order 1 (param-eter ). The canonical decomposition yields:

• the sufficient statistic ;• , the natural parameter;• , the log-normalizer;• and the carrier measure (with respect to

the counting measure).Since we deal with applications using multivariate normals in

the following, we also report explicitly that canonical decompo-sition for the multivariate Gaussian family .We rewrite the usual Gaussian density of mean and vari-ance-covariance matrix :

(48)

(49)

in the canonical form of (45) with• , with

denotes the cone of positive definite matrices;• ;• ;• .

.In this case, the inner product is composite and is calculated

as the sum of a dot product and a matrix trace as follows:

(50)

The coordinate transformation is given forby

(51)

and its inverse mapping by

(52)


B. Bhattacharyya/Chernoff Coefficients and -Divergences asSkew Burbea-Rao Divergences

For arbitrary probability distributions and (para-metric or not), we measure the amount of overlap between thosedistributions using the Bhattacharyya coefficient [36]:

(53)

Clearly, the Bhattacharyya coefficient (measuring the affinitybetween distributions [37]) falls in the unit range:

(54)

In fact, we may interpret this coefficient geometrically by con-sidering and as unit vectors. The Bhattacharyyadistance is then the dot product, representing the cosine of theangle made by the two unit vectors. The Bhattacharyya distance

is derived from its coefficient [36] as

(55)

The Bhattacharyya distance allows one to get both upper andlower bound the Bayes’ classification error [38], [39], whilethere are no such results for the symmetric Kullback-Leibler di-vergence. Both the Bhattacharyya distance and the symmetricKullback-Leibler divergence agrees with the Fisher informationat the infinitesimal level. Although the Bhattacharyya distanceis symmetric, it is not a metric. Nevertheless, it can be metrizedby transforming it into to the following Hellinger metric [40]:

(56)

such that . It follows that

(57)

Hellinger metric is also called Matusita metric [37] in theliterature. The thesis of Hellinger was emphasized in the workof Kakutani [41].

We consider a direct generalization of Bhattacharyya coeffi-cients and divergences called Chernoff divergences10

(58)

(59)

(60)

10In the literature, Chernoff information is also definedas � �� . Similarly, Cher-noff coefficients � �� are defined as the supremum:� �� .

defined for some (the Bhattacharyya divergence isobtained for ), where denote the expectation, and

the likelihood ratio. The termis called the Chernoff coefficient. The Bhattacharyya/Chernoffdistance of members of the same exponential family yields aweighted asymmetric Burbea-Rao divergence (namely, a skewBurbea-Rao divergence):

(61)

with

(62)

Chernoff coefficients are also related to -divergences, thecanonical divergences in -flat spaces in information geometry[12] (p. 57):

(63)

The class of -divergences satisfy the following reference du-ality: . Remapping

, we transform Amari -divergences to Chernoff -di-vergences:11

(64)

Theorem 2: The Chernoff -divergence ofdistributions belonging to the same exponential family isgiven in closed-form by means of a skewed Burbea-Rao di-vergence as: , with

. Amari -divergence for members of thesame exponential families amount to compute

Let us compute the Chernoff coefficient for distributions be-longing to the same exponential families. Without loss of gener-ality, let us consider the reduced canonical form of exponential

11Chernoff coefficients are also related to Rényi �-divergence generalizingthe Kullback-Leibler divergence: � ��

built on Rényi entropy � �� . The Tsallis en-tropy � �� can also be obtained from the Rényi

entropy (and vice-versa) via the mappings: � ��

and � �� .


TABLE ICLOSED-FORM BHATTACHARYYA DISTANCES FOR SOME CLASSES OF EXPONENTIAL FAMILIES (EXPRESSED IN SOURCE PARAMETERS FOR EASE OF USE)

families . Chernoff coefficientsof members and of the

same exponential family , as shown in the equation at thebottom of the page.

We get the following theorem for Bhattacharyya/Chernoffdistances:

Theorem 3: The skew Bhattacharyya divergenceis equivalent to the Burbea-Rao diver-

gence for members of the same exponential family:

.In particular, for , the Kullback-Leibler divergence

of those exponential family distributions amount to computea Bregman divergence [14] (by taking the limit as or

).

Corollary 1: In the limit case , the -diver-gences amount to compute a Kullback-Leibler divergence,and is equivalent to compute a Bregman divergence forthe log-normalized on the swapped natural parameters:

.Proof: The proof relies on the equivalence of Burbea-Rao

divergences to Bregman divergences for extremal values of

(65)

(66)

(67)

(68)

Similarly, we have.

Table I reports the Bhattacharyya distances for members ofthe same exponential families.

C. Direct Method for Calculating the Bhattacharyya Centroidsof Multivariate Normals

To the best of our knowledge, the Bhattacharyya centroidhas only been studied for univariate Gaussian or diagonal mul-tivariate Gaussian distributions [42] in the context of speechrecognition, where it is reported that it can be estimated usingan iterative algorithm (no convergence guarantees are reportedin [42]).

In order to compare this scheme on multivariate data withour generic Burbea-Rao scheme, we extend the approach of


Rigazio et al. [42] to multivariate Gaussians. Plugging the Bhat-tacharyya distance of Gaussians in the energy function of theoptimization problem (OPT), we get

(69)

This is equivalent to minimize the following energy:

(70)

In order to minimize , let us differentiate with respect to. let denote . Using matrix differentials [43,

p. 10, Eq. (73)], we get:

(71)

Then one can estimate iteratively , since depends onwhich is unknown. We update as follows:

(72)

Now let us estimate . We used matrix differentials [43] (p.9(55)) for the first term, and (51) p.8 for the two others):

(73)

Taken into account the fact that is symmetric, differentialcalculus on symmetric matrices can be simply estimate:

(74)

Thus, if one notes

(75)

and recalling that is symmetric, one has to solve

(76)

Let

(77)

Then one can estimate iteratively as follows:

(78)

Let us now compare the two generic Burbea-Rao/tailoredGaussian methods for computing the Bhattacharyya centroidson multvariate Gaussians.

D. Applications to Mixture Simplification in Statistics

Simplifying Gaussian mixtures is important in many applica-tions arising in signal processing [26]. Mixture simplification isalso a crucial step when one wants to study the Riemannian ge-ometry induced by the Rao distance with respect to the Fishermetric: The set of mixture models need to have the same numberof components, so that we simplify source mixtures to get a setof Gaussian mixtures with prescribed size. We adapt the hierar-chical clustering algorithm of Garcia et al. [26] by replacing thesymmetrized Bregman centroid (namely, the Jeffreys-Bregmancentroid) by the Bhattacharyya centroid. We consider the task ofcolor image segmentation by learning a Gaussian mixture modelfor each image. Each image is represented as a set of points(color and position ).

The first experimental results depicted in Fig. 3 demon-strates the qualitative stability of the clustering performance.In particular, the hierarchical clustering with respect to theBhattacharrya distance performs qualitatively much better onthe last colormap image.12

The second experiment focuses on characterizing the numer-ical convergence of the generic Burbea-Rao method comparedto the tailored Gaussian method. Since we presented two noveldifferent schemes to compute the Bhattacharyya centroids ofmultivariate Gaussians, one wants to compare them, both interms of stability and accuracy. Whenever the ratio of Bhat-tacharyya distance energy function between those estimatedcentroids is greater than 1%, we consider that one of the twoestimation methods is beaten (namely, the method that givesthe highest Bhattacharyya distance). Among the 760 centroidscomputed to generate Fig. 3, 100% were correct with theBurbea-Rao approach, while only 87% were correct with thetailored multivariate Gaussian matrix optimization method.The average number of iterations to reach the 1% accuracy is4.1 for the Burbea-Rao estimation algorithm, and 5.2 for thealternative method.

Thus, we experimentally checked that the generic CCCPiterative Burbea-Rao algorithm described for computing theBhattacharrya centroids always converge, and moreover beatsanother ad-hoc iterative method tailored for multivariateGaussians.

VI. CONCLUDING REMARKS

In this paper, we have shown that the Bhattacharrya distancefor distributions of the same statistical exponential families canbe computed equivalently as a Burbea-Rao divergence on thecorresponding natural parameters. Those results extend to skewChernoff coefficients (and Amari -divergences) and skew

12See reference images and segmentation using Bregman centroids at http://www.informationgeometry.org/MEF/


Fig. 3. Color image segmentation results: (a) source images, (b) segmentation with � � �� 5D Gaussians, and (c) segmentation with � � �� 5D Gaussians.

Bhattacharyya distances using the notion of skew Burbea-Raodivergences. We proved that (skew) Burbea-Rao centroidscan be efficiently estimated using an iterative concave-convexprocedure with guaranteed convergence. We have shown thatextremally skewed Burbea-Rao divergences amount asymptot-ically to evaluate Bregman divergences. This work emphasizeson the attractiveness of exponential families in Statistics. In-deed, it turns out that for many statistical distances, one canevaluate them in closed-form. For sake of brevity, we havenot mentioned the recent -divergences and -divergences[44], although their distances on exponential families are againavailable in closed-form.

The differential Riemannian geometry induced by theclass of such Jensen difference measures was studied byBurbea and Rao [18], [19] who built quadratic differentialmetrics on probability spaces using Jensen differences. TheJensen-Shannon divergence is also an instance of a broadclass of divergences called the -divergences. A -divergence

is a statistical measure of dissimilarity defined by thefunctional . It turns out that theJensen-Shannon divergence is a -divergence for the generator

(79)

-divergences preserve the information monotonicity [44], andtheir differential geometry was studied by Vos [45]. However,this Jensen-Shannon divergence is a very particular case ofBurbea-Rao divergences since the squared Euclidean distance(another Burbea-Rao divergence) does not belong to the classof -divergences.

VII. SOURCE CODE

The generic Burbea-Rao barycenter estimation algorithmshall be released in the jMEF open source library:

http://www.informationgeometry.org/MEF/An applet visualizing the skew Burbea-Rao centroids ranging

from the right-sided to left-sided Bregman centroids is availableat:

http://www.informationgeometry.org/BurbeaRao/

ACKNOWLEDGMENT

We are very grateful to the reviewers for their thorough andthoughtful comments and suggestions. In particular, we arethankful to the anonymous Referee that pointed out a rigorousproof of Lemma 1.

REFERENCES

[1] A. N. Kolmogorov, “Sur la notion de la moyenne,” Accad. Naz. LinceiMem. Cl. Sci. Fis. Mat. Natur. Sez., vol. 12, pp. 388–391, 1933.

[2] M. Nagumo, “Über eine klasse der mittelwerte,” Jap. J. Mathemat.,vol. 7, pp. 71–79, 1930.

[3] J. D. Aczél, “On mean values,” Bull. Amer. Mathemat. Soc., vol. 54,no. 4, pp. 392–400, 1948.

[4] A. Ben-Tal, A. Charnes, and M. Teboulle, “Entropic means,” J. Math-emat. Anal. and Applic., vol. 139, no. 2, pp. 537–551, 1989.

[5] S. M. Ali and S. D. Silvey, “A general class of coefficients of diver-gence of one distribution from another,” J. Roy. Statist. Soc., Ser. B,vol. 28, pp. 131–142, 1966.

[6] I. Csiszár, “Information-type measures of difference of probability dis-tributions and indirect observation,” Studia Scient. Mathemat. Hun-garica, vol. 2, pp. 229–318, 1967.

[7] F. Nielsen and R. Nock, “Sided and symmetrized Bregman centroids,”IEEE Trans. Inf. Theory, vol. 55, no. 6, pp. 2048–2059, Jun. 2009.


[8] Y. Censor and S. A. Zenios, Parallel Optimization: Theory, Algorithms,and Applications. Oxford, U.K.: Oxford University Press, 1997.

[9] M. J. Wainwright and M. I. Jordan, “Graphical models, exponentialfamilies, and variational inference,” Found. Trends in Mach. Learn.,vol. 1, pp. 1–305, Jan. 2008.

[10] S.-I. Amari, “�-divergence is unique, belonging to both f-divergenceand bregman divergence classes,” IEEE Trans. Inf. Theor., vol. 55, no.11, pp. 4925–4931, Nov. 2009.

[11] S.-I. Amari, “Integration of stochastic models by minimizing � -diver-gence,” Neural Comput., vol. 19, no. 10, pp. 2780–2796, 2007.

[12] S. Amari and H. Nagaoka, Methods of Information Geometry, A. M.Society, Ed. Oxford, U.K.: Oxford University Press, 2000.

[13] F. Nielsen and R. Nock, “The dual voronoi diagrams with respect torepresentational bregman divergences,” in Proc. Int. Symp. Voronoi Di-agrams (ISVD), Copenhagen, Denmark, Jun. 2009.

[14] J.-D. Boissonnat, F. Nielsen, and R. Nock, “Bregman Voronoi dia-grams,” Discrete & Comput. Geom., 2010.

[15] I. Csiszár, “Generalized projections for non-negative functions,” ActaMathematica Hungarica, vol. 68, no. 1–2, pp. 161–185, 1995.

[16] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering withBregman divergences,” J. Mach. Learn. Res., vol. 6, pp. 1705–1749,2005.

[17] M. Detyniecki, “Mathematical Aggregation Operators and Their Ap-plication to Video Querying,” Ph. D. dissertation, Université Pierre etMarie Curie, Paris, France, 2000.

[18] J. Burbea and C. R. Rao, “On the convexity of some divergence mea-sures based on entropy functions,” IEEE Trans. Inf. Theory, vol. IT-28,no. 3, pp. 489–495, 1982.

[19] J. Burbea and C. R. Rao, “On the convexity of higher order Jensendifferences based on entropy functions,” IEEE Trans. Inf. Theory, vol.IT-28, no. 6, p. 961, 1982.

[20] J. Lin, “Divergence measures based on the Shannon entropy,” IEEETrans. Inform. Theory, vol. 37, no. 1, pp. 145–151, Jan. 1991.

[21] M. Basseville and J.-F. Cardoso, “On entropies, divergences and meanvalues,” in Proc. IEEE Workshop on Information Theory, Whistler, BC,Canada, 1995, p. 330.

[22] L. M. Bregman, “The relaxation method of finding the common pointof convex sets and its application to the solution of problems in convexprogramming,” USSR Comput. Mathemat. and Mathemat. Phys., vol.7, pp. 200–217, 1967.

[23] F. Nielsen and V. Garcia, “Statistical exponential families: A digestwith flash cards,” ArXiv.org:0911.4863, 2009.

[24] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed.New York: Academic Press, 1990.

[25] L. Rigazio, B. Tsakam, and J. C. Junqua, “An optimal bhattacharyyacentroid algorithm for gaussian clustering with applications in auto-matic speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech,and Signal Processing, 2000, vol. 3, pp. 1599–1602 [Online]. Avail-able: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=861998

[26] V. Garcia, F. Nielsen, and R. Nock, “Hierarchical gaussian mixturemodel,” in Proc. Int. Conf. Acoustics, Speech and Signal Processing(ICASSP), Mar. 2010.

[27] P. Chen, Y. Chen, and M. Rao, “Metrics defined by Bregman diver-gences: Part I,” Commun. Math. Sci., vol. 6, pp. 9915–926, 2008.

[28] P. Chen, Y. Chen, and M. Rao, “Metrics defined by Bregman diver-gences: Part II,” Commun. Math. Sci., vol. 6, pp. 927–948, 2008.

[29] H. Jeffreys, “An invariant form for the prior probability in estimationproblems,” Proc. Roy. Soc. London, vol. 186, no. 1007, pp. 453–461,Mar. 1946.

[30] F. Liese and I. Vajda, “On divergences and informations in statisticsand information theory,” IEEE Trans. Inf. Theory, vol. 52, no. 10, pp.4394–4412, Oct. 2006.

[31] J. Zhang, “Divergence function, duality, and convex analysis,” Neur.Comput., vol. 16, no. 1, pp. 159–195, 2004.

[32] A. Yuille and A. Rangarajan, “The concave-convex procedure,” Neur.Comput., vol. 15, no. 4, pp. 915–936, 2003.

[33] B. Sriperumbudur and G. Lanckriet, “On the convergence of the con-cave-convex procedure,” Neur. Inf. Process. Syst., 2009.

[34] Y. He, A. B. Hamza, and H. Krim, “An information divergencemeasure for ISAR image registration,” in Autom. Target Recognit. XI(SPIE), 2001, vol. 4379, pp. 199–208.

[35] A. O. Hero, B. Ma, O. Michel, and J. D. Gorman, Alpha-divergencefor classification, indexing and retrieval Comm. and Sig. Proc. Lab.(CSPL), Dept. EECS, University of Michigan, Ann Arbor, MI, 2001,Tech. Rep. 328.

[36] A. Bhattacharyya, “On a measure of divergence between two statisticalpopulations defined by their probability distributions,” Bull. CalcuttaMathemat. Soc., vol. 35, pp. 99–110, 1943.

[37] K. Matusita, “Decision rules based on the distance, for problems of fit,two samples, and estimation,” Ann. Mathemat. and Statist., vol. 26, pp.631–640, 1955.

[38] T. Kailath, “The divergence and Bhattacharyya distance measures insignal selection,” IEEE Trans. Commun. Technol., vol. CT-15, no. 1,pp. 52–60, Jan. 1967.

[39] F. Aherne, N. Thacker, and P. Rockett, “The Bhattacharyya metric asan absolute similarity measure for frequency coded data,” Kybernetika,vol. 34, no. 4, pp. 363–368, 1998.

[40] E. D. Hellinger, “Die Orthogonalinvarianten Quadratischer Formenvon Unendlich Vielen Variablen,” , Unov. Göttingen, Göttingen,Germany, 1907.

[41] S. Kakutani, “On equivalence of infinite product measures,” Ann. Math-emat., vol. 49, no. 214–224, 1948.

[42] L. Rigazio, B. Tsakam, and J. Junqua, “Optimal Bhattacharyya cen-troid algorithm for Gaussian clustering with applications in automaticspeech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, andSignal Processing, 2000, vol. 3, pp. 1599–1602.

[43] K. B. Petersen and M. S. Pedersen, The Matrix Cookbook TechnicalUniv. Denmark, oct. 2008 [Online]. Available: http://www2.imm.dtu.dk/pubdb/p.php?3274

[44] A. Cichocki and S. i. Amari, “Families of alpha- beta- and gamma-divergences: Flexible and robust measures of similarities,” Entropy,2010.

[45] P. Vos, “Geometry of � -divergence,” Ann. Inst. Statist. Mathematics,vol. 43, no. 3, pp. 515–537, 1991.

Frank Nielsen (SM’10) received the B.Sc. (1992) and M.Sc. (1994) degreesfrom Ecole Normale Superieure (ENS), Lyon, France. He prepared the Ph.D.on adaptive computational geometry at INRIA Sophia-Antipolis (France) anddefended it in 1996.

As a civil servant of the University of Nice (France), he gave lectures at theengineering schools ESSI and ISIA (Ecole des Mines). In 1997, he served inthe army as a scientific member in the computer science laboratory of EcolePolytechnique. In 1998, he joined Sony Computer Science Laboratories Inc.,Tokyo, Japan, where he is a senior researcher. He became a professor of theCS Dept. of Ecole Polytechnique in 2008. His current research interests includegeometry, vision, graphics, learning, and optimization. He is a senior ACM andsenior IEEE member.

Sylvain Boltz received the M.S. and Ph.D. degrees in computer vision from theUniversity of Nice-Sophia, Antipolis, France, in 2004 and 2008, respectively.

Since then, he has been a postdoctoral fellow at the VisionLab, Universityof California, Los Angeles, and a LIX-Qualcomm postdoctoral fellow in EcolePolytechnique, France. His research spans computer vision and image, videoprocessing with a particular interest in applications of information theory andcompressed sensing to these areas.

the burbea-rao and bhattacharyya centroids

Documents