differential geometry in statistical inference || differential geometry in statistical inference
TRANSCRIPT
Differential Geometry in Statistical InferenceAuthor(s): S.-I. Amari, O. E. Barndorff-Nielsen, R. E. Kass, S. L. Lauritzen and C. R. RaoSource: Lecture Notes-Monograph Series, Vol. 10, Differential Geometry in StatisticalInference (1987), pp. i-iii+1-17+19+21-95+97-161+163+165-217+219-240Published by: Institute of Mathematical StatisticsStable URL: http://www.jstor.org/stable/4355557 .
Accessed: 18/06/2014 23:33
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp
.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].
.
Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access toLecture Notes-Monograph Series.
http://www.jstor.org
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Institute of Mathematical Statistics
LECTURE NOTES-MONOGRAPH SERIES
Shanti S. Gupta, Series Editor
Volume 10
Differential Geometry
in
Statistical Inference
S.-l. Amari, ?. E. Barndorff-Nielsen,
R. E. Kass, S. L. Lauritzen, and C. R. Rao
Institute of Mathematical Statistics
Hayward, California
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Institute of Mathematical Statistics
Lecture Notes-Monograph Series
Series Editor, Shanti S. Gupta, Purdue University
The production of the IMS Lecture Notes-Monograph Series is
managed by the IMS Business Office: Nicholas P. Jewell, IMS
Treasurer, and Jose L. Gonzalez, IMS Business Manager.
Library of Congress Catalog Card Number: 87-82603
International Standard Book Number 0-940600-12-9
Copyright ? 1987 Institute of Mathematical Statistics
All rights reserved
Printed in the United States of America
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
TABLE OF CONTENTS
CHAPTER 1. Introduction
Robert E. Kass
CHAPTER 2. Differential Geometrical Theory of Statistics
Shun-ichi Amari .19
CHAPTER 3. Differential and Integral Geometry in Statistical Inference
O. E. Barndorff-Nielsen .95
CHAPTER 4. Statistical Manifolds
Steffen L. Lauritzen .163
CHAPTER 5. Differential Metrics in Probability Spaces C. R. Rao.217
in
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
CHAPTER 1. INTRODUCTION
Robert E. Kass
Geometrical analyses of parametric inference problems have developed
from two appealing ideas: that a local measure of distance between members of a
family of distributions could be based on Fisher information, and that the
special place of exponential families in statistical theory could be understood
as being intimately connected with their loglinear structure. The first led
Jeffreys (1946) and Rao (1945) to introduce a Riemannian metric defined by
Fisher information, while the second led Efron (1975) to quantify departures
from exponentiality by defining the curvature of a statistical model. The
papers collected in this volume summarize subsequent research carried out by
Professors Amari, Barndorff-Nielsen, Lauritzen, and Rao together with their
coworkers, and by other authors as well, which has substantially extended both
the applicability of differential geometry and our understanding of the role it
plays in statistical theory.**
The most basic success of the geometrical method remains its concise
summary of information loss, Fisher's fundamental quantification of departure
from sufficiency, and information recovery, his justification for conditioning.
Fisher claimed, but never showed, that the MLE minimized the loss of information
among efficient estimators, and that successive portions of the loss could be
*
Department of Statistics, Carnegie-Mellon University, Pittsburgh, PA **
These papers were presented at the NATO Advanced Workshop on Differential
Geometry in Statistical Inference at Imperial College, April, 1934.
1
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Robert E. Kass
recovered by conditioning on the second and higher derivatives of the log-
likelihood function, evaluated at the MLE. Concerning information loss, recall
that according to the Koopman-Darmois theorem, under regularity conditions, the
families of continuous distributions with fixed support that admit finite-
dimensional sufficient reductions of i.i.d. sequences are precisely the exponen-
tial families. It is thus intuitive that (for such regular families) departures
from sufficiency, that is, information loss, should correspond to deviations
from exponential ity. The remarkable reality is that the correspondence takes a
beautifully simple form. The most transparent case, especially for the untrain-
ed eye, occurs for a one-parameter subfamily of a two-dimensional exponential
family. There, the relative information loss, in Fisher's sense, from using a
statistic ? in place of the whole sample is
lim ??G'?p???^?)] - ?2 + } S2 (1)
where ni(?) is the Fisher information in the whole sample, i (?) is the Fisher
information calculated from the distribution of ?, ? is the statistical curva-
ture of the family and ? is the mixture curvature of the "ancillary family"
associated with the estimator T. When the estimator ? is the MLE, 3 vanishes;
this substantiates Fisher's first claim.
In his 1975 paper, Efron derived the two-term expression for infor-
mation loss (in his equation (10.25)), discussed the geometrical interpretation
of the first term, and noted that the second term is zero for the MLE. He
defined*? to be the curvature of the curve in the natural parameter space that
describes the subfamily, with the inner product defined by Fisher information
replacing the usual Euclidean inner product. The definition of ? is exactly
analogous to that of ?, with the mean value parameter space used instead of the
natural parameter space, but Efron did not recognize this and so did not
identify the mixture curvature. He did stress the role of the ancillary family
associated with the estimator ? (see his Remark 3 of Section 9 and his reply to
discussants, p. 1240), and he also noticed a special case of (1) (in his reply,
p. 1241). The final simplicity of the complete geometrical version of (1)
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Introduction
appeared in Amari's 1982 Annals paper. There it was derived in the multi-
parameter case; see equation (4.8) of Amari's paper in this volume.
Prior to Efron's paper, Rao (1961) had introduced definitions of
efficiency and second-order efficiency that were intended to classify estimators
just as Fisher's definitions did, but using more tractable expressions. This
led to the same measure of minimum information loss used by Fisher (correspond-
p ing to ? in equation (1)). Rao (1962) computed the information loss in the
case of the multinomial distribution for several different methods of estimation.
Rao (1963) then went on to provide a decision-theoretic definition of second-
order efficiency of an estimator T, measuring it according to the magnitude of
the second-order term in the asymptotic expansion of the bias-corrected version
of T. Efron's analysis clarified the relationship between Fisher's definition
and Rao's first definition. Efron then provided a decomposition of the second-
order variance term in which the right-hand side of (1) appeared, together with
a parameterization-dependent third term. The extension to the multi parameter
case was derived by Madsen (1979) following the outline of Reeds (1975). It
appears here in Amari's paper as Theorem 3.4.
An analytically and conceptually important first step of Efron's
analysis was to begin by considering smooth subfamilies of regular exponential
families, which he called curved exponential families. Analytically, this made
possible rigorous derivations of results, and for this reason such families
were analyzed concurrently by Ghosh and Subramaniam (1974). Conceptually, it
allowed specification of the ancillary families associated with an estimator:
the ancillary family associated with ? at t is the set of points y in the sample
space of the full exponential family - equivalently, the mean value parameter
space for the family - for which T(y) = t. The terminology and subsequent
detailed analysis is due to Amari but, as noted above, the importance of the
ancillary family, at once emphasized and obscured by Fisher, was apparent from
Efron's presentation.
The ancillary family is also important in understanding information
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Robert E. Kass
recovery, which is the reason Amari has chosen to use the modifier "ancillary."
In the discussion of Efron's paper, Pierce (1975) noted another interpretation
of statistical curvature: it furnishes the asymptotic standard deviation of
observed information. More precisely, it is the asymptotic standard deviation
-1/2 ? -1 of the asymptotically ancillary statistic ? '
?(?) [?(?) - ni(e)], where
ni(?) is expected information and ?(?) is observed information; the one-
parameter statement appears in Efron and Hinkley, (1978), and the multiparameter
version is in Skovgaard (1985). When fitting a curved exponential family by the
method of maximum likelihood, this statistic becomes a normalized component of
the residual (in the direction normal to the model within the plane spanned by
the first two derivatives of the natural parameter for the full exponential
family). Furthermore, conditioning on this statistic recovers (in Fisher's
sense) the information lost by the MLE, at least approximately. When this
conditional distribution is used, the asymptotic variance of the MLE may be
estimated by the inverse of observed rather than expected information; in some
problems observed information is clearly superior.
This argument, sketched by Pierce and presented in more detail by
Efron and Hinkley, represented an attempt to make sense of some of Fisher's
remarks on conditioning. In Section 4 of his paper in this volume, Amari
presents a comprehensive approach to information recovery as measured by Fisher
information. He begins by defining a statistic ? to be asymptotically suffi-
cient of order q when
?1(?) - iT(e) = 0(n"q+1)
and asymptotically ancillary of order q when
iT(e) = 0(n"q) .
These definitions differ from some used by other authors, such as Cox (1980),
McCullagh (1984a), and Skovgaard (1985). They are, however, clearly in the
spirit of Fisher's apparent feeling that i (?) is an appropriate measure of
information. To analyze Fisher's suggestion that higher derivatives of the
loglikelihood function could be used to create successive higher-order
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Introduction
approximate ancillary statistics, Amari defines an explicit sequence of
combinations of the derivatives: he takes successive components of the residual
in spaces spanned by the first ? derivatives - of the natural parameter for the
ambient exponential family - but perpendicular to the space spanned by the first
p-1, then normalizes by higher-order curvatures. In Theorems 4.1 and 4.2
Amari achieves a complete decomposition of the information. He thereby makes
specific, justifies, and provides a geometrical interpretation for Fisher's
second claim. In Amari's decomposition the p-th term is attributable to the.
p-th statistic in his sequence and has magnitude equal to n"p times the
square of the p-th order curvature. (Actually, Amari's treatment is more
general than the rough description here would imply since he allows for the use
of efficient estimators other than the MLE.)
As far as the basic issue of observed versus expected information is
concerned, Amari (1982b) used an Edgeworth expansion involving geometrically
interpretable terms (as in Amari and Kumon, 1983) to provide a general motiva-
tion for using the inverse of observed information as the estimate of the
conditional variance of the MLE. See Section 4.4 of the paper here. (In truth,
the result is not as strong as it may appear. When we have an approximation ?
to a variance ? satisfying ?(?) = ? (?){1 + 0(n" )}, and we use it to estimate
?(?), we substitute ? (?), where ? is some estimator of ?, and then all we
-112 usually get is ?(?) =
??(?){1 + 0 (?
' )}. For essentially this reason,
observed information does not in general provide an approximation to the con-
ditional variance of the MLE based on the underlying true value ?, having
relative error 0 (n ) - although it does do so whenever expected information is
constant, as it is for a location parameter. Similarly, as Skovgaard, 1985,
points out in his careful consideration of the role of observed information in
inference, when estimated cumulants are used in an Edgeworth expansion it loses
its higher-order approximation to the underlying density at the true value.
This practical limitation of asymptotics does not affect Bayesian inference, in
which observed information furnishes a better approximation to the posterior
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Robert E. Kass
variance than does expected information for all regular families.)
For curved exponential families, then, the results summarized in the
first few sections of Amari's paper provide a thorough geometrical interpreta-
tion of the Fisherian concepts of information loss and recovery and also Rao's
concept of second-order efficiency. In addition, in section 3.4 Amari discusses
the geometry of testing, as had Efron, providing comparisons of several commonly-
used test procedures with the locally most powerful test. Curved exponential
families were introduced, however, for their mathemetical and conceptual
simplicity rather than their applicability. To extend his one-parameter
results, Efron, in his 1975 paper, did two things: he noted that any smooth
family could be locally approximated by a curved exponential family, and he
provided an explicit formula for statistical curvature in the general case.
In Section 5 of his paper, Amari shows how results established for curved
exponential families may be extended by constructing an appropriate Hilbert
bundle, about which I will say a bit more below. With the Hilbert bundle,
Amari provides a geometrical foundation, and generalization, for Efron's sugges-
tion. From it, necessary formulas can be derived.
One reason that the role of the mixture curvature in (1) and in the
variance decomposition went unnoticed in Efron's paper was that he had not
made the underlying geometrical structure explicit: to calculate statistical
curvature at a given value 0Q of a single parameter ? in a curved exponential
family, Efron used the natural parameter space with the inner product defined
by Fisher information at the natural parameter point corresponding to eQ. In
order to calculate the curvature at a new point ?,, another copy of the natural
parameter space with a different inner product (namely, that defined by Fisher
information at the natural parameter point corresponding to ?,) would have to be
used. The appropriate gluing together of these spaces into a single structure
involves three basic elements: a manifold, a Riemannian metric, and an affine
connection. Riemannian geometry involves the study of geometry determined by
the metric and its uniquely associated Riemannian connection. In his discussion
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Introduction
to Efron's paper, Dawid (1975) pointed out that Efron had used the Riemannian
metric defined by Fisher information, but that he had effectively used a non-
Riemannian affine connection, now called the exponential connection, in cal-
culating statistical curvature. Although Dawid did not identify the role of the
mixture curvature in (1), he did draw attention to the mixture connection as an
alternative to the exponential connection. (Geodesies with respect to the
exponential connection form exponential families, while geodesies with respect
to the mixture connection form families of mixtures; thus, the terminology.)
Amari, who had much earlier researched the Riemannian geometry of Fisher infor-
mation, picked up on Dawid's observation, specified the framework, and provided
the results outlined above.
The manifold with the associated linear spaces is structured in what
is usually called a tangent bundle, the elements of the linear spaces being
tangent vectors. For curved exponential families, the linear spaces are finite-
dimensional, but to analyze general families this does not suffice so Amari
uses Hilbert spaces. When these are appropriately glued together, the result
is a Hilbert bundle. The idea stems from Dawid's remark that the tangent
vectors can be identified with score functions, and these in turn are functions
having zero expectation. As his Hilbert space at a distribution P, Amari takes
the subspace of the usual L2(P) Hilbert space consisting of functions that have
zero expectation with respect to P. This clearly furnishes the extension of
the information metric, and has been used by other authors as well, e.g.,
Beran (1977). Amari then defines the exponential and mixture connections and
notes that these make the Hilbert bundle flat, and that the inherited connec-
tions on the usual tangent bundles agree with those already defined there. He
then decomposes each Hilbert space into tangential and normal components,
which is exactly what is needed to define statistical curvature in the general
setting. Amari goes on to construct an "exponential bundle" by associating
with each distribution a finite-dimensional linear space containing vectors
defined by higher derivatives of the loglikelihood function, and using structure
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
8 Robert E. Kass
inherited from the Hilbert bundle. With this he obtains a satisfactory version
of the local approximation by a curved exponential family that Efron had
suggested.
This pretty construction allows results derived for curved exponen-
tial families to be extended to more general regular families, yet it is not
quite the all-encompassing structure one might hope for: the underlying
manifold is still a particular parametric family of densities rather than the
collection of all possible densities on the given sample space. Constructions
for the latter have so far proved too difficult.
In his Annals paper, Amari also noted an interesting relationship
between the exponential and mixture connections: they are, in a sense he
defined, mutually dual. Furthermore, a one-parameter family of connections,
which Amari called the a-connections, may be defined in such a way that for each
a the a-connection and the -a-connection are mutually dual, while a=l and -1
correspond to the exponential and mixture connections. See Amari's Theorem 2.1.
This family coincides with that introduced by Centsov (1971) for multinomial
distributions. When the family of densities on which these connections are
defined is an exponential family, the space is flat with respect to the exponen-
tial and mixture connections, and the natural parametrization and mean-value
parameterization play special roles: they become affine coordinate systems for
the two respective connections and are related by a Legendre transformation.
The duality in this case can incorporate the convex duality theory of exponen-
tial families (see Barndorff-Nielsen, 1978, and also Section 2 of his paper in
this volume). In Theorem 2.2 Amari points out that such a pair of coordinate
systems exists whenever a space is flat with respect to an a-connection (with
a f 0). For such spaces, Amari defines a-divergence, a quasi-distance between
two members of the family based on the relationship provided by the Legendre
transformation. In Theorem 2.4 he shows that the element of a curved exponential
family that minimizes the a-divergence from a point in the exponential family
parameter space may be found by following the a-geodesic that contains the
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Introduction 9
given point and is perpendicular to the curved family. This generates a new
class of minimum a-divergence estimators, the MLE being the minimum
-1-divergence estimator, an interpretation also discussed by Efron (1978).
As applications of his general methods based on a-connections on
Hilbert bundles, Amari treats the problems of combining independent samples (at
the end of section 5), making inferences when the number of nuisance parameters
increases with the sample size (in section 6), and performing spectral estima-
tion in Gaussian time series (in section 7).
As soon as the a-connections are constructed a mathematical question
arises. On one hand, the a-connections may be considered objects of differen-
tial geometry without special reference to their statistical origin. On the
other hand, they are not at all arbitrary. They are the simplest one-parameter
family of connections based on the first three moments of the score function.
What is it about their special form that leads to the many special properties
of a-connections (outlined by Amari in Section 2)?
Lauritzen has posed this question and has provided a substantial
part of the answer. Given any Riemannian manifold M with metric g there is a
unique Riemannian connection v. Given a covariant 3-tensor D that is symmetric
in its first two arguments and a nonzero number c, a new (symmetric) connection
is defined by
? = ? + c ? D (2)
which means that given vector fields X and Y,
??? =
??? + c ? D(X,Y)
where
g(D(X,Y),Z)E D(X,Y,Z)
for all vector fields Z. Now, when M is a family of densities and g and D are
defined, in terms of an arbitrary parameterization, as
g(dy d.) = E(d.id.a)
D(3., d.9 8k) =
E(3i?,3j?3k5l)
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
10 Robert E. Kass
where i is the loglikelihood function, and if c = -a/2, then (2) defines the
a-connection.
In this statistical case, D is not only symmetric in its first two
arguments, as it must be in (2), it is symmetric in all three. Lauritzen
therefore defines an abstract statistical manifold to be a triple (M,g,D) in
which M is a smooth m-dimensional manifold, g is a Riemannian metric, and D is
a completely symmetric covariant 3-tensor. With this additional symmetry
constraint alone, he then proceeds to establish a large number of basic proper-
ties, especially those relating to the duality structure Amari described. The
treatment is "fully geometrical" or "coordinate-free." This is aesthetically
appealing, especially to those who learned linear models in the coordinate-free
setting. Lauritzen's primary purpose is to show that the appropriate mathemat-
ical object of study is one that is not part of the standard differential
geometry, but does have many special features arising from an apparently simple
structure. He not only presents the abstract generalities about a-connections
on statistical manifolds, he also examines five examples in full detail. The
first is the univariate Gaussian model, the second is the inverse Gaussian
model, the third is the two-parameter gamma model, and the last two are
specially constructed models that display interesting possibilities of the non-
standard geometries of a-connections. In particular, the latter two statistical
manifolds are not what Lauritzen calls "conjugate symmetric" and so the
sectional curvatures do not determine the Riemann tensor (as they do in
Riemannian geometry). He also discusses the construction of geodesic folia-
tions, which are decompositions of the manifold and are important because they
generate potentially interesting decompositions of the sample space. At the
end of his paper, Lauritzen calls attention to several outstanding problems.
Amari's a-connections, based on the first three moments of the
score function, do not furnish the only examples of statistical manifolds. In
his contribution to this volume, Barndorff-Nielsen presents another class of
examples based instead on certain "observed" rather than expected derivatives
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Introduction 11
of the loglikelihood.
Although the idea of using observed derivatives might occur to
any casual listener on being told of Amari's use of expectations, it is not
obvious how to implement it. First of all, in order to define an observed
information Riemannian metric, one needs a definition of observed information
at each point of the parameter space. Apparently one would want to treat each
? as if it were an MLE and then use ?(?). However, ?(?) depends on the whole
sample y rather than on ? alone, so this scheme does not yet provide an explicit
definition. Barndorff-Nielsen's solution is natural in the context of his
research on conditional ity: he replaces the sample y with a sufficient pair
(?,a) where a is the observed value of an asymptotically ancillary statistic A.
This is always possible for curved exponential families, and in more general
models A could at least be taken so that (?,A) is asymptotically sufficient.
With this replacement, the second component may be held fixed at A=a while ?
varies. Writing ?(?) = I/g a\(e) thus allows the definition ?(?) ? I,
a\(e)
to be made at each point ? in the parameter space. Using this definition of
the Riemannian metric, Barndorff-Nielsen derives the coefficients that deter-
mine the Riemannian connection. From the transformation properties of tensors,
he then succeeds in finding an analogue of the exponential connection based on
a certain mixed third derivative of the loglikelihood function (two derivatives
being taken with respect to ? as parameter, one with respect to ? as MLE). In
so doing, he defines the tensor D in the statistical manifold and thus arrives
at his "observed conditional geometry."
Barndorff-Nielsen1s interest in this geometry lies not with
analogues of statistical curvature and other expected-geometry constructs, but
rather with an alternative derivation, interpretation, and extension of an
approximation to the conditional density of the MLE, which had been obtained
earlier (in Barndorff-Nielsen and Cox, 1979). In several papers, Barndorff-
Nielsen (1980, 1983) has discussed generalizations and approximate versions of
Fisher's fundamental density-likelihood formula for location models
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
12 Robert E. Kass
?(? |a,e) = c - L(e)/L(?) (3)
where ? is the MLE, a is an ancillary statistic, ? is the conditional density
of the MLE, and L is the likelihood function. (This is discussed in Efron and
Hinkley, 1978; Fisher actually treated the location-scale case.) The formula
is of great importance both practically, since it provides a way of computing
the conditional density, and philosophically, since it entails the formal
agreement of conditional inference and Bayesian inference using an invariant
prior. Inspection of the derivation indicates that the formula results from
the transformational nature of the location problem, and Barndorff-Nielsen has
shown that a version of it (with an additional factor for the volume element)
holds for very general transformation models. He has also shown that for non-
transformation models, a version of the right-hand side of (3) while not
exactly equal to the left-hand side, remains a good asymptotic approximation for
it. (See also Hinkley, 1980, and McCullagh, 1984a.) In his paper in this
volume, Barndorff-Nielsen reviews these results, shows how the various observed
conditional geometrical quantities are calculated, and then derives his desired
expansion (of a version of the right-hand side of (3)) in terms of the geo-
metrical quantities that correspond to those used by Amari in his expected
geometry expansions. Barndorff-Nielsen devotes substantial attention to trans-
formation models, which may be treated within his framework of observed
conditional geometry. In this context, the models become Lie Groups, for which
there is a rich mathematical theory.
In the fourth paper in this volume, Professor Rao returns to the
characterization of the information metric that originally led him (and also
Jeffreys) to introduce it: it is an infinitesimal measure of divergence based
on what is now called Shannon entropy. Rao considers here a more general class
of divergence measures, which he has found useful in the study of genetic
diversity, leading to a wide variety of metrics. He derives the quadratic and
cubic terms in Taylor series expansions of these measures and shows how, in the
case of Shannon entropy, the cubic term is related to the a-connections.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Introduction 13
The papers here collectively show that geometrical structures of
statistical models can provide both conceptual simplifications and new methods
of analysis for problems of statistical inference. There is interesting
mathematics involved, but does the interesting mathematics lead to interesting
statistics?
The question arises because geometry has provided new techniques,
and its formalism produces convenient summaries for complicated multivariate
expressions in asymptotic expansions (as in Amari and Kumon, 1983, and
McCullagh, 1984b), but it has not yet created new methodology with clearly
important practical applications. Thus, it is already apparent from (1) that
there exists a wide class of estimators that minimize information loss (and are
second-order efficient): it consists of those having zero mixture curvature
for their associated ancillary families. It is interesting that the MLE is only
one member of this class, and it is nice to have Eguchi's (1983) derivation that
certain minimum contrast estimators are other members, but it seems unlikely -
though admittedly possible - that any competitor will replace maximum likelihood
estimation as the primary method of choice in practice. Similarly, there is
not yet any reason to think that alternative minimum a-divergence estimators or
their observed conditional geometry counterparts will be considered superior to
the MLE.
On the other hand, as I indicated at the outset, geometry does
give a definitive description of information loss and recovery. Since Fisher
remains our wisest yet most enigmatic sage, it is worth our while to try to
understand his pronouncements. Together with the triumvirate of consistency,
?k-k Since Rao's work on second order efficiency arose in an attempt to understand
Fisher's computation of information loss in estimation, it might appear that
Efron's investigation also began as an attempt to understand Fisher. He has
informed me, however, that he set out to define the curvature of a statistical
model and came later to its use in information loss and second-order efficiency.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
14 Robert E. Kass
sufficiency, and efficiency, information loss and recovery form the core of
Fisher's theory of estimation. On the basis of the geometrical results, it is
fair to say that we now know what Fisher was talking about, and that what he
said was true. Here, as in other problems (such as inference with nuisance
parameters, discussed in Amari's section 5, or in nonlinear regression, e.g.,
Bates and Watts, 1980, Cook and Tsai, 1985, Kass, 1984, McCullagh and Cox, 1936),
the geometrical formulation tends to shift the burden of derivation of results
away from proofs, toward definitions. Thus, once the statement of a proposition
is understood, its truth is easier to see and in this there is great simplifica-
tion. One could make this argument about much abstract mathematical develop-
ment, but it is particularly appropriate here.
Furthermore, there are reasons to think that future work in this
area could lead to useful results that would otherwise be difficult to obtain.
One important problem that structural research might solve is that of determin-
ing useful conditions under which a particular root of the likelihood equation
will actually maximize the likelihood. Global results on foliations might be
very helpful, as might be formulas relating computable characteristics of
statistical manifolds to the behavior of geodesies. The results in these papers
could turn out to play a central role in the solution of this or some other
practical problem of statistical theory. We will have to wait and see. Until
then, readers may enjoy the papers as informative excursions into an intriguing
realm of mathematical statistics.
Acknowledgements
I thank 0. E. Barndorff-Nielsen, D. R. Cox, and C. R. Rao for their
comments on an earlier draft. This paper was prepared with support from the
National Science Foundation under Grant No. NSF/DMS - 8503019.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
REFERENCES
Amari, S. (1982a). Differential geometry of curved exponential families -
curvatures and information loss. Ann. Statist. 10, 357-387.
Amari, S. (1982b). Geometrical theory of asymptotic ancillarity and conditional
inference. Biometrika 69, 1-17.
Amari, S. and Kumon, M. (1983). Differential geometry of Edgeworth expansions
in curved exponential family. Ann. Inst. Statist. Math. 35A, 1-24.
Barndorff-Nielsen, 0. E. (1978). Information and Exponential Families,
New York: Wiley.
Barndorff-Nielsen, 0. E. (1980). Conditionality resolutions. Biometrika 67,
293-310.
Barndorff-Nielsen, 0. E. (1983). On a formula for the distribution of the
maximum likelihood estimator. Biometrika 70, 343-305.
Barndorff-Nielsen, 0. E. and Cox, D. R. (1979). Edgeworth and Saddlepoint
approximations with statistical applications, (with Discussion).
J. R. Statist. Soc. B41, 279-312.
Bates, D. M. and Watts, D. G. (1980). Relative curvature measures of non-
linearity. J. R. Statist. Soc. B42, 1-25.
Beran, R. (1977). Minimum Hellinger distance estimates for parametric models.
Ann. Statist. 5, 445-463.
Centsov, N. N. (1971). Statistical Decision Rules and Optimal Inference (in
Russian). Translated into English (1982), AMS, Rhode Island.
Cook, R. D. and Tsai, C.-L. (1985). Residuals in nonlinear regression.
Biometrika 72, 23-29.
15
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
16 Robert E. Kass
Cox, D. R. (1980). Local ancillarity. Biometrika 62, 269-276.
Dawid, A. P. (1975). Discussion to Efron's paper. Ann. Statist. 3, 1231-1234.
Efron, ?. (1975). Defining the curvature of a statistical problem (with
applications to second-order efficiency), (with Discussion).
Ann. Statist. 3, 1189-1242.
Efron, ?. (1978). The geometry of exponential families. Ann. Statist. 6,
362-376.
Efron, ?. and Hinkley, D. V. (1978). Assessing the accuracy of the maximum
likelihood estimator: Observed versus expected Fisher information,
(with discussion). Biometrika 65, 457-487.
Eguchi, S. (1983). Second order efficiency of minimum contrast estimators in
a curved exponential family. Ann. Statist. 11, 793-803.
Fisher, R. A. (1925). Theory of statistical estimation. Proc. Camb. Phil. Soc.
22, 700-725.
Fisher, R. A. (1934). Two new properties of mathematical likelihood. Proc.
R. Soc. A144, 285-307.
Ghosh, J. K. and Subramaniam, K. (1974). Second order efficiency of maximum
likelihood estimators. Sankya 36A, 325-358.
Hinkley, D. V. (1980). Likelihood as approximate pivotal distribution.
Biometrika 67, 287-292.
Jeffreys, H. (1946). An invariant form for the prior probability in estimation
problems. Proc. Roy. Soc. Al86, 453-461.
Kass, R. E. (1984). Canonical parameter!zations and zero parameter-effects
curvature. J. Roy. Statist. Soc. B46, 1, 86-92.
Madsen, L. T. (1979). The geometry of statistical model - a generalization of
curvature. Res. Report 79-1. Statist. Res. Unit, Danish Medical
Res. Council.
McCullagh, P. (1984a). On local sufficiency. Biometrika 71, 233-244.
McCullagh, P. (1984b). Tensor notation and cumulants of polynomials.
Biometrika 71, 461-476.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Introduction 17
McCullagh, P. and Cox, D. R. (1986). Invariants and likelihood ratio statistics.
Ann. Statist. 14, 1419-1430.
Pierce, D. A. (1975). Discussion to Efron's paper. Ann. Statist. 3, 1219-1221.
Rao, C. R. (1945). Information and accuracy attainable in the estimation of
statistical parameters. Bull. Calcutta Math. Soc. 37, 81-89.
Rao, C. R. (1961). Asymptotic efficiency and limiting information. Proc.
Fourth Berkeley Symp. Math. Statist. Prob., Edited by J. Neyman,
1, 531-545.
Rao, C. R. (1962). Efficient estimates and optimum inference procedures in
large samples (with discussion). J. Roy. Statist. Soc. B24, 46-72.
Rao, C. R. (1963). Criteria of estimation in large samples. Sankya 25, 189-
206.
Reeds, J. (1975). Discussion to Efron's paper. Ann. Statist. 3, 1234-1238.
Skovgaard, I. (1985). A second-order investigation of asymptotic ancillarity.
Ann. Statist. 13, 534-551.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
DIFFERENTIAL GEOMETRICAL THEORY OF STATISTICS
Shun-ichi Amari*
1. Introduction. 21
2. Geometrical Structure of Statistical Models . 25
3. Higher-Order Asymptotic Theory of Statistical Inference in
Curved Exponential Family . 38
4. Information, Sufficiency and Ancillarity Higher Order Theory . 52
5. Fibre-Bundle Theory of Statistical Models . 59
6. Estimation of Structural Parameter in the Presence of Infinitely
Many Nuisance Parameters . 73
7. Parametric Models of Stationary Gaussian Time Series . 83
8. References. 91
Department of Mathematical Engineering and Instrumentation Physics, University
of Tokyo, Tokyo, JAPAN
19
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
1. INTRODUCTION
Statistics is a science which studies methods of inference, from
observed data, concerning the probabilistic structure underlying such data.
The class of all the possible probability distributions is usually too wide to
consider all its elements as candidates for the true probability distribution
from which the data were derived. Statisticians often assume a statistical
model which is a subset of the set of all the possible probability distribu-
tions, and evaluate procedures of statistical inference assuming that the model
is faithful, i.e., it includes the true distribution. It should, however, be
remarked that a model is not necessarily faithful but is approximately so. In
either case, it should be very important to know the shape of a statistical
model in the whole set of probability distributions. This is the geometry of a
statistical model. A statistical model often forms a geometrical manifold, so
that the geometry of manifolds should play an important role. Considering that
properties of specific types of probability distributions, for example, of
Gaussian distributions, of Wiener processes, and so on, have so far been studied
in detail, it seems rather strange that only a few theories have been proposed
concerning properties of a family itself of distributions. Here, by the proper-
ties of a family we mean such geometric relations as mutual distances, flatness
or curvature of the family, etc. Obviously it is not a trivial task to define
such geometric structures in a natural, useful and invariant manner.
Only local properties of a statistical model are responsible for the
asymptotic theory of statistical inference. Local properties are represented
by the geometry of the tangent spaces of the manifold. The tangent space has a
21
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
22 Shun-ichi Amari
natural Riemannian metric given by the Fisher information matrix in the regular
case. It represents only a local property of the model, because the tangent
space is nothing but local linearization of the model manifold. In order to
obtain larger-scale properties, one needs to define mutual relations of the two
different tangent spaces at two neighboring points in the model. This can be
done by defining a one-to-one affine correspondence between two tangent spaces,
which is called an affine connection in differential geometry. By an affine
connection, one can consider local properties around each point beyond the
linear approximation. The curvature of a model can be obtained by the use of
this connection. It is clear that such a differential-geometrical concept pro-
vides a tool convenient for studying higher-order asymptotic properties of
inference. However, by connecting local tangent spaces further, one can obtain
global relations. Hence, the validity of the differential-geometrical method is
not limited within the framework of asymptotic theory.
It was Rao (1945) who first pointed out the importance in the
differential-geometrical approach. He introduced the Riemannian metric by using
the Fisher information matrix. Although a number of researches have been
carried out along this Riemannian line (see, e.g., Amari (1968), Atkinson and
Mitchell (1981), Dawid (1977), James (1973), Kass (1980), Skovgaard (1984),
Yoshizawa (1971), etc.), they did not have a large impact on statistics. Some
additional concepts are necessary to improve its usefulness. A new idea was
developed by Chentsov (1972) in his Russian book (and in some papers prior to
the book). He introduced a family of affine connections and proved their unique-
ness from the point of view of categorical invariance. Although his theory was
deep and fundamental, he did not discuss the curvature of a statistical model.
Efron (1975, 1978), independently of Chentsov's work, provided a new idea by
pointing out that the statistical curvature plays an important role in higher-
order properties of statistical inference. Dawid (1975) pointed out further
possibilities. Efron's idea was generalized by Madsen (1979) (see also Reeds
(1975)). Amari (1980, 1982a) constructed a differential-geometrical method in
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 23
statistics by introducing a family of affine connections, which however turned
out to be equivalent to Chentsov's. He further defined a-curvatures, and point-
ed out '?he fundamental roles of the exponential and mixture curvatures played in
statistica? inference. The theory has been developed further by a number of
papers (Amrri (1982b, 1983a, b), Amari and Kumon (1983), Kumon and Amari (1983,
1984, 1985), Nagaoka and Amari (1982), Eguchi (1983), Kass (1984)). The new
developments were also shown in the NATO Research Workshop on Differential Geo-
metry in Statistical Inference (see Barndorff-Nielsen (1985) and Lauritzen
(1985)). They together seem to prove the usefulness of differential geometry as
a fundamental method in statistics. (See also Csisz?r (1975), Burbea and Rao
(1982), Pfanzagl (1982), Beale (1960), Bates and Watts (1980), etc., for other
geometrical work.)
The present article gives not only a compact review of various
achievements up to now by the differential geometrical method most of which have
already been published in various journals and in Amari (1985) but also a pre-
view of new results and half-baked ideas in new directions, most of which have
not yet been published. Chapter 2 provides an introduction to the geometrical
method, and elucidates fundamental geometrical properties of statistical mani-
folds. Chapter 3 is devoted to the higher-order asymptotic theory of statisti-
cal inference, summarizing higher-order characteristics of various estimators
and tests in geometrical terms. Chapter 4 discusses a higher-order theory of
asymptotic sufficiency and ancillarity from the Fisher information point of
view. Refer to Amari (1985) for more detailed explanations in these chapters;
Lauritzen (1985) gives a good introduction to modem differential geometry. The
remaining Chapters 5, 6, and 7 treat new ideas and developments which are just
under construction. In Chapter 5 is introduced a fibre bundle approach, which
is necessary in order to study properties of statistical inference in a general
statistical model other than a curved exponential family. A Hilbert bundle and
a jet bundle are treated in a geometrical framework of statistical inference.
Chapter 6 gives a summary of a theory of estimation of a structural parameter
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
24 Shun-ichi Amari
in the presence of nuisance parameters whose number increases in proportion to
the number of observations. Here, the Hilbert bundle theory plays an essential
role. Chapter 7 elucidates geometrical structures of parametric and non-para-
metric models of stationary Gaussian time series. The present approach is use-
ful not only for constructing a higher-order theory of statistical inference on
time series models, but also for constructing differential geometrical theory of
systems and information theory (Amari, 1983 c). These three chapters are
original and only sketches are given in the present paper. More detailed theo-
retical treatments and their applications will appear as separate papers in the
near future.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
2. GEOMETRICAL STRUCTURE OF STATISTICAL MODELS
Metric and a-connection
Let S = {?(?,?)} be a statistical model consisting of probability
density functions ?(?,?) of random variable ?e? with respect to a measure ? on
X such that every distribution is uniquely parametrized by an n-dimensional
vector parameter ? = (? ) = (? ,... ,? ). Since the set {p(x)> of all the den-
sity functions on X is a subset of the L, space of functions in x, S is consid-
ered to be a subset of the L-. space. A statistical model S is said to be geo-
metrically regular, when it satisfies the following regularity conditions
^1 "
^6' anc' S 1S regarded as an ?-dimensional manifold with a coordinate system
?.
A-,. The domain T of the parameter ? is homeomorphic to an n-dimen-
sional Euclidean space Rn.
k2. The topology of S induced from Rn is compatible with the
relative topology of S in the L, space.
A~. The support of ?(?,?) is common for all ?e?, so that ?(?,?)
are mutually absolutely continuous.
A-. Every density function ?(?,?) is a smooth function in ?
uniformly in x, and the partial derivative 9/3T and integration of log ?(?,?)
with respect to the measure P(x) are always commutative.
A5- The moments of the score function (a/ae^log ?(?,?) exist up to
the third order and are smooth in ?.
Ag. The Fisher information matrix is positive definite.
Condition 1 implies that S itself is homeomorphic to R . It is
25
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
26 Shun-ichi Amari
Figure 1
possible to weaken Condition 1. However, only local properties are treated
here so that we assume it for the sake of simplicity. In a later section, we
assume one more condition which guarantees the validity of Edgeworth expansions.
Let us denote by 3. = a/ae1 the tangent vector e. of the i-th
coordinate curve ?1 (Fig. 1) at point ?. Then, ? such tangent vectors e. = a.,
i = 1,..., ?, span the tangent space ? at point ? of the manifold S. Any tan- ?
gent vector ?e? is a linear combination of the basis vectors a.,
A = aV,
where A are the components of vector A and Einstein's summation convention is
assumed throughout the paper, so that the summation S is automatically taken
for those indices which appear twice in one term once as a subscript and once as
a superscript. The tangent space ? is a linearized version of a small neigh-
borhood at ? of S, and an infinitesimal vector de = de1a. denotes the vector
connecting two neighboring points ? and ? + de or two neighboring distributions
?(?,?) and p(x, ? + de).
Let us introduce a metric in the tangent space T_. It can be done
by defining the inner product g - . (?) = o., a.> of two basis vectors a. and a.
at e. To this end, we represent a vector 3?e?? by a function a.?(x,e) in x,
where ?(x,e) = log p(x,e) and a.(in a.?) is the partial derivative a/ae1.
Then, it is natural to define the inner product by
g (e) = <a.,a.> = Efi[a.?(x,e)a.?(x,e)], (2.1)
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 27
where E denotes the expectation with respect to ?(?,?). This g.. is the
Fisher information matrix. Two vectors A and ? are orthogonal when
<A,B> = <A1ai,BJa.>
= AVg... = 0.
It is sometimes necessary to compare a vector ?e? of the tangent ?
space ?? at one point ? with a vector ?e? . belonging to the tangent space ?O? ?D D
at another point ?'. This can be done by comparing the basis vectors a. at T. ? ?
with the basis vectors ai at ? .. Since TA and t l are two different vector Id d d
spaces, the two vectors a. and a', are not directly comparable, and we need some
way of identifying Tn with T., in order to compare the vectors in them. This D D
can be accomplished by introducing an affine connection, which maps a tangent
space ? +. at ? + de to the tangent space ? at e. The mapping should reduce
to the identity map as de-*0. Let m(ai) be the image of a'-e? . mapped to ? .
It is slightly different from a-e?,.. The vector j e
va a. = lim -4r im(3'.) - a.} 9i J de+0 de1 3 3
represents the rate at which the j-th basis vector 34e?? "intrinsically" changes j w
as the point e moves from ? to ?+de (Fig. 2) in the direction a... We call
v. a. the covariant derivative of the basis vector a. in the direction a.. a-j J j '
Since it is a vector of ? , its components are given by D
rijk =
<VrV ' (2?2)
Figure 2
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
28 Shun-ichi Amari
and
where r.-k =
G?. .mgmk? We call r.-k the components of the affine connection. An
affine connection is specified by defining v. a. or r.... Let A(e) be a vector d. J IJK * -j
field, which assigns to every point QeS a vector A(e) = A (e)a. e TQ. The ? ?
intrinsic change of the vector A(e) as the position ? moves is now given by the
covariant derivative in the direction a. of A(e) = AJ(e)a., defined by ? j
v3A = (d.?p)d. +
?a'(?9^) = (a.Aj +
rikJAk)3j,
in which the change in the basis vectors as well as that in the components
A1(e) is taken into account. The covariant derivative in the direction ? =
??3^ is given by
vRA = ? ? ?.
B a.
We have defined the covariant derivative by the use of the basis
vectors a. which are associated with the coordinate system or the parametriza-
tion e. However, the covariant derivative vJ\ is invariant under any parametri-
zation, giving the same result in any coordinate system. This yields the trans-
formation law for the components of a connection r.... When another coordinate
system (parametrization) e' = e'(e) is used, the basis vectors change from
{3.J} to {a'..,}, where
3V =B1i,3i,
i i i ' and ?., = ae /ae' is the inverse matrix of the Jacobian matrix of the coor-
dinate transformation. Since the components r'.,.,,. of the connection are
written as
rVj'k' ?<?V*y\*>
in this new coordinate system, we easily have the transformation law
r'., ... , = B1. ,b4 ,?^,?. .. + B^B^.g. .(3.BJ!,). ? j k1 ? ' j' k' ijk ? ' k,:,kjx ? j
We introduce the a-connection, where a is a real parameter, in the
statistical manifold S by the formula
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 29
rijk =
??[{ Vj*(x,e) +
^G ^?(?,?^??,?^^?,?)]. (2.3)
It is easily checked that the connection defined by (2.3) satisfies the trans-
formation law. In particular, the l-connection is called the exponential con-
nection, and the -l-connection is called the mixture connection.
2.2 Imbedding and a-curvature
Let us consider an m-dimensional regular statistical model M =
{q(x?u)}, which is imbedded in S = {?(?,?)} by
q(x,u) = p{x,e(u)}.
Here, u = (ua) = (u ,...,um) is a vector parameter specifying distributions of
M, and defines a coordinate system of M. We assume that e = e(u) is smooth and
its Jacobian matrix has a full rank. Moreover, it is assumed that M forms an
m-dimensional submanifold in S. We identify a point ?e? with the point
e = e(u) imbedded in S. The tangent space ? (M) at u of M is spanned by m
vectors aa, a = 1,..., m, where 3a = 3/3ua denotes the tangent vector of the a a
coordinate curve ua in M. The basis an can be represented by a function a
3 ?(x,u) in ? as before, where ?(x,u) = log q(x,u). Since M is imbedded in S, a
the tangent space ? (M) of M is regarded as a subspace of the tangent space
??/ x(S) of S at e = e(u). The basis vector ?,e???) is written as a linear ? ̂ u ) au
combination of 3-,
3a =
Bl(u)V
where B^ = 3??(?)/3??. This can be understood from the relation a
aa?(x?u) = ?Y$,{x,e(u)}. a a?
Hence, the tangential directions of M at u is represented by m vectors a, a
(a = 1,...,m) or ? = (?1) in the component form with respect to the basis a. a a ?
of Te(u)(S>-
It is convenient to define n-m vectors a , ? = m + 1,... ,n in ?
? / AS) such that ? vectors {3a,3 }, a = 1,...,m; ? = m + 1,...,n, together e\U) a ?
form a basis of ? , , AS) and moreover 3 's are orthogonal to 3 's, (Fig. 3), ? \U / ? <A
93k(u) =
<3a'V = ?-
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
30 Shun-ichi Amari
The vectors 3 span the orthogonal complement of ? (M) in ??/ JS). We denote ? u ? ̂ u )
the components of 3 with respect to the basis 3. by 3 = ??(?)3?. The inner
products of any two basis vectors in {a ,3 } are given by
Wu) ?
<VV =
Wij ?
V(u) "
<W ?
B?B?9ij ?
The basis vector 3a may change in its direction as point u moves in a
M. The change is measured by the a-covariant derivative ?.?a'3a of 3 in the b a a
direction 3. , where the notion of a connection is necessary, because we need to
compare two vectors 3, and 3' belonging to different tangent spaces TQ/i\(S) and ? a a e ? u /
( ) ? / ,\(S). The a-covariant derivative v9 a is calculated in S as
" (?bBa
+ B?Barik)J>3j
?
When the directions of the tangent space ? (M) of M do not change as point u
moves in M, the manifold M is said to be a-flat in S, where the tangent direc-
tions are compared by the a-connection. Otherwise, M is curved in the sense of
the a-connection. The a-covariant derivative v,/a'3a is decomposed into the 3b
a
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 31
tangential component belonging to ? (M) and the normal component perpendicular
to ? (M). The former component represents the way 3 changes within ? (M), u au
while the latter represents the change of 3^ in the directions perpendicular to a
? (M), as u moves in M. The normal component is measured by
H?b^-a(a)^b'9-=(^bBi +
BbBakia)j)BXr <2-5>
a
which is a tensor called the a-curvature of submanifold M in S. It is usually
called the imbedding curvature or Euler-Shouten curvature. This tensor repre-
sents how M is curved in S. A tensor is a mu? ti-li near mapping from a number of
tangent vectors to the real set. In the present case, for A = Aa3 e? (M)
? = ? 3. e? (?) and C = CK3 belonging to the orthogonal complement of ? (M), we
(a) have the multi-linear mapping Hx ,
H(a)(A,B,C) = h[iI
AaBbCK.
(ai (a) This Hx ' is the a-curvature tensor, and H^.; are its components. The sub-
fa) manifold M is a-flat in S when H\ ' - 0 holds. The m ? m matrix
auK
ru(a)-|2 _ ?(a)?(a) ?? Cd LHM Jab
" HacKHbdxg g
represents the square of the a-curvature of M, where gK and gc are the inverse
matrix of g and g ,, respectively. Efron called the scalar
2 . G?(1)?2 ab ? *
[HM ]ab g
the statistical curvature in a one-dimensional model M, which is the trace of
the square of the exponential- or 1-curvature of M in our terminology.
Let e = e(t) be a curve in S parametrized by a scalar t. The curve
e: e = e(t) forms a one-dimensional submanifold in S. The tangent vector 3. of
the curve is represented in the component form as
\ =
ei(t)ai
or shortly by e, where ? denotes d/dt. When the direction of the tangent
vector a. = e does not change along the curve in the sense of the a-connection,
the curve is called an a-geodesic. By choosing an appropriate parameter, an
a-geodesic e(t) satisfies the geodesic equation
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
32 Shun-ichi Amari
or in the component form
?(^? = 0
?1 + r^W
= 0 . (2.6)
2.3 Duality in a-flat manifold
Once an affine connection is defined in S, we can compare two
tangent vectors ?e?O and ?'e?O? belonging to different tangent spaces ?? and OD ?
T., by the following parallel displacement of a vector. Let e: ? = e(t) be a D
curve connecting two points e and e'. Let us consider a vector field A(t) =
??(?)3.e? /.? defined on each point e(t) on the curve. If the vector A(t) does
not change along the curve, i.e., the covariant derivative of A(t) in the
direction e vanishes identically
v.A(t) = ?(t) + rjkiAk(t)ej
= 0 ,
the field A(t) is said to be a parallel vector field on c. Moreover,
A(t'^T /.,x at e(t') is said to be a parallel displacement of A(t)eT /.x at
e(t). We can thus displace in parallel a vector ?e?. at e to another point e' D
along a curve e(t) connecting e and e*, by making a vector field A(t) which
satisfies the differential equation v-A(t) = 0, with the boundary conditions D
e = e(0), e' = e(l), and A(0) = ?e? . The vector A' = ?(1)e? , at e' = e(l) is O D
the parallel displacement of A from e to e' along the curve e: e = e(t). We
denote it by A' = p A. When the a-connection is used, we denote the a-parallel
(a) displacement operator by tt . The parallel displacement of A from e to e' in
general depends on the path c: e(t) connecting e and e'. When this does not
depend on paths, the manifold is said to be flat. It is known that a manifold
is flat when, and only when, the Riemann-Christoffel curvature vanishes identi-
cally (see textbooks of differential geometry). A statistical manifold S is
said to be a-flat, when it is flat under the a-connection.
The parallel displacement does not in general preserve the inner
product, i.e., <p ?,p B> = <A,B> does not necessarily hold. When a manifold has
two affine connections with corresponding parallel displacement operators p
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 33
and p* and moreover when
<ttcA,tt*B> = <A,B> (2.7)
holds, the two connections are said to be mutually dual. The two operators p
and p* are considered to be mutually adjoint. We have the following theorem
in this regard (Nagaoka and Amari (1982)).
Theorem 2.1. The a-connection and -a-connection are mutually dual.
When S is a-flat, it is also -a-flat.
When a manifold S is a-flat, there exists a coordinate system (e1)
such that
vW3j = 0 or
r|j)(e) = 0
identically holds. In this case, a basis vector a. is the same at any point e
in the sense that 3?e?? is mapped to 3?e??, by the a-parallel displacement ID ID
irrespective of the path connecting e and e'. Since all the coordinate curves
e1 are a-geodesies in this case, e is called an a-affine coordinate system. A
linear transformation of an a-affine coordinate system is also a-affine.
We give an example of a 1-flat (i.e., a = 1) manifold S. The
density functions of exponential family S = {?(?,?)} can be written as
p(x,e) = ???{???.. - ?(?)}
with respect to an appropriate measure, where e = (e1) is called the natural or
canonical parameter. From
3?.?(?,?) = X. -
3???(?), 3.3j?,(x,e) = -3..3.?(?) ,
we easily have
g1d(e) =
^(e), rg(e) =
? 3.9.9|(? .
Hence, the l-connection r:./ vanishes identically in the natural parameter, ? j ?
showing that ? gives a 1-affine coordinate system. A curve e^t) = ant + b1,
which is linear in the e-coordinates, is a 1-geodesic, and conversely.
Since an a-flat manifold is -a-flat, there exists a -a-flat coor-
dinate system ? = (?..) = (n-|,....n ) in an a-flat manifold S. Let
31 = a/3n-j be the tangent vector of the coordinate curve ?. in the new coordin-
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
34 Shun-ichi Amari
ate system ?. The vectors {31} form a basis of the tangent space ? (i.e. at
TQ where e = ?(?)) of S. When the two bases {3.} and {31} of the tangent space D I
TQ satisfy D
<3. ,3J> =6'?
at every point e (or ?), where ?3, is the Kronecker delta (denoting the unit
matrix), the two coordinate systems e and ? are said to be mutually dual.
(Nagoaoka and Amari (1982)).
Theorem 2.2. When S is a-flat, there exists a pair of coordinate
systems e = (e1) and ? = (?.) such that i) e is a-affine and ? is -a-affine,
ii) e and ? are mutually dual, iii) there exist potential functions ?(?) and
?(?) such that the metric tensors are derived by differentiation as
9^(3) =
<3i'9j> =
?a??(?) >
g1J(n) = <a\aJ*> = aV^(n) ,
where g.. and g 1J are mutually inverse matrices so that
3i =gij3 ' 3 ''I d3
holds, iv) the coordinates are connected by the Legendre transformation
?1 = 3??(?), ??? =
3?.?(?) (2.8)
where the potentials satisfy the identity
?(?) + f(?) - ??? = 0, (2.9)
where ??? = e ?..
In the case of an exponential family S,? becomes the cumulant
generating function, the expectation parameter ? = (?.)
ni =
??[?.] =
a^(e)
is -I-affine, e and ? are mutually dual, and the dual potential f(?) is given
by the negative entropy,
?(?) = E[1og ?] ,
where the expectation is taken with respect to the distribution specified by ?.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 35
2.4 a-divergence and a-projection
We can introduce the notion of a-divergence D (e,e') in an a-flat
manifold S, which represents the degree of divergence from distribution p(x,e)
to ?(?,?'). It is defined by
Da(e,e') = ?(?) + f(?') - ???? , (2.10)
where ?' = n(e') are the ?-coordinates of the point e', i.e., the -a-coordinates
of the distribution ?(?,?'). The a-divergence satisfies D (e,e') > 0 with the
equality when and only when e = e'. The -a-divergence satisfies D_ (e,e') =
D (e',e). When S is an exponential family, the -1-divergence is the Kullback-
Leibler information,
D^(e,e') = I[p(x,e') : p(x,e)] =
Jp(x,e)1og PJ*;^ dP.
As a preview of later discussion, we may also note that, when
S = {p(x)} is the function space of a non-parametric statistical model, the
a-divergence is written as
D ip(x),q(x)} = ~\ O - 01 1-a?
(p(x)(1"a)/2 q(x)(1+a)/2 dP)
when a f ?1, and is the Kullback information or its dual when a = -1 or 1.
When e and e' = e + de are infinitesimally close,
Da(e,e + de) =
\ g.^eJdeW (2.11)
holds, so that it can be regarded as a generalization of a half of the square
of the Riemannian distance, although neither symmetry nor the triangular
inequality holds for D . However, the following Pythagorean theorem holds
(Efron (1978) in an exponential family, Nagaoka and Amari (1982) in a general
case).
Theorem 2.3. Let c be an a-geodesic connecting two points ? and
e', and let c' be a -a-geodesic connecting two points e' and e" in an a-flat
S. When the two curves c and c' intersect at e' with a right angle such that
e, e' and e" form a right triangle, the following Pythagorean relation holds,
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
36 Shun-ichi Amari
D (?,?') + D (?1 ,?") = D (?,?") . (2.12) a? ' a? ' a '
Let ? = {q(x,u)} be an m-dimensional submanifold imbedded in an
a-flat ?-dimensional manifold S = {p(x,e)} by e = e(u). For a distribution
p(x,eQ^S, we search for the distribution q(x,u^M, which is the closest dis-
tribution in M to p(x,en) in the sense of the a-divergence (Fig. 4a),
min Da{eQ,e(u)}
= Da{eQ,e(?)}
.
?e?
We call the resulting u(eQ) the a-approximation of
p(x,eQ) in M, assuming such
exists uniquely. It is important in many statistical problems to obtain the
a-approximation, especially the -1-approximation. Let c(u) be the a-geodesic
connecting a point ?(?)e? and eQ, c(u) : e = e(t;u), e(u) = e(0,u), eQ
= e(1,u)
(Fig. 4b). When the a-geodesic c(u) is orthogonal to M at e(u), i.e.,
<e(0;u),a > = 0 a
where aa = 3/sua are the basis vectors of TM(M), we call the u the a-projection a u
of eQ on M. The existence and the uniqueness of the a-approximation and the
a-projection are in general guaranteed only locally. The following theorem was
first given by Amari (1982a) and by Nagaoka and Amari (1982) in more general
form.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 37
Figure 4
Theorem 2.4. The a-approximation ?(eQ) of eQ
in M is given by the
a-projection u(eQ) of eQ on M.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
3, HIGHER-ORDER ASYMPTOTIC THEORY OF STATISTICAL INFERENCE IN
CURVED EXPONENTIAL FAMILY
Ancillary family
Let S be an ?-dimensional exponential family parametrized by the
natural parameter e = (?1) and let M = {q(x,u)} be an m-dimensional family
parametrized by u = (ua), a = 1,..., m. M is said to be an (n,m)-curved expo-
nential family imbedded in S = {p(x,e)} by e = e(u), when q(x,u) is written as
q(x,u) = exp[e1(u)xi
- ?{?(?)}].
The geometrical structures of S and M can easily be calculated as follows. The
quantities in S in the e-coordinate system are
gij(e) =
aiV(e) ,
r$ =
^ T.jk ,
Tijk=3iW<9> ?
The quantities in M are
9ab<u> ?
<3a'V ?
BaBb9ij >
rabc vaa W <WBc91j 2 'abc'
Tabc =
BlBbBcTijk - B?=3a9?(u)?
Here, the basis vector 3 of TM(M) is a vector a u
3a ?
Bl3i
in ? / \(S). If we use the expectation coordinate system ? in S, M is repre-
sented by ? = n(u). The components of the tangent vector 3a are given by a
38
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 39
Bai "
33??(?) ?
Ba9ji ?
Where : = ? .31, 3? = 3/3?.. ai ?
Let X(-|\> X(2)",#'X(N)
be N independent observations from a distri-
bution q(x,i:)eM. Then, their arithmetic mean
? = (=y(j))/N
is a minimal sufficient statistic. Since the joint distribution q(x#,>,...,
X/M\i u) can be written as
? 4 t q(xm.u)
= exp[N{e (u)x. - ?{?(??)}}], j=l Uj 1
the geometrical structure of M based on ? observations is the same as that
based on one observation except for a constant factor N. We treat statistical
inference based on x. Since a point ? in the sample space X can be identified
with a point ? = ? in S by using the expectation parameter ?, the observed suf-
ficient statistic ? defines a point ? in S whose ?-coordinates are given by x,
? = x. In other words, we regard ? as the point (distribution) ? in S whose
expectation parameter is just equal to x. Indeed, this ? is the maximum likeli-
hood estimator in the exponential family S.
Let us attach an (n-m)-dimensional submanifold A(u) of S to each
point ?e?, such that all the A(u)'s are disjoint (at least in some neighborhood
of M, which is called a tubular neighborhood) and the union of A(u)'s covers S
(at least the tubular neighborhood of M). This is called a (local) foliation of
S. Let ? = (vK), K=m+l,...,nbea coordinate system in A(u). We assume
that the pair (u,v) can be used as a coordinate system of the entire S (at
least in a neighborhood of M). Indeed, a pair (u,v) specifies a point in S such
that it is included in the A(u) attached to u and its position in A(u) is given
by ? (see Fig. 5). Let ? = ?(?,?) be the ?-coordinates of the point specified
by (u,v). This is the coordinate transformation of S from w = (u,v) to ?,
where w = (u,v) = (w^) is an n-dimensional variable, 3 = 1*...> n, such that its
first m components are u = (ua) and the last n-m components are ? = (vK).
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
40 Shun-ichi Amari
Figure 5
Any point ? (in some neighborhood of M) in S can be represented uniquely by
w = (u,v). We assume that A(u) includes the point ? = ?(?) on M and that the
origin ? = 0 of A(u) is put at the point ?e?. This implies that ?(?,?) is the
point ?(?)e?. We call A = {A(u)} an ancillary family of the model M.
In order to analyze the properties of a statistical inference
method, it is helpful to use the ancillary family which is naturally determined
by the inference method. For example, an estimator ? can be regarded as a map-
ping from S to M such that it maps the observed point ? = ? in S determined by
the sufficient statistic ? to a point ?(?)e?. Its inverse image u" (u) defines
an (n-m)-dimensional subspace A(u) attached to ?e?,
A(u) = ?"?(?) = ineS | u(n) = u} .
Obviously, the estimator ? takes the value u when and only when the observed ?
is included in A(u). These A(u)'s form a family A = {A(u)} which we will call
the ancillary family associated with the estimator u. As will be shown soon,
large-sample properties of an estimator ? are determined by the geometrical
features of the associated ancillary submanifolds A(u). Similarly, a test ?
can be regarded as a mapping from S to the binary set {r,r}, where r and r
imply, respectively, rejection and acceptance of a null hypothesis. The
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 41
inverse image T"1(r)c= s is called the critical region, and the hypothesis is
rejected when and only when the observed point ? = xeS is in ? (r). In order
to analyze the characteristics of a test, it is convenient to use an ancillary
family A = {A(u)} such that the critical region is composed of some of the
A(u)'s and the acceptance region is composed of the other A(u)'s. Such an
ancillary family is said to be associated with the test T.
In order to analyze the geometrical features of ancillary submani-
folds, let us use the new coordinate system w = (u,v). The tangent of the
coordinate curve w^ is given by 30 = 3/3w . The tangent space T (S) at point ? n
? = Aw) of S is spanned by {30>, 3 = 1,..., n. They are decomposed into two P
parts {3Q} = {3.3 }, 3 = 1,..., n; a = 1,..., m; ? = m + l,...,n. The former p a ?
part 3 = 3/3ua spans the tangent space ? (M) of M at u and the latter 3 =
3/3vK spans the tangent space ? (A) of A(u). Their components are given by
Bo- = 30n,. (w) in the basis 3?. They are decomposed as pi pi
9a =
B^.31 , 3 = ? .31 , a ai ? ??
with ? . = 3 ?.(?,?), ? . = 3 ?.(u,v). The metric tensor in the w-coordinate ai a ? ?? ? ?
system is given by
U -
<3a'V ?
Bai V?J =
^ij (3?1)
where
Ba =
g1j?aj = ae1(u'v)/3wa ?
The metric tensor is decomposed into three parts:
W") ?
<3a'V =
BaiBb/J (3?2)
is the metric tensor of M,
9kX(u> *
<3<'V ?
BKiBXj91J (3?3)
is the metric tensor of A(u), and
V =
<3a'V =
BaiBKj9?J (3?4)
represents the angles between the tangent spaces of M and A(u). When gaK(u,0)
= 0, M and A(u) are orthogonal to each other at M. The ancillary family
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
42 Shun-ichi Amari
A = {A(u)} is said to be orthogonal, when g (u) = 0, where f(u) is the abbre-
viation of f(u,0) when a quantity f(u,v) is evaluated on M, i.e., at ? = 0.
We may treat an ancillary family ?.. which depends on the number ? of observa-
tions. In this case g also depends on N. When g = <a ,a > is a quantity of
-1/2 order ? converging to 0 as ? tends to infinity, the ancillary family is
said to be asymptotically orthogonal.
The a-connection in the w-coordinate system is given by
r(a) = <v(a)a , a > = (a ? .)??' - i^T a3? a 3 ?
v air ? 2 a3?
= (a b!)B , +^T . , (3.5) a 3 ?? ? a3?
where ? ? = ?^? ?. ., . The M-part t y? I gives the components of the a-connec- a3? a 3 ? tjk
r abc 3 r
tion of M and the A-part r^' gives those of the a-connection of A(u). When A
is orthogonal, the a-curvatures of M and A(u) are given respectively by
abK abK xAa ?xa
The quantities g (u), H?? and H^ are fundamental in evaluating asymptotic
properties of statistical inference procedures. When a = 1, the l-connection is
called the exponential connection, and we use suffix (e) instead of (1). When
a = -1, the -l-connection is called the mixture connection, and we use suffix
(m) instead of (-1).
3.2 Edgeworth expansion
We study higher-order asymptotic properties of various statistics
with the help of Edgeworth expansions. To this end, let us express the point
? = ? defined by the observed sufficient statistic in the w-coordinate system.
The w-coordinates w = (u,v) are obtained by solving
? = n(w) = ?(?,?) . (3.7)
The sufficient statistic ? is thus decomposed into two parts (u,v) which to-
gether are also sufficient. When the ancillary family A is associated with an
estimator or a test, ? gives the estimated value or the test statistic,
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 43
respectively. We calculate the Edgeworth expansion of the joint distribution of
(u,v) in geometrical terms. Here, it is necessary further to assume a condition
which guarantees the Edgeworth expansion. We assume that Cramer's condition is
satisfied. See, for example, Bhattacharya and Ghosh (1978).
When Uq is the true parameter of distribution, ? converges to ?(?0,
0) in probability as the number ? of observations tends to infinity, so that the
random variable w also converges to wQ =
(uq,0). Let us put
* = ?G?? - n(u0,0)} , ft = ?N(w -
w0) ,
u = ?G(? - u0) , ? = ?? ? . (3.8)
Then, by expanding (3.7), we can express w in the power series of x. We can
obtain the Edgeworth expansion of the distribution p(w;uQ) of w = (u,v). How-
ever, it is simpler to obtain the distribution of the one-step bias-corrected
version w* of w defined by
w* = w - e-[w] ,
where E denotes the expectation with respect to p(x,w). The distribution of w
is obtained easily from that of w*. (See Amari and Kumon (1983).)
Theorem 3.1. The Edgeworth expansion of the probability density
p(w*,uQ) of w*, where q(x,uQ) is the underlying true distribution, is given by
p(?M,0> -
n(?*;ga?){l + J- ? he* *
J y*) + O^'2)} ,
6/? (3.9)
AN(w*) = t C2flha? + ? ? fi ?ha?Y5 + ? ? . ?? ?a??def ,
?? ' 4 a3 24 a3?d 72 a3? <$ef
where n(w*;g A is the multivariate normal density with mean 0 and covariance x 3a3
9?^ = (g o)" 9 ha^y etc. are the tensorial Hermite polynomials in w* and
?\*?? " "
a6?
C2? =
G(? r(mJ 9?e9ds , etc. a3 ??a es3
The tensorial Hermite polynomials in w with metric gaB
are defined
by
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
44 Shun-ichi Amari
a, ...<x. . ?, a., h
' k(w) = (-1)K{D
' ...D Kn(w;g J}/n(w;g ) , ap ap
where Da = ga3(3/3w3), cf. Amari and Kumon (1983), McCullagh (1984). Hence,
h? = 1, ?a = wa, ?a3 = waw3 - ga3 ,
?a3? = wW - g?V - gaV - g3V , etc.
Theorem 3.1 shows the Edgeworth expansion up to order ? of the
joint distribution of u* and v*, which together carry the full Fisher informa-
tion. The marginal distribution can easily be obtained by integration.
Theorem 3.2. When the ancillary family is orthogonal, i.e., g (u)
0, the distribution p(u*,uQ) of u* is given by
p(u*,u0) =
n(u*;gab){l +1
N"1/2KabcK abc
where Kabc
= - 3rabc/3)"
+ N_1AN(u*)}
+ 0(N_3/2) , (3.10)
Vu*> =
? Cabh3b
+ terms common to all the orthogonal ancillary families,
4 ?
^4 +
2<HS4 +
(*ab ? <3?1?>
, p?2 = (m) (m) ce df ir }ab rcda efb 9 9 '
,Mex2 . ?(e) ?(e) cd ?? (Vab
- HacK Hbdx g g ?
(Vab "
HKva HXyb g g ?
3.3 Higher-order efficiency of estimation
Given an estimator ? : S^M which maps the observed point ? = xeS to
?(?)e?, we can construct the ancillary family A = {A(u)} by
A(u) = u"](u) = {neS | u(n) = u} .
The A(u) includes the point ?(?) = ?(?,0), when and only when the estimator is
consistent. (We may treat a case when A(u) depends on N, denoting an ancillary
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 45
family by AN(u). In this case, an estimator is consistent if lim AM(u)-^(u,0).)
Let us expand the covariance of the estimation error u = /N(u - uQ) as
cov[?a,?b] = gf
+ gf
N"1/2 + gfrf1
+ 0(N"3/2) .
A consistent estimator is said to be first-order efficient or simply efficient,
when its first-order term ga (u) is minimal among all the consistent estimators
at any u, where the minimality is in the sense of positive semidefiniteness of
matrices. The second- and third-order efficiency is defined similarly.
Since the first-order term ga is given from (3.9) by
ab _ / n n ????-1 gl
- (gab
- VbX9 ) '
the minimality is attained, when and only when g, = 0, i.e., the associated
ax
ancillary family is orthogonal. From this and Theorem 3.2, we have the follow-
ing results.
Theorem 3.3. A consistent estimator is first-order efficient, iff
the associated ancillary family is orthogonal. An efficient estimator is always
second-order efficient, because of g2 = 0.
There exist no third-order efficient estimators in the sense that
g~ (u) is minimal at all u. This can be checked from the fact that g\ includes
a term linear in the derivative of the mixture curvature of A(u), see Amari
(1985). However, if we calculate the covariance of the bias-corrected version
u* = u - E-[u] of an efficient estimator u, we see that there exists the third-
order efficient estimator among the class of all the bias-corrected efficient
cd estimators. To state the result, let
g~ab = g~ 9ca9b(j
be the lower index
o ab version of g~ .
Theorem 3.4. The third-order term g~ . of the covariance of a bias-
corrected efficient estimator u* is given by the sum of the three non-negative
geometric quantities
93ab ?
1 <rm4 +
<<& +
1 (?A ? <3-12>
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
4 6 Shun-ichi Amari
The first is the square of mixture connection components of M, and depends on
the parametrization of M but is common to all the estimators. The second is
the square of the exponential curvature of M, which does not depend on the
estimator. The third is the square of the mixture curvature of the ancillary
submanifold A(u) at ?(?), which depends on the estimator. An efficient estima-
tor is third-order efficient, when and only when the associated ancillary family
is mixture-flat at ?(?). The m.l.e. is third-order efficient, because it is
given by the mixture-projection of ? to M.
The Edgeworth expansion (3.10) tells more about the characteristics
of an efficient estimator u*. When H\; vanishes, an estimator is shown to be ?xa
mostly concentrated around the true parameter u and is third-order optimal
under a symmetric unimodal loss function. The effect of the manner of paramet-
rizing M is also clear from (3.10). The a-normal coordinate system (parameter)
in which the components of the a-connection become zero at a fixed point is very
important (cf. Hougaard, 1983; Kass, 1984).
3.4 Higher-order efficiency of tests
Let us consider a test ? of a null hypothesis HQ : ueD against the
alternative H, : u^D in an (n,m)-curved exponential family, where D is a region
or a submanifold in M. Let R be a critical region of test ? such that the
hypothesis Hn is rejected when and only when the observed point ? = ? belongs to
R. When ? has a test statistic ?(?), the equation ?(?) = const, gives the
boundary of the critical region R. The power function PT(u) of the test ? at
point u is given by
Pj(u) =
j p(x;u) dx , Ir
where p(x;u) is the density function of ? when the true parameter is u.
Given a test T, we can compose an ancillary family A = {A(u)> such
that the critical region R is given by the union of some of A(u)'s, i.e., it
can be written as
R = IL? A(u) , -HjeRM
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 47
where R? is a subset of M. Then, when we decompose the observed statistic
? = ? into (u,v) by ? = ?(?,?) in terms of the related w-coordinates, the hypo-
thesis Hq
is rejected when and only when ueR^ Hence, the test statistics ?(?)
is a function of only u. Since we have already obtained the Edgeworth expansion
of the joint distribution of (u,v) or of (u*,v*), we can analyze the character-
istics of a test in terms of geometry of associated A(u)'s.
We first consider the case where M = {q(x,u)} is one-dimensional,
so that u = (ua) is a scalar parameter, indices a, b, etc becoming equal to 1.
We test the null hypothesis HQ : u = uQ against the alternative H, : u j u~ .
Let u. be a point which approaches uQ as ? tends to infinity by
ut =
u0 + t(Ng)'1/2 , (3.13)
-1/2 i.e., the point whose Riemannian distance from uQ
is approximately tN ' ,
where g = g b(u0)? The power PT(u.,N) of a test ? at u. is expanded as
PT(ut,N) =
PT1(t) +
PT2(t)N"1/2 +
P^tON""1 + 0(N~3/2) .
A test ? is said to be first-order uniformly efficient or, simply, efficient,
if the first-order term Pyi(t)
satisfies Pj-j(t)
> Py.-jU)
at all t, compared
with any other test T' of the same level. The second- and third-order uniform
efficiency is defined
Pj(ut,N)'s defined by
efficiency is defined similarly. Let P(u.,N) be the envelope power function of
P(ut,N) = sup PT(ut,N)
. (3.14)
Let us expand it as
P(ut,N) =
P^t) +
P2(t)N"1/2 +
?3(?)?"? + 0(N"3/2) .
It is clear that a test ? is i-th order uniformly efficient, iff
pTk(t) =
Pk(t)
holds at any t for k = l,...,i.
An ancillary family A = {A(u)} in this case consists of (n-1)-
dimensional submanifolds A(u) attached to each u or ?(?)e?. The critical
region R is bounded by one of the ancillary submanifolds, say A(u+), in the
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
48 Shun-ichi Amari
one-sided case, and by two submanifolds A(u+) and A(u_) in the two-sided unbias-
ed case. The asymptotic behavior of a test ? is determined by the geometric
features of the boundary 3R, i.e., A(u+)[and A(u_)]. In particular, the angle
between M and A(u) is important. The angle is given by the inner product
ga (u) = <9a?a
> of the tangent 3 of M and tangents 3 of A(u). When g (u) =
0 for all u, A is orthogonal. In the case of a test, the critical region and
hence the associated ancillary A and g (u) depend on N. An ancillary family is
-1/2 said to be asymptotically orthogonal, when g (u) is of order N~ . We can
assume ga (un) = 0, and ga (u.) can be expanded as aK ? 3? l
93?> ?t
W^'172 ? {3?15)
where Q . = 3 g. (uQ). The quantity Q . represents the direction and the
magnitude of inclination of A(u) from being exactly orthogonal to M. We can
now state the asymptotic properties of a test in geometrical terms (Kumon and
Amari (1983), (1985)).
Theorem 3.5. A test ? is first-order uniformly efficient, iff the
associated ancillary family A is asymptotically orthogonal. A first-order
uniformly efficient test is second-order uniformly efficient.
Unfortunately, there exist no third-order uniformly efficient test
(unless the model M is exponential family). An efficient test ? is said to be
third-order tQ-efficient, when its third-order power Pjo(t) is minimal among
all the other efficient tests at tQ, i.e., when PjoUq)
= PoUq),
and when
there exist no tests T' satisfying Pjio(t) >
Pyo(t) "for a^ t. An efficient
test is third-order admissible, when it is tQ
- efficient at some tQ.
We define
the third-order power loss function (deficiency function) APy(t) of an efficient
test ? by
PT(t) = lim
N{P(ut,N) -
PT(ut,N)} =
P3(t) - P-U) . (3.16)
It characterizes the behaviors of an efficient test T. The power loss function
can be explicitly given in geometrical terms of the associated ancillary A
(Kumon and Amari (1983), Amari (1983a)).
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 49
Theorem 3.6. An efficient test ? is third-order admissible, only
when the mixture curvature of A(u) vanishes as N-**> and the A(u) is not exactly
orthogonal to M but asymptotically orthogonal to compensate the exponential
(e) curvature H\ ' of model M such that
<W "
cHaebK (3?17>
holds for some constant c. The third-order power loss function is then given by
??t(?) = a.(t,a){c -
Jjit.cOlV , (3.18)
where a.(t,a) is some fixed function of t and a,a being the level of the test,
^ - ^Hitl
h^ gacgbd (3.19)
is the square of the exponential curvature (Efron's curvature) of M, and
J^t.a) = 1 -
t/{2u1(a)},
J2(t,a) = 1 -
t/[2u2(a)tanh{tU2(a)}],
i = 1 for the one-sided case and i = 2 for two-sided case, ? being the standard
normal density function, and u-.(a) and u2(a) being the one-sided and two-sided
100a% points of the normal density, respectively.
The theorem shows that a third-order admissible test is character-
ized by its c value. It is interesting that the third-order power loss function
(3.18) depends on the model M only through the statistical curvature ? , so that
AP-r(t)/y gives a universal power loss curve common to all the statistical
models. It depends only on the value of c. Various widely used tests will next
be shown to be third-order admissible, so that they are characterized by c
values as follows.
Theorem 3.7. The test based on the maximum likelihood estimator
(e.g. Wald test) is characterized by c = 0. The likelihood ratio test is char-
acterized by c = 1/2. The locally most powerful test is characterized by c = 1
2 in the one-sided case and c = 1 -
l/{2u2(a)} in the two-sided case. The con-
(eW ditional test conditioned on the approximate ancillary statistic a =
H\^v
is characterized also by c = 1/2. The efficient-score test is characterized by
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
50 Shun-ichi Amari
c = 1, and is inadmissible in the two-sided case.
We show the universal third-order power loss functions of various
tests in Fig. 6 in the two-sided case and in Fig. 7 in the one-sided case,
where a = 0.05 (from Amari (1983a)). It is shown that the likelihood ratio test
has fairly good performances throughout a wide range of t, while the locally
most powerful test behaves badly when t > 2. The m.l.e. test is good at around
t = 3a.
We can generalize the present theory to the multi-parameter cases
with and without nuisance parameters. It is interesting that none of the
above tests are third-order admissible in the multi-parameter case. However, it
is easy to modify a test to get a third-order ^-efficient test by the use of
the asymptotic ancillary statistic a (Kumon and Amari, 1985). We can also
design the third-order t^-most-powerful confidence region estimators and the
third-order minimal size confidence region estimators.
It is also possible to extend the present results of estimation and
testing in a statistical model with nuisance parameter ?. In this case, a set
M(Uq) of distributions in which the parameter of interest takes a fixed value
Uq, but ? takes arbitrary values, forms a submanifold. The mixture curvature
and the exponential twister curvature of M(uQ) are responsible for the higher-
order characteristics of statistical inference. The third-order admissibility
of the likelihood ratio test and others is again proved. See Amari (1985).
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 51
HAPjU?/r cl = 0.05 two- Sided tests
efficient score test
locally most powerful test
I.e. test
likelihood ratio test
> t
Figure 6
N/lPT(0/r A
0.5 t
o? = 0.05 one-sided tests
?^ efficient score test
(locally most porfrfuf test)
m.l.e. test
likelihood ratio test
->t
Figure 7
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
4. INFORMATION, SUFFICIENCY AND ANCILLARITY
HIGHER ORDER THEORY
Information and conditional information
Given a statistical model M = {p(x,u)}, u = (ua), we can follow
Fisher and define the amount 9ak(T) of information included in a statistic
? = t(x) by
gab(T) =
E[dat(t,u)3bA(t,u)] , (4.1)
where ?(t,u) is the logarithm of the density function of t when the true para-
meter is u. The information 9ak(O is a positive-semidefinite matrix depending
on u. Obviously, for the statistic X, gak.W 1S the Fl*sher information matrix.
Let T(X) and S(X) be two statistics. We similarly define, by using the joint
distribution of ? and S, the amount gak(T>S) ?f information which ? and S to-
gether carry. The additivity
h^'V -
9ab<T> +
WS>
does not hold except when ? and S are independent. We define the amount of
conditional information carried by ? when S is known by
9ab(T|S) =
EsET|S[aa?(tls'u)3b?(tls>u)] ? (4'2)
where ?(t|s,u) is the logarithm of the conditional density function of ? con-
ditioned on S. Then, the following relation holds,
9ab(T'S> =
9ab(T) +
9ab(SlT> =
9ab(S) +
9ab<TlS> '
From gak(S|T) =
gak(T,S) -
gak(T), we see that the conditional information
denotes the amount of loss of information when we discard s from a pair of
statistics s and t, keeping only t. Especially,
52
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 53
*9ab(T) -
9ab{X> "
U? '
9ab(XlT> <4?3>
is the amount of loss of infonnation when we keep only t(x) instead of keeping
the original x. The following relation is useful for calculation,
Agab(T) =
ETCov[3a*(x,u),3bs,(x,u)|t] , (4.4)
9ab(SlT> =
9ab(T> "
9ab(T'S> ' <4?5>
where Cov[.|t] is the conditional covariance.
A statistic S is sufficient, when gak(S)
= 9ahW
or A9ab^ = ?*
When S is sufficient, gak(T|S) = ? holds for any statistic T. A statistic a is
ancillary, when gab(A) = 0. When A is ancillary, g ^(?,?)
= gafc>(T|A)
for any T.
It is interesting that, although A itself has no information, A together with
another statistic ? recovers the amount
9ab(AlT> ?
9ab(T'A> "
9ab{T)
of information. An ancillary statistic carries some information in this sense,
and this is the reason why an ancillarity is important in statistical inference.
We call g ,(A|T) the amount of information of ancillary A relative to statistic
T.
When ? independent observations x,,...,xN are available, the Fisher
information gab(X ) is Ng .(?), ? times that of one observation. When M is a
curved exponential family, ? = ??^/?
is a sufficient statistic, keeping the
whole information, g ,(X) = Ng b(X). Let t(x) be a statistic which is a func-
tion of x. It is said to be asymptotically sufficient of order q, when
^T) =
9abW-WT) = 0(N"q+1) ? (4?6)
Similarly, a statistic t(x) is said to be asymptotically ancillary of order q,
when
gab(T) = 0(N"q) (4.7)
holds. (The definition of the order in the present article is different from
that by Cox (1980) etc.)
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
54 Shun-ichi Amari
4.2 Asymptotic efficiency and ancillarity
Given a consistent estimator u(x) in an (n,m)-curved exponential
family M, we can construct the associated ancillary family A. By introducing
an adequate coordinate system ? in each A(u), the sufficient statistic ? is de-
composed into two statistics (u,v) by ? = ?(?,?). The amount Ag . (U) of inform-
ation loss of estimator u is calculated from (4.4) by using the stochastic ex-
pansion of 9 ?(x,u) as a
^ab^^aA/"^1)
Hence, when and only when A is orthogonal, i.e., ga (u) = 0, u is first-order
sufficient. In this case, u is (first-order) efficient. The loss of informa-
tion of an efficient estimator ? is calculated as
^)--(\%)?b+(V2H^)lb + 0^) . (4.8)
where (H?) is the square of the exponential curvature of the model M and (H?)
is the square of the mixture curvature of the associated ancillary family A at
? = 0. Hence, the loss of information is minimized uniformly in u, iff the
mixture curvature of the associated ancillary family A(u) vanishes at ? = 0 for
all u. In this case, the estimator ? is third-order efficient in the sense of
the covariance in ?3. The m.l.e. is such a higher-order efficient estimator.
Among all third-order efficient estimators, does there exist one
whose loss of information is minimal at all u up to the term of order N~ ? Is
the m.l.e. such a one? This problem is related to the asymptotic efficiency of
estimators of order higher than three. By using the Edgeworth expansion (3.9)
and the stochastic expansion of 3a?(x,u), we can calculate the terms, which a
depend on the estimator, of the information loss of order N~ in geometrical
terms of the related ancillary family. The loss of order N~ includes a term
related to the derivatives of the mixture curvature H ,; of A in the direction KAa
of 3 and 3a (unpublished note). From this formula, one can conclude that ? a
there exist no estimators whose loss Agab(U) of information is minimal up to
the term of order N~ at all u among all other estimators. Hence, the loss of
information of the m.l.e. is not uniformly minimal at all u, when the loss is
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 55
evaluated up to the term of order N~ .
We have already obtained the Edgeworth expansion up to order ? of
the joirt distribution of (u,v), or equivalently (u*,v*) in (3.9). By integra-
tion, we have the distribution of v*,
ptf*;u) = n(>;gK?){l
+ 1 /? ????????
+ OtN'1)), (4.9)
where g (u) and ? . (u) depend on the coordinate system ? introduced to each ka ? Ay
A(u). The information gab(V*) of v* can be calculated from this. It depends on
the coordinate system v, too. It is always possible to choose a coordinate
system ? in each A(u) such that {9 } is an orthonormal system at ? = 0, i.e.,
g (u) = d . Then, v* is first-order ancillary. It is always possible to ?? ??
choose such a coordinate system that ? (u) = 0 further holds at ? = 0 in every
A(u). This coordinate system is indeed given by the (a = - l/3)-normal coor-
dinate system at ? = 0. The v* is second-order ancillary in this coordinate
system. By evaluating the term of order N~ in (4.9), we can prove that there
exists in general no third-order ancillary v.
However, Skovgaard (1985), by using the method of Chernoff (1949),
showed that one can always construct an ancillary ? of order q for any q by
modifying ? successively. The q-th order ancillary ? is a function of ?
depending on N. Hence, our previous result implies only that one cannot in
general construct the third-order ancillary by using a function of ? not depend-
ing on N, or by relying on an ancillary family A = {A(u)} not depending on N.
There is no reason to stick to an ancillary family not depending on N, as
Skovgaard argued.
4.3 Decomposition of information
Since (u,v) together are sufficient, the information lost by sum-
marizing ? into ? is recovered by knowing the ancillary v. The amount of
recovered information g .(V|U) is equal to Ag ,(U). Obviously, the amount of
information of ? relative to u does not depend on the coordinate system of A(u).
In order to recover the information of order 1 in Agak(U)> not all the compo-
nents of ? are necessary. Some functions of ? can recover the full information
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
56 Shun-ichi Amari
of order 1. Some other functions of ? will recover the information of order ?
and some others further will recover the information of order ? . We can de-
compose the whole ancillary ? into parts according to the order of the magnitude
of the amount of relative information.
The tangent space ? (A) of the ancillary subspace A(u) associated
with an efficient estimator ? is spanned by ? - m vectors 8 . The ancillary ?
can be regarded as a vector ? = vKa belonging to ? (A). Now we decompose ? (A)
as follows. Let us define
h a 1 =
(v?e)---v?e) B? )? ? > 2 (4???) ar--ap ai Vi ap
which is a tensor representing the higher-order exponential curvature of the
feii model. When ? = 2, it is nothing but the exponential curvature H\ '
, and when
? = 3, ? . ?
represents the rate of change in the curvature ? ^ > anc* so on?
For fixed indices a,,...,a. ? 1
is a vector in T (S), and its projection ? ? a-j.. .a u
to ? (A) is given by
a....apK a....ap ??
Let T?(A)_ (p > 2) be the subspace of TM(A) spanned by vectors ? , up- u a,a0K
K, a a ?...,IC a , and let ? be the orthogonal projection from T,,(A) to a-ia^a^K d-?...a ? ?,p U
Tu(A)p. We call
?!?) a = (?? -
p\n ?)? ? (4.11) a-j...a ? ? ? p-r a,...a ? '
the p-th order exponential curvature tensor of the model M, where I = (I ) is
the identity operator. The square of the p-th order curvature is defined by
(^)p,b ?
?ie!,...v,K "?Vvi> ^a'b,-'v,v' ? ???2>
(e? There exists a finite pn such that H,
' a vanishes for ? > pn. ?
aT--ap - ?
Now let us consider the following sequence of statistics,
Tl = {">' T2
=
Ha'a2^K>??? ?
Moreover, let t = a ?(x,?), which vanishes if ? is the m.l.e. Obviously, the a a
sequence Tp? T^, ... gives a decomposition of the ancillary statistic ? = (vK)
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 57
into the higher-order curvature directions of M. Let
t1 =
Tr t2 =
{t,TrT2},...,Tp =
{t?-1 ,Tp} .
Then, we have the following theorems (see Amari (1985)).
Theorem 4.1. The set of statistics t is asymptotically sufficient
of order p. The statistic ? carries information of order ? relative to t ?,
9ab<VVl> ?
N"P+2(HM>ab ? <4?13>
into
Theorem 4.2. The Fisher information gab(X) = Ng .(X) is decomposed
W*) ?
p?l WW^ =
WU> +
Pk N"P+2(? ? (4.14)
The theorems imply the following. An efficient estimator ? carries
all the information of order N. The ancillary v, which together with ? carries
the remaining smaller-order information, is decomposed into the sum of p-th
(ei -? order curvature-direction components aa = H: '
a ? , which carries all
-D+2 1 ? 1 ?
the missing information of order ? v relative to t ,. The proof is obtained
by expanding 3 ?(x,u), where ? = u - u, as a
3 ?(X,U) = 3.?(x,u) + ??? -?4-?- 3 3 3 ?(x,u)?al ...? P a a ?? ? ?; a a-?... a
and by calculating g . (? |t ,). The information carried by 3 3 3 ?(x,u)
? is equivalent to (3,?, a ?)? .vK or H,e; a vK relative to t ? up to the
a a-?...a Kl a a??...a ? ?-1
necessary order.
4.4. Conditional inference
When there exists an exact ancillary statistic a, the conditionality
principle requires that statistical inference should be done by conditioning on
a. However, there exist no non-trivial ancillary statistics in many problems.
Instead, there exists an asymptotically ancillary statistic v, which can be
refined to be higher-order ancillary. The asymptotic ancillary statistic car-
ries information of order 1, and is very useful in improving higher-order
characteristics of statistical inference. For example, the conditional covari-
ance of an efficient estimator is evaluated by
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
58 Shun-ichi Amari
N Cov[u\ub|v] = (gab
+ H^v*)"1
+ higher order terms ,
where gab
+ ?^??
= - 3 3^(?,??) is the observed Fisher information. When two
groups of independent observations are obtained, we cannot get a third-order
efficient estimator for the entire set of observations by combining only the two
third-order efficient estimators ?, and ?p for the respective samples. If we
can use the asymptotic ancillaries H^v? and
H^Vp, we can calculate the
third-order efficient estimator (see Chap. 5). Moreover, the ancillary ?_^ vK
can be used to change the characteristics of an efficient test and of an
efficient interval estimator. We can obtain the third-order t^-efficient test
or interval estimator by using the ancillary for any given t^. It is interest-
ing that the conditional test conditioned on the asymptotic ancillary ? is
third-order admissible and its characteristic (deficiency curve) is the same as
that of the likelihood-ratio test (Kumon and Amari (1983)).
In the above discussions, it is not necessary to refine ? to be a
(e)-K higher-order asymptotic ancillary. The curvature-direction components H\'v
are important, and the other components play no role. Hence, we may say that
^ab ^s use"ful not because it is (higher-order) ancillary but because it re-
covers necessary information. It seems that we need a more fundamental study on
the invariant structures of a model to elucidate the conditionality principle
and ancillarity (see Kariya (1983), Barndorff-Nielsen, (1937).) There are
many interesting discussions in Efron and Hinkely (1978), Hinkley (1980), Cox
(1980), Barndorff-Nielsen (1980). See also Amari (1985).
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
FIBRE-BUNDLE THEORY OF STATISTICAL MODELS
Hilbert bundle of a statistical model
In order to treat general statistical models other then curved
exponential families, we need the notion of fibre bundle of a statistical model.
Let M = {q(x,u)} be a general regular m-dimensional statistical model parametr-
ized by u - (u ). To each point ueM, we associate a linear space H consisting
of functions r(x) in ? defined by
Hu =
ir(x)|Eu[r(x)] = 0, Eu[r2(x)]<-}, (5.1)
where E denotes the expectation with respect to the distribution q(x,u).
Intuitively, each element r(x)eH denotes a direction of deviation of the dis-
tribution q(x,u) as follows. Let eq(x) be a small disturbance of q(x,u), where
e is a small constant, yielding another distribution q(x,u) + eq(x), which does
not necessarily belong to M. Here, q(x)dP = 0 should be satisfied. The )
logarithm is written as
log{q(x,u) + eq(x)} 4= ?(x,u) + e q^*^
,
where ?(x,u) = log q(x,u). If we put
*? ? ?
?
it satisfies E [r(x)3 = 0. Hence, r(x)eHu denotes the deviation of q(x,u) in
p the direction q(x) = r(x)q(x,u). The condition Eu[r ]?? implies that we con-
sider only deviations having a second moment. (Note that given r(x^Hu, the
function
q(x,u) + er(x)q(x,u)
59
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
60 Shun-ichi Amari
does not necessarily represent a probability density function, because the
positivity condition
q(x,u) + er(x)q(x,u) > 0
might be broken for ?e even when e is an infinitesimally small constant.)
We can introduce an inner product in the linear space ? by
<r(x),s(x)> = Eu[r(x)s(x)]
for r(x), s(x^Hu. Thus, Hu is a Hilbert space. Since the tangent vectors
3as,(x,u), which span T,,(M), satisfy E[3 ?] = 0, E[(3 ?)2] s gaa(u)<??, they belong a U a d aa
to ? . Indeed, the tangent space ? (M) of M at u is a linear subspace of ? ,
and the inner product defined in ? is compatible with that in ? . Let ? be
the orthogonal complement of ? in ? . Then, ? is decomposed into the direct
sum
Hu ?
Tu +
Nu ?
The aggregate of all ? 's attached to every ?e? with a suitable
topology,
H(M) - uUM Hu , (5.2)
is called the fibre bundle with base space M and fibre space H. Since the fibre
space is a Hilbert space, it is called a Hilbert bundle of M. It should be
noted that H and ? , are different Hilbert spaces when u ^ u'. Hence, it is
convenient to establish a one-to-one correspondence between H and H ,, when u
and u' are neighboring points in M. When the correspondence is affine, it is
called an affine connection. Let us assume that a vector r(x)eH at u corres-
ponds to r(x) + <^(?)e? . . at a neighboring point u + du, where d denotes
infinitesimally small change. From
Eu+du [r(x) + dr(x)] = |{q(x,u)
+ dq(x,u)}{r(x) + dr(x)}dP
Eu[r] +
Eu[dr(x) +
3a*(x,u)r(x)dua] = 0
and E [r] = 0, we see that dr(x) must satisfy
Eu[dr] = -
E[3a?r] dua ,
where we neglected higher-order terms. This leads us to the following defini-
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 61
tion of the a-connection: When dr(x) is given by
dr(x) = - ? E[3a?r]
dua - ^ 3a*rdua , (5.3)
the correspondence is called the a-connection. More formally, the a-connection
fai is given by the following a-covariant derivative vv '. Let r(x,u) be a vector
field, which attaches a vector r(x,u) to every point ?e?. Then, the rate of
the intrinsic change of the vector r(x,u) as u changes in the direction 3 is a
given by the a-covariant derivative,
vja)r =
3ar(x,u) -
? Eu[3ar] +
? ra^, (5.4) a
where E[3 &r] = - E[3 r] is used. The a-covariant derivative in the direction a a
A = Aa3 e?,,(?) is given by a u
Aa)r = AMa)r . A da
The l-connection is called the exponential connection, and the -l-connection is
called the mixture connection.
When we attach the tangent space ? (M) to each point ?e? instead of
attaching the Hilbert space ? , we have a smaller aggregate
which is a subset of h[ called the tangent bundle of M. We can define an affine
connection in T(M) by introducing an affine correspondence between neighboring
? and ? ,. When an affine connection is given in h[(M) such that re ? corres-
ponds to r + dreH +. , it naturally induces an affine connection in T(M) such
that reT (M)ch corresponds to the orthogonal projection of r + d^H +. to
? ., (M). It can easily be shown that the geometry of M is indeed that of T(M),
so that the a-connection of T(M) or M, which we have defined in Chapter 2, is
exactly the one which the present a-connection of h[(M) naturally induces.
Hence, the a-geometry of H(M) is a natural extension of that of M.
Let u = u(t) be a curve in M. A vector field r(x't)eHu(t) defined
along the curve is said to be a-parallel, when
?)r =
f-TEuW+T? = () (5?5)
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
62 Shun-ichi Amari
is satisfied, where r denotes ar/at, etc. A vector r1(x)eH is the a-parallel
shift of ??>(?)e? along a curve u(t) connecting uQ =
u(tQ) and u, = u(t,), when
^q(x) =
^?Xjtg) and r, (x) =
r(x,t-j) in the solution r(x,t) of (5.5).
The parallel shift of a vector r(x) from u to u' in general depends
on the curve u(t) along which the parallel shift takes place. When and only
when the curvature of the connection vanishes, the shift is defined independent-
ly of the curve connecting u and u*. We can prove that the curvature of h[(M)
always vanishes for a = ?1 connections, so that the e-parallel shift (a = 1) and
the m-parallel shift (a = - 1) can be performed from a point u to another point
u1 independently of the curve. Let p" and ^m'??U be the e- and m-parallel
shift operators from u to u'. Then, we can prove the following important
theorem.
Theorem 5.1. The exponential and mixture connections of _H(M) are
curvature-free. Their parallel shift operators are given, respectively, by
(e)^'r(x) = r(x) -
Eu,[r(x)] , (5.6)
The e- and m-connections are dual in the sense of
<r,s> = <(e)TTU'r , Ku's> . , ' u u ' u u
where <.,.> is the inner product at u.
Proof. Let c: u(t) be a curve connecting two points u = u(0) and u' = u(l).
fai Let rv '(x,t) be an a-parallel vector defined along the curve c. Then, it
satisfies (5.5). When a = 1, it reduces to
r(e)(x,t) = Eu(t)[r(e>(x,t)].
Since the right-hand side does not depend x, the solution of this equation with
the initial condition r(x) = r^(x,0) is given by
r(e)(x,t) = r(x) + a(t) .
where a(t) is determined from
Lu(t) [r(e)(x,t)] = 0
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 63
as
a(t) = - Eu(t)[r(x)]
.
This yields (5.6), where we put u(t) = u'. Since E ,[r(x)] does not depend on
the path connecting u and u', the exponential connection is curvature free.
Similarly, when a = -1, (5.5) reduces to
?(m)(x,t) + r(m)(x,t)?(x,u(t)) = 0 .
The solution is
r(m)(x,t)q(x,u(t)) = a(x) ,
which yields (5.7). This shows that the mixture connection is also curvature
free. The duality relation is directly checked from (5.6) and (5.7).
(a) We have defined the imbedding a-curvature HAl of a curved exponen-
tial family. The concept of the imbedding curvature (which sometimes is called
the relative or Euler-Schouten curvature) can be defined for a general M as
? follows. Let ? be the projection operator of ? to ? which is the orthogonal
subspace of T (M) in H . Then, the imbedding a-curvature of M is a function in
? defined by
a
which is an element of ? <= ? . The square of the a-curvature is given by
('?e)i?b-<H?c?W-Hbd?W>9Cd? (5?8)
The scalar ? = ga (???') . is the statistical curvature defined by Efron in the
one-dimensional case.
5.2. Exponential bundle
Given a statistical model M = {q(x,u)>, we define the following
elements in ? ,
Xla =
aa?(x'u) '
X2ab V xlb ' a
X = v^ X ka-i.. .a^ 3a? kap.. .a^
and attach to each point ?e? the vector space T^a' '
spanned by these vectors,
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
64 Shun-ichi Amari
where we assume that they are linearly independent. The aggregate
I(a'k)W =
uUMTu(a'k> (5.9)
with suitable topology is then called the a-tangent bundle of degree k of M.
All the a-tangent bundles of degree 1 are the same, and are merely the tangent
bundle T(M) of M. In the present paper, we treat only the exponential (i.e.,
a = 1) tangent bundle of degree 2, which we call the local exponential bundle
of degree 2, although it is immediate to generalize our results to the general
fa) a-bundle of degree k. Note that when we replace the covariant derivative vv '
by the partial derivative 3, we have the so-called jet bundle. Its structures
(e) are the same as the exponential bundle, because vv ' reduces to 3 in the
logarithm expression 3 ?(x,u) of tangent vectors.
(1 2) (2? The space ? x '
, which we will also more briefly denote by Tv ,
is spanned by vectors X, and Xp, where X, consists of m vectors
Xa(x,u) =
3a?(x,u), a = l,...,m
and Xp consists of m(m + l)/2 vectors
Xab(x>u) =
v^e)3b =
3a3b?(x,u) +
gab(u), a, b = l,...,m . a
(See Fig. 8.) We often omit the indices a or a, b in the notation X or X .,
(2) briefly showing them as X, or Xp. Since the space T/?
' consists of all the
linear combinations of X-, and Xp, it is written as
tJ2) = {??'?.(?,?)}
where the coefficients ? = (? ,? ) consist of ? = (ea), ? = (ea ), and
???. = ?? + ?2?0 = eV + eabX . . ? 1 2 a ab
(2) (2) The set X. forms a basis of the linear space r . The metric tensor of ?* ; is
then given by
9ij =
<??'? =
Eu[Vx>u)Xj(x'u)] *
Here, g?.., denotes an m ? m matrix
9ll =
<Xa'V = E? W]
= gab
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 65
Figure 8
which is the metric tensor of the tangent space ? (M) of M. The component
g21 =
g12 rePresents
g21 =
gabc =
<Xab'V =
rabc ?
Similarly, g22 is a quantity having four indices
g22 =
^ab'V '
The exponential connection can be introduced naturally in the local
(2) exponential fibre bundle Tv y(M) of degree 2 by the following principle:
1) The origin of T?+? corresponds to the point
X,du = X (x,u)dua?T(,2) ? a u
(2? (2? 2) The basis vector X.?(x,u
+ du)eT*+?
is mapped to T/j '
by 1-
(2? parallely shifting it in the Hilbert bundle ? and then projecting it to ? .
(2) (2) We thus have the affine correspondence of elements in ?KA and r ' ,
X.(u + du) ^ X.(u) + dX. = X,(u) + r^.X.(u)dua , ? ili a ? j
i (2) where H. are the coefficients of the exponential affine connection in ? (M),
aj ?
The coefficients are given from the above principle (2) by
ral - ?' ral =St4> ra2
= giJE[Xj9a9b8cl(x.u)]
. (5.10)
We remark again that the index i = 1 stands for a single index b, for example,
and i = 2 stands for a pair of indices, for example b, c.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
66 Shun-ichi Amari
Let e(u) = ^ [u)*.{x9u)el^
be a point in T^2\
We can shift the
(2) (2) point 8(u)er
' to point e(u*)eT^, belonging to another point u' along a curve
u = u(t). Since the point ??(?)?.(?)eG '
corresponds to the point ?1(u + du)
(2) (X. + dX..) +
X]dueT,j+c|U9 where dX.. is determined from the affine connection and
the last term X,du corresponds to the change in the origin, we have the follow-
ing equation
?1 + r?.ejua + ?iV = 0 . (5.11) aj a
( 2 ) ? i whose solution e(t) represents the corresponding point in Tv/'x, where ? =
?a3 ??(?). Note that we are here talking about the parallel shift of a point in a
affine spaces, and not about the parallel shift of a vector in linear spaces
where the origin is always fixed in the latter case.
(2) Let u' be a point close to u. Let e(u';u) be the point in r J
(2) corresponding to the origin e(u') = 0 of the affine space ?, . The map depends
in general on the curve connecting u and u'. However, when ?u1 - u| is small,
the point ?(?';?) is given by
e1(u';u) = ?jdi'-u)
+ \ d2 (u'-u)2 + 0(|u'-u|3) .
3 Hence, if we neglect the term of order |u'-u| , the map does not depend on the
route. In the component form,
^(u'ju) = ea(u?;u) = u'a-ua ,
e2(u';u) = ebc(u';u) = \ (u'b-ub)(u,c-uc) , (5.12)
3 where we neglected the term of order |u'-u| . Since the origin e(u') = 0 of
r, can be identified with the point u* (the distribution q(x,u')) in the model
M, this shows that, in the neighborhood of u, the model M is approximately re-
(2) presented in G ' as a paraboloid given by (5.12).
Let us consider the exponential family E = ?p(x,e;u)} depending
on u, whose density function is given by
p(x,e;u) = qU^exp?e^?x^)
- ??(?)} , (5.13)
(2) where ? is the natural parameter. We can identify the affine space r ' with
i (2) the exponential family E , by letting the point ? = ? X^r
' represent the
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 67
= t<2>
Figure 9
distribution ?(?,?;?)e? specified by ?. We call E the local exponential
family approximating M at u. The aggregate
iW =
uUEM Eu
with suitable topology is called the fibre bundle of local exponential family of
degree 2 of M. The metric and connection maybe defined from the resulting identi-
fication of ?(M) with T^ '(M). The distribution q(x,u) exactly corresponds to
the distribution p(x,0;u) in ? , i.e., the origin ? = 0 of E or T* . Hence,
the point ? = e(u*;u) which is the parallel shift of e(u*) = 0 at E ,, is the
counterpart in E of the q(x,u'^M, i.e., the distribution p{x,e(u',u); ukE
is an approximation in E of ?(?,??')e?. For a fixed u, the distributions
ftu = {qf(x,u';u)> ,
q(xsu?;u) = p{x,e(u';u); u}
form an m-dimensional curved exponential family imbedded in E (Fig. 9). The
point of this construction is that M is approximated by a curved exponential
family M in the neighborhood of u. The tangent spaces ? (M) of M and Tu(Mu)
of M exactly correspond at u, so that their metric structures are the same at
u. Moreover, the squares of the imbedding curvatures are the same for both M
and ftu at u, because the curvature is obtained from the second covariant
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
68 Shun-ichi Amari
derivative of X^
= 3a?.
This suggests that we can solve statistical inference
problems in the curved exponential family M instead of in M, provided u is suf-
ficiently close to the true parameter uQ.
5.3. Statistical inference in a local exponential family
Given ? independent observations xm???? ?xfu\? we can define the
observed point ?(?)e? , for each u, by
n.Cu) = X-(u) = 1
3l^uyu) . (5.14)
We consider estimators based on the statistics n(u). We temporarily fix a point
u, and approximate model M by ? , which is a curved exponential family imbedded
in E . Let e be a mapping from E to M that maps the observed ?(?)e? to the
estimated value e(u) in M when u is fixed, by denoting it as
e(u) = e{X(u);u} .
The estimated value depends on the point u at which M is approximated by M .
The estimator e defines the associated ancillary family A = {A (u1), ?'e? }
for every u, where
Au(u') = e'^u'iu) =
{n?Eu|e(n;u) = u'} .
When the fixed u is equal to the true parameter un, M approximates M very U Uq well in the neighborhood of uQ. However, we do not know
uQ. To get an estima-
tor ? from e, let us consider the equation
e{X(u);u} = u .
The solution ? of this equation is a statistic. It implies that, when M is
approximated at lj, the value of the estimator e at E- is exactly equal to u.
The characteristics of the estimator u associated with the estimator e in M are
given by the following geometrical theorems, which are direct extensions of the
theorems in the curved exponential family.
Theorem 5.2. An estimator ? derived from e is first-order efficient
when the associated ancillary family A is orthogonal to M . A first-order
efficient estimator is second-order efficient.
Theorem 5.3. The third-order term of the covariance of a bias cor-
rected efficient estimator is given by
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 69
a = 1 (VW\2 + fH?eV +1 (H^h2 g3ab 2 {T Ub tHM }ab
+ 2 (HA }ab
'
The bias corrected maximum likelihood estimator is third-order efficient,
because the associated ancillary family has vanishing mixture curvature.
The proof is obtained in the way sketched in the following. The
true distribution q(x,uQ) is identical with the distribution q(x,e(u0);u0) at
un of the curved exponential family M . Moreover, when we expand q(x,u) and U Uq
q(x>e(u)-,UQ) at
uQ in the Taylor series, they exactly coincide up to the terms
of u-Uq
and (u-uQ) , because E is composed of X, and Xp. Hence, if the estima-
tion is performed in ? , we can easily prove that Theorems 5.2 and 5.3 hold, 0
because the Edgeworth expansion of the distribution u is determined from the
expansion of ?(x,u) up to the second order if the bias correction is used. How-
ever, we do not know the true uQ, so that the estimation is performed in E-.
In order to evaluate the estimator u, we can map E- (and M~) to M by the U U Uq
exponential connection. In estimating the true parameter, we first summarize ?
observations into X(u) which is a vector function of u, and then decompose it
into the statistics X(?) = {X,(?),X2(?)},
where e(X(?);?) = ?. The X2(?) be-
comes an asymptotic ancillary. When the estimator is the m.l.e., we have X-.(u)
= 0 and X2(u) =
Ha?'vK in Mq.
The theorems can be proved by calculating the
Edgeworth expansion of the joint distribution of X(u) or (u,v). The result is
the same as before.
We have assumed that our estimator e is based on X(u). When a
general estimator
u'= f(x(!)---x(N))
is given, we can construct the related estimator given by the solution of
ef(X,x;u) = u, where
ef(X;u) =
Eu[f(x(1),...,x(N))|X(u) = X] .
Obviously, ef(X;u) is the conditional expectation of u' given X(u) = X. By
virtue of the asymptotic version of the Rao-Blackwell theorem, the behavior of
ef is equal to or better than u' up to the third-order. This guarantees the
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
70 Shun-ichi Amari
validity of the present theory.
The problem of testing the null hypothesis HQ:u =
uf? against
H-, : u f Uq can be solved immediately in the local exponential family E . When
Hq is not simple, we can also construct a similar theory by the use of the
statistics ? and X(u). It is possible to evaluate the behaviors of various
third-order efficient tests. The result is again the same as before.
We finally treat the problem of getting a better estimator ? by
gathering asymptotically sufficient statistics X(u)'s from a number of indepen-
dent samples which are subject to the same distribution q(x,u0) in the same
model. To be specific, let X/, x,,... >Xm\n sind Xfo?i'* *#'xi2?N be two "^depen-
dent samples each consisting of ? independent observations. Let u, and u? be
the m.l.e. based on the respective samples. Let x(-?\(u.?)
be the observed point
in ?- , i = 1, 2. The statistic ?,.? consists of two components ?/??\?? =
(X^.xa) and
X^.x2 =
(X(-?)ab)? since
^ is the m.l.e.,
*(i)i(?i> = ?
is satisfied. The statistic u. carries the whole information of order ?
included in the sample and the statistic X2(u.)> which is asymptotically ancil-
lary, carries whole information of order 1 together with u.. Obviously X(-?\o (e) ?
is the curvature-direction component statistic, Xz-mo = H?b m ln tbe curved
exponential family E- . i ?
Given two sets of statistics (u., x(-? )2^,?))> i = 1, 2, which
summarize the original data, the problem is to obtain an estimator ?, which is
third-order efficient for the 2N observations. Since the two statistics X(u.)
give points ?#.? = X(?.) in the different ?- , in order to summarize them it is
necessary to shift these points in parallel to a common ? ,. Then, we can
average the two observed points in the common ? , and get an estimator ? in
this ? ,. The parallel affine shift of a point in Eu
to a different Eu, has
already been given by (5.11) in the ?-coordinate system. This can be rewritten
-1/2 in the ?-coordinate system. In particular, when du = u - u' is of order ?
-1 /2 and n(u) is also of order ? , the parallel affine shift of n(u^Eu
to Eu,
is
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 71
given in the following expanded form for ? = (n1,?2), ^
= (? ) and ?_ = (n . ),
na(u') =
na(u) +
gabdub -
nab(u)dub +
\ r^dubduc + 0(N'3/2) ,
"ab(u') =
iiab(u)+0<N"1) ?
Now, we shift the two observed points ?(,??(???) to a common ? ,,
where u' may be any point between u, and u2, because the same estimator ? is
obtained up to the necessary order by using any ? ,. Here, we simply put
u' = (u1
+ u2)/2,
and let d be
d = (\??
- u2)/2
.
Then, the point X/.*(u.) is shifted to X/.x(u') of ? , as
X(l)a =
X(l)a +
W"'^ -
X(l)ab*b +
\ ?mc?^C + 0(N~3/2> '
and we get similar expressions for ?,?? by changing d to -d. Since u. is the
m.l.e., X/.? = 0. The average of X/,x and ?,?? in the common ? , gives the
estimated observed point X(u') = (?-,,??) from the pooled statistics (u-,?,.?
(u^).
Xl 2 (X2ab Xlab)? +
2 rbca6 6 '
<h
X2 =
2 (X2ab +
Xlab) ?
/Xt f\j By taking the m.l.e. in ? , based on (X, ,Xp), we have the estimator
-a .a 1 ?ab/? ? \*c , 1 ab?(m)~c~d u = u * 2 g (X2bc
" Xlbc)?
+ 2 g rcdb?
d '
which indeed coincides with that obtained by the equation e(u) = ? up to the
third order. Therefore, the estimator ? is third-order efficient, so that it
coincides with the m.l.e. based on all the 2N observations up to the necessary
order.
The above result can be generalized in the situation where k
asymptotically sufficient statistics (??-?iX/.,?^) are given in
Eq , i = l,...,k,
ii. being the m.l.e. from N. independent observations. Let
u' = S?.??./S?1
.
Moreover, we define the following matrices
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
72 Shun-ichi Amari
Giab =
NiCgab(u')+Ir??^-u,C)- W?
Gab =??? Giab ? (Q3b) = ^ba)"1
?
Then, we have the following theorem.
Theorem 5.4. The bias corrected version of the estimator defined by
is third-order efficient.
This theorem shows that the best estimator is given by the weighted
average of the estimators from the partial samples, where the weights are
given by Giab- It is interesting that 6. . is different from the observed
Fisher information matrix
Jiab =
-s9aV(x(i)'u'> ?
They are related by
G. = J. k +?N.r?m)(u? - u'c) . lab lab 2 ? bean '
See Akahira and Takeuchi [1981] and Amari [1985].
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
6. ESTIMATION OF STRUCTURAL PARAMETER IN THE PRESENCE
OF INFINITELY MANY NUISANCE PARAMETERS
Estimating function and asymptotic variance
Let M = ??(?;?,?)} be a family of probability density functions of
a (vector) random variable ? specified by two scalar parameters ? and ?. Let
x-j? x2,...,xN be a sequence of independent observations such that the i-th
observation x. is a realization from the distribution ?(?;?,?.), where both ?
and r. are unknown. In other words, the distributions of x. are assumed to be
specified by the common fixed but unknown parameter ? and also by the unknown
parameter ?. whose value changes from observation to observation. We call ?
the structural parameter and ? the incidental or nuisance parameter. The prob-
lem is to find the asymptotic best estimator ?.. = ?..(?? ,??,... ,xN) of the
structural parameter ?, when the number ? of observations is large. The asymp-
totic variance of a consistent estimator is defined by
AV(e,H) = lim V[/?(6N
- ?)] (6.1) ?-**>
where V denotes the variance and ? denotes an infinite sequence ? = (?,,?,,,...)
of the nuisance parameter. An estimator ? is said to be best in a class C of
estimators, when its asymptotic variance satisfies, at any ?,
AV[e,H] < ??[?',?]
for all allowable ? and for any estimator ?'e C. Obviously, there does not
necessarily exist a best estimator in a given class C.
Now we restrict our attention to some classes of estimators. An
estimator ? is said to belong to class CQ, when it is given by the solution of
the equation
73
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
74 Shun-ichi Amari
? , ?? .S, ?(??>?)
= ? , 1=1 ?
where y(x,e) is a function of ? and ? only, i.e., it does not depend on ?. The
function y is called the estimating function. Let C-, be a subclass of Cn, con-
sisting of all the consistent estimators in CQ. The following theorem is well
known (see, e.g., Kumon and Amari [1984]).
Theorem 6.1. An estimator ?eCQ is consistent if and only if its
estimating function y satisfies
Ee^[y(x,e}] = 0 ,
Ee^[a0y(x,e)] j 0 ,
where E. c denotes the expectation with respect to ?(?;?,?) and da = 3/3T. The
asymptotic variance of an estimator ?e?, is given by
??(?,?) = lim ? zV[y(x.,0)] /{(idQy)}2 ,
where S3 v(x.,?)/? is assumed to converge to a constant depending on ? and ?. ? ?
Let HA r(M) be the Hilbert space attached to a point (?,?)e?, ? 9?
????(?) = (a(x) ? E65?[a]
= 0 , E9)?[a2]
< -}.
The tangent space T_ r(M)-<= HQ _(M) is spanned by ?(?;?,?) = 3 ?(?;?,?) and ? ,? ?,? ?
?(?;?,?) = 3 ?(?;?,?) . Let w be
w(x;e,?) = u - ^^P ? ,
<? >
2 where <? > = <?,?>. Then, the partial information g?n is given by ??
2 2 q =q - q /q =<w>, y60 yee yec /ycc
2 2 where g._ = <u >, grr = <v >, g?r = <u,v> are the components of Fisher informa-
?? ?? ??
tion matrix. The theorem shows that the estimating function y(x,e) of a con-
sistent estimator belongs to Hn r for any ?. Hence, it can be decomposed as s,?
y(x,e) = a(?,?)?(?;?,?) + ?(?,?)?(?;?,?) + ?(?;?,?) ,
where ? belongs to the orthogonal complement of ? r in HA _, i.e., ?,? ? ,?
<u,n> = <?,?> = 0 .
The class C, is often too large to guarantee the existence of the
best estimator. A consistent estimator is said to be uniformly informative
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 75
(Kumon and Amari, 1984) when its estimating function y(x,e) can be decomposed as
y(x,e) = w(x;e,c) + ?(?;?,?) .
The class of the uniformly informative estimators is denoted by CUI. A uniform-
ly informative estimator satisfies
<y,w>9t? =
<?2>?)? =
9??(?,?) .
Let Cry be the class of the information unbiased estimators introduced by
Lindsay [1982], which satisfy a similar relation,
<y,w>9)? =
<y2>e>? .
Note that <y,w> = <y,u> holds.
Let us define the two quantities
g?(~) = lim 1 <S?(?;?,?.)2> , ?-*?
which depends on the estimating function y(?,?) and
g(s) = 1?G??S(???(?,?.) ,
which latter is common to all the estimators. Then, the following theorem gives
a new boi
(1984)).
a new bound for the asymptotic variance in the class CTU (see Kumon and Amari
Theorem 6.2. For an information unbiased estimator ?
AV[e;n] = g"1 + g"2g? .
We go further beyond this theory by the use of the Hilbert bundle theory.
6.2. Information, nuisance and orthogonal subspaces
We have already defined the exponential and mixture covariant de-
rivatives v^e) and v(m) in the Hilbert bundle H = U, *HQ (M). A field ? ? ? ?, ? ; ? , ? (e)
??;?,?)e?? (?) defined at all (?,?) is said to be e-invariant, when v; 'r = 0 ?,? <3
holds. A field G(?;?,?) is said to be strongly e-invariant (se-invariant),
when r does not depend on ?. A se-invariant field is e-invariant. An estimat-
ing function y(x,e) belonging to C, is an se-invariant field, and conversely,
an se-invariant y(x9e) gives a consistent estimator, provided <u,y> j 0.
Hence, the problem of the existence of a consistent estimator in CQ
reduces to
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
7 6 Shun-ichi Amari
the problem of the existence of an se-invariant field in the Hilbert bundle
H(M).
We next define the subspace ?? r of HQ r by ?,? ?,?
"l.i'h' {<"^.a(x) |3(?)e??>?1},
i.e., the subspace composed of all the m-parallel shifts to (?,?) of the vectors
belonging to the tangent space ? , at all (?,?1)^ with common ?. Then, ? ??
? is decomposed into the direct sum ? ??
??,? =
??,????,? ?
where H0 , is the orthogonal complement of Hn r. We call H_ _ the orthogonal 8 ?? 8 ,? ? ,?
subspace at (?,?). We next define the nuisance subspace HQ r at (?,?) spanned 8 ,?
by the m-parallel shifts ^v,v from (?,?*) to (?,?) of the ?-score vectors
?(?;?,?') = di for all ?'. It is a subspace of HQ r> so that we have the ? 8 ,?
decomposition
??,? =
??,?^??,? ?
I NT where Hn r is the orthogonal complement of Ha r in HQ r. It is called the
?,? 8,? ?,?
information subspace at (?,?). Hence,
Any vector ??;?,?)e? can uniquely be decomposed into the sum, ? ,?
G(?;?,?) = rl(x9e9E,) + G?(?;?,?) + G?(?;?,?) , (6.2)
where r e?^ r, r e?? r and r e?_ r are called respectively the I-, N- and 0- ? ,? ?,? ?,?
parts of r.
We now define some important vectors. Let us first decompose the
?-score vector u = 3??e?? r into the three components. Let u (?;?,?)e?_ r ? ? ,? ? ,?
be the I-part of the ?-score ?e?. . We next define the vector 8,?
?(?;?,?;??) = (??)p^?(?;?,??)
(6.3)
in ?? r9 which is the m-shift of the ?-score vector ueT. r, from (?,?') to ?,? 8,?
(?,?). Let ?1 be its I-part. The vectors ? (?;?,?;?') in \\\ where (?,?) is 8 , ?
fixed, form a curve parametrized by ?* in the information subspace ? . When 8 > ?
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 77
a11 of 9" (?')? (?;?,?;?')e?. r lie in a hyperplane in H* r for all ??, we say 88 ?,? ?,?
that ? are coplanar. In this case, there exists a vector w e?* r for which ?,?
<wI,?I(x;0^;c')> = gee(rj) (6.4)
holds for any ?". The vector w (^,?,?)e?O r is called the information vector. 8 ,?
When it exists, it is unique.
6.3. Existence theorems and optimality theorems
It is easy to show that a field ??;?,?) is se-invariant if its
? nuisance part r vanishes identically. Hence, any estimating function
y(x,0^C-, is decomposed into the sum
y(x.e) = y!(x;e^) + y?(x;e,?) .
We can prove the following existence theorems.
Theorem 6.3. The class C, of the consistent estimators is nonempty
if the information subspace HA r includes a non-zero vector. 8??
Theorem 6.4. The class CUT of the uniformly informative estimators
in C, is nonempty, if ? (?;?,?;?') are coplanar. All the uniformly informative
estimators have the identical I-part y (?;?,?), which is equal to the informa-
tion vector w (?;?,?).
Outline of proof of Theorem 6.3. When the class C, is nonempty,
there exist an estimating function y(x,e) in C,. It is decomposed as
y(x?e) = y!(x;e^) + y (?;?,?) .
Since y is orthogonal to the tangent space HQ r we have 8,?,
<y?,u> = 0 .
By differentiating <y(x9e)> = 0 with respect to ?, we have
0 = <3Qy>
+ <y,u>
= <dQy>
+ <y ,u> .
Since <3Qy> = 0, we see that y (?;?,?) j 0, proving that H. 8 ? ,?
includes a non-zero vector. Conversely, assume that there exists a non-zero
vector a(?,?) in Hz r for some ?. Then, we define a vector ?,?
y(x;e,c') = ^e^'
a(x,e) = a(xfe) - Ee^,[a]
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
78 Shun-ichi Amari
in each H. _, by shifting a(x,e) in parallel in the sense of the exponential ?,?
connection. By differentiating <a> r = EQ r[a] with respect to ?, we have ?,? ?,?
3 <a> = <3 a> + <a,v> = 0 ,
? because a does not include ? and a is orthogonal to HQ . This proves ? ,?
??)?,[>] = 0.
Hence, the above ?(?;?,?') does not depend on ?' so that it is an estimating
function belonging to C-,. Hence,C, is nonempty, proving theorem 6.3.
Outline of proof of Theorem 6.4. Assume that there exists an
estimating function y(x,e) belonging to Cyj. Then, we have
<y,u(x;e^)>6^ =
5??(?) ,
because of <y,v> = 0. Hence, when we shift y in exponential parallel and we
shift u in mixture parallel along the ?-axis, the duality yields
<(e),y, (m),(g"1u)> = ? .
or
<yI(x;e^), ??(?;?,?;??)>= 9??(?') .
This shows that ? are coplanar, and the information vector w is given by
projecting y to ? ,. Conversely, when ? are coplanar, there exists the ? ,?
information vector w e ?? r. We can extend it to any ?1 by shifting it in ex- ? ,?
ponential parallel,
y(x.e) = (?\??
.
which yields an estimating function belonging to Cyj.
The classes C-. and Cyj
are sometimes empty. We will give an
example later. Even when they are nonempty, the best estimators do not neces-
sarily exist in C, and in Cjy. The following are the main theorems concerning
best estimators. (See Lindsay (1982) and Begun et al. (1983) for other
approaches to this problem.)
Theorem 6.5. A best estimator exists in C,, iff the vector field
? (?;?,?), which is the I-part of the ?-score u, is e-invariant. The best
estimating function y(x,e) is given by the e-invariant u , which in this case
is se-invariant.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 79
Theorem 6.6. A best estimator exists in C..,, iff the information
vector w (?;?,?) is e-invariant. The best estimating function y is given by
the e-invariant w , which in this case is se-invariant.
Outline of proofs. Let ? be an estimator in C, whose estimating
function is y(x9e). It is decomposed into the following sum,
y(x,e) = ?(?,?) u1 + ?^?-,?,?) + y?(x;e^) ,
where u (?,?) is the projection of ?(?;?,?) to H: r, ?(?,?) is a scalar, and 8 ,?
a e?O r is orthogonal to ? in HQ r. The asymptotic variance of ? is calculated 8,? 8 ,?
as
??[?;?] = lim N?z(c.2A. + ?.)}/{(??.?.)2} , ?-*?
ili il
where ? = (?^?^...), ci
= ?(?,???),
and
??? = <u ,u > ,
B1 = <(a!(x))2> + <(y?)2> .
From this, we can prove that, when and only when B. = 0, the estimator is
uniformly best for all sequences ?. The best estimating function is u (?;?,?)
for ? = (?,?,?, ...). Hence it is required that u is se-invariant. This
proves Theorem 6.5. The proof of Theorem 6.6 is obtained in a similar manner
by using w instead of u .
6.4. Some typical examples: nuisance exponential family
The following family of distributions,
?(?;?,?) = exp{s(x,e)c + r(x,e) - ?(?,?)} (6.5)
is used frequently in the literature treating the present problem. When ?
is fixed, it is an exponential family with the natural parameter ?, admitting
a minimal sufficient statistic s(x,e) for ?. We call this an n-exponential
family. We can elucidate the geometrical structures of the present theory by
applying it to this family. The tangent vectors are given by
U = ?3?5
+ 3Qr
- 3?? , V = S -
3?? .
The m-parallel shift of a(x) from (?,?') to (?,?) is
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
80 Shun-ichi Amari
(G?)p^, a(x) = ?(?)???{(? - ?')d - ?(?) + ?(?*)} .
From this follows a useful Lemma.
? Lemma. The nuisance subspace HA r is composed of random variables - ? ,?
of the following form,
?^? = ?f[s(x,e) - ?(?,?)]} ,
where f is an arbitrary function and ?(?,?) = En r[f(s)]. The I-part a of 8 ,?
a(x) is explicitly given as
a*(x) = a(x) - E. [a(x) | s(x,e)] , (6.6) 8 ,?
by the use of the conditional expectation E[a|s]. The information subspace
Hft , is given by 8,?
?^? =
{h(s9Q9g)(dQs)1 +
fts-.e.c^r)1}
for any f, where h = a f + ??.
We first show the existence of consistent estimators in C, by
applying Theorem 6.3.
Theorem 6.7. The class C, of consistent estimators is nonempty in
an ?-exponential family, unless both s and r are functionally dependent on s,
i.e., unless
(v)! =
(v)J = ? ?
On the other hand, a consistent estimator does not necessarily exist
in general. We give a simple example: Let ? = (?-?,?2) be a pair of random
variables taking on two values 0 and 1 with probabilities
P(x1 = 0) = V(l + exp?e + ?}) ,
P(x2 = 0) = 1/(1 + expikU)}) ,
where k is a known nonlinear function. The family M is of ?-exponential type
only when k is a linear function. We can prove that ?O r = {0}, unless k is ?>?
linear. This proves that there are no consistent estimators in this problem.
Now we can obtain the best estimator when it exists for
?-exponential family. The I-part of the ?-score ? is given by
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 81
?^?-,?,?) = (a^)1
+ (a^)1
.
It is e-invariant, when and only when (aQs) = 0. ?
Theorem 6.8. The optimal estimator exists in C-. when and only when
(a.s) = 0, i.e., a.s(x,e) is functionally dependent on s. The optimal ? ?
estimating function is given in this case by the conditional score u = (anr) = ?
a.r - E[3Qr I s], and moreover the optimal estimator is information unbiased in ? ?
this case.
According to Theorem 6.4, in order to guarantee the existence of
uniformly informative estimators, it is sufficient to show the coplanarity of
? (?;?,?;?'), which guarantees the existence of the information vector
w(x;0^H: r. By putting w = h(s)(dAs) + f(s)(3Qr) , this reduces to the ?,? ? ?
integral-differential equation in f,
<w^?(a0s)1 +
(\r)l>c =
5??(??) . (6.7)
When the above equation has a solution f(s;e,c), ? are coplanar and the inform-
ation vector w exists. Moreover, we can prove that when (aQr) = 0, the
information vector w is e-invariant.
Theorem 6.9. The best uniformly informative estimator exists when
(a r) = 0. The best estimating function is given by solving ?
E6)?l[h(s)V[3es I s]] = 9??(?')/?' , (6.8)
where h(s;e) does not depend on ?' and V[Vs I s] is the conditional covariance. ?
We give another example to help understanding. Let ? = (?,,x2)
be
a pair of independent normal random variables, ?-|^?(?,1), ?2^?(??,1). Then,
the logarithm of their joint density is
?(?;?,?) = - \ [(x1
- ?)2 + (x2
- ??)2 - log(2ir)]
= ?$(?,?) + r(x,e) - ?(?,?) ,
where s(x,e) = x]
+ ??2, r(x,e) = - (?2 + x2)/2, ?(?,?) = ?2(1 + ?2)/2 +
log(2ir). From 3As = x0, 3nr = 0, we have ? c ?
OgS)1 =
(X2 -
??^/d t ?2) , O^)1
= ?.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
82 Shun-ichi Amari
Hence, from Theorems 6.7 and 6.8, the class C, is nonempty, but the best
estimator does not exist in C-,. Indeed, we have
?^?-,?,?) = ?(?2
- ??^/?
+ ?2) ,
which depends on ? so that it is not e-invariant. Since any vector w in H* r ?,?
can be written as
w = hisMa^)1
for some h(s;e,c), the information vector w (?;?,?)e?O r can be obtained by ? ??
solving (6.4) or (6.7), which reduces in the present case to
Hence, we have
Ee^[h(s)(x2 -
???)] = ?(1 + ?2)
h(s) = s/O + ?2) ,
which does not depend on ?. Therefore, there exists a best uniformly informa-
tive estimator whose estimating function is given by
y(x,e) = w^x.e) = h(s)(3es)T
= (x2
- ??])(?1
+ ??2)/(1
+ ?2)2
or equivalently by (x? - ??, )(x.. + ??2). This is the m.l.e. estimator. This is
not information unbiased.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
7. PARAMETRIC MODELS OF STATIONARY GAUSSIAN TIME SERIES
a-representation of spectrum
Let M be the set of all the power spectrum functions S(?) of
zero-mean discrete-time stationary regular Gaussian time series, SU) satisfy-
ing the Paley-Wiener condition,
log S(o))d(A) > - ?? .
Stochastic properties of a stationary Gaussian time series ?xA* t = ..., -1, 0,
1, 2, ..., are indeed specified by its power spectrum SU), which is connected
with the autocovariance coefficients c. by
1 |?p c. =
?- S(cd) cosando) , (7.1)
S(?)) = c0
+ 2 t^0 ct
coso)t , (7.2)
where
ct =
E[xrVt]
for any r. A power spectrum S(u>) specifies a probability measure on the
sample space X = {xt> of the stochastic processes. We study the geometrical
structure of the manifold M of the probability measures given by SU). A
AR specific parametric model, such as the AR model M of order n, is treated as a
submanifold imbedded in M.
Let us define the a-representation jra'U) of the power spectrum
SU) by
r- l (SU)}""01, a j 0 ,
*(a)U) = ? a
(7.3)
1 log SU) , a = 0 .
83
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
84 Shun-ichi Amari
(Remark: It is better to define the a-representation by - (l/a)[SU)~a- 1],
However, calculations are easier in the former definition, although the follow-
ing discussions are the same for both representations.) We impose the regular-
fa) ity condition on the members of M that iK ' can be expanded into the Fourier
series for any a as
*(a)U) =?<a) + 2 tlQ ??a)a>5?? , (7.4)
where
? 2p ^a)(a))C0Su)td?) , t = 0 1, 2,
We may denote the ^a\?) specified by ?a = u[ah by ?^(?-,?^). An
infinite number of parameters {?!01'} together specify a power function by
[-?*(a)(?;?(a))G1/?, a j 0
S(u-Ja)) A (7.5)
exp U(0W(0))h
Therefore, they are regarded as defining an infinite-dimensional coordinate
system in M. We call ??a ' the a-coordinate system of M. Obviously, the -1-
coordinates are given by the autocovariances, ??~ ' = c.. The negative of the
1-coordinates ?? ', which are the Fourier coefficients of S" (?), are denoted
by c. and are called the inverse autocovariances, ? Oy
7.2. Geometry of parametric and non-parametric time-series models
Let M be a set of the power spectra SU;u) which are smoothly
specified by an ?-dimensional parameter ? = (ua), a = 1, 2, ..., n, such that
M becomes a submanifold of ?., e.g., M could be an autoregressive process.
This M is called a parametric time-series model. However, any member of M can
be specified by an infinite-dimensional parameter u, e.g., by the a-coordinates
?(<0 = {??a)},
t = 0, 1, ... in the form 5(?,?^). The following discussions
are hence common to both the parametric and non-parametric models, irrespective
of the dimension ? of the parameter space.
We can introduce a geometrical structure in M or M in the same
manner as we introduced before in a family of probability distributions on
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 85
sample space X, except that X = {x.} is infinite-dimensional in the present
time-series case (see Amari, 1983 c). Let PT(x-j,... ,xT; u) be the joint prob-
ability density of the ? consecutive observations x,,...,x_ of a time series
specified by u. Let
?T(x1,...,xT;u) = log p(x-,,... ,xT*,u)
.
Then, we can introduce in M or M the following geometrical structures as
before,
gab(u)=lim 1
E[3aWTJ ,
rabc ?
jim ? E[{3aVT -
? aaVb*T>8ctT] '
T-x?
However, the limiting process is tedious, and we define the geo-
metrical structure in terms of the spectral density SU) in the following.
Let us consider the tangent space ? at u of M or ? , which is
spanned by a finite or infinite number of basis vectors aa = 3/au associated a
with the coordinate system u. The a-representation of 3^ is the following func- a
tion in ?,
3a = (3/3?3?(a)(?;??) .
a
Hence, in M, the basis 3^ associated with the a-coordinates ?^a' is
1 , t = 0
A*)
2cosu)t , t =f 0 .
Let us introduce the inner product g . of 3a
and 3b
in Ty by
gab(u) =
<8a'9b> =
y3a*(a)(w;u)ab?(a)U;u)] ,
where E is the operator defined at u by
Ea[aU)] =
j{S(w;u)}2aaU)d?) .
The above inner product does not depend on a, and is written as
< VV
= aa[log SU,u)]3b[log S(w,u)]dw . (7.6)
(a) We next define the a-covariant derivative v^ y3b
of ab in the
a
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
86 Shun-ichi Amari
(a) direction of an by the projection of 3 3. ? ' to ? . Then, the components of
a a d u
the a-connection are given by
f S2a3 3. *(a)3?*(a)a? . (7.7) ill <?> = Aa)w
-
d Vb*
?il^ -
If we use 0-representati on, it is given by
|(3 3jog S - 3 log Sablog S)a log S d? .
From (7.4) and (7.7), we easily see that the a-connection vanishes in M
identically, if the a-coordinate system ?^a' is used. Hence, we have
Theorem 7.1. The non-parametric M is a-flat for any a. The
(a) (a) a-affine coordinate system is given by ?? '. The two-coordinate systems ?? '
and ?^~a' are mutually dual.
Since M is a-flat, we can define the a-divergence from S-.U) to
S2U) in M. It is calculated as follows.
Theorem 7.2. The a-divergence from S, to S? is given by
7l/a2) f {[S2U)/S.,U)]a
- 1 - alog[S2/S1]}dw , a f 0
?a(SyS2)=]<
(1/2) f [log S^u)) - log S2U)]2du> , a = 0 .
7.3. a-flat models
An a-model Mj* of order ? is a parametric model such that the
a-representation of the power spectrum of a member in M?| is specified by ? + 1
parameters u = (u ), k = 0, ?,.,.,?, as
ir?^U;u) = Uq
+ 2 k|^ uk
cos ku> .
Obviously, Ma is a-flat (and hence -a-flat), and u is its a-affine coordinate
system.
AR The AR-model M of order ? consists of the stochastic processes
defined recursively by ?
k=0 Vt-k =
et
where {e.} is a white noise Gaussian process with unit variance and a = (aQ, AR
a,,...,a ) is the (n+1)-dimensional parameter specifying the members of M .
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 87
Hence, it is an (n+1)-dimensional submanifold of M. The power spectrum SU;a)
of t^e process specified by a is given by
SUa) = | j0 akeikT2 .
AR We can calculate the geometric quantities of M in terms of the AR-coordinate
system a . the above expression.
MA Similarly, the MA-model M of order ? is defined by the pro-
cesses ?
xt =
k=0 bket-k
where b = (bQ,
b, ,...,b ) is the MA-parameter. The power spectrum SU;b) of
the process specified by b is
iko),2 SU;b) =
|g bke'
f order ? introd
posed of the following power spectra SU;e) parameterized by e = (e?, e,,
? SU;e) =
exp{eQ +
2k?Q ekcos kw} .
EXP The exponential model M of order ? introduced by Bloomfield (1973) is com-
en),
given by
AR It is easy to show that the 1-representation of SU;a) in M is
V JkVt-k - k = 0' 1-??>?
where
ck = 0 , k > ?
?(1V,a) = - S"V;a) = l\eUk
This shows that M isa submanifold specified by ck = 0, (k > n) in M. Hence,
it coincides exactly with a one-model M^ , although the coordinate system a is
MA not 1-affine but curved. Similar discussions hold for M .
AR (1 ) Theorem 7.3. The AR-model M coincides with
M^ ' , and hence is
MA (-1 ) ?l-flat. The MA-model M coincides with M^ y, and hence is also ?l-flat.
FXP fOi The exponential model M^
coincides with M^ ', and is 0-flat. Since it is
self-dual, it is an (n+1)-dimensional Euclidean space with an orthogonal
Cartesian coordinate system e.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
88 Shun-ichi Amari
7.4. a-approximation and a-projection
Given a parametric model M = {SU;u)}, it is sometimes necessary
to approximate a spectrum S(u>) by one belonging to M . For example, given a
finite observations ?-,, ..., Xy of {x.}, one tries to estimate u in the paramet-
ric model Mn by obtaining first a non-parametric estimate SU) based on x,, ...,
xT and then approximating it by SU;u^Mn. The a-approximation of S is the one
that minimizes the a-divergence D [SU), SU,u)], ?e? . It is well known that
the -1-approximation is related to the maximum likelihood principle. As we
have shown in ?2, the a-approximation is given by the a-projection of SU) to
M . We now discuss the accuracy of the a-approximation. To this end, we con-
sider a family of nested models {M > such that Mnz> ?? zd M0 zd ...M = M. The ? ? ? ? ?? AR MA EXP
{M MM } and {hr } are nested models, in which MQ is composed of the white
noises of various powers.
Let {Ma} be a family of the a-flat nested models, and let S U;? )e
M be the -a-approximation of SU), where ? is the (n+1)-dimensional parameter
given by
min D [S,S ??)] = D [S,S (?;? )] . ? ?.a -a ? -a ? ?
?e ?
The error of the approximation by S e? is measured by the -a-divergence
D_a(S,Sn). We define
??^> =
,mi"a D-a^>V *
D-a^>V ? <7?8)
5?e?? ? ?
It is an interesting problem to find out how E (S) decreases as ? increases.
We can prove the following Pythagorean relation (Fig. 10).
D-a^V =
D-a^Sn+1) +
D.a(Sn+1,Sn) .
The following theorem is a direct consequence of this relation.
Theorem 7.4. The approximation error E (S) of S is decomposed as
En<S> =
kin D-JSk+rSk> ? <7?9>
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 89
Figure 10
Hence,
D (S,Sn) =
J D (Sn+1,Sn) . -a 0 n=0 -a n+1 n
The theorem is proved by the Pythagorean relation for the right
triangle aSS Sq composed of the a-geodesic S SQ included in M?? and -a-geodesic
SS intersecting at S perpendicularly. The theorem shows that the approxima-
tion error E (S) is decomposed into the sum of the -a-divergences of the
successive approximations Sk, k = n+1, ...,??, where Sto
= S is assumed. More-
over, we can prove that the -a-approximation of S. in M?? (? < k) is S. In ? ? ?
other words, the sequence {S } of the approximations of S has the following
property that S is the best approximation of S. (k > n) and that the approxima-
tion error E (S) is decomposed into the sum of the -a-divergences between the
further successive approximations. This is proved from the fact that the a-
geodesic in M connecting two points S and S' belonging to M^J is completely in-
cluded in M?? for an a-model M??. ? ? AR
Let us consider the family {M } of the AR-models. It coincides
with M . Let S be the -1-approximation of S. Let c.(S) and c.(S) be, res-
pectively, the autocovariances and inverse autocovariances. Since c. and c^
are the mutually dual -1-affine and 1-affine coordinate systems, the -1-approx-
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
90 Shun-ichi Amari
imation S of S is determined by the following relations
1) ct(Sn) =
ct(S), t = 0, 1, ..., ?
2) ct(S ) =0, t = n+1, n+2, _
This implies that the autocovariances of S are the same as those of S up to
t = n, and that the inverse autocovariances c. of S vanish for t > n. Similar
relations hold for any other a-flat nested models, where c. and c. are replaced
EXP by the dual pair of a- and -a-affine coordinates. Especially, since {M }
are the nested Euclidean submanifolds with the self-dual coordinates ?^ ', their
properties are extremely simple.
We have derived some fundamental properties of a-flat nested para-
metric models. These properties seem to be useful for constructing the theory
of estimation and approximation of time series. Although we have not discussed
about them here, the ARMA-modes, which are not a-flat for any a, also have in-
teresting global and local geometrical properties.
Acknowledgements
The author would like to express his sincere gratitude to Dr. M.
Kumon and Mr. H. Nagaoka for their collaboration in developing differential
geometrical theory. Some results of the present paper are due to joint work
with them. The author would like to thank Professor K. Takeuchi for his
encouragement. He also appreciates valuable suggestions and comments from the
referees of the paper.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
REFERENCES
Akahira, M. and Takeuchi, ?. (1981). On asymptotic deficiency of estimators
in pooled samples. Tech. Rep. Limburgs Univ. Centr. Belgium.
Amari, S. (1968). Theory of information spaces ? a geometrical foundation of
the analysis of communication systems. RAAG Memoirs 4, 373-418.
Amari, S. (1980). Theory of information spaces ? a differential geometrical
foundation of statistics. POST RAAG Report, No. 106.
Amari, S. (1982a). Differential geometry of curved exponential families ?
curvatures and information loss. Ann. Statist. 10? 357-387.
Amari, S. (1982b). Geometrical theory of asymptotic ancillarity and condition-
al inference. Biometrika 69, 1-17.
Amari, S. (1983a). Comparisons of asymptotically efficient tests in terms of
geometry of statistical structures. Bull. Int. Statist. Inst.,
Proc. 44th Session, Book 2, 1190-1206.
Amari, S. (1983b). Differential geometry of statistical inference, Probability
Theory and Mathematical Statistics (ed. Ito, ?. and Prokhorov,
J. V.), Springer Lecture Notes in Math 1021, 26-40.
Amari, S. (1983c). A foundation of information geometry. Electronics and
Communication in Japan, 66-A, 1-10.
Amari, S. (1985). Differential-Geometrical Methods in Statistics. Springer
Lecture Notes in Statistics, 28, Springer.
Amari, S. and Kumon, M. (1983). Differential geometry of Edgeworth expansions
in curved exponential family, Ann. Inst. Statist. Math. 35A,
1-24.
91
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
92 Shun-ichi Amari
Atkinson, C. and Mitchell, A. F. (1981). Rao's distance measure, Sankya A43,
345-365.
Barndorff-Nielsen, 0. E. (1980). Condi tionality resolutions. Biometrika 67,
293-310.
Barndorff-Nielsen, 0. E. (1987). Differential and integral geometry in
statistical inference. IMS Monograph, this volume.
Bates, D. M. and Watts, D. G. (1980). Relative curvature measures of non-
linearity, J. Roy. Statist. Soc. B40, 1-25.
Beale, E. M. L. (1960). Confidence regions in non-linear estimation. J. Roy.
Statist. Soc. B22, 41-88.
Begun, J. M., Hall, W. J., Huang, W.-M. and Wellner, J. A. (1983). Informa-
tion and asymptotic efficiency in parametric-nonparametric models.
Ann. Statist. 1]_, 432-452.
Bhattacharya, R. N. and Ghosh, J. K. (1978), On the validity of the formal
Edgeworth expansion. Ann. Statist. J5, 434-451.
Bloomfield, P. (1973). An exponential model for the spectrum of a scalar time
series. Biometrika 60, 217-226.
Burbea, J. and Rao. C. R. (1982). Entropy differential metric, distance and
divergence measures in probability spaces: A unified approach.
J. Multi. Var. Analys. 12, 575-596.
Chentsov, N. N. (1972). Statistical Decision Rules and Optimal Inference
(in Russian). Nauka, Moscow, translated in English (1982), AMS,
Rhode Island.
Chernoff, H. (1949). Asymptotic studentization in testing of hypotheses,
Ann. Math. Stat. 20, 268-278.
Cox, D. R. (1980). Local ancillarity. Biometrika 67, 279-286.
Csisz?r, I. (1975). I-divergence geometry of probability distributions and
minimization problems. Ann. Prob. ,3, 146-158.
Dawid, A. P. (1975). Discussions to Efron's paper. Ann. Statist. 3> 1231-
1234.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Geometrical Theory of Statistics 93
Dawid, A. P. (1977). Further comments on a paper by Bradley Efron. Ann.
Statist. 5, 1249.
Efron, ?. (1975). Defining the curvature of a statistical problem (with
application to second order efficiency) (with Discussion). Ann.
Statist. 3, 1189-1242.
Efron, ?. (1978). The geometry of exponential families. Ann. Statist. 6,
362-376.
Efron, ?. and Hinkely, D. B. (1978). Assessing the accuracy of the maximum
likelihood estimator: Observed versus expected Fisher information
(with Discussion). Biometrika 65, 457-487.
Eguchi, S. (1983). Second order efficiency of minimum contrast estimators in
a curved exponential family. Ann. Statist. ?, 793-803.
Hinkely, D. V. (1980). Likelihood as approximate pivotal distribution.
Biometrika 67, 287-292.
Hougaard, P. (1983). Parametrization of non-linear models. J. R. Statist.
Soc. B44, 244-252.
James, A. T. (1973). The variance information manifold and the function on it.
Multivariate Analysis (ed. Krishnaiah, P. K.), Academic Press,
157-169.
Kariya, T. (1983). An invariance approach in a curved model. Discussion paper
Ser. 88, Hitotsubashi Univ.
Kass, R. E. (1980). The Riemannian structure of model spaces: A geometrical
approach to inference. Ph.D. Thesis, Univ. of Chicago.
Kass, R. E. (1984). Canonical parametrization and zero parameter effects
curvature. J. Roy. Statist. Soc. B46, 86-92.
Kumon, M. and Amari, S. (1983). Geometrical theory of higher-order asymptotics
of test, interval estimator and conditional inference, Proc. Roy.
Soc. London A387, 429-458.
Kumon, M. and Amari, S. (1984). Estimation of structural parameter in the
presence of a large number of nuisance parameters. Biometrika 71,
445-459.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
94 Shun-ichi Amari
Kumon, M. and Amari, S. (1985). Differential geometry of testing hypothesis:
a higher order asymptotic theory in multi parameter curved exponen-
tial family, METR 85-2, Univ. Tokyo.
Lauritzen, S. L. (1987). Some differential geometrical notions and their use
in statistical theory. IMS Monograph, this volume.
Lindsay, B. G. (1982). Conditional score functions: Some optimality results.
Biometrika 69, 503-512.
McCullagh, P. (1984). Tensor notation and cumulants of polynomials.
Biometrika 7J_, 461-476.
Madsen, L. T. (1979). The geometry of statistical model ? a generalization
of curvature. Research Report, 79-1, Statist. Res. Unit., Danish
Medical Res. Council.
Nagaoka, H. and Amari, S. (1982). Differential geometry of smooth families of
probability distributions, METR 82-7, Univ. Tokyo.
Pfanzagl, J. (1982). Contributions to General Asymptotic Statistical Theory.
Lecture Notes in Statistics V3> Springer.
Rao, C. R. (1945). Information and accuracy attainable in the estimation of
statistical parameters. Bull. Calcutta. Math. Soc. 37, 81-91.
Reeds, J. (1975). Discussions to Efron's paper. Ann. Statist. 3, 1234-1238.
Skovgaard, lb. (1985). A second-order investigation of asymptotic ancillarity,
Ann. Statist. ]_3, 534-551.
Skovgaard, L. T. (1984). A Riemannian geometry of the multivariate normal
model, Scand. J. Statist. V[, 211-223.
Yoshizawa, T. (1971). A geometrical interpretation of location and scale
parameters. Memo TYH-2, Harvard Univ.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
DIFFERENTIAL AND INTEGRAL GEOMETRY IN STATISTICAL INFERENCE
0. E. Barndorff-Nielsen
1. Introduction. 97
2. Review and Preliminaries . 99
3. Transformation Models . 118
4. Transformation Submodels . 127
5. Maximum Estimation and Transformation Models . 130
6. Observed Geometries . 135
7. Expansion of c|j| L. 147
8. Exponential Transformation Models . 152
9. Appendix 1. 154
10. Appendix 2. 156
11. Appendix 3. 157
12. References. 159
Department of Theoretical Statistics, Institute of Mathematics, University of
Aarhus, Aarhus, Denmark
95
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
1. INTRODUCTION
This paper gives an account of some of the recent developments in
statistical inference in which concepts and results from integral and differen-
tial geometry have been instrumental.
A great many important contributions to the field of integral and
differential geometry in statistics are not discussed or even referred to here,
but a rather comprehensive overview of the field can be obtained from the mate-
rial compiled in the present volume and from the survey paper by Barndorff-
Nielsen, Cox and Reid (1986).
Section 2 reviews pertinent parts of statistics and of integral
and differential geometry, and introduces some of the terminology and notation
that will be used in the rest of the paper.
A considerable part of the material in sections 3, 4, 5 and 8 and
in the appendices, which are mainly concerned with the systematic theory of
transformation models and exponential transformation models, has not been pub-
lished elsewhere.
Sections 6 and 7 describe a theory of "observed geometries" and its
relation to an asymptotic expansion of the formula c|j| C for the conditional
distribution of the maximum likelihood estimator; the results there are mostly
taken from Barndorff-Nielsen (1986a). Briefly speaking, the observed geome-
tries on the parameter space of a statistical model consist of a Riemannian
metric and an associated one-parameter family of affine connections, construct-
ed from the observed information matrix and from an auxiliary statistic a cho-
sen such that (?,a), where ? denotes the maximum likelihood estimator of the
97
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
98 O. E. Barndorff-Nielsen
parameter of the model, is minimal sufficient. The observed geometries and the
closely related expansion of c|jpL form a parallel to the "expected geometries"
and the associated conditional Edgeworth expansions for curved exponential
families studied primarily by Amari (cf., in particular, Amari 1985, 1986), but
with some essential differences. In particular, the developments in sections 6
and 7 are, in a sense, closer to the actual data and they do not require inte-
grations over the sample space; instead they employ "mixed derivatives of the
log model function." Furthermore, whereas the studies of expected geometries
have been largely concerned with curved exponential families the approach taken
here makes it equally natural to consider other parametric models, and in par-
ticular transformation models. The viewpoint of conditional inference has been
instrumental for the constructions in question. However, the observed geometri-
cal calculus, as discussed in section 6, does not require the employment of
exact or approximate ancillaries.
The observed geometries provide examples of the concept of
statistical manifolds discussed by Lauritzen (1986).
Throughout the paper examples are given to illustrate the general
results.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
2. REVIEW AND PRELIMINARIES
We shall consider parametrized statistical models M specified by
(?^,?(?;?),O) where X^ is the sample space, O is the parameter space and ?(?;?)
is the model function, i.e. ?(?;?) = dP /dy for some dominating measure y. The ?
dimension of the parameter ? will usually be denoted by d and we write ? on
coordinate form as (? ,.,.,? ). Generic coordinates of ? will be indicated as
r s t *+,* ? , ? , ? , etc.
The present section is organized in a number of subsections and it
serves two purposes: to provide a survey of previous results and to set the
stage for the developments in the following sections.
Combinants. It is useful to have a term for functions which depend
on both the observation ? and the parameter ? and we shall call any such func-
tion a combinant.
Jacobians. Our vectors are row vectors and we denote transposi-
tion of a matrix by an as ten* ? *. If f is a differentiable transformation of
a space ? then the Jacobian df/dy* of f at yeY is also denoted by J_f(y), while
we write Jf(y) for the Jacobian determinant, i.e. J^ = |JJ . When appropriate
we interpret 3Ay) as an absolute value, without explicitly stating this. We
shall repeatedly use the fact that for differentiable transformations f and g
we have
if 0 g(y) =
^(y)Jf(g(y)) (2.1)
and hence
Jf 0 g(y) =
Jf(g(y))Jg(y)? <2?2)
99
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
100 O. E. Barndorff-Nielsen
Foliations. A partition of a manifold of dimension k into submani-
folds all of dimension m<k is called a foliation and the submanifolds are said
to be the leaves of the foliation.
A dimension-reducing statistical hypothesis may often, in a natural
way, be viewed as a leaf of an associated foliation of the parameter space O.
Likelihood. We let L = LU) = LU;x) denote an arbitrary version
of the likelihood function for ? and we set 1 = log L. Furthermore, we write
3r = 9/9?G, and lr
= al, 1 = aal, etc. The observed information is the
matrix
?M = -Drs] (2.3)
and the expected information is
1(?) = E j(u>). (2.4) ?
The inverse matrices of j and i are referred to as observed and expected forma-
tion, respectively.
Suppose the minimal sufficient statistic t for M is of dimension k.
We then speak of M as a (k,d)-model (d being the dimension of the parameter ?).
Let (?,a) be a one-to-one transformation of t, where ? is the maximum likeli-
hood estimator of ? and a, of dimension k-d, is an auxiliary statistic.
In most applications it will be essential to choose a so as to be
distribution constant either exactly or to the relevant asymptotic order. Then
a is ancillary and according to the conditionality principle the conditional
model for ? given a is considered the appropriate basis for inference on ?.
However, unless explicitly stated, distribution constancy of a is
not assumed in the following.
There will be no loss of generality in viewing the log likelihood
1 = 1(?) in its dependence on the observation ? as being a function of the
minimal sufficient (?,a) only. Henceforth we shall think of 1 in this manner
and we will indicate this by writing
1=1(?,?,a).
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 101
Similarly, in the case of observed information we write
j = jU;ui,a)
etc. It turns out to be of interest to consider the function
?-(?) =*U;a) = lU;u),a), (2.5)
obtained from lU;u>,a) by substituting ? for ?. Similarly we write
?-U) = ?"U;a) = jU;u),a). (2.6)
For a general parametric model ?(?;?) and for a general auxiliary a
a conditional probability function p*U;?|a) for ? given a may be defined by
p*U;u)|a) = clJl^L (2.7)
where L is the normed likelihood function, i.e.
C = ?(?;?)/?(?;?),
and where c = c(ai,a) is a norming constant determined so as to make the integral
of (2.7) with respect to ? equal to 1.
Suppose now that a is approximately or exactly distribution con-
stant. Then the probability function p*(u>;u>|a), given by (2.7), is to be
considered as an approximation to the conditional probability function ?(?;?|?)
of the maximum likelihood estimator ? given a, cf. Barndorff-Nielsen (1980,
1983). In general, p*U;u>|a) is simple to calculate since it only requires
knowledge of standard likelihood quantities plus an integration over the sample
space to determine the norming constant c. Moreover, to sufficient accuracy
this norming constant can often be approximated by (2tt)~ '
, where d is the
dimension of ?; and a more refined approximation to c solely in terms of mixed
derivatives of the log model function is also available, cf. the next subsection
and section 7. In a great number of cases, including virtually all transforma-
tion models, p*U*,u)|a) is, in fact, equal to pU;ci>|a). Furthermore, outside
these exactness cases one often has an asymptotic relation of the form
pU;u)|a) = p*U;o)|a){l + 0(n"3/2)} (2.8)
uniformly in ? for Z?U-?) bounded, where ? denotes sample size. This holds,
in particular, for (k,d) exponential models. For more details and further
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
102 O. E. Barndorff-Nielsen
discussion, see Barndorff-Nielsen (1980, 1983, 1984, 1985, 1986a,b) and
Barndorff-Nielsen and Blaesild (1984).
Expansion of cjjl L in the single-parameter case. Suppose ? is
one-dimensional. From formulas (4.2) and (4.5) of Barndorff-Nielsen and Cox
(1984) we have
cj^C = f(?-?; j){l + CjHl
+ A-j^U-u)))
+ A2(a^U-o)))}
?{1 + 0(n"3/2)}.
(2.9)
Here <i>(w;y) denotes the probability density function of the normal distribution
with mean 0 and variance ?" . Furthermore, C,, A,, and A? are given by
Cl ?
?{-3U4 +
12U3,1 "5U3 +
24U2,1U3 -
24U2,1 "
12U2,2} <2-10>
and
A^u) =
P1(u)U2J +
P2(u)U3
A2(u) =
P3(u)U2j2 +
P4(u)?2tl +
P5(u)U4 +
P6(u)U3>1 +
P7(u)U3
+ P8?U>U2,1U3
where P.(u), i = 1,...,8, are polynomials, the explicit forms of which are
given in Barndorff-Nielsen (1985), and where U = U n and U ? are defined as ^ ' ? v,0 v,s
? / \ ? = 1,2,3,...
,, , ? as(rv;U^,a)} uv,s(u))-.(v+s)/2-
> * s = 0,1,2...
rv' denoting the v-th order derivative of 1 = lU;??>a) with respect to ? and
8S indicating differentiation s times with respect to ?. Note that, in the
repeated sampling situation, U is of order 0(n"^v s" ''
). Hence the ? ,s
quantities C.s ?-i and A2 are of order 0(n" ), 0(n ) and 0(n~ ), respectively.
Integration of (2.7) yields an approximation to the conditional
distribution of the likelihood ratio statistic
w = 2{1(?) - 1(?0) (2.11)
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 103
for testing a dimension reducing hypothesis O0 of O. In particular, if O? is
a po^nt hypothesis, ?~ = {?^}, we have
^0 P*(w;Wfi|a) = ce_i~2W / |j|^ (2.12)
?|w,a
as an app^ imation to p(w;u)Q|a). (The leading term of (2.9) together with
(2.12) yields the usual ? approximation for w. For a connection to Bartlett
adjustment factors see Barndorff-Nielsen and Cox (1984)).
Furthermore, (2.9) may be integrated termwise to obtain expansions
for the conditional distribution function for ? and, by inversion, for confi-
-3/2 dence limits for ?, correct to order 0(n ), conditionally as well as uncon-
ditionally, cf. Barndorff-Nielsen (1985). The resulting expressions allow one
to carry out "conditional inference without conditioning and without integra-
tion."
For extensions to the case of multidimensional parameters see
section 7.
Reparametrization. A basic form of invariance is parametrization
invariance of statistical procedures (though parametrization equivariance might
be a more proper term). If we think of an inference frame as consisting of the
data in conjunction with the model and a particular parametrization of the
model, and of a statistical procedure p as a method which leads from the
inference frame to a conclusion formulated in terms of the parametrization of
the inference frame then parametrization invariance may be formally specified
as commutati vity of the diagram
inference reparametrization ^ inference
frame frame
procedure p
procedure
conclusion -? conclusion reparametri zati on
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
104 O. E. Barndorff-Nielsen
In words, the procedure p is parametrization invariant if changing the inference
base by shifting to another parametrization and then applying p yields the same
conclusion as first applying p and then translating the conclusion so as to be
expressed in terms of the new parametrization. (We might describe a parametri-
zation invariant procedure as a 0-th order generalized tensor.) Maximum
likelihood estimation and likelihood ratio testing are instances of parametri-
zation invariant procedures.
Example 2.1. Consider any log-likelihood function 1(?), of a one-
dimensional parameter ?. Define the functions r^ = r*-vJ(w), ? = 1,2,...,
recursively by
*Cl]U) = !(1)U)/iU)^
G[?].*?^f1(?)^ j f ?\w/ , v=2,3,..., ??
and set f*-v-* = r-v-*U). The derivatives rLvJ are parametrization invariant,
i.e. r^ takes the same value whatever the parametrization employed.
While parametrization invariance is clearly a desirable property,
there are a number of useful, and virtually indispensable, statistical methods
which do not have this property. Thus procecures which rely on the asymptotic
normality of the maximum likelihood estimator, such as the Wald test or stan-
dard ways of setting confidence intervals in non-linear regression problems,
are mostly not parametrization invariant. However, in cases of non parametri-
zation invariance particular caution must be exercised, as demonstrated for
instance for the Wald test by Hauck and Donner (1977) and Vaeth (1985).
We shall be interested in how various quantities behave under
reparametrizations of the model M. Let ?, of dimension d, be the parameter of
some parametrization of M, alternative to that indicated by ?. Coordinates of
? will be denoted by ??, ?s, etc. and we write a for 3/3?? and ?
r - a r/,,P r - a2 r/aip..o /? ? /?s ?
etc. Furthermore, we write 1(?) for the log likelihood under the parametriza-
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 105
tion by ?, though formally this is in conflict with the notation 1(?), and
correspondingly we let lp
= 3 1 = a 1(?), etc.; similarly for other parameter
dependent quantities. Finally, the symbol ? over such a quantity indicates that
the maximum likelihood estimate has been substituted for the parameter.
Using this notation and adopting the summation convention that if a
suffix occurs repeatedly in a single expression then summation over that suffix
is understood, we have
p r /?
1 = "Lcw/ ?* + 1 ?, (2.13)
?s rs /? /s r /?s ? '
1 = ?^+?, ?7 ?, + ?^?, ?, [3] + 1 ?, (2.14) ?st rst /? /s /t rs /?s /tu
j r /?st ? '
etc., where [3] signifies a sum of three similar terms determined by permutation
of the indices ?,s,t. On substituting ? for ? in (2.13) we obtain the well-
known relation
Ko =
jrs%>
which, now by substitution of ? for ?, may be reexpressed as
ho -
K////0 <2?15>
or, written more explicitly,
Equation (2.15) shows that j is a metric tensor on M, for any given value of the
auxiliary statistic a. Moreover, in wide generality ? will be positive definite
on M, and we assum? henceforth that this is the case. In fact, for any ?eO we
have j? = j, i.e. observed information at the maximum likelihood point, which is
generally positive definite (though counterexamples do exist). r, ... r
Let ?(?) = [? ?(?)] be an array, depending on ? and where
sl "?
sq each of the ? + q indices runs from 1 to d. Then A is said to be a (p,q)
tensor, or a tensor of contravariant rank ? and covariant rank q, if under
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
106 O. E. Barndorff-Nielsen
reparametrization from ? to ? A obeys the transformation law
pr..pn s-, srt p, p^ rr..r
^...o^*)-"/lr-/V/v-^pAv-<w?
Example 2.2. A covariant tensor of rank q is given by
E j al al ?
In particular, the expected information i is a (0,2) tensor.
The inverse [irs] of i = [i ] is a contravariant second order
tensor.
r.r?... t^tp... The (outer) product of two tensors A and ?
sls2??' u1u2... is defined as the array C given by
S,Sp. . .U-^.
. . "~
S-jS?... U-jUp... '
This product is again a tensor, of rank (p' + p", q' + q") if (p',q') and
(p",q") are the ranks of A and B.
Lower rank tensors may be derived from higher rank tensors by con-
traction, i.e. by pairwise identification of upper and lower indices (which
implies a summation).
The parameter space as a manifold. The parameter space O may be
viewed as a (pseudo-) Riemannian manifold with (pseudo-) metric determined by
a metric tensor ?, i.e. ? is a rank 2 covariant, regular and symmetric tensor.
o The associated Riemannian connection ? is determined by the Christoffel symbols
?t rrs
where
?t tu ? r = f r
rs ? rsu
and
?rst =
*<Vst *
Vrs +
Vrt>? {2J6)
If ? is any affine connection with connection symbols r then
these symbols satisfy
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 107
y? a, = r* a. (2.17) ar
s rs t v ;
and the transformation law
G?>) -
[?(?>?,>?? +
?*po]*;t . (2.18)
On the other hand, any set of functions [r ] which satisfy the law (2.18)
constitute the connection symbols of an affine connection on O. It follows that
all affine connections on ? are of the form
t ?t t r = rL + S (2.19) rs rs rs v????*/
where the S are characterized by the transformation law
Sp>) =
S?s<??%%*/t * (2?20)
If, for a given metric tensor f, we define r . and S . by
G j. = ru f. and S . = Su f. rst rs^tu rst rs^tu
then (2.18), (2.19) and (2.20) are equivalent to, respectively,
G (?) = G?,^ + (?)?/ ?/ ?/ + F4...(?)?, ?7 (2.21) ?st?
' rst ' /? /s /t tu /?s /t
rrst ?
?rst +
Srst <2'22>
and
?st rst /? /s /t = S .?, ?# ?, . (2.23)
Thus, in particular, [S J is a tensor.
Suppose ?:3 -> ? is a mapping of full rank from an open subset ? of
a Euclidean space of dimension d? < d into O. Then ? is said to be an immer-
sion of ? in O. We denote coordinates of 3 by 3a,3 , etc. If f is a metric
tensor on ? then the metric tensor on ? induced from ? by ? is defined by
*ab(6) =
*rsU)Wb ? (2?24)
If ?* (?) is a connection on O and if r = r" ?. then the induced connection
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
108 O. E. Barndorff-Nielsen
on ? is defined by ^(3) =
rab(j(eHCd(3) and by
rabc(3) =
rrst(u))w/aw/bw/c +
*tu%bw/c ' (2'25)
Let G be a group acting smoothly on the parameter space. A metric
tensor f is said to be (G-) invariant if
FG5(?) =^??!1fG?5,(9?)^??1_, geG. (2.26) 3? d?a
For a given g let a new parametrization be introduced by ? = go*. From the
transformation law for tensors it follows that F is invariant if and only if
FGd(?) =
FG$(9?), geG. (2.27)
(On the left hand side the tensor is expressed in ? coordinates, on the right
hand side in ? coordinates.) Similarly, a connection r is said to be invariant
1f rJsU)
= r?;s(gu)), g?G. (2.28)
The pseudo-Riemannian connection derived from an invariant metric tensor is
invariant.
In generalization of (2.27) an arbitrary covariant tensor A,
is said to be (G-) invariant if V"rq
A? ? (?) = A (gui), geG. rr..rq rr..rq
If r is a G-invariant connection and if ? and S . are G- ?? G ? G w U
invariant tensors, with ? being a metric tensor, then r defined by
^t t . tU/K r = r + ? S rs rs ? rsu
is a G-invariant connection.
Now, let ? be the information tensor i on O. Then (2.16) takes the
form
?rst ?
E{1rsV +
*ilrVt>.
Obviously,
Trst =
Eilr1slt} (2.29)
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 109
satisfies (2.23) and hence, for any real a an affine connection is defined by
"rst=E{1rsV+?E{VsV? <2?30>
These are the a-connections introduced and studied by Chentsov (1972) and
Amari (1982a,b, 1985, 1986).
However, we shall be mainly concerned with another type of connec-
tion, determined from observed information, more specifically from the metric
tensor j-, see sections 6-8. We refer to i and # as expected and observed in-
formation metric on M, respectively.
Suppose, as above, that ?:3 ?+ ? is an immersion of ? in O. The
submodel Mq of M obtained by restricting ? to lie in O = ?(?) has expected
information
iM'WiMW' (2?31)
Thus i(3) equals the Riemannian metric induced from the metric i(?) on O to
the imbedded submanifold ?0? Furthermore, the a-connection of the model M~
equals the connection on ?0 induced from the a-connection on O, by the general
construction (2.25).
The measures on ? defined by
and
|?|\?? (2.32)
???^a? (2.33)
are both geometric measures, relative to expected and observed information
metric, respectively. Note that (2.33) depends on the value of the auxiliary
statistic a. We shall speak of (2.32) and (2.33) as expected and observed
information measure, respectively. It is an important property of these mea-
sures that they are parametrization invariant. This property follows from
the fact that i and ?r are covariant tensors of rank 2. As a consequence we
have that c|j| L (of (2.7)) is parametrization invariant.
Invariant measures. A measure y on ?, is said to be invariant with
respect to a group G acting on X^ if gy = y for all geG.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
110 O. E. Barndorff-Nielsen
Invariant measures, when they exist, may often be constructed from
a quasi-invariant measure, as follows.
A measure ? on X is called quasi-invariant with multiplier
? = x(g?x) "if Qy and y are mutually absolutely continuous for every geG and if
d(g_1y)(x) = x(g,x)dy(x).
Furthermore, define a function m on X to be a modulator with associated
multiplier x(g,x) if m is positive and
m(gx) = x(g,x)m(x).
Then, if yx is quasi-invariant with multiplier x(g,x) and if m is a modulator
with the same multiplier we have that
? ? y = m yA
is an invariant measure on IX.
As quasi-invariance is clearly a very weak property the problem in
constructing invariant measures lies mainly in finding appropriate modulators.
It is usually possible to specify the modulators in terms of Jacobians.
In particular, in applications it is often the case that X^ is an
open subset of a Euclidean space. By the standard theorem on transformation
of integrals, Lebesgue measure ? on X is then quasi-invariant with multiplier
J /a\(x). Under mild conditions an invariant measure on X^ is then given by
dy(x) = <]?(2)(?G?<??(?).
(2.34)
Here J , ? denotes the Jacobian determinant of the mapping y(g) of iX onto itself
determined by geG and (z,u) constitutes an orbital decomposition of x, i.e.
(z,u) is a one-to-one transformation of ? such that ?e_? and u is maximal
invariant while ze& and x=zu. For a more detailed discussion see section 3
and appendix 1.
Transformation models. Let G be a group acting on the sample space
X. If the class ? of probability measures given by the statistical model is
invariant under the induced action of G on the set of all probability measures
on iX then the model is called a composite transformation model and if ?
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 111
consists of a single orbit we use the term transformation model. For a
composite transformation model, G acts on _P and we may, of course, equally
think of G as acting on the parameter space O. A parameter (function) ? which
is maximal invariant under this action is said to be an index parameter.
Virtually all composite transformation models of interest have the property
that after minimal sufficient reduction (and possibly after deletion of a null
set from _X) there exists a sub-group ? of G such that ? is the isotropy group
for a point on every one of the orbits of _X and of O. Each of these orbits is
then isomorphic to the homogeneous space G/K = {gK.^G} of left cosets of K.
For a transformation model the information measures (2.32) and
(2.33) are invariant measures relative to the action of G on O induced from the
action of G on X via the maximum likelihood estimator ?, which is an equivariant
mapping from _X to O. This action is the same as the above-mentioned action of
G on ? ? ? and also the same as the natural action of G on G/K ? ?.
It follows that relative to information measure on O the formula
(2.7) for the conditional distribution of ? is simply cL. From this it may be
shown that, with the auxiliary a as the maximal invariant statistic, ?*(?,?|a)
is exactly equal to ?(?;?|a).
These results are shown in outline in Barndorff-Nielsen (1983). A
more general statement will be derived in section 5.
Exponential models. A (k,d) exponential model has model function of
the form
p(x;u>) = exp{e(u>)-t(x) - ?(?(?)) - h(x)}. (2.35)
Here k is the order of the model (2.35) and is equal to the common dimension
of the vectors ?(?) and t(x), while d denotes the dimension of the parameter ?.
The full exponential model generated by (2.35) has model function
p(x;e) = exp{e-t(x) - ?(?) - h(x)} (2.36)
and ?(?) is the cumulant transform of the canonical statistic t = t(x). From
the viewpoint of inference on ? there is no restriction in assuming ? = t,
since t is minimal sufficient, and we shall often do so. We set t = t(?) = Et,
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
112 O. E. Barndorff-Nielsen
i.e. t is the mean value parameter of (2.36), and we write ? for x(int0)
where T denotes the canonical parameter domain of the full model (2.36).
Let f be a real differentiable function defined on an open subset
k t of R . The Legendre transform f of f is defined by
fT(y) = x-y-f(x)
where
y = (Df)(x) =|f(x) .
The Legendre transform is a useful tool in studying various, dualistic aspects
of exponential models (cf. Barndorff-Nielsen (1978a), Barndorff-Nielsen and
Blaesild (1983a)).
In particular, we may use the Legendre transform to define the -1
dual likelihood function 1 of (2.35) by
-1 1 (?) = ??t(?) - 1(t(?)). (2.37)
Here, and elsewhere, ' as top index indicates maximum likelihood estimation
under the full model. Further, in this connection we take 1 as the sup-log-
likelihood function of (2.36) and then 1 is, in fact, the Legendre transform of
?. Note that for t = t(?) e ? we have 1(t) = ??t - ?(?). An inference
methodology, parallel to that of likelihood inference for exponential families,
may be developed from the dual likelihood (2.37). The estimates, tests and
confidence regions discussed by Amari and others under the name of a = -1 (or
mixture) procedures are, essentially, part of the dual likelihood methodology.
More generally, based on Amari's concepts of a-geometry and a- a
divergence, one may for each ae[-1,1] introduce an "a-likelihood" L by
L(?>) = L(a>;t) = exp{-Da(e,e(?)))> (2.38)
where
Da^> =
W$#? <2?39>
Here ?(?;?) is given by (2.36) and the function f is defined as
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 113
? log ?, a = 1
f (?) = 4 {1.?(?a)/2}> _1<a<1
a ? c 1-a
-log ?, a = -1
(2.40)
a a
Letting 1 = log L we have, in particular,
1 1(?) = 1(?) = -?(?,?) = ?-t - ?(?) - ?(t) (2.41)
and -1
1(?) = -?(?,?) = ??t - ?(t) - ?(?) (2.42)
where I denotes the discrimination information. Furthermore, for -1<a<1,
1(e) ?-^ [e ? 2 2 2
_1L l-a?
Affine subsets of T are simple from the likelihood viewpoint while,
correspondingly, affine subsets of ? are simple in dual likelihood theory. Dual
affine foliations, of T and ? respectively, are therefore of some particular
interest. Such foliations have been studied in Barndorff-Nielsen and Blaesild
(1983a), see also Barndorff-Nielsen and Blaesild (1983b).
Suppose that the auxiliary component a of (?,a) is approximately or
exactly distribution constant, i.e. a is ancillary. For instance, a may be the
affine ancillary or the directed log likelihood ratio statistic, as defined in
Barndorff-Nielsen (1980, 1986b). We may think of the partitions generated,
respectively, by a and ? as foliations of T, to be called the ancillary
foliation and the maximum likelihood foliation. (Amari's ancillary subspaces
are then, in the present terminology and for a = 1, leaves of the maximum like-
lihood foliation.)
Exponential transformation models. A model M which is both trans-
formational and exponential is called an exponential transformation model. For
such models we have the following structure theorem (Barndorff-Nielsen,
Blaesild, Jensen and Jorgensen (1982), Eriksen (1984b)).
Theorem 2.1. Let M be an exponential transformation model with
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
114 O. E. Barndorff-Nielsen
acting group G. Suppose X_ is locally compact and that t is continuous. Fur-
thermore, suppose that G is locally compact and acts continuously on _X.
Then there exists, uniquely, a k-dimensional representation A(g) of
G and k-dimensional vectors B(g) and B(g) such that
t(gx) = t(x)A(g) + B(g) (2.43)
e(g) = eteWg"1)* + 8f(g) (2.44)
where ee& denotes the identity element. Furthermore, the full exponential model
generated by M is invariant under G, and &* = {[A(g" )*,&(g)]: geG} is a group of
affine transformations of R leaving T and into invariant in such a way that
e(gP) = e?PjA?g"1)* + B(g), geG, ?e? .
Dually, G = ?[A(g),B(g)]^G} is a group of affine transformations leaving
C = cl conv t( X_ ) as well as ? = x(inte) invariant. Finally, let 6 be the
function given by
6(g) = ?(Q(e))a(Q(g))-?exp('Q(g)M9)). (2.45)
We then have
a(e(gP)) = a(0(P))o(g)"1exp(-e(gP).B(g)). (2.46)
Exponential transformation models that are full are a rarity.
However, important examples of such models are provided by the family of Wishart
distributions and the transformational submodels of this.
In general, then, an exponential transformation model M is a curved
exponential model. It is seen from the above theorem that the full model M
generated by M is a composite transformation model and that, correspondingly,
M (and, hence T and T) is a foliated manifold with M as a leaf. It seems of
interest to study how the leaves of this foliation are related geometric-
statistically. Exponential transformation models of type (k,d), and in partic-
ular those of type (2,1), have been studied in some detail by Eriksen (1984a,c).
In the first of these papers the Jordan normal form of a matrix is an important
tool.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 115
Many of the classical differentiable manifolds with their associated
acting Lie groups are carriers of interesting exponential transformation models.
Instances of this are compiled in table 2.1.
Analogies between exponential models and transformation models.
There are some intriguing analogies between exponential models and transforma-
tion models.
Example 2.3. Under a d-dimensional location parameter model, with
? as the location parameter and for a fixed value of the (ancillary) configura-
tion statistic, the possible score functions are horizontal translates of each
other.
On the other hand, under a (k,d) exponential model, with ? as a
component of the canonical parameter and provided the complementary part of the
canonical statistic is a cut, the possible score functions are vertical trans-
lates of each other. (For details, see Barndorff-Nielsen (1982)).
Example 2.4. Suppose ? is one-dimensional. If ? is the location
parameter of a location model then the correction term C,, given by (2.10),
takes the simple form
1 ?(4) j(3)2 C1
= - 24 {3 -^-
+ 5 :3 > .
Exactly the same expression is obtained for a (1,1) exponential
model with ? as the canonical parameter.
(This was noted in Barndorff-Nielsen and Cox (1984)).
Maximum estimation. Suppose that for a certain class of models we
have an estimation procedure according to which the estimate ? of ? is obtained
by maximizing a positive function ? = ?(?) = ?(?;?) with respect to ?. Let
m = log M and suppose that
? = -[3rasm](20 (2.47)
is positive definite. We shall then say that we have a maximum estimation pro-
cedure. Maximum likelihood estimation and dual maximum likelihood estimation -1
(where m(u>) = 1(?) = ??t(?) - 1(?), cf. (2.37)) are examples of this. More
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
116 O. E. Barndorff-Nielsen
generally, minimum contrast estimation, as discussed by Eguchi (1983), is of
this type.
Suppose that M depends on ? through the minimal sufficient statis-
tic only and let a be an auxiliary statistic such that (?,a) is minimal suf-
ficient. In generalization of (2.7) we may consider
p*(2f;u)|a) = ?\?\\/?9 (2.48)
as a possible approximation to ?(?;?|?). Here t = iQ) and c is a norming
constant, determined so as to make the integral of the right hand side of
(2.48) with respect t? ? equal to 1.
It will be shown in section 5 that (2.48) is exactly equal to
?(?;?|a) for a considerable range of cases.
Finally, it may be noted that by an argument of analogy it would
seem rather natural to consider the modification of (2.48) in which the func-
tion M is substituted for the likelihood function L. While this approach is
not without interest its general asymptotic degree of accuracy is only 0(n )
-1 -3/2 in comparison with 0(n~ ) or 0(n"
' ) for (2.48). Also, for transformation
models this modification is exact in exceptional cases only.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 117
? ? (O
? F ? O S1 F
? * ?: U ??
4J F U co f S 0)
?H
U ?P r< I
x
o ?a & 33
?S C
en
r-\ O <H ?H
(H 05 U ?H ? ? o?
?
F ? <d ft co u s
u cd ft
o 1 >i W
JJ ? <d u * ? :d f ?*?*
3?1 ?
I
F H
CO PS ?P -H C ? ?? 3
I
f > ??
ss
8 ? ft
F -H 4J en
m
co -P (O
t? o
f O
t? O
?? ?M
O en
o H Cn
? ?H ?? ? ?d a*
U + CQ
O CO
? O
U
?
o
(d ? ?> o
o co
?a ? S
o
r-i I
o co
g.
Il
S
?H ?M ? F ? O
O
f ? co
fi ?? ?
?? ? ? F ? -? ? -? m -?
? F ? -? ? -?
?* ? (0 ?
f F
(0 ? ?? ?? ??
CO
? ?? ?
CQ
?
CM
F|
?-? ?
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
3. TRANSFORMATION MODELS
Transformation models were introduced in section 2. For any ?e?
the set Gx = {gxigeG} of points traversed by ? under the action of 6 is termed
the orbit of x. The sample space )Ms thus partitioned into disjoint orbits,
and if on each orbit we select a point u, to be called the orbit representative,
then any point ? in iX can be determined by specifying the representative u of
Sx and an element zeG such that ? = zu. In this way ? has, as it were, been
expressed in new coordinates (z,u) and we speak of (z,u) as an orbital decompo-
sition of x.
The orbit representative, or any one-to-one transformation thereof,
is a maximal invariant - and hence ancillary - statistic, and inference under
the model proceeds by first conditioning on that statistic.
The action of G on a space _X is said to be transitive if ^consists
of a single orbit and free if for any pair g and h of different elements of G
we have gx j hx for every xeX. Note that after conditioning on a maximal
invariant statistic u we have a transitive action of G on the conditional sample
space. For any ?e_? the set Gx = {g:gx = x) is a subgroup, called the isotropy
group of x. The space X_is said to be of constant orbit type if it is possible
to select the orbit representatives u so that G is the same for all u.
The situation is particularly transparent if the action of G on the
sample space ?X is free. Then for given ? and u there is only one choice of ZeG
such that ? = zu, and X, is thus representable as a product space of the form
U ? G where U is the subset of ^consisting of the orbit representatives u.
Note that u and ? as functions of ? are, respectively, invariant and equivariant
118
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 119
? .e.
u(gx) = u(x), z(gx) = gz(x).
It is o'ten feasible to construct an orbital decomposition by first finding an
equivariant mapping ? from X_ onto G and then defining the orbit representative
u for ? bv
? = z" x.
In particular, the maximum likelihood estimate g of g is equivariant, and may be
used as ? provided g(x) exists uniquely for every ?e_? and g(X) = G. In this
case, G's action on ? must also be free.
However, we shall need to treat more general cases where the actions
of 6 on X and on IP are not necessarily free.
Let ? and ? be subsets of G. We say that these constitute a
factorization of G if G is uniquely factorizable as
G = HK
in the sense that to each element geG there exists a unique pair (???)e??? such
that g = hk. We speak of a left factorization if, in addition, ? is a subgroup
of G, and similarly for right factorization. If a factorization is both left
and right then G is said to be the product of the groups H and K. An important
example of such a product is afforded by the well-known unique factorization of
a regular ? ? ? matrix A into a product UT of an orthogonal matrix U and a
lower triangular matrix with positive diagonal elements, i.e., using standard
notations for matrix groups, GL(n) is the product of 0(n) and T+(n).
A relevant left factorization is often generated in the following
way. Let ? be a member of the family P^ of probability measures for a transform-
ation model M, and let ? be the isotropy group Gp, i.e.
? = {geG:gP = P}.
For each ?e?^ we may select an element h of G such that ? = hP, and letting ? be
the set consisting of these elements we have a (left) factorization G = HK.
(In a more technical wording, the elements h are representatives of the left
cosets of K.) Note that G? =
hGph , and that the action of G on ? is free if
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
120 O. E. Barndorff-Nielsen
and only if ? consists of the identity element alone. The quantity h para-
metrizes f\
Suppose G = HK is a factorization of this kind. For most transform-
ation models of interest, if the action of G on X is not free then there exists
an orbital decomposition (z,u) of ? with ?e? and such that for every u the iso-
tropy group G equals ? and, furthermore, if ? and z% are different elements of
? then zu f z'u.
Example 3.1. Hyperboloid model. This model (Barndorff-Nielsen
(1978b), Jensen (1981)) is analogous to the von Mises-Fisher model but pertains
k-1 k to observations ? on the unit hyperboloid ? of R , i.e.
? k-1
{x:x*x = 1, Xq>0}
where ? = (xq,x,,...,x. ,) and * denotes the non-definite scalar product of
vectors in R which is given by
x*y = x0y0-x1y1-...-xk_1yk_r
The analogue of the orthogonal group 0(k) is the so called pseudo-
orthogonal group 0(1,k-1), which is the subgroup of GL(k) with matrix represent-
ation
0(1,k-1) = {U:U* I U = I}
where ? denotes the k ? k diagonal matrix
1 0
0 -1
0
0 .... -1
For k = 4 this is the Lorentz group of relativistic physics. Topologically,
the group 0(1,k-1) has four connected components, of which one is a subgroup of
0(1,k-1) and is defined by
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 121
SO+(l,k-l) = {lfcO(l,k-l):|U| = 1, uQQ>0}
(the elements of U are denoted by u.., i and j = 0,1,...,k-l). This subgroup ' J k-1
is called the special pseudo-orthogonal group and it acts on H by (U,x) -*xU*
k-1 (vector-matrix multiplication). The points of H can be expressed in hyper-
bolic-spherical coordinates as
Xq = cosh u
x, = sinh ? cos v,
Xp = sinh u sin v-. cos Vp
?. , = sinh u sin v, ... sin v. 2 ,
k-1 + and an invariant measure ? on ? , relative to the action of SO (l,k-l), is
specified by
k-2 k-3 dy = sinh u sin v, ... sin v. - dudv, ... dv. 2- (3.1)
The hyperboloid model function, relative to the invariant measure
(3.1) on Hk"\ is
?(?;?,?) = ak(x)e"x?*x (3.2)
where the parameters ? and ?, called the mean direction and the precision,
k-1 satisfy ?e? and ?>0, and where
ak(x) =
??</2-1/{(2p),</2-12?|</2.1(?)} (3.3)
with K. i2 ? ? Bessel function.
For any fixed ?, the hyperboloid distributions (3.2) constitute a
transformation model under the action of S0f(l,k-1), and the induced action on
the parameter space is (?,?) -> ??* (vector-matrix multiplication). The isotropy
group ? of the element ? = (1,0,...,0) may be identified with SO(k-l). Further-
more, S0f(l,k-1) can be factored as
S0*(l,k-1) = HK = H SO(k-l)
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
122 O. E. Barndorff-Nielsen
where the matrix representation of ??e? is
h =
1 + l+xr
xlx2 1+??
Vl
X2X1 1+Xn
Xk-lXl 1+Xn
1 +
Vl
xlxk-l ?+??
X2Xk-l l+xr l+xr
xk-lx2 1+Xn
.k-1
1 + Ak-1 1+Xn
(3.4)
for ? = (xQ,x..,... ,?. , ) varying over ?
" . In relativity theory a Lorentz
transformation of the type (3.4) is termed a "pure Lorentz transformation" or
a "boost." (It may be noted that S0f(l,k-1) can equally be factored as KH with
the same ? and H as above.)
We have already mentioned the concept of equivariance of a mapping
from X_ onto G. More generally, if s is a mapping of X onto a space S and if
s(x) = s(x') implies s(gx) = s(gx') for ?,?'e?^ and all geG then s is said to be
equivariant. In this case we may define an action of G on S by gs = s(gx)
for s = s(x) and for any ?e?., and we speak of this as the action induced by s.
In the applications to be discussed later S is typically the parameter domain
under some parametrization of the model and s is the maximum likelihood estima-
tor, which is automatically equivariant.
We are now ready to state the results which constitute the main
tools of the theory of transformation models.
Subject to mild topological regularity conditions (for details, see
Barndorff-Nielsen, Blaesild, Jensen and Jorgensen (1982)) we have
Lemma 3.1. Let u be an invariant statistic with range space U =
uOO, let s be an equivariant statistic with range space S = sQO, and assume
that the induced action of G on S is transitive. Furthermore, let y be
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 123
invariant measure on IL Then, we have (s,u)QO = S ? U and
(S,u)y = v x ?
where ? is an invariant measure on S and ? is some measure on U.
Suppose r, s and t are statistics on X^ (in general vector-valued).
The symbol rx s|t is used to indicate that r and s are conditionally indepen-
dent given t.
Theorem 3.1. Let the notations and assumptions be as in lemma 3.1,
and suppose that the transformation model has a model function p(x;g) relative
to an invariant measure ? on X such that p(x) = p(x;e) is of the form
p(x) = q(u)r(s,w) (3.5)
for some functions q and r and some invariant statistic w which is a function
of u.
Then the following conclusions are valid.
(i) The model function p(x;g) is of the form
p(x;g) = q(u)r(g"]s,w), (3.6)
and hence the statistic (s,w) is sufficient.
(ii) We have
s i u|w.
(iii) The invariant statistic u has probability function
p(u) = q(u)/r(s,w)dv(s) <p> (3.7)
(where ? is invariant measure on S).
(iv) The conditional probability function of s given w is
p(s;g|w) = c(w)r(g" s,w) <v> (3.8)
where c(w) is a norming constant.
It should be noted that the theorem covers the case where no suffi-
cient reduction is available (take q constant and w = u) as well as the case
where s - typically the maximum likelihood estimator - is sufficient (take w
degenerate). Note also that theorem 3.1 does not assume that the action of G
is free. If, however, the action is free and if (z,u) is an orbital decompo-
sition of ? then the theorem applies with s = z.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
124 O. E. Barndorff-Nielsen
Example 3.2. Hyperboloid model (continued). Let x-,,...,? be a
sample from the hyperboloid distribution (3.2) and let ? = (?,,...,? ) and
x+ = x,+ ... +x . Considering ? as fixed, theorem 3.1 applies with u as the
maximal invariant statistic, s = x+// x+*x+ and w = / x+*x+ . In particular,
it turns out that the conditional distribution of s given w (or, equivalently,
given u) is again a hyperboloid distribution, with mean direction ? and pre-
cision wx. This is in complete analogy with the von Mises-Fisher situation,
and accordingly s and w are termed the mean direction and the resultant length
of the sample. For details and further results see Jensen (1981) and Barndorff-
Nielsen, Blaesild, Jensen and Jorgensen (1982).
Lemma 3.1 and theorem 3.1 are formulated in terms of invariant
dominating measures on X^ and S. In applications, however, the probability func-
tions are ordinarily expressed relative to Lebesgue measure - or, more general-
ly, relative to geometric measure when the underlying space is a differentiable
manifold. It is therefore important to have a formula which gives the relation
between the two types of dominating measure.
Let ? be an action of G on a space ? and suppose Y_ has constant
orbit type under this action. Then there exists a subgroup ? of G, a subset ?
of G and an orbital decomposition (z,u) of ye? such that G = ? and ?e? for
every y. We assume that ? can be chosen so that HK constitutes a (left)
factorization of G. If ? is a differentiable manifold and if ? acts differen-
ti ably on ? then an invariant measure y on ? can typically be constructed from
geometric measure ? on _Y, by means of Jacobians. In particular, if ?_ is an
open subset of some Euclidean space Rr, so that ? is Lebesgue measure, then
y defined by
dy(y) = Jy{z)(u)']<lx(y)
(3.9)
will be invariant; here J / % denotes the Jacobian determinant of the mapping
y(g) of ? onto itself. A proof of this is sketched in appendix 1.
Example 3.3. Hyperboloid model (continued). We show here how the
k-1 invariant measure (3.1) on the unit hyperboloid H may be derived from
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 125
Lebesgue measure. For simplicity, suppose k = 3. The manifold H2 is in one-
to-one smooth correspondence with R through the mapping
2 2 ?? -> R?
F:
(x0,xrx2) ->
(xrx2)
2 * and we start by finding an invariant measure on R . The action of SO (1,2) on
2 9 ? is given by (U,x) -> xU* and the induced action on R is therefore of the
form (U,y) + f(f~ (y)U*). These actions are transitive, and if we take
u = (0,0) as the orbit representative of R and let ? be the boost
1 + yly2
1+Vn i+y,
y2yl 1 +
0
A
(3.10)
i+yf
y 2 2
1 + y-. + y2? then (u,z) constitutes an orbital decomposition of
2 yeR of the type required for the use of formula (3.9). Letting ? denote the
2 / 2 2~~ action of SO (1,2) on R one finds that J'(z\(u) =^ 1 + Y-i +
Y2 and hence the
measure
dy(y) ?y}?y2
2 is an invariant measure on R . Shifting to hyperbolic-spherical coordinates
(u,v) for (y-j,y?) this measure is transformed to (3.1) with k = 3.
Below and in sections 4 and 5 we shall draw several important con-
clusions from lemma 3.1 and theorem 3.1. Various other applications may be
found in Barndorff-Nielsen, Blaesild, Jensen and Jorgensen (1982).
Corollary 3.1. Let G = HK be a left factorization of G such that
? is the isotropy group of p. Thus the likelihood function depends on g through
h only. Suppose theorem 3.1 applies with S = H and let L(h) = L(h;x) be any
version of the likelihood function. Then, the conditional probability
function of s given w may be expressed in terms of the likelihood function as
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
126 O. E. Barndorff-Nielsen
p(s;h|w) = c(w)y|j
<v> . (3.11)
In formula (3.11) the likelihood function changes with the value of
s. However, an alternative expression for the conditional probability function
is available which employs only the single observed likelihood function. Sup-
pose for simplicity that ? consists of the identity element alone, so that
S = G. Further, let xQ denote the observed point in X^ and write Lf?(g) for
L(g;xQ). Also, for specificity, let the action of G on S = G be the so called
left action of G on itself, i.e. a geG acts on a point $e$ simply by multiply-
ing s on the left by g, in the group theoretic sense. (Thus, the two possible
interpretations of the symbol gs coincide). The situation here specified
occurs, in particular, if the action of G on X is free and if s is the group
component of an orbital decomposition of x. Setting sQ =
s(xQ) and wQ =
w(xQ),
we are interested in the conditional distribution of s given w = wQ
and by
(3.6) and (3.11) this may be written as
L0(s0rl9) P(s;g|w0)
= c(w0)?T-(i-)-
<a> ,
the invariant measure being denoted here by a, as a standard notation for left
invariant measure on G. This formula, which generalizes a similar
expression for the location-scale model due to Fisher (1934), shows how the
"shape and position" of the conditional distribution of s is simply determined
by the observed likelihood function and the observed sQ, respectively.
Formula (3.11), however, besides being slightly more general, seems
more directly applicable in practice.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
4. TRANSFORMATIONAL SUBMODELS
Let M be a transformation model with acting group G. If Pn is any
of the probability measures in M and if GQ
is a subgroup of G then P^ =
{gP0:9eGn* de'f'1'nes a transformation submodel M~ of M. For a given GQ the col-
lection of such submodels typically constitutes a foliation of M.
Suppose G is a Lie group, as is usually the case. The one-parameter
subgroups of G are then in one-to-one correspondence with TG , the tangent
space of G at the identity element e, and this in turn is in one-to-one corre-
spondence with the Lie algebra ? of left invariant vector fields on G. More
generally, each subalgebra h of the Lie algebra of G determines a connected
subgroup H of G whose Lie algebra is h (cf., for instance, Boothby (1975) chap-
ter 4, theorem 8.7). If ?e?? , the one-parameter subgroup of G determined by
A is of the form {exp(tA)^R}. In general, the subgroup of G determined
by r linearly independent elements A,,...,A. of TG may be represented as
exp?^A^.^exp?t A }.
Example 4.1. Let M be a location-scale model,
? ? ?(??,...,??;?,s)
= s"? ? f(s~ (? -y)). (4.1) 1 ? 1=1 ?
Here G is the affine group with elements l\i9o~\ which may be represented by
2?2 matrices
1 0
? s
the group operation being then ordinary matrix multiplication. The Lie algebra
of G, or equivalently TG , is represented as the set of 2 ? 2 matrices of the
127
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
128 O. E. Barndorff-Nielsen
form
A = 0 0
b a , a^R.
We have
etA = I + tA + ~|- t2A2 +..,
b/a(eta-l) eta
where the last expression is to be interpreted in the limiting sense if a = 0.
There are therefore four different types of submodels. Specifical-
ly, letting Uq^q) denote an arbitrary value of (m9o) and taking PQ as the
corresponding measure (4.1) we have
(i) If a = 0 then ?~ is a pure location model.
(ii) If a f 0, b = 0 and yQ = 0 then Pq is a pure scale model.
(iii) If a j= 0, b = 0 and ?O f 0 then M~ may be characterized as
the submodel of M for which the coefficient of variation y/s is constant and
equal to Uq/oq.
(iv) If both a and b are different from 0 then P~ may be character-
ized as the submodel ?L? of M for which s~ (y+b/a) is constant and equal to
c0 =
s0 (??+^a)' ???# ?^ we ^et c = ^a ^en Mn 1S determined by
s" (y+c) = CQ. (4.2)
on
Letting F denote the distribution function of f we can express (4.2) as the
condition that (y,a) is such that -c is the F(-c0)-quantile of the distributi
o-]f(o-\x-A).
The above example is prototypical in the sense that G is generally
a subgroup of the general linear group GL(m) for some m and TG may be repre-
sented as a linear subset of the set M(m) of all m ? m matrices.
Example 4.2. Hyperboloid model. The model function of the hyper-
boloid model with k = 3 and a known precision parameter ? may be written as
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 129
?(?,?;?,f) = (2^-Vs?nh u e-X{coshx cosh u"s1nhx sinh u cos(v-+)} (43)
where u > ?, ?e[0,2p) and ? > 0, fe[0,2p). The generating group G = S0f(l;2)
may be represented as the subgroup of GL(3) whose elements are of the form
0
COS4
-sind
0
sin4
COSd
coshx sinhx 0
sinhx coshx 0
0 0 1
1+^
? -?
(4.4)
where -??<?<-??. This determines the so called Iwasa decomposition (cf., for
instance, Barut and Raczka (1980) chapter 3) of S0*(l;2) into the product of
three subgroups, the three factors in (4.4) being the generic elements of the
respective subgroups. It follows that TG is the linear subspace of M(3) gen-
erated by the linearly independent elements
Ei -
r
E3 =
0 1
0 1
-1 0
Each of the three subgroups of the Iwasawa decomposition generates
a transformational foliation of the hyperboloid model given by (4.3), as dis-
cussed in general terms above. In particular, the group determined by the
third factor in (4.4) yields, when applied to the distribution (4.3) with
? = F = 0, the following one-parameter submodel of the hyperbolic model:
?(?,?;?)
2 (2 G^? "x(cosh u"^sinh u e"*5^ ^cosh u~sinh u cos V)"2C sinn u sin v>
The general form of the one-parameter subgroups of SO (1;2) is
expit } ,
where a, b, c are fixed real numbers.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
5. MAXIMUM ESTIMATION AND TRANSFORMATION MODELS
We shall be concerned with those situations in which there exists an
invariant measure y on X that dominates P_9 where P^ = {gP^G} is transformation-
al. Letting
?(x) = P(x;g)
and writing p(x) for p(x;e) we have
p(x;g) = p(g" ?) <p>.
In most cases of interest the model has the following additional structure (pos-
sibly after deletion of a null set from _X , cf. also section 3). There exists
a left factorization G = ?? of G, a K-invariant function f on X_> and an orbit-
al decomposition (h,u) of ? such that:
(i) G = ? for all u and, furthermore, Gp = K. Hence, in particu-
lar, ? may be viewed as the parameter space of the model.
(ii) For every ?e_? the function m(h) = f(h" x) has a unique maximum
on ? and the maximum point is h.
(iii) ? may be viewed as an open subset of some Euclidean space R
and for each fixed ?e?^ the function m is twice continuously differentiable on H
and the matrix * = 'K(h) given by
is positive definite.
In these circumstances we have:
Proposition 5.1. The maximum estimator h is an equivariant mapping
130
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 131
of X. onto ? and the action of G on ? induced by ?? coincides with the natural
action of G on H. Furthermore, if the mapping ? -* (h,u) is proper then there
exists an invariant measure ? on H, and for any fixed u such a measure is given
by
dv(h) = |*|^? (5.1)
where dh indicates the differential of Lebesgue measure on H.
(iii).
Here ? is considered as an open subset of R , in accordance with
'Xj Proof. The equi variance of h follows immediately from (ii). Obvi-
ously, there is a one-to-one correspondence between the family of left cosets
G/K = {gK^G} and H. Let ? be the mapping from G/K to ? which establishes this
correspondence. The natural action ? of G on G/K is given by
G ? G/K ^ G/K
f:
(g,gK) -> ggK
and we have to show that when this action is transferred to ? by ? it coincides
with the action ? of G on ? induced by ?V. In other words, we must verify that
for any geG the diagram
G/K-y H
F(9) j [ ?(9) (5.2)
G/K-y H P
commutes. Let ? be the mapping from G to ? that sends a geG into the uniquely
determined ??e? such that g = hk for some keK. For any ft = ft(x) in H we have
that y(g)?i = ft(gx) is determined by
fUfiitgx)}"1 gx) l fin"1 gx), ?e?. (5.3)
Now, by the K-invariance of f,
fin"1 gx) = f((g-\r\) = fOitg'V'x)
and here n(g h) ranges over all of ? when h ranges over H. Hence (5.3) may be
rewritten as
f??n?rt?gx))}"^} * fUr'x), heH,
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
132 O. E. Barndorff-Nielsen
i.e., by (ii),
or, equivalently,
f?(x) = n(g"^(gx))
R(x)k = rtgxjK
and this, precisely, expresses the commutativity of (5.2), since ? (h) = hK.
When the mapping ? -> (??,?) is proper the subgroup ? is compact
because ? = Gu- Hence there exists an invariant measure on H, cf. appendix 1.
That |1t| dh is such a measure follows from (3.9) and formula (5.10) below.
In particular, then, there is only one action of G on H at play,
namely ?, and
y(g)h = n(gh). (5.4)
Now, let h -> ? be an arbitrary reparametri zation of the model and
let ?t?(?) = m(h(u))) and
*(?) =*(?>;u) = - ~* (?;??). (5.5) s?s?
This matrix is a (0,2) tensor on O.
We shall now show that
-fc(h) = -R(h;u) = J (e)~]\(e9u)? (e)"1. (5.6) Y(h) Y(h)
Here the unit element e is to be thought of as a point in H.
We have
m(h) = f(h"]x) = f?h"1^) = f({n(?V"1h)}'1u)
where, again, we have used the K-invariance of f. Thus, with ? as the projec-
tion mapping defined above we obtain
M?hixi {h) = i?teLl (n(firlh)) ??f?? (h) (5.7)
and
a2m(h;x) ,.* _ 3?(?tG1?) ,hx a2m(h;u) . /fr-lhU 3n(f?"1h)* ah ah* (h)-ah*
' (h) ahah* Mh h)) ah ahah* *"' ah*-~ X"J ~a??ah*~~ ^" "" dh <h>
ah {r]{ri n)) ahah* + Mhiui(n(rlh)) .
*2iCh)M - (5.8)
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 133
In these expressions we have, since n(ft~ h) = ?(?G )h, that
Mf^1 (h) - J t (h). (5.9)
y(h )
On inserting ft for h in (5.7), (5.8) and (5.9) (whereby (5.7) becomes 0) and
combining with (2.1) we obtain (5.6).
From (5.6) we may draw two important conclusions.
First, taking determinants we have
\*(h,u)\h =
J?{h)(e)~]\ne;u)\h (5.10)
and this, by (3.9) and the tensorial nature of -K, implies that |*(?)| ?? is an
invariant measure on O. In connection with formula (5.10) it may be noted that
Jy(h)(e) =
J6(h)(e)
where d denotes left action of the group G on itself. A proof of this latter
formula is given in appendix 2.
Secondly, the tensor *(?) is found to be G-invariant, whatever the
value of the ancillary. In fact, by (5.4) we have, for any ?0e? and c^G,
Y(y(g)h)h0 = i(g) ?
?(?)?0?
Consequently
-rr^h^(e) = ?
A^{e) ???^ ?(y(g)h) ?(?) -y{g)
and this together with (5.6) and (2.26) establishes the invariance.
In particular, observed information ^determines a G-invariant
Riemannian metric on the parameter space. The expected information metric i
can also be shown to be G-invariant.
From proposition 5.1 and corollary 3.1 we find
Corollary 5.1. The model function p*(or,u)|u) = c|l<| L/t is exactly
equal to ?(?;?|?).
By taking m of (ii) equal to the log likelihood function 1 this
corollary specializes to theorem 4.1 of Barndorff-Nielsen (1983).
Suppose, in particular, that the model is an exponential transform-
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
134 O. E. Barndorff-Nielsen
a ation model. Then the above theory applies with ??(?) = 1(?). The essential
a -1 property to check is that 1(?;?(?)) is of the form f(h x). This follows simply
a from the definition of 1 and theorem 2.1.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
6. OBSERVED GEOMETRIES
In section 2 we briefly reviewed how the parameter space of the
model M may be set up as a manifold with expected information i as Riemannian
metric tensor and with an associated family of affine connections, the a-con-
nections (2.30). We shall now discuss a similar type of geometries on the
parameter space, related to observed information and depending on the choice of
the auxiliary statistic a which together with the maximum likelihood estimator
? constitutes a minimal sufficient statistic for M. These latter geometries
are termed observed geometries (Barndorff-Neilsen, 1986a). In applications to
statistical inference questions it will usually be appropriate to take a to
be ancillary but a great part of what we shall discuss does not require dis-
tribution constancy of a and, unless explicitly stated otherwise, the auxil-
iary a is considered arbitrary (except for the implicit smoothness properties).
Let an auxiliary a be chosen. We may now take partial derivatives
of 1 = l(?>;u),a) with respect to the coordinates ? of ? as well as with respect
to ?G. Letting ? = 3/3?G we introduce the notation
1 = a a a a 1 (6.1 ) rr..Vsr..sq rr.. rpsr.. sq
and refer to these quantities as mixed derivatives of the log model function.
The function of ? and a obtained from (6.1) by substituting ? for ? will be
denoted by * . Thus, for instance, rr..rp,sr..sq
*rs;t =
*rs;t(w) =
*rs;tUa) =
?G5;?(?;?'?)?
More generally, for any combinant g of the form g(ar,u),a) we write
135
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
136 O. E. Barndorff-Nielsen
-f =-?K?)-,a) = g(a>;?),a).
This is in consistency with the notation $ introduced by (2.6). The observed
geometries, to be discussed, are expressed in terms of the mixed derivatives
*r r -s s ? (6'2)
rr..rp,sr..sq
So are the terms of an asymptotic expansion of (2.7), cf. section 7.
Given the observed value of a the observed information tensor 3-, of
(2.6), defines the parameter space of M as a Riemannian manifold. The Rieman-
?t ?t man connection determined by ^ has connection symbols $* given by &* =
.tu? a- *rstand
?rst =
*<Vst -
3As +
Vr^
Employing the notation established above we have d.?r = -?+_+ -*-.+> etc.
u G? rit F5)t
so that
1st =
*rs;t -
^Pst +
W3])? ^
As we shall now show, the quantity
*rst =
-(*rst+*rs;t[3]) (6?4)
is a covariant tensor of rank 3, i.e.
? - -f ,.4.^/ ?, ?, . (6.5) ?st rst /? /a /t
First, from (2.14) we have
^ ~ ^+?/ ?/ ?/ + ^?^? ?/ [3]. (6.6) ?st rst /? /s /t rs /oo /t1" J ? '
Further, from (2.13) we obtain, on differentiating with respect to ?t and then
substituting parameter for estimate,
\ . - ^^.4-?, ?, ?, + ?^.+U/ ?, . (6.7) ?s;t rs;t /? /s /t r;t /?s /t ? '
Finally, differentiating the likelihood equation
*r = ?
we find
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 137
or
*rs +
*r;s = ? <6?8)
*r;s =
*rs? <6'9)
Combination of (6.4), (6.6), (6.7) and (6.9) yields (6.5).
It follows from the tensorial nature of ? and from (6.3) and (6.9) a
that for any real a an affine connection ? on M may be defined by
at __ .tu a
prs ' * ?rsu
with
'rst-WT^rsr {6J0)
In particular, we have 1 -1
*rst =
*rs;t ' *rst
= V,rs
^^
where to obtain the latter expression we have used
rst rs;t rt;s r;st
which follows on differentiation of (6.8). It may also be noted that
1-11-1
and
3t*rs "
*rts +
*str "
^str +
^rts
a ,, 1 -, -1 ? = J+2L ? + L?* ? *rst 2 *rst 2 ^rsf
a The connections -f, which we shall refer to as the observed a-con-
a
nections, are analogues of the expected a-connections r given by (2.30). The
a a
analogy between r and jp becomes more apparent by rewriting the skewness tensor
(2.29) as
Vst=-E{1rst +
VtC3^
the validity of which follows on differentiation of the formula
E{lrs +
lry = 0, (6.12)
which, in turn, may be compared to (6.8).
Under the specifications of a of primary statistical interest one
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
138 O. E. Barndorff-Nielsen
has that, in broad generality, the observed geometries converge to the corre-
sponding expected geometries as the sample size tends to infinity.
For (k,k) exponential models
?(?;?) = a(e)b(x)e6't(x) (6.13)
no auxiliary statistic is involved since ? is minimal sufficient, and we find a a
j- = i and F = r, aeR.
Let i,j,k,... be indices for the coordinates of ?, t and t, using
upper indices for ? and lower indices for t and t.
In the case of a curved exponential model (2.35), we have
lr =
(t-x).ejr (6.14)
and, letting ? denote the maximum likelihood estimator of ? under the full model
generated by (2,35), the relation + = j* takes the form r, s rs
W") =
KiJ(6)9/r^/s
Furthermore,
*rstW '
-*1jk<e>e/re/se/t -
Kij<e)e/re/st[3] +
(*-Vjrst> <6-16>
WU) =
KiJ(e)e/rs*/t=4Vst (6?17)
and
^rs-^J^/t^rs-'rsf (6J8)
It is also to be noted that, under mild regularity conditions, the quantities
?r and ^possess asymptotic expansions the first terms of which are given by
and
2- = ? >st rst {^ke/rse/te/x^
+ V/rse/tAC33
+ V/rst6/^*???' (6?20>
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 139
where a , ? = l,...,k-d, are the coordinates of the auxiliary statistic a. For
instance, in the repeated sampling situation and letting aQ denote the affine
ancillary, as defined in Barndorff-Nielsen (1980), we may take a = ? a and
the expansions (6.19) and (6.20) are asymptotic in powers of ? . (For further
comparison with Amari (1982a) it may be noted that the coefficient in the first e e
order correction term of (6.19) may be written as ??.??.?.. = nH where ? ?? /rs /? ij rsA rsA
is Amari's notation for the exponential curvature, or a-curvature with a = 1, of
the curved exponential model viewed as a manifold imbedded in the full (k,k)
model. )
For a transformation model we find
lr(h;x) =
1G,(?(??);?)?(??)^
(cf. the more general formula (5.7)) and hence
+ W<esu>?s +<Kr "?s*
<6-22>
where, for 3 = 3/3hr and a = 3/3hr,
?? =
3snr(h_1h),
so that
while
<? = ?J (e)"1)., (6.23) S "?(h)
rs
Bst ?
V/(fi"lh?
?;t "
^sVr(h_1h)
B;st "
V/(fi"lh'?
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
140 O. E. Barndorff-Nielsen
Furthermore, to write the coefficients of 1 , ,.,(e;u) in (6.21) and (6.22) as
indicated we have used the relation
^/(h^hiL =
-3?G(^)| A . (6.24) s h=h
s h=h
Formula (6.24) is proved in appendix 3.
We now briefly consider four examples. In the first three the
model is transformational and the auxiliary statistic a is taken to be the max-
imal invariant statistic, and thus a is exactly ancillary. In the fourth ex-
ample a is only approximately ancillary. Examples 6.1, 6.3 and 6.4 concern
curved exponential models whereas the model in example 6.2 - the location-scale
model - is exponential only if the error distribution is normal.
Example 6.1. Constant normal fractile. For known ae(?,?) and
ce(-oo,oo)5 let ? denote the class of normal distributions having the real ?a,C
number c as a-fractile, i.e.
? . = {?(?,s2):(?-?)/s = U }, ?a,C a
where u denotes the a-fractile of the standard normal distribution, and let a
x, ,...,x be a sample from a distribution in ? . The model for x = (x,,... ,xi ? ? ?a,c ? ?
thus defined is a (2,1) exponential model, except for u = 0 when it is a (1,1)
model. Henceforth we suppose that u f 0, i.e. a f ^ The model is also a
transformation model relative to the subgroup G of the group of one-dimensional
affine transformations given by
G = {[c(l - ?),?]:?>0},
the group operation being
[c(l - x),x][c(l - ?'),??] = [c(l - ??'),??']
and the action of G on the sample space being
[c(l - x),x](xr...,xn)
= (c(l - ?) + xxr...,c(l
- ?) + ???).
(Note that G is isomorphic to the multiplicative group.)
Letting
a = (x - c)/s\
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 141
where ? = (??, +...+ xn)/n and
s"2 = 1 ? (x, - x)2. ? i=1
t
we have that a is maximal invariant and, parametrizing the model by ? = log s,
that the maximum likelihood estimate is
? = log(bs')
where
b = b(a) = (u /2)a + /l + {(u /2)2 + l}a2. a a
Furthermore, (?,a) is a one-to-one transformation of the minimal sufficient
statistic (x,s*) and a is exactly ancillary.
The log likelihood function may be written as
1(?) = lU;E,a) = ?[? -?- h{b2e2[^] + (ua
+ ab'V^)2}]
from which it is evident that the model for ? given a is a location model.
Indicating differentiation with respect to ? and ? by subscripts ?
and ?, respectively, we find
1 = ?{-1 + ?"2e2(?~?) + ab-1(u + ab~V"^)e^} ? a '
and hence
2r = n{2b"2 + ab"](u + 2ab-1)} a
* = n{4b"2 + ab"](ua
+ 4ab-1)}
? - = -n{4b~2 + ab_1(u + 4ab~])} = *
-9 -1 -1 "? ]
3c - = n{4b ? + ab '(u + 4ab
' )} = -p= --F
?',?? a
and the observed skewness tensor is
Jc = n{8b"2 + 2ab"1(u + 4ab-1)}. a
Note also that a 1
We mention in passing that another normal submodel, that specified
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
142 0. E. Barndorff-Nielsen
by a known coefficient of variation ?/s, has properties similar to those ex-
hibited by example 6.1.
Example 6.2. Location-scale model. Let data ? consist of a sample
x,,...,x from a location-scale model, i.e. the model function is
?(?;?,s) = s n
? fM?) 1=1
s
for some known probability density function f. We assume that {x:f(x)>0} is an
open interval and that g = -log f has a positive and continuous second order
derivative on that interval. This ensures that the maximum likelihood estimate
(?,s) exists uniquely with probability 1 (cf., for instance, Burridge (1981)).
Taking as the auxiliary a Fisher's configuration statistic
a = (a19...,an)
= ( ?^?
? -? ? ),
which is an exact ancillary, we find
3-(?5s) = s
and, in an obvious notation,
-2 Eg" (a ) zag? (a)
za g"(a ) n+sa g"(a )
Jr = ^~3Eg,M(a.) ???? * 1
Jc = -a"3za.g,M(a.) ??,s Ia l'
^?s,? = "s
"WU^^g'"^)}
^?s,s =
-s"3{2^9???) +
S???9,?(???)>
^,y =
-"3{4Eai9''(ai) +
^,M(ai)>
* = -s ss,s
3{2? + Azaria.)
+ zajg,,,(a1)}
* = a~3zg"'(a,)
-3 * =
a-^g"^.) +
S3?.9"'(3.)}
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 143
*?ss =
^""3?4zaig"(ai) + S329"'(?.)}
^sss = s~3{4? +
^F'^) + ^3g"
? (a. )>.
Furthermore,
Jr = 2s~3* ((0,l);a) ??? ???
Jr = -2s~3? ((0,l);a) + 2s~3? ((0,l);a) ??s ??? ??s??
' '
J = -4s"33- ((0,l);a) + 2s"3+ ((0,l);a) ss? ??s ss? '
* = -6s~3^ ((0,l);a) + 2s~3* ((0,l);a). sss ss sss
Example 6.3. Hyperboloid model. Let (u,,?-j),... ,(u ,? ) be a
sample from the hyperboloid distribution (4.3) and suppose the precision ? is
known. The resultant length is
2 2 9 \ a = {(? cosh ???)
- (? sinh u^ cos v..) - (? sinh u. sin v.) }
and a is maximal invariant after minimal sufficient reduction. Furthermore,
the maximum likelihood estimate (?,?) of (?,?) exists uniquely, with probabil-
ity 1, (a,?,f) is minimal sufficient and the conditional distribution of (?,?)
given the ancillary a is again hyperboloidic, as in (4.3) but with u, ? and ?
replaced by ?, ? and ax. It follows that the log likelihood function is
1(?.F) = Hx?<l>;x.?.a) = -ax?coshx coshx - sinhx sinhx cos($-<|>)}
and hence
a a a a -F =-?=?.= -F . . . = 0
??? ??F ?F? FFF
a ? ?? = ax cosh ? sinh ?
?ff
a -f = -ax cosh ? sinh ?,
FF?
whatever the value of a. Thus, in this case, the a-geometries are identical.
We note again that whereas the auxiliary statistic a is taken so
as to be ancillary in the various examples discussed here - exactly distribu-
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
144 O. E. Barndorff-Nielsen
ti on constant in the three examples above and asymptotically distribution con-
stant in the one to follow - ancillarity is no prerequisite for the general
theory of observed geometries.
Furthermore, let a be any statistic which depends on the minimal
sufficient statistic t, say, only and suppose that the mapping from t to (?,a)
is defined and one-to-one on some subset T~ of the full range ? of values of t
though not, perhaps, on all of T. We can then endow the model M with observed
geometries, in the manner described above, for values of t in T?. The
next example illustrates this point.
The above considerations allow us to deal with questions of non-
uniqueness and nonexistence of maximum likelihood estimates and nonexistence of
exact ancillaries, especially in asymptotic considerations.
Example 6.4. Inverse Gaussian - Gaussian model. Let x(?) and y(?)
be independent Brownian motions with a common diffusion coefficient s = 1 and
drift coefficients ?>0 and ?, respectively. We observe the process x(?) till it
first hits a level x~>0 and at the time u when this happens we record the value
? = y(u) of the second process. The joint distribution of u and ? is then
given by p(u,v;y,c)
- <2,rVV?-V*?t,2)""V"A*?-*\ ,6.25,
Suppose that (u, ,v, ),... ,(u ,v ) is a sample from the distribution
(6.25) and let t = (u,v) where ? and ? are the arithmetic means of the observa-
tions. Then t is minimal sufficient and follows a distribution similar to
(6.25), specifically ?(?,?;?,?)
, ???? 9 -
?(x2+v2)?-1 -
? ?2?+??? - ??2? =
(2p)"'?0?6 ?
?T2e 2 ?
e 2 2 . (6.26)
Now, assume ? equal to ?. The model (6.26) is then a (2,1) exponential model,
still with t as minimal sufficient statistic. The maximum likelihood estimate
of ? is undefined if t^T^ where
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 145
?f = it =
(?,v):x0 + ? > 0}
whereas for tej^, ? exists uniquely and is given by
-1 ? =
^(x0 + ?) ? . (6.27)
The event t$Tg happens with a probability that decreases exponentially fast with
the sample size ? and may therefore be ignored for most statistical purposes.
Defining, formally, ? to be given by (6.27) even for t$Tg
and let-
ting
a = F~(?;2??2,2 ??2),
where f"(?;?,?) denotes the distribution function of the inverse Gaussian dis-
tribution with density function
F-(?;?,?) = (Zw)-^ e^* x"3/2 e-,,(xx"1+*x) (6.28)
we have that the mapping t -> (?,a) is one-to-one from ? = it = (?,?):?>0> onto
(-??,+??) ? (0,?>) and that a is asymptotically ancillary and has the property
that p*(y ;y|a) =c | j | L approximates the actual conditional density of ? given
a to order 0(n"3/2), cf. Barndorff-Nielsen (1984).
Letting F_(?;?>?) denote the inverse function of F~(?;?>?) we may
write the log likelihood function for ? as
1(?) = l(y;?,a)
- 2 = n{(x0
+ ?)? - ?? }
= ?F (a;2nx2,2n{?2) {2??-?2} (6.29)
From this we find
so that
??
2 ~9 1 = -2?F (a;2n ??>2?? )
?? - U
2 2 ? =
2?F_(a;2??^ ,2?? )
* =0 ???
and
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
14 6 O. E. Barndorff-Nielsen
* : = 8?2?(?" ? F /o")(a;2nx2 2??2) ??,? - ? ?
= S = -h ? ??? ???
where f" denotes the derivative of f"(?;?,?) with respect to ?. By the well-
known result (Shuster (1968))
f-(?;?,?) = F(f? - x\'h) + e2^(-foV + xV*)),
where f is the distribution function of the standard normal distribution, f" ?
could be expressed in terms of F and ? = f'.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
EXPANSION OF c|j(^L
We shall derive an asymptotic expansion of (2.7), by Taylor expan-
sion of c|j| L in ? around ?, for fixed value of the auxiliary a. The various
terms of this expansion are given by mixed derivatives (cf. (6.2)) of the log
model function. It should be noted that for arbitrary choice of the auxiliary
statistic a the quantity c|j|C constitutes a probability (density) function on
the domain of variation of ? and the expansions below are valid. However,
c|j|L furnishes an approximation to the actual conditional distribution of ?
given a, as discussed in section 2, only for suitable ancillary specification
of a.
To expand c|j| L in ? around ? we first write L as exp{l-l} and
expand 1 in ? around ?. By Taylor's formula,
? r r 1-1= S -L (?-?) ^..(?-?) v(d 3 1)(?)
v=2 V?
rl rv
whence, expanding each of the terms (a ...3^ 1)(?) around ?, rl rv
1-1
f ? \v ri r (-1) /A \ 1 t \ ?
? ? ?-?) ...(?-?) v=2
= ? -r
? S ?-(?-?)5?...(<:-?)5? 3S ...3S \ ...f. (7.1) p=0 1 ? 1 ?
Consequently, writing d for ?-? and d ""'
for (?-?) (?-?) ..., we have
147
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
148 O. E. Barndorff-Nielsen
i-?-V%s +
wrst(*rs.t +
?*rst)
+ a?rStU^rs;tu+8i-rst;u
+ 3*rstu)
+ ??? ? <7?2>
L? Next, we wish to expand log{ | j |/ \?r\} in ? around ?. To do this we observe
that if A is a d ? d matrix whose elements a depend on ? then
3tlog|A| =
|A|"1at|A|
sr a
Vrs
rs where a denotes the (r,s)-element of the inverse of A. Furthermore, using
3tars = -arvawVa , t U t vw
which is obtained by differentiating a aus = 6S with respect to ? and solving
for ars, we find
3.3 log IA i = -avrasw3 a 3.a + asr3.3 a . t ? a| ' u vw t rs t ? rs
It follows that
logi|j|/M}*--wVVrst+*rs;t)
-?tu{/s(+rstu++rst.u++rsu.t++rs;tu)
+ irVXst^rs;t)(+vwu+J-vw.u)H...
- (7.3)
By means of (7.2) and (7.3) we therefore find
clJl^L = (2p)?/2?Fa(?-?;3-){1
+ A]
+ A2
+ ...} (7.4)
where F?(?;^) denotes the density function of the d-dimensional normal distribu-
tion with mean 0 and precision (i.e. inverse variance-covariance matrix) ?- and
where
A! =
-wV^rsit +
W +
?"*<+?* +
! W (7"5)
and
A2 =
? [- 36tu{2/s(+rstu +
+rst;u +
*rsu;t +
*rsstu)
+ (2/Vw - ?rVwmrs;t
+ W^w;u ^vwu)i
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 149
+ *rStU{(3*rstu
+ 8*rst;u
+ 6*rs;tu>
-^VXw;u++vwu)^rs;t +
!+rst)>
+ 36rstuvw(* . +4+ J(+ + U )], (7.6) v rs;t 3 rst/v uv;w 3 uvw/J' v-w
A-j and A2 being of order O(rf^) and 0(n" ), respectively, under ordinary repeat-
ed sampling.
By integration of (7.4) with respect to ? we obtain
(2p)a/2? = 1 + C,
+ ... , (7.7)
where C|
is obtained from Ap by changing the sign of A? and making the sub-
stitutions
_rs .rs d + a-
xrstu ^rs.tur^n d + a- a- [3]
rrstuvw .rs.tu.vwricn d + 3r 3- 3- [15],
the 3 and 15 terms in the two latter expressions being obtained by appropriate
permutations of the indices (thus, for example, <srstu -> jTs? u + j- aSU +
.ru.stx dr a- ).
Combination of (7.4) and (7.7) finally yields
c|j|^L = f(?-?;*){1 + A1
+ (A2+C1)
+ ...} (7.8)
-3/2 with an error term which in wide generality is of order 0(n ) under repeated
sampling. In comparison with an Edgeworth expansion it may be noted that the
expansion (7.8) is in terms of mixed derivatives of the log model function,
rather than in terms of cumulants, and that the error of (7.8) is relative,
rather than absolute.
In particular, under repeated sampling and if the auxiliary statis-
tic is (approximately or exactly) ancillary such that
?(?;?|a) = p*(?;u)|a){l + 0(n'3/2)}
(cf. section 2) we generally have
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
150 O. E. Barndorff-Nielsen
?(?;?|?) = F?(?-?;*){1
+ ?]
+ (?2
+ C])
+ 0(?"3/2)}. (7.9)
For one-parameter models, i.e. for d = 1, the expansion (7.8) with
A,, A2 and C, as given above reduces to the expansion (2.9). In Barndorff-
-3/2 Nielsen and Cox (1984) a relation valid to order 0(n
' ) was established, for
general d, between the norming constant c of (2.7) and the Bartlett adjustment
factors for likelihood ratio tests of hypotheses about ?. By means of this rel-
ation such adjustment factors may be simply calculated from the above expression
for Zy
Example 7.1. Suppose M is a (k,k) exponential model with model
function (6.13). Then the expression for C-. takes the form
r _ 1 ,0 rs tu (0 ru sv tw , 0 rs tu vwx, Cl
" 24 {3KrstuK
K " KrstKuvw(2K
? ? + 3? K K )}
where, for 3 = 3/3?G and ?(?) = -log a(e),
Vs... =
Vs ??? ?(?)
and where ?rs is the inverse matrix of ? .
From (7.8) we find the following expansion for the mean value of ?:
?? =? +??+?0+... ? ? c
? 1 ? 9 where ?? is of order 0(?" ), ?? is of order 0(n" ), and
y-, - -W a- +r;st
- -h* 3r -Fstr. (7.10)
Hence, from (7.8) and writing d1 for d-?,,
f'?^? = f?(?
-?- \iy ?r) ? +
(?] -
^?-^?*) + ...}
= F?(?
- ? - ??;?){1
+ ?irst(?' ;?)(*rs;t +
f *rst) + ..?>? (7.?)
-1 rT,,rn where the error term is of order 0(n~ ) and where h (?%&) denotes the
tensorial Hermite polynomial (as defined by Amari and Kumon (1983)), relative
write
-1/3
to the tensor ?r . Using (6.10) we may rewrite the last quantity in (7.11) as
+rs;t+f*rst-^rst+^st (7J2)
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
where
Since
we find
Differential and Integral Geometry in Statistical Inference 151
*Let =
k*r.:t -
*(*rt:? + h*..J)? (7.13) rst 3 rs;t v rt;s st;r;
?rs^.t (d1;*) = ?'VVL - ?rVL[3] (7.14)
hrst(6';mrst
and hence (7.11) reduces to . , ? r-t -1/3
c|j|\ = f?(?
- ? - ?-????
- ^G^(d';j-) ^ + ...}, (7.15)
the error term being 0(n" ).
Suppose, in particular, that the model is an exponential (k,d)
model. We may then compare (7.15) with the Edgeworth expansion for an effi-
cient, bias adjusted estimate of ? given an ancillary statistic, provided by
formulas (3.33) and (3.25) in Amari and Kumon (1983). It appears that hrst "1/3 -1/3 abr
(d% \ir) ?? t
of (7.15) is the counterpart of Amari and Kumon's rabch -
e ab \c m a ?-? H.u h h + H , hh . Thus (7.15) offers some simplification over the cor- abK KXa r
responding expression provided by the Amari and Kumon paper.
Note that, again by the symmetry of (7.14), if
-1/3
*rst[3] = 0 (7.16)
for all r,s,t then the first order correction term in (7.15) is 0. Further- a
more, for any one-parameter model M the quantity -F with a = -1/3, can be made
to vanish by choosing that parametrization for which ? is the geodesic coordin-
ate for the -1/3 observed conditional connection. (Note that generally this
parametrization will depend on the value of the ancillary a.) An analogous
result holds for the Edgeworth expansion derived by Amari and Kumon (1983),
referred to above. The parametrization making the a = -1/3 expected connection a r vanish has the interpretation of a skewness reducing parametrization, cf.
Kass (1984).
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
8. EXPONENTIAL TRANSFORMATION MODELS
Suppose M is an exponential transformation model and that the full
exponential model M generated by M is regular. By theorem 2.1 the group G acts
affinely on ? = t(?), and Lebesgue measure on ? is quasi-invariant (in fact,
relatively invariant) with multiplier |A(g)|. Assuming, furthermore, that M
and G have the structure discussed in section 3 with {g:|A(g)| = 1} <= ? we find,
since the mapping g -> A(g) is a representation of G, that
|A(h(gx))| = |A(g)||A(h(x))|.
Thus m(x) = |A(fi)| is a modulator and
dv(h) = |A(h) |"????? (8.1)
is an invariant measure on H (cf. appendix 1).
Again by theorem 2.1 the log likelihood function is of the form
1(h) = {?(?)?(?G???* + ?(n_1h)}.w - ?(?(?)?(?~??)* + &(?"*??)) (8.2)
where w = t(u) = h" t.
Some interesting special cases are
(i) ?(?) or Bf(.) or both are 0. Then d(?) of (2.45) is a multi-
plier (i.e. a homomorphism of G into (R+,?))? Furthermore, if &(?) = 0 and if
(2.35) is an exponential representation of M relative to an invariant dominat-
ing measure on X. then b(x) is a modulator.
(ii) The norming constant a(e(g)) does not depend on g. If in
addition B(g) does not depend on g, which implies that B(?) = 0, then the con-
ditional distribution of h given w is, on account of the exactness of (2.7),
152
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 153
p(h;h|w) = c(w)|j|^ ee(h" h)'w
(8.3)
where the norming constant does not depend on h.
Note that the form (8.3) is preserved under repeated sampling, i.e.
the conditional distribution of h is of the same "type" whatever the sample
size.
The von Mises-Fisher model for directional data with fixed precision
has this structure with w equal to the resultant length r, and as is well-
known the conditional model given r is also of this type, irrespective of
sample size. Other examples are provided by the hyperboloid model with fixed
precision and by the class or r-dimensional normal distributions with mean 0
and precision ? such that |d| = 1.
(iii) M is a (k,k-l) model.
For simplicity we now assume that M has all the above-mentioned
properties. There is then little further restriction in supposing that M is of
the form
?(?,?) = bWexp?-axe^h^hr^e^} (8.4)
where ? is the index parameter, a is maximal invariant and e, and e_, are
known nonrandom vectors. For (8.4) the log likelihood function is
1(h) = -axe^?f^e^ (8.5)
- _i* where we have written A for A . Hence
rrs =
ax(3t3u?ij)(e)elie_ljAj(h)^(h) (8.6)
where ? is given by (6.23). In this case, then, the conditional observed
ot geometries (<r(e;x,a),.F(-;A,a)) are all "proportional" for fixed a, with ax as
the proportionality factor. The geometric leaves of the foliation of M, deter-
mined as the partition of M generated by the index parameter x, are thus highly
similar. In this connection see example 6.3.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
APPENDIX 1
Construction of invariant measures
One may usefully generalize the concepts of invariant and relatively
invariant measures as follows. Let a measure y on X be called quasi-invariant
W1'th multiplier ? = x(g,x) if g? and ? are mutually absolutely continuous for
every geG and if
d(g" y)(x) = x(g?x)dy(x).
Furthermore, define a function m on X to be a modulator with associated multi-
plier x(g,x) if m is positive and
m(gx) = x(g,x)m(x). (Al.l)
Then, if ?? is quasi-invariant with multiplier x(g,x) and if m is a modulator
satisfying (Al.l) we have that
? = m~V (Al.2)
is an invariant measure on L?
In particular, to verify that the measure ? defined by (3.9) is
invariant one just has to show that m(y) = J (z\(u) is a modulator with associ-
ated multiplier J /a\(y) because, by the standard theorem on transformation of
integrals, Lebesgue measure x is quasi-invariant with multiplier J / \(y).
Corresponding to the factorization G = HK there are unique factorizations g = hk
and gz = hk and, using repeatedly the assumption that ? = G for every orbit
representative u, we find
m(gy) = Jy(h)(u)
=
JY(g)(y)JY(z)(u)JY(..1)(u)
= JY(g)(y)m(y).
154
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference 155
In the last step we have used the fact that
J ,k)(u)
= 1 for every keK. (Al.3)
To see the validity of (Al.3) one needs only note that for fixed u the mapping
k -> J /?^(u) is a multiplier on ? and since ? is compact this must be the
trivial multiplier 1. Actually, (Al.3) is a necessary and sufficient condition
for the existence of an invariant measure on ?. This may be concluded from
Kurita (1959), cf. also Santalo (1979), section 10.3.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
APPENDIX 2
An equality of Jacobians under left factorizations
Lemma. Let G = HK be a left factorization of G (as discussed in
sections 3 and 5), let ? denote the natural action of G on ? and let d denote
left action of G on itself. Then J'(h\(e)
= J?(h\(e)
for all heH.
Proof. Let g = hk denote an arbitrary element of G. Writing g
symbolically as (h,k) and employing the mappings ? and ? defined by
n:g ?> h c:g -> k
we have, for any h'eH,
?(h')g = 6(h')(h,k) = (n(h'h),c(h'hk))
and hence the differential of 6(h')g is
3n(h'h)*
D6(h')(g) =
3h
3c(h'hk)* 3?(h'hk)* 3h 3k
from which we find, using n(h'h) = y(h')h and c(h'k) = k,
J?(h')(e) =
JY(h')(e)!"Sk lk=e
?i(V){eh
156
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
APPENDIX 3
An inversion result
The validity of formula (6.24) is established by the following
Lemma. Let G = HK be a left factorization of the group G with the
associated mapping n:g = hk ?> h (as discussed in sections 3 and 5). Further-
more, let h' denote an arbitrary element of H. Then
3n(h;V)*| = _ 3n(h'"1h)*l (A3J) dh W
ah h=h?
Proof. The mapping h -> n(h" h1) may be composed of the three
mappings h + h'" h, g -> g" and ?, as indicated in the following diagram
,?
H
where i indicates the inversion g -> g" . This diagram of mappings between dif-
ferentiable manifolds induces a corresponding diagram for the associated dif-
ferential mappings between the tangent spaces of the manifolds, namely
157
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
158 O. E. Barndorff-Nielsen
D(h1_1.)
m -> TG , h ^ - h'-\
Di O
Dn
???(??~??)
From this latter diagram and from the well-known relation
(Di)(e) = -I,
where I indicates the identity matrix, formula (A3.1) may be read off immediate-
ly.
Acknowledgements
I am much indebted to Poul Svante Eriksen, Peter Jupp, Steffen L.
Lauritzen, Hans Anton Salomonsen and J?rgen Tornehave for helpful discussions?
and to Lars Smedegaard Andersen for a careful checking of the manuscript.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
REFERENCES
Amari, S.-I. (1982a). Differential geometry of curved exponential families -
curvatures and information loss. Ann. Statist. 10, 357-385.
Amari, S.-I. (1982b). Geometrical theory of asymptotic ancillarity and condi-
tional inference. Biometrika 69, 1-17.
Amari, S.-I. (1935). Differential-Geometric Methods in Statistics. Lecture
Notes in Statistics 28, Springer, New York.
Amari, S.-I. (1986). Differential geometrical theory of statistics - towards
new developments. This volume.
Amari, S.-I. and Kumon, M. (1983). Differential geometry of Edgeworth expansion
in curved exponential family. Ann. Inst. Statist. Math. 35, 1-24.
Barndorff-Nielsen, 0. E. (1978a). Information and Exponential Families.
Wiley, Chichester.
Barndorff-Nielsen, 0. E. (1978b). Hyperbolic distributions and distributions on
hyperbolae. Scand. J. Statist. !5, 151-157.
Barndorff-Nielsen, 0. E. (1980). Conditionality resolutions. Biometrika 67,
293-310.
Barndorff-Nielsen, 0. E. (1982). Contribution to the discussion of R. J.
Buehler: Some ancillary statistics and their properties. J. Amer.
Statist. Assoc. 77, 590-591.
Barndorff-Nielsen, 0. E. (1983). On a formula for the distribution of the maxi-
mum likelihood estimator. Biometrika 70, 343-365.
Barndorff-Nielsen, 0. E. (1984). On conditionality resolution and the likeli-
hood ratio for curved exponential families. Scand. J. Statist. ?, 157-
159
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
160 O. E. Barndorff-Nielsen
170. Amendment Scand. J. Statist. 12 (1985).
Barndorff-Nielsen, 0. E. (1985). Confidence limits from c|j| E in the single-
parameter case. Scand. J. Statist. 12, 83-87.
Barndorff-Nielsen, 0. E. (1986a). Likelihood and observed geometries. Ann.
Statist. U, 856-873.
Barndorff-Nielsen, 0. E. (1986b). Inference on full or partial parameters
based on the standardized signed log likelihood ratio. Biometrika 73,
307-322.
Barndorff-Nielsen, 0. E. and Blaesild, P. (1983a). Exponential models with
affine dual foliations. Ann. Statist. 11, 753-769.
Barndorff-Nielsen, 0. E. and Blaesild, P. (1983b). Reproductive exponential
families. Ann. Statist. 11, 770-782.
Barndorff-Nielsen, 0. E. and Blaesild, P. (1984). Combination of reproductive
models. Research Report 107, Dept. Theor. Statist., Aarhus University.
Barndorff-Nielsen, 0. E., Blaesild, P., Jensen, J. L. and Jorgensen, B. (1982).
Exponential transformation models. Proc. R. Soc. A 379, 41-65.
Barndorff-Nielsen, 0. E. and Cox, D. R. (1984). Bartlett adjustments to the
likelihood ratio statistic and the distribution of the maximum likelihood
estimator. J. R. Statist. Soc. ? 46, 483-495.
Barndorff-Nielsen, 0. E., Cox. D. R. and Reid, N. (1986). The role of differen-
tial geometry in statistical theory. Int. Statist. Review 54, 83-96.
Barut, A. 0. and Raczka, R. (1980). Theory of Group Representations and Appli-
cations. Polish Scientific Publishers, Warszawa.
Boothby, W. M. (1975). An Introduction to Differentiable Manifolds and
Riemannian Geometry. Academic Press, New York.
Burridge, J. (1981). A note on maximum likelihood estimation for regression
models using grouped data. J. R. Statist. Soc. ? 43, 41-45.
Chentsov, ?. N. (1972). Statistical Decision Rules and Optimal Inference.
(In Russian.) Moscow, Nauka. English translation (1982). Translation of
Mathematical Monographs Vol. 53. American Mathematical Society, Providence,
Rhode Island.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential and Integral Geometry in Statistical Inference -^l
Eguchi, S. (1983). Second order efficiency of minimum contrast estimators in a
curved exponential family. Ann. Statist. 11, 793-803.
Eriksen, P. S. (1984a). (k,l) exponential transformation models. Scand. J.
Statist. VL, 129-145.
Eriksen, P. S. (1984b). A note on the structure theorem for exponential trans-
formation models. Research Report 101, Dept. Theor. Statist., Aarhus
University.
Eriksen, P. S. (1984c). Existence and uniqueness of the maximum likelihood
estimator in exponential transformation models. Research Report 103,
Dept. Theor. Statist., Aarhus University.
Fisher, R. A. (1934). Two new properties of mathematical likelihood. Proc.
Roy. Soc. A 144, 285-307.
Hauck, W. W. and Donner, A. (1977). Wald's test as applied to hypotheses in
logit analysis. J. Amer. Statist. Ass. 72, 851-853. Corrigendum:
J. Amer. Statist. Ass. 75 (1980), 482.
Jensen, J. L. (1981). On the hyperboloid distribution. Scand. J. Statist. 8,
193-206.
Kurita, M. (1959). On the volume in homogeneous spaces. Nagoya Math. J. 15,
201-217.
Lauritzen, S. L. (1986). Statistical manifolds. This volume.
Santal ?, L. ?. (1979). Integral Geometry and Geometric Probability. Encyclo-
pedia of Mathematics and Its Applications. Vol. 1, Addison-Wesley, London.
Shuster, J. J. (1968). A note on the inverse Gaussian distribution function.
J. Amer. Statist. Assoc. 63, 1514-1516.
Vaeth, M. (1985). On the use of Wald's test in exponential families. Int.
Statist. Review 53, 199-214.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
STATISTICAL MANIFOLDS
Steffen L. Lauritzen
1. Introduction. 165
2. Some Differential Geometric Background . 167
3. The Differential Geometry of Statistical Models . i77
4. Statistical Manifolds . 179
5. The Univariate Gaussian Manifold . I90
6. The Inverse Gaussian Manifold . 198
7. The Gamma Manifold. 203
8. Two Special Manifolds. 206
9. Discussion and Unsolved Problems . 212
10. References. 215
Institute for Electronic Systems, Aalborg University Center, Aalborg, Denmark
163
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
1. INTRODUCTION
Euclidean geometry has served as the major tool in clarifying the
structural problems in connection with statistical inference in linear normal
models. A similar elegant geometric theory for other statistical problems
does not exist yet.
One could hope that a more general geometric theory could get the
same fundamental role in discussing structural and other problems in more
general statistical models.
In the case of non linear regression it seems clear that the
geometric framework is that of a Riemannian manifold, whereas in more general
cases it seems as if a non-standard differential geometry has yet to be
developed.
The emphasis in the present paper is to clarify the abstract
nature of this differential geometric object.
In section 2 we give a brief introduction to the notions of modern
differential geometry that we need to carry out our study. It is an extract
from Boothby (1975) and Spivak (1970-75) and we are mainly using a coordinate-
free setup.
Section 3 is an ultrashort summary of some previous developments.
The core of the paper is contained in section 4 where we abstract the notion
of a statistical manifold as a triple (M,g,D) where Misa manifold, g is a
metric and D is a symmetric trivalent tensor, called the skewness of the
manifold. Section 4 is fully devoted to a study of this abstract notion.
Sections 5, 6, 7, and 8 are detailed studies of some examples of
165
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
!66 Steffen L. Lauritzen
statistical manifolds of which some (the Gaussian, the inverse Gaussian and
the Gamma) manifolds are of interest because of their leading role in statis-
tical theory, whereas the examples in section 8 are mostly of interest because
they to a large extent produce counterexamples to many optimistic conjectures.
Through the examples we also try to indicate possibilities for discussing
geometric estimation procedures.
In section 9 we have tried to collect some of the questions that
naturally arise in connection with the developments here and in related pieces
of work.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
SOME DIFFERENTIAL GEOMETRIC BACKGROUND
A topological manifold Misa Hausdorff space with a countable
base such that each point ?e? has a neighborhood that is homeomorphic to an
open subset of IRm. m is the dimension of M and is well-defined. A differen-
tiable structure on M is a family
where U is an open subset of M and ? are homeomorphisms from U, onto an open ? ? X
subset of IRm, satisfying the following:
(1) UU, = M
? ?
1 m (2) for any ?^,?^et: ?? ??~ is a C??(IR ) function wherever it is well
defined
(3) if V is open, ?: V -> IRm is a homeomorphism, and ? ? ? ~
, ? ? ?" are ? ?
C?? wherever they are well defined, then (?,?)e?.
The condition (2) is expressed as ?. and ? being compatible. x1 x2
In very simple cases M is itself homeomorphic to an open subset
of IR and the differentiable structure is just given by (?,F?) and all sets
(U.,? ) where U, is an open subset of M and ? ? ? "
is a diffeomorphism. ? ? ? xu
The sets U are called coordinate neighborhoods and ? coordinates. ? ?
The pair (U ,? ) is called a local coordinate system. ? ?
M, equipped with a differentiable structure is called a differenti-
able manifold or a C??-manifo1d.
A differentiable structure can be specified by any system satisfy-
ing (1) and (2). Then there is a unique structure U_ containing the specified
167
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
168 Steffen L. Lauritzen
local coordinate system.
The differentiable structure gives rise to a natural way of defin-
ing a differentiable function. We say that f: M -* IR is in 0??(?) if it is
a usual C??-function when composed with the coordinates:
f e C??(M) <-* f ? f?
"? e C??UX(U)) for all X.
Important is the notion of a regular submanifold ? ? M of M. A subset N. of M
is a regular submanifold if it is a topological manifold with the relative
topology and if it has preferred coordinate neighborhoods, i.e. to each point
?e?_ there is a local coordinate system (U. ,?.) with ?e?. such that XX X
i) F?(?) = (0.....0); F?(??)
= ]-e,e[m
ii) F?(?????) = ?(?1.???.??.0,....0), |?1'|<e}
?^ inherits then in a natural way the differentiable structure from M by
(??.F?) where
?? =
U/1N, \ -
f?|??,
where (U.,??) is a preferred coordinate system. ? ?
All C??(N)-functions can then be obtained by restriction to N^ of
C??(M)-functions.
For ?e?, C??(p) is the set of functions whose restriction to some
open neighborhood U of ? is in C??(U). We here identify f and g e C??(p) if their
restriction to some open neighborhood of ? are identical.
The tangent space ? (M) to M at ? is now defined as the set of all
maps X : C??(p) -* IR satisfying
1) Xp(af+eg)
= aXp(f)+3Xp(g)
a,? e IR
?) Xp(fg)
= xp(f)g(p)+f(p)xp(g)
f,g e cro(p)
One should think of X as a directional derivative. X is called a tangent ? - ? ?2?
vector.
? (M) is in an obvious way a vector-space and one can show that
dim(Tp(M)) = m.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 169
For each particular choice of a coordinate system, there corre-
sponds a canonical basis for ? (M), with basis vectors being
Eip(f)=?7fU"1(x))|x=*(p)
A vector field is a smooth family of tangent vectors X = (? ,?e?) where
? e? (M). To define "smooth11 in the right way, we demand a vector field X to r H
be a map:
i)
ii)
and now we write
X: C??(M) - C"(M)
X(af+?g) = aX(f)+?X(g) a.?elR
x(fg) = x(f)g+fx(g) f,geC"(M)
xp(f) = x(f) (p)
The vector fields on M are denoted as X_(M). X_(M) is a module over C??(M): if
f,geC??(M), ?,?e?(?) then
(fX+gY) (h) = fX(h) + gY(h)
is also in X^(M). X_(M) is a Lie-algebra with the bracket operation defined as
CX,Y](f) = X(Y(f)) - Y(X(f)).
The Lie-bracket [ ] satisfies
[X,[Y.Z]] + [Y,[Z,X]] + [Z,[X,Y]] = 0
[X,Y] = -[Y,X]
[aX^?Xg.Y] =
a[XrY] +
?[X2,Y] a,??IR
[X,aY1+3Y2] =
a[?,??] +
?[?,?2] a,? e IR
Further one can easily show that
[X,fY] = f[X,Y] + (X(f))Y .
The locally defined vector fields ?., representing differentiation w.r.t. local
coordinates, constitute a natural basis for the module XjU), where U is a
coordinate neighborhood.
A covariant tensor D of order k is a C??-k-linear map
(Jacobi identity)
(anticommutati vity)
(bilinearity)
D: X(M)x...*X(M) - C??(M),
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
170 Steffen L. Lauritzen
i.e.
D(Xr...,Xk) e C~(M),
D(X1,...,fXi+gYi,X.+1,...,Xk)
= fD(Xr...,Xk)
+ gD(X1,...,Y.,X.+1,...,Xk).
A tensor is always pointwise defined in the sense that if X . = Y ., then XL- pl pi
D(Xr...,Xk)(p) =
D(Yr...,Yk)(p).
This means that any equations for tensors can be checked locally on a basis
e.g. of the form E.. These satisfy [?.,E.] = 0 and all tensorial equations hold
if they hold for vector fields with mutual Lie-brackets equal to zero. This is
a convenient tool for proving tensorial equations and we shall make use of it
in section 3.
A Riemannian metric g is a positive symmetric tensor of order two:
g(X,X) > 0 g(X,Y) = g(Y,X)
Since tensors are pointwise, it can be thought of as a metric g on each of the
tangent spaces ? (M).
A curve ? = (y(t),te[a,b]) is a C??-map of [a,b] into M. Note that
a curve is more than the set of points on it. It involves effectively the
parametrization and is thus not a purely geometric object.
Let now ? denote any vector field such that
;(f)(y(t)) = ? f (Y(t)) for all te[a,b],feC??(M)
The length of the curve ? is now given as
M = /b i/g(Y,Y)y(t)dt. a
Curve length can be shown to be geometric.
An important notion is that of an affine connection on a manifold.
We define an affine connection as an operator ?
?: X?M) ? ?(?) -> ?(?)
satisfying (where we write v?Y for the value)
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 171
i) ??(a?+??) =
a??? +
????, a,? e IR
?) Vx(fY) = X(f)Y +
fvxY
iii) vfx+gYZ
= fvxZ +gvYZ
.
An affine connection can be thought of as a directional derivation of vector
fields, i.e. ???
is the "change" of the vector field Y in X's direction.
An affine connection can be defined in many ways, the basic reason
being, that "change" of Y is not well defined without giving a rule for compar-
ing vectors in ? (M) with vectors in ? (M), since they generally are different P-l
- P2
-
spaces.
An affine connection is exactly defining such a rule via the notion
of parallel transport, to be explained in the following. We first say that a
vector field X is parallel along the curve ? if
??? = 0 on ?, ?
where again ? is any vector field representing tt-.
Now for any vector X rx e ? ,? (_M) there is a unique curve of
vectors
XY(t).te[a,b], Xy(t) cTY(t)(M)
such that ??? = 0 on ?, i.e. such that these are all parallel, and such that
X / ? is equal to the given one. We then write
y(b) ?? y(ar
and say that p defines parallel transport along ?. p is in general an affine
map.
Note that p depends effectively on the curve in general.
An affine connection can be specified by choosing a local basis
for the vector-fields (E. ,i=l,... ,m) and defining the symbols (C??-functions)
k r.?,, i,j,k=l,...,m
by
Vrr!jM=k!/^'
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
172 Steffen L. Lauritzen
where we adopt the summation convention that whenever an index appears in an
expression as upper and lower, we sum over that index. Using the properties of
an affine connection we thus have for an arbitrary pair of vector-fields
X = f^., Y = giEi
vxY =
f1Ei(gJ)Ej+fV4.Ek
A geodesic is a curve with a parallel tangent vector field, i.e. where
??? = 0 on ?. ?
Associated with the notion of a geodesic is the exponential map induced by the
connection.
For all ?e?, ? e ? (M) there is a unique geodesic ?? , such that ?
?? (0) = ? ;? (0) = ? (**) ? ?
This is determined in coordinates by the differential equations below together
with the initial conditions (**)
xk(t) + ^(tl^tlrJ.Wt))
= 0
where ?? (t) = (x (t),...,xm(t)) in coordinates. XP
Defining now for ? e ? (M)
exp?X) = Yy (1)
? xp
we have exp?tX ? = ?? (t). ?
?? The exponential map is in general well defined at least in a neigh-
borhood of zero in ? (M) and can only in special cases be defined globally.
In general, geodesies have no properties of "minimizing" curve
length. However, on any Riemannian manifold, (i.e. a manifold with a metric
tensor g), there is a unique affine connection ? satisfying
i) ??? -
??? - [?,?] ? 0
11) Xg(Y.Z) = g(vxY,z)
+ g(Y,vxz).
This connection is called the Riemannian connection or the Levi-Civita connec-
tion.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 173
Property i) is called torsion-freeness and property ii) means that
the parallel transport p is isometric, which is seen by the argument.
yg(Y.Z) = g(v.Y.Z) + g(Y,v^Z)
= 0 if v^Y
= v^Z
= 0.
We can then write girMf^Z)^^
= g(Y,Z)y(a)
or just g(?^9Ji^l)
= g(Y,Z).
If ? is Riemannian, its geodesies will locally minimize curve length.
To all connections ? there is a torsion free connection ? such that
this has the same geodesies. All connections in the present paper are torsion
free, whereas not all of them are Riemannian.
When the manifold is equipped with a Riemannian metric, it is often
convenient to specify the connection through the symbols (C??-functions) G... , ? j ?
where
rijk "
9(^?G??
Defining the matrix of the metric tensor and its inverse as
gu-gtMj) (g^-ig^r1.
the symbols are related to those previously defined as
The Riemannian connection is given by
^k =
^9jk>+ ??^ -
M^ij??
A connection defines in a canonical way the covariant derivative of
a tensor D as
(vxD)(Xr...,Xk) =
XD(Xr...,Xk) - S
D(X1,...,VxX.,...,Xk).
(???) is again a covariant tensor of order k and the map
S(X,Xr...,Xk) =
(vxD)(Xr...,Xk)
becomes a tensor of order k+1. The fact that the Riemannian connection pre-
serves inner product under parallel translation can then be written as
(vxg)(Y,Z) ? 0.
Similarly, if D is a multilinear map from ?(?)?...??(?) into ?(M) its
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
174 Steffen L. Lauritzen
covariant derivative is defined as
(vxD)(Xr...,Xk) =
vxD(Xr...fXk) - S
D(X1,...,vxX.,...,Xk).
Such multilinear maps are called tensor fields.
An important tensor field associated with a space with an affine
connection is the curvature field, R: X{tt) ? X(M) ? X(M) -> J((M)
r(XjY)Z = ?????
- ?????
- v[XjY]Z.
A manifold with a connection satisfying R ? 0 is said to be flat. If the
connection is torsion free, the curvature satisfies the following identities:
a) R(X,Y)Z = -R(Y,X)Z
b) R(X,Y)Z + R(Y,Z)X + R(Z,X)Y = 0
(Bianchi's 1st identity)
c) (VXR)(Y,Z,W) +
(vyR)(Z,X,W) +
(vzR)(X,Y,W) = 0
(Bianchi's 2nd identity).
Strictly speaking, a) does not need torsion freeness.
On a Riemannian manifold, we also define the curvature tensor R as
R(X,Y,Z,W) = g(R(X,Y)Z,W)
where R is used in two meanings, both referring to the Riemannian connection.
The Riemannian curvature tensor satisfies
1) R(X,Y,Z,W) = -R(Y,X,Z,W)
ii) R(X,Y,Z,W) + R(Y,Z,X,W) + R(Z,X,Y,W) = 0
ili) R(X,Y,Z,W) = -R(X,Y,W,Z)
iv) R(X,Y,Z,W) = R(Z,W,X,Y)
We shall use the symbol R also for the curvature tensor
R(X,Y,Z,W) = g(R(X,Y)Z,W),
when M has a Riemannian metric g and a torsion-free but not necessarily
Riemannian connection v. Then i) and ii) are satisfied, but not necessarily
iii) and iv).
If (E,-D) is a local basis for ? (M), the curvature tensor can be
calculated as
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 175
Rijkm =
R<Ei'EJ>Ek'Em>
? ?El?rjV-Ejtr?k?9sm+(r1nnr?k-rJn.r?k)?
The sectional curvature is given as
?(s ) = 9(*(?.?)?.?) X,Y
g(x,x)g(Y,Y)-g(x,Y)2
and determines in a Riemannian manifold also the curvature. If the curvature
satisfies i) to iv) the sectional curvature also determines R.
Two other contractions of the curvature tensor are of interest:
The Ricci-curvature
ClR(X,X) =
ml] g(R(u.,X)X,u.) ? ?,=] ? ?
= g(x,x)m^ ?(s? )
1=1 ?'???
where (X/g(X,X),u,,...,u ,) is an orthonormal system for ? (M).
Finally the scalar curvature is
S(p) = S c.R(u.,u.) i=l
'
where u-.,...,u is an orthonormal system in ? (M). We then have the identity
S(p) = S ?(s , ). i9j i j
If ? is a regular submanifold of M, the tangent space of ? can in a natural way
be identified with the subspace of _X(M) determined by
? e X(N)5 X(M) *-* [f=g on ? -* X(f) = X(g) on N].
In that way all tensors etc. can be inherited to ? by restriction. If M has a
Riemannian metric, N^ inherits it in an obvious way, and this preserves curve
length, in the sense that the length of a curve in N^w.r.t. the metric inherit-
ed, is equal to that when the curve is considered as a curve in M.
An affine connection is inherited in a more complicated way:
We define
(????)(?) =
??(???)(?)
where ? is the projection w.r.t. g onto the tangent space ? (N)cT (M) of the
vector (???)? which is not necessarily in
Tp(N). In fact we define the
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
176 Steffen L. Lauritzen
embedding curvature of ? relative to M as the tensor field X(N) ? X_(N) -> X^(M)
or,equivalently, as
HN(X,Y) =
??? -
????
hn(x.y,z) =
g(HN(x,Y),z)
where ?,? e X(N), ? e X(NH (or ? e ?(?)). r r
If HjuiO we say that N^ is a totally geodesic submanifold of M. A
totally geodesic submanifold has the property that any curve in ? which is a
geodesic w.r.t. the connection on j^, also is a geodesic in M.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
THE DIFFERENTIAL GEOMETRY OF STATISTICAL MODELS
A family of probability measures ? on a topological space X. inher-
its its topological structure from the weak topology. Most statistical models
are parametrized at least locally by maps (homeomorphisms)
?: U->0clRm
where U is an open subset of P^ and T an open subset of IRm. From this para-
metrization we get P_ equipped with a differentiable structure, provided the
various local parametrizations are compatible. Considering for a while only
local aspects, we can think of ? as {?.?eT}. We let now f(x,e) denote the ?
density of Pa w.r.t. a dominating measure y and assume these to be C??-functions ?
of ?. Under suitable regularity assumptions we can now equip P^with a
Riemannian metric by defining 1(?,?) = log f (?,?) and
9??(?) =
9(?G?.) -le?MDEjO)). ?
The metric is the Fisher information and different parametrizations define the
same metric on P_. Similarly we can define a family of affine connections (the
a-connections) on ^P by the expressions
?ijk =
"rijk -
fTijk' aeIR' where
TlJk(Pe)-?etE1(l)EJ(l)Ek(l)}.and
r... is the Riemannian connection, ? j ?
The Fisher information as a metric was first studied by Rao (1945)
and the a-connections in the case of finite and discrete sample spaces by
Chentsov (1972). Later the a-connections were introduced and investigated
independently and in full generality by Amari (1982).
177
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
178 Steffen L. Lauritzen
For a more fair description of the history of the subject (the
above is indecently short), see e.g. the introduction by Kass in the present
monograph, Amari (1985) and/or Barndorff-Nielsen, Cox and Reid (1986).
Two of these connections play a special role:
The exponential connection (for a=l) and
the mixture connection (for a=-l). 1
The exponential connection has r... ? 0 when expressed in the ? j ?
canonical parameter in an exponential family, and similarly when we express -1 r... (the mixture connection) in the mean value coordinates of an exponential 1 JK
family r. .. ? 0. Further we have the formulae ? j ?
J^M?Mj?DE^DJand
1 ? T. ., = 2(r. ., - r... )
?jk v
?jk ijk'
which often are useful for computations.
These structures are in a certain sense canonical on a statistical
manifold. Chentsov (1972) showed in the case of discrete sample spaces that
the a-connections were the only invariant connections satisfying certain in-
variance properties related to a decision-theoretic approach. Similarly, the
Fisher information metric is the only invariant Riemannian metric. These re-
sults have recently been generalized to exponential families by Picard (1985).
On the other hand, similar geometric structures have recently
appeared such as minimum-contrast geometries (Eguchi, 1983) and the observed
geometries introduced by Barndorff-Nielsen in this monograph.
The common structure that seems to appear again and again in cur-
rent statistical literature is not standard in modern geometry since it involves
study of the interplay between a Riemannian metric and a non-Riemannian connec-
tion or even a whole family of such connections.
It seems thus worthwhile to spend some time on studying this
structure from a purely mathematical point of view. This has already been done
to some extent by Amari (1985). In the following section we shall outline the
mathematical structures.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
STATISTICAL MANIFOLDS
A statistical manifold is a Riemannian manifold with a symmetric
and covariant tensor D or order 3. In other words a triple (M,g,D) where M is
an m-dimensional C??-manifold, g is a metric tensor and D: X(M) ? _X(M) ? X_(M) -*
C??(M) a tri li near map satisfying
D(X,Y,Z) = D(Y,X,Z) = D(Y,Z,X)
(=D(X,Z,Y) = D(Z,X,Y) = D(Z,Y,X))
D is going to play the role T.... in the previous section. We use D to distin- ? j ?
guish the tensor from the torsion field. Dis called the skewness of the
manifold.
Instead of D we shall sometimes consider the tensor field ? defined
as
g(Bf(X,Y),Z) = D(X,Y,Z).
We have here used that the value of a tensor field is fully deter-
mined when the inner product with an arbitrary vector field ? is known for all
Z.
The above defined notion could seem a bit more general than neces-
sary, in the sense that some Riemannian manifolds with a symmetric trivalent
tensor D might not correspond to a particular statistical model.
On the other hand the notion is general enough to cover all known
examples, including the observed geometries studied by Barndorff-Nielsen and
the minimum contrast geometries studied by Eguchi (1983).
Further, all known results of geometric nature for statistical
manifolds as studied by Amari and others can be shown in this generality and
179
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
180 Steffen L. Lauritzen
it seems difficult to restrict the geometric structure further if all known
examples should be covered by the general notion.
a? Let now (M,g,D) (or(M,g,D)) be a statistical manifold. We now
define a family of connections as follows:
??? =
??? -
|D(X,Y) (3.1)
where ? is the Riemannian connection. We then have a
3.1 Proposition ? as defined by (3.1) is a torsion free connection. It is the
unique connection that is torsion free and satisfies
(vxg)(Y,Z) = aD(X,Y,Z) (3.2)
a Proof: That ? is a connection: Linearity in X is obvious. Scalar linearity
in Y as well. We have
vx(fY) =
vx(fY) -
f D(X,fY) = X(f)Y + fvxY.
Torsion freeness follows from symmetry of D:
??? -
??? - [?,?] =
??? -
??? - [?,?]
-f [D(X,Y) - D(Y,X)] = 0.
a That ? satisfies (3.2) follows from
(vxg)(Y,Z) = Xg(Y,Z) -
g(vxY,Z) -
g(Y,?xZ)
= (vxg)(Y,Z)
+ aD(X,Y,Z) = 0 + aD(X,Y,Z).
If ^ is torsion free and satisfies (3.2) we obtain:
i) Xg(Y,Z) = g(vxY,Z)
+ g(Y,vxZ)
+ aD(X,Y,Z)
ii) Zg(X,Y) = g(vxZ,Y)
+ g(vYZ,X)
+ aD(X,Y,Z)
+ g([z,x],Y) + g([z,Y],x)
iii) Yg(Z,X) = g(^YZ,X)
+ g(vxY,Z)
+ aD(X.Y.Z)
- g([x.Y],z)
Calculating now i) - ii) + iii) we get
Xg(Y,Z) - Zg(X,Y) + Yg(Z,X) = aD(X,Y,Z)
-g([z,x],Y) - g([z,Y],x) - g([x,Y],z) + 29(???,?).
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 181
Since this equation also is fulfilled for ? we get
g(vxY,Z) =
g(vxY,Z), whereby ? = v.
0 . Obviously ? = v, the Riemannian connection.
To check what happens when we make a parallel translation we first
consider the notion of a conjugate connection (Amari, 1983).
Let (M,g) be a Riemannian manifold and ? an affine connection. The
conjugate connection v* is defined as
g(v*xY,Z) = Xg(Y,Z) -
9(?,???) (3.3)
3.2 Lemma v* is a connection, (v*)* = v.
Proof: Linearity in X is obvious. So is linearity in Y w.r.t. scalars. We
have
g(v*x(fY),Z) = Xg(fY,Z) -
g(fY,vxZ)
= X(f)g(Y,Z) + fXg(Y,Z) - fg(Y,vxZ)
= g(X(f)Y + fv*xY,Z).
And further
g((v*)*xY,Z) = Xg(Y,Z) -
g(v*xZ,Y)
= Xg(Y,Z) - {Xg(Z,Y) - g(vxY,Z)}
= g(vxY,Z).
If we now let p ,p* denote parallel transport along the curve ? we obtain:
3.3 Proposition
9(???,p*?) = g(X,Y)
Proof: Let X be v-parallel along ? and Y v*-parallel. Then we have
yg(x>Y) = g(vO(,Y) + g(x,v*or) = o.
In words Proposition 3.3 says that parallel transport of pairs of vectors w.r.t.
a pair of conjugate connections is "isometric" in the sense that inner product
is preserved.
Finally we have for the a-connections, defined by (3.1):
3.4 Proposition (?)* = ? .
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
182 Steffen L. Lauritzen
Proof:
g(vxY>z) =
g(vxY,z) -
|d(x,y,z)
g(Y>vxz) =
g(Y,vxZ) +
|D(X,Z,Y)
Adding and using the symmetry of D together with the defining property of the
Riemannian connection we get
g(vxY,Z) +
g(Y^vxZ) = Xg(Y,Z) (3.4)
The relation (3.4) is important and was also obtained by Amari (1983). If we
now consider the curvature tensors R and R* corresponding to ? and v* we obtain
the following identity:
3.5 Proposition
R(X,Y,Z,W) = -R*(X,Y,W,Z) (3.5)
Proof: Since we shall show a tensorial identity, we can assume [?,?] = 0 as
discussed in section 1. Then we get
XYg(Z,W) = X(g(vYZ,W)
+ g(Z,v*yW))
= g(vxvyZ,W)
+ g(vyZ,v*W)
+ g(vxZ,v*W)
+ g(Z,v*v*W).
By alternation we obtain
0 = [X,Y]g(Z,W) = XYg(Z,W) - YXg(Z,W)
= R(X,Y,Z,W) + R*(X,Y,W,Z).
Note that the Riemannian connection is self-conjugate which gives the well
known identity for the Riemannian curvature tensor, see section 1.
Consequently we obtain
3.6 Corollary The following conditions are equivalent
i ) R = R*
ii) R(X,Y,Z,W) = -R(X,Y,W,Z)
Proof: It follows directly from (3.5).
And, also as a direct consequence:
3.7 Corollary ? is flat if and only if v* is.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 183
The identity ii) is not without interest and we shall shortly
investigate for which classes of statistical manifolds this is true. Before we
get to that point we shall investigate the relation between a statistical mani-
fold and a manifold with a pair of conjugate connections.
We define the tensor field D,, and the tensor D, in a manifold with
a pair (v,v*) of conjugate connections by
?jiX.Y) =
?*?? -
???
?^?,?,?) =
9(0?(?,?),?).
We then have the following
3.8 Proposition If ? is torsion free, the following are equivalent
i) v* is torsion free
i i) D, is symmetric
iii) ? = ^(v+v*)
Proof: That D, is symmetric in the last two arguments follows from the
calculation
?^?,?,?) = g(v*Y,Z) -
g(vxY,Z)
= Xg(Y,Z) - g(Y,vxZ)
- [Xg(Y,Z)-g(Y,v*Z)]
= D.,(X,Z,Y)
The difference between two connections is always a tensor field, i) ?-> ii)
follows from the calculation
g(v*Y-v*X-[X,Y],Z) = g(vxY-vyX-[X,Y],Z)
+ ?^?,?,?)
- ?^?,?,?).
That iii) ?> i) is obvious since then v*=2v-v.
To show that i) ?> iii) we use the uniqueness of the Riemannian con-
nection. We define
? = ^(v+v*)
and see that this is torsion free, when ? and v* both are. But
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
184 Steffen L. Lauritzen
g(vxY,Z) +
g(Y,vxZ) =
^g(vxY,Z) + ^g(v*Y,Z)
+ ^(?,?*?) + ^(?,???)
= Xg(Y,Z)
showing that ? is Riemannian and thus equal to v.
Suppose now that ? is given with v* being torsion free. We can then
define a family of connections as
??? =
??? -
f ^(?,?)
and we obtain a -a? "1
3.9 Corollary ?* = ?, ?= ?, ? = ?*.
? Proof: It is enough to show ? = v. But
1
V =
^???+???) -
^(???-???) =
V'
We have thus established a one-to-one correspondence between a statistical
manifold (M,g,D) and a Riemannian manifold with a connection ? whose conjugate
v* is torsion free, the relation being given as
D(X,Y) = v*Y - ???
??? =
??? - y)(X,Y).
In some ways it is natural to think of the statistical manifolds as
being induced by the metric (Fisher information) and one connection (v) (the
exponential), but the representation (M,g,D) is practical for mathematical
purposes, because D has simpler transformational properties than v.
By direct calculation we further obtain the following identity for
a statistical manifold and its a-connections
3.10 Proposition
g(vxY,Z) -
g(vxZ,Y) =
g(vxY,Z) -
g(vxZ,Y) (3.6)
Proof: The result follows from
g(vxY,Z) -
g(vxZ,Y) =
g(vxY,Z) -
g(vxZ,Y)
- |D(X,Y,Z)
+ |D(X,Z,Y)
and the symmetry of D.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 185
We shall now return to studying the question of identities for the
curvature tensor of a statistical manifold. We define the tensor
F(X,Y,Z,W) = (vxD)(Y,Z,W)
where D is the skewness of the manifold, and ? is the Riemannian connection. We
then have
3.11 Proposition The following are equivalent a -a
i) R = R for ail aeIR
i i) F is symmetric
Proof: The proof reminds a bit of bookkeeping. We are simply going to estab-
lish the identity
R(X,Y,Z,W) - R(X,Y,Z,W) = a{F(X,Y,Z,W) - F(Y,X,Z,W)} (3.7)
by brute force.
Symmetry of F in the last three variables follows from the symmetry
of D. We have
2aF(X,Y,Z,W) = 2aXD(Y,Z,W)
-2a(D(vxY,Z,W) +
D(Y,VXZ,W) +
D(Y,Z,VXW)) a -a -a a
Since ? = h(v + v) and aD(X,Y,Z) = g(vxY,Z)
- g(vxY,Z)
we further get
2aD(vxY,Z,W) =
2g(vzW,vxY) -
2g(vzW,vxY)
-a a -a -a =
g(vzW,vxY) +
g(vzW,vxY) a a a -a
- g(vzW,vxY)
- g(vzw,vxY),
and similarly for the two other terms. Further we get
2aXD(Y,Z,W) = 2X(g(vYZ,W)
- g(vyZ,W))
= 2g(vxvYZ,W)
- 2g(vxvyZ,W)
-a a a -a +
2g(vyZ,vxW) -
2g(vyZ,vxW)
Collecting terms we get the following table of terms in 2aF(X,Y,Z,W), where
lines 1-3 are from 2aXD(Y,Z,W), 4 and 5 from 2aD(vxY,Z,W)
6 and 7 from
2aD(Y,vxZ,W) and 8 and 9 from 2aD(Y,Z,vxW).
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
186 Steffen L. Lauritzen
Table of terms of 2aF(X,Y,Z,W)
with + sign with - sign
!? 2g(vx vyZ,W) 2g(vxvyZ,W) -a a a -a
2. g(vxY,vxw) g(vYZ,vxw)
-a a a -a 3.
g(vxY,vxW) g(vYz,vxW)
4. g(vzw,vxY) g(vzW,vxY)
a -a -a -a
g(vzw,vxY) g(vzw,vxY) 5.
6. g(vyW,vxZ) g(vyW,vxZ)
a -a -a a 7.
g(vyw,vxz) g(vyw,vxz)
a a -a -a
a a
g(vyZ,vxW) g(vyZ,vxW) a -a -a a
g(vyZ,vxW) g(vyZ,vxW)
Lines 4 and _5 disappear by torsion freeness and alternation. Lines 2_ + 9 add up
to zero. Lines 3_ + ? disappear by alternation. Lines 6. + 8 also. What is left
over are only terms from line ]_ whereby
2aF(X,Y,Z,W) - 2aF(Y,X,Z,W)
-a a = 2R(X,Y,Z,W) - 2R(X,Y,Z,W)
and the result and (3.7) follows.
A statistical manifold satisfying this kind of symmetry shall be
called conjugate symmetric. We get then immediately
3.12 Corollary The following is sufficient for a statistical manifold to be
conjugate symmetric a
3 a? j such that R ? 0,
i.e. that the manifold is a^-flat.
As shown e.g. in Amari (1985), exponential families are ?1 - flat
and therefore always conjugate symmetric.
In a conjugate symmetric space, the curvature tensor thus satisfies
all the identities of the Riemannian curvature tensor, i.e. also
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 187
R(X,Y,Z,W) = -R(X,Y,W,Zfi
\ (3?8)
R(X,Y,Z,W) = R(Z,W,X,Y)J
This implies as mentioned earlier that the sectional curvature determines the
curvature tensor.
We shall later see examples of statistical manifolds actually
generated by a statistical model that are not conjugate symmetric.
It also follows that the condition
3 a0 t 0 such that fP= R? (3.9)
is sufficient for conjugate symmetry.
Amari (1985) investigated the case when the statistical manifold was
aQ (and thus -aA flat in detail, showing the existence of local conjugate coor- a
dinates (??) and (?.) such that r... = 0 in the ?-coordinates and its conjugate -a 3
0 r-Mu
= ? in the ?-coordinates. 1 j ?
Further that potential functions ?(?) and f(?) then exist such that
gij(e) =
EiEj^e) 9ij(n) =
????:?(f(?))
and the ?- and ?-coordinates then are related by the Legendre transform:
?1 = ?.(f(?)) ?. = ?.(?(?))
?(?) + f(?) - ?????
= 0.
In a sense aQ-flat families are geometrically equivalent to exponential families..
If N^ is a regular submanifold of (M,g,D), the tensors g and D are
inherited in a simple way (by restriction). The a-connections are inherited by
orthogonal projections on to the space of tangent vectors to _N, i.e. by the
equation
g$xY,Z) =
g(vxY,Z) for ?,?,? e X(N). (3.10)
It follows from (3.10) that the a-connections induced by the restriction of g
and D to ?.(?) are equal to those obtained by projection (3.10). This consis-
tency condition is rather important although it is so easily verified.
A submanifold is totally a-geodesic (or just a-geodesic) if
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
188 Steffen L. Lauritzen
a
??? e ?(?) for all ?,? e ?(?).
If the submanifold is a-geodesic for all a we say that it is geodesic. We then
note the following
3.12 Proposition A regular submanifold ? is geodesic if and only if there
exist a, j a2 such that ? is a,-geodesic and g^-geodesic.
Proof: Let ?,? e X(N) and ? e ? (?)1 ? e ?.
Then IN is a.-geodesic, i=l, 2 iff
g(avj?Y,Z)p =
g(vxY,Z) = 0
for all such ?,?,?. But since
g(vxY,Z) =
g(vxY,Z) -
|D(X,Y,Z)
this happens if and only if D(X,Y,Z) = 0 for all such ?,?,?, whereby ? is geo-
desic iff it is a.-geodesic, i=l,2.
In statistical language, geodesic (a-geodesic) submanifolds will be
called geodesic (a-geodesic) hypotheses. A central issue is the problem of
existence and construction of a-geodesic and geodesic foliations of a statisti-
cal manifold.
A foliation of (M,g,D) is a partitioning
M = U ? - ?e^ -?
of the manifold into submanifolds ? of fixed dimension n(<m). N. are called ?? -?
the leaves of the foliation.
The foliation is said to be geodesic (or a-geodesic) if the leaves
are all geodesic (or a-geodesic).
It follows from Proposition 3.12 that geodesic foliations of full
exponential families (and of a-flat families) are those that are affine both in
the canonical and in the mean value parameters, in other words precisely the
affine dual foliations studied by Barndorff-Nielsen and Blaesild (1983). In
the paper cited it is shown that existence of such foliations are intimately
tied to basic statistical properties related to independence of estimates and
ancillarity. Proposition 3.12 shows that the concept itself is entirely geo-
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 189
metric in its nature.
It seems reasonable to believe that the existence (locally as well
as globally) of foliations of statistical models could be quite informative. It
plays at least a role when discussing procedures to obtain estimates and an-
cillary statistics on a geometric basis.
Let H be a submanifold of M and suppose that ?e? is an estimate of
p, obtained assuming the model M. Amari (1982, 1985) discusses the a-estimate
of ? assuming ? as follows.
To each point ? of ? we associate an ancillary manifold A (p)
Aa(p) = exp
(Tp(N)A)
a i where exp is the exponential map associated with the a-connection and ? (?) is
the set of tangent vectors orthogonal to ? at p. In general the exponential map
might not be defined on all ? (N) , but then let it be maximally defined.
? is then an a-estimate of p, assuming ? if
? e ?a(?).
Amari (1985) shows that if M is a-flat and ? is -a-geodesic, then the a-estimate
is uniquely determined and it minimizes a certain divergence function.
This suggest that it might be worthwhile studying procedures that
use the -a-estimate for a-geodesic hypotheses H9 and call such a procedure
geometric estimation. In general it seems that one should study the decomposi-
tion of the tangent spaces at ?e?? as
Tp(M) =
??(?)F??(?)?
and especially the maps of these spaces onto itself induced by a-parallel trans-
port of vectors in ? (?), -a parallel transport of vectors in the complement,
both along closed curves in H.
It should also be possible to define a teststatistic in geometric
terms by a suitable lifting of the manifold N, see also Amari (1985). Things
are especially simple in the case where M has dimension 2 and N^ has dimension 1
and we shall try to play a bit with the above loose ideas in some of the
examples to come.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
5. THE UNIVARIATE GAUSSIAN MANIFOLD
Let us consider the family of normal distributions ?(?,s ), i.e.
the family with densities
? ?G" 1 2 f(x;y,a) = 1/2ps exp{--? (x-?) },?e^,s>0
2s
w.r.t. Lebesgue measure on IR. This manifold has been studied as a Riemannian
manifold by Atkinson and Mitchell (1981), Skovgaard (1984) and, as a statistical
manifold in some detail by Amari (1982, 1985). Working in the (?,s) parametri-
zation we obtain the following expressions for the metric, the a-connections
and the D-tensor (skewness) expressed as T... (cf. Amari, 1985).
??-ve ?) s M) V
a
G1? =
G122 =
G212 =
G221 = ?
a-? a9 a9 a-, ?? =r? =r? =rl = ? iy ?12 ?21 ?22
?
G112 - (1-a)/s3 G^
= (1-a)/(2s)
a a ? a1 a1 G121
= G211
= -(1+a)/s G?2 =
G21 = "(1+a)/c
G222 = "2(1+2a)/s G22
= -(1+2a)/s
?111 "
?122 ~
?212 "
?221 " ?
??2 =
?121 =
?2? = 2/s ?222
= 8/s
The a-curvature tensor is given by
d - p 2W 4 1212
" * ''? *
190
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 191
so the manifold is conjugate symmetric, and the scalar (sectional) curvature by
?a(s]2) =
R1221/(gng22) = -O"?2)/2
For a = 0 (the Riemannian case) we have ?(s,2)
= -1/2 and the manifold is the
space of constant negative curvature (Poincar?*s halfplane or hyperbolic space).
Note that it also has constant a-curvature for all a although nobody knows what
that implies, since such objects have never been studied previously.
To find all a-geodesic submanifolds of dimension 1 we proceed as
follows. Let (e,E) denote the tangent vector fields
e* J- E = -^. d\i do
a-? If we have ? =
?0 constant on _N, _X(N) is spanned by E. Since r22 = 0 we have
a vrE = f E for all a, t a
and thus that the submanifolds
? = {(?,s) |?=??},??e^ -v0
U U
are geodesic submanifolds and the family
(N ,peIR) (4.1)
constitutes a geodesic foliation of the Gaussian manifold.
If ? is non-constant on N, we must be able to parametrize ? locally
as
(t,a(t)), tei SIR.
The tangent space to ? is then spanned by
? = e + s E
d
the manifold by o(x9y):= o(x).
where we have let ?(t) = ^(t) and extended s to a function defined on all of
a a a a ?a
??? =
ve+aE^e+?E^ =
Vee +
2aVeE +
?? + ^? ^'2^
where we have used torsion freeness and the fact that e(a) = s, ?(s) = 0. Using
ak now the expressions for r.., we get
? s 2s s
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
192 Steffen L. Lauritzen
If this again has to be in the direction of N, we must have
1+a 0-2 _ 1-a _,_ ?? l+2a -2 2s = V1 + s
2s s s
which by multiplication with 2s reduces to the differential equation
20a + 2h2 = (a-1)
? o This is most conveniently solved by letting u = s , whereby ii = 2ss + 2s and
the equation becomes as simple as
?j = a_? ^ u(t) = ^(a-l)t2 + Bt + C, (4.3)
such that the a-geodesic submanifolds are either straight lines (a = 1) or
parabolas in the (?,s )-parametrisation.
The special case a = 1, ? = 0 corresponds to the manifolds
\ = {(?,s) |s=s0>, o^IR+
that give a 1-geodesic foliation.
Another special case is the submanifolds of constant variation
coefficient
V^ =
{(?,s)|s=??},?e^+
that we now see are a-geodesic if and only if a = 1+2? by inserting into (4.3).
V are now connected submanifolds but is composed by two non-connected submani- ??
folds V +
and V "
V + = {(?,s)|?>0}?? , V
" = {(?,s) |y>0}f?V .
The (V ,V ") manifolds do not represent a-geodesic foliations since they are
not a-geodesic for the same value of a. For a = 0 we see that the geodesic sub-
2 2 manifolds are parabola's in (?,s ) with coefficient -h to ? , a result also
obtained by Atkinson and Mitchell (1981) and Skovgaard (1984).
Consider now the hypothesis (?,s) e? , i.e. that of constant varia-
tion coefficient. We shall illustrate the idea of geodesic estimation in this
example as described at the end of section 3.
2 V is a=1+2? geodesic. The ancillary manifolds to be considered
are then -a-geodesic manifolds orthogonal to V .
An arbitrary -a-submanifold is the "parabola"
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 193
s = (-(1+?2)?2+??+0?5 ?
which follows from (4.3) with a = -(1+2? ). Its tangent vector is equal to
e+?E = ??: [-2(1+?2)?+?]?+?.
The tangent vector of the hypothesis is
e+??.
They are at right angles at (uo^q)
if and on1y if
1+1 [-2(1+?2)?0+?]=0
~ ?=(1+2?2)?0.
The ancillary manifold intersects at (?0>??0)
if and only if
-(1+?2)?2+(1^)??+0=?2?2 ^ C=0?
2 The -(1+2? )-geodesic ancillary manifolds are thus given as
W = {(?,s (t))|tel }, pcIRxiO}
(Wq =
{(0,a)|a?IR+})
where s 2(t) = -(l+y2)t2 + (l+2Y2)yt and
r 2
I Vl
' 1+2Y?
A
]0, ?%- u[ if U>0
(]-^?-?,0[ Tf ???. V. 1+?
2 The manifolds W , ?e IR actually constitute a -(1+2? ) -foliation of the Gaussian
manifold. To see this, let (x,s) be an arbitrary point in M. If we try to
solve the equation
(x,s2) = (t,-(l+Y2)t2+(l+2Y2)yt)
we obtain exactly one solution ? for xj09 given as
s2
^ (l+Y2)x2+s2 (1+?2)?+?2 ? ~J=
(1+2?2)? (1+2?2) *A (4.4)
s2 i.e. a linear combination of ? and ?jz.
y?X ?, as determined by (4.4) is the geometric estimate of ?, when ?
and s denote the empirical mean and standard deviation of a sample x,,...,x .
It is by construction (see Amari (1982)) consistent and first-order efficient.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
194 Steffen L. Lauritzen
A picture of the situation is given below in three different parametrizations:
2 -2 (??s), (?,s ), and (?,s ):
-2 0 2
Fig. 1: Geometric estimation with constant coefficient of variation, (?,s)-
param.
Fig. 2: Geometric estimation, (?,s )-param.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 195
??
Fig. 3: Geometric estimation, (?,-*-) param.
To obtain a geometric ancillary and test-statistic we proceed as follows:
We take a system of vectors on the hypotheses whose directions are
2 -(1+2? ) -parallel and whose lengths are equal to one. Further they are to be
orthogonal to the hypothesis (and thus tangent to the estimation manifolds).
The directions should thus be given as
? = (vrv2) -e + y- E.
2?
To obtain unit length, we get ||v| ?17^ 1 / 2??-1 _ ? 2?2+1
s V 0 2 "
\9 4 2 ?? ?y ?
when s=??, and our orthogonal field is thus
?(?) = [??(?),?2(?)]
= a[-y,^]
4 2 h where a = (2? /(2? +1)) . To find the exponential map
-(1+2?2) exp a?(?)} = (f(t,v),o(t,p))
we shall solve the equations
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
196 Steffen L. Lauritzen
s2(?,?) = -0+Y2)f(t,y)2 + (l+2y2)f (t ,? )? (4.5)
d fd {?'?)
= -ap and ?(0.?) =
f=2f ? (1+a) *- f = -2/YZf
J- (4.6)
since only the speed of the geodesic has to be determined. (4.6) is easily seen
to be equivalent to
a 2 f = ?s ? for some K+0. (4.7)
Inserting (4.5) into this we obtain
f = K(-(1+Y2)f2 + (l+2Y2)yf)'2Y
and separation of variables yield 2
/J[-(1+Y2)u2 + (1+2?2)??]2? du = Kt+C ?
Substituting v=u/y we get
v4Y2+1G(fit1}il) = Kt+C (4i8)
where G(x) = /J [-(l+Y2)v2+(1+2Y2)v]2Y2dv.
Using the initial condition ?(0,?)=? we get
C = p4y2+1G(1)
and the condition f(0,y) = -ay yields together with (4.7)
? = s4?2(0,?)(-3?) =-3?4?2?4?2+1,
whereby
/Ai ejf?LHi, . .aY4y>2+1t + ?4?2+16(?),
2 4? +1
and dividing by y Y yields thus
^l?hA) = -??4? + G(1)
and therefore f(t,y) = yh(t) where
2 h(t) = G'^-ay^ t + 6(1)).
Inserting this into (4.5) yields
a(t,y) = y /-(HY2)h(t)2+(l+2Y2)h(t)
which is linear in y. If we now interpret points of same "distance" from the
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 197
hypothesis as those where t is fixed and only y varying, we see that s/x is in
one-to-one correspondence with t. We shall therefore say that s/x is the
geometric ancillary and this it also is the geometric test statistic for the
hypothesis a=yy.
It is of course interesting, although not surprising, that this
test statistic (ancillary) is obtained solely by geometric arguments but still
equal to the "natural" when considering the transformation structure of the
model.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
6. THE INVERSE GAUSSIAN MANIFOLD
Consider the family of inverse Gaussian densities
*fv. ,\ - ?L ^ - ^(xx"1"?- ??) -3/2 . ~ t(?;?,?) -
y/g^e ? ? ???>0
w.r.t. Lebesgue measure on IR . We choose to study this manifold in the para-
metrization (?,?), where
? = x_1 ? = ?-
1???
, 2- . 1(?-1+?2?) f(x;n.e) = h(x)n"S
n 2?
The metric tensor and the skewness tensor can now be calculated either by using
their definition directly or by calculating these in the (?,?) coordinates and
using transformation rules of tensors. We get
?
?? /
-3 -1 -2 3 G112=0, ?1?=? ' ?122=? ? ' ?222=~72~"
* ? ?
The Riemannian connection is now determined by
?ijk =
^Wuc^k9?3' such that
arm = -(Ha)/(2n3), r112
= ?211
= ?]21
= 0
a ? a a ? G221
= (1-a)/(2?tG), G122 =
G212 = -(1+a)/(2?? )
G222 = (3a-1)/(2?2?)
Multiplying with the inverse metric we get
198
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 199
a-. a~ a? a,
Ty = -0+a)/? G^
= rj2
= r?1
= O
G22 =
G^ = -(1+a)/(2?) rj2
= (1-a)/?
G22 = (3a-1)/2?.
To find all geodesic submanifolds of dimension one we first notice
a2 that since r,, ? 0, the manifolds
\' <(?.?}|?-?0>
are a-geodesic for all a, i.e. geodesic and they constitute a geodesic foliation
of the inverse Gaussian manifold. Because
f X = ?"1
they correspond to hypotheses of constant expectation.
Consider now a submanifold of the form (n(t),t), i.e. with tangent
? given as
? = ? e + E, Where e = ^
E = ?-
.
We extend ? by letting n(x,y): = n(y)> i.e. such that e(n) = 0, ?(?) = ?. Then
a ?a #ot _ a vMN = ? ? e + 2r>v E + ne + v-E
? e e E
/*' 1+a ?2 . 1-ou , / 1+a ? , 3a-l\r- = (? . __ ? + __)e + (- _ n + ___)?
a We now have V..N = hN iff
?r 1+a ? , 3a-1t _ r?? 1+a ? , 1-a? ?[- ?"?
+ ~2G]
' [? " ~ ? + ~G]
which reduces to the differential equation
3a-l ? a-1 2t " t
'
This is first solved for a = ?:
2 2 n =
--^r+-*n =
-2-logt + C^
n(t) = - |
t log t + C-jt
+ C2.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
200 Steffen L. Lauritzen
1-3a 1 o
For a f 2 we get by letting u = nt that u satisfies the differential
equati0n - l+3a 1^3?
? = (a-i)t 2
^u(t) =fe^-t 2 +
C1
Whereby 3^
n(t) = y^-t
+ Bt 2 + C, a ? ]?
For a=l (the exponential connections) we get the parabolas:
n(t) = Bt2 + C
and for a=-l (the mixture connection) we get the curves:
n(t) = -t + B/t + C.
In the Riemannian case (a=0) we get
n(t) = -2t + B/F + C
that are parabolas in (/?G,?).
The curvature tensor is given by
a a a a a a a , 2
R1212
* <?G21>
- Mll^s
+ (rlr2r21
" r2r2r?l > =
^?
The manifold is thus conjugate symmetric (we already know, since it is an ex-
ponential family) and the sectional curvature is
?a(s]2) =
-R12l2/(gllg22) = "?-a2)/2.
Note that the Riemannian curvature (a=0) is again constant equal to -h9 as in
the Gaussian case. In fact the a-curvature is exactly as in the Gaussian case.
We can map the inverse Gaussian manifold to the Gaussian by letting
? = /2? s2 = ?/2
and this map is a Riemannian isometry. However, it does not preserve the skew-
ness tensor and thus the Gaussian and inverse Gaussian manifolds do not seem to
be isomorphic as statistical manifolds, although they are as Riemannian mani-
folds.
Corresponding to the hypothesis of constant coefficient of vari-
ation, we shall investigate the submanifold corresponding to the exponential
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 201
transformation model /?? = ?, ? fixed, i.e.
h(x)/?V 2? s>0
which in the (n,e)-parametrization is a straight line through the origin (as
const, coeff. of var.)
{? = ??} = V ??
This submanifold is a-geodesic if and only if
2(g-l) _ 2+? ? =
l^T ~ a -
2+37 ?
The tangent space to V is spanned by ye+ E, and the orthogonal -a-geodesic
submanifolds are given by solving the equations
l-3a
2ia^?=.2iI+^.+ B. 2
+c l-3a l+3a
to get the intersecting point and orthogonality at ( ,_I ' #,?) gives
3a+l
(5.1)
? = 8a y 2
l-9a2
Combining this with (5.1) we get C=0, i.e. the estimation manifolds are given as
3a+l l-3a
ncy (t) - i?Mt - SajL2
2 {Z) "
l+3a Z
? Q 2 r
The manifolds W^,, ?>0 again constitute a -a-foliation of the inverse Gaussian ??
manifold as is seen by solving the equations
(?0??0) =
(n#(t),t)
which gives t=eQ, and
- G (3a-?? "? 4a
3a+1 ? 3a-1
?0 +
^G ?0?0
3a+1
. G(3a-1)(a 90 [_
4a
-t 2
?11+ 9a2-1 ?0 8a ?
0 J
3a+1
This again determines a geometric estimate ? of ? from a sample x^,...,x from
the inverse Gaussian distribution, and this is obtained by letting
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
202 Steffen L. Lauritzen
?0 = ]/* ?0
= ? S??
" ?/* '
and inserting a = (2+?)/(2+3?) into the expression given above.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
7. THE GAMMA MANIFOLD
Consider the family of gamma densities
f(x;y,?) = (?/y)? x?"Vr(6) exp{- ^} y>0, ?>0
w.r.t. Lebesgue measure on IR+. The metric tensor is obtained by direct cal-
culation in the (y,3)-parametrization as
0
where ?(?) = D2 log r(?) - 1/$.
The Riemannian connection is now obtained by
fijk =
^3l9jk +
3jgik -
3kgij] t0 be
fin = -?/y3; f112
= - l/(2y2); f121 =
?211 = 1/(2?2)
f222 = V(?)> f221
= r122
= f212
= 0.
1 Similarly we calculate r... by the formula
? J ?
?1Jk -
t?^Ej?DE^D) to be
1 3 ] 2 G??
= -23/? G121 = 1/p
11111
G122 =
G?2 =
G212 =
G222 =
G221 = ?
and the skewness tensor T..k =
2(^1-j|< "
rijk^
Tlll = 2?/^ ??2
= T121
= T211
= ~1/y T222 = f'(?)
T221 =
T122 =
T212 = ?'
203
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
204 Steffen L. Lauritzen
whereby the a-connections are determined to be
aT - (1+?)? "
_ad_ hn
" 3 r112 0 2
? 2?
a a ,, a ?
1121 '211 0 2 !222 2 f m 2?
a a a
G122 =
G212 =
G221 = ?"
Multiplying by the inverse metric we get
? = . J+2. ?2 - a'? 11 y "
2?(3)
aA ?1 . J+a "2 _ha f'(?) M2
" *21 23 *22
" 2 Tf?T
and all other symbols equal to zero.
The curvature is by direct calculation found to be
" _ (a2-1)[f(?)+?f'(?)]
1212 4?F(3)
The space is conjugate symmetric and therefore the curvature tensor is fully
determined by the sectional (scalar) curvature which is
(a) _ 2 ? 22 . l-a2 [f(?)+?F'(?)3 K " Rl2129 9 -"?
?2f(?)
Note that this is even for a=0 different from the two previous examples in that
the curvature is non-constant and truly dependent on the shape parameter 3.
To find all geodesic submanifolds we proceed as follows:
If p=yQ is constant on N^, X,(N) is spanned by the tangent vector E corresponding
to differentiation w.r.t. the second coordinate. Since
h-
a -,
vEE- ?
these submanifolds are geodesic for all values of a and constitute a geodesic
foliation of the gamma manifold.
Considering the manifold given by 3=3q? its tangent space is span-
ned by e and since
"=-^e+ "-1 -
e P 2?2f(?)
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 205
these are a-geodesic if and only if g=l.
In general let us consider a hypothesis (submanifold) of the type
(f(t),t). Its tangent vector is
f e + E and e(f) = 0, E(f) = f
we have a . ??a .a .. a v? ? ? c(f e + E) = f ? e = 2fv E + f e + ?G? f e + E ' e e E
= r.f2 Ha + f
1+e + i]e + [f2 _??_ + J^a .?^Lje ? ? 2/f(?)
2 F(?)
If we now let ?=t u=f and multiply the coefficient to E by f we obtain the
equation
-(l+a)^(Ha){+f =
4^+2^'(t)f
which unfortunately does not seem soluble in general. For o=l the solutions are
f(t) = t/(At+B).
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
8. TWO SPECIAL MANIFOLDS
In the present section we shall see that things are not always as
simple as the previous examples suggest, but even then we seem to be able to get
some understanding from geometric considerations.
First we should like to notice that when we combine two experiments
independently with the same parameter space, both the Fisher information metric
and the skewness tensors are additive. Let X^PA ?^?? and let ?., ?. denote the DO 11
derivative of the two log-likelihood functions
Ai =
W7 log f (?;?) Bi =
drlog g(y;e)-
Then the skewness tensor is to be calculated as
V =
E^){VBj)(VBk>
? EWk+ EBiBjBk
since all terms containing both A1s and B's vanish due to the independence and
the fact that EA. = EB. = 0.
If we now let ?^?(?,s ), ?^?(s,?) and X and Y independent we get
by adding the information and skewness tensors that in the ^,a)-parametrization
? ? "Is-1- ?2 (?
?
and that, as in the Gaussian manifold, we have
3 3 Tlll
= T122
= T212
= T221
= ? ??2
= 2/s T222 = 8/s '
a Since derivatives of the metric are as in the Gaussian case, so are the r...-
symbols:
206
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 207
a a a a
G1? =
G122 =
G212 =
G221 = ?
a a ? a ^ G121
= G2?
= "(?+a)/s G222=-2(1+2a)/s^.
ak But the a-connections are truly different which is seen by looking at the r..-
* j
symbol s :
a9 ~ a, a?
G^ = (l-a)/(2a+aJ) rj2
= r^
= -(1+a)/s
G22 = -2(1+2a)/(2s+s3)
and all others equal to zero. Considering now the curvature tensor we get
R - (1 ? [2(1+a)+a2(2-a)] _
?
K1212 u a; _4??_2? K211
-a a4(2+a2)
2112
and this is clearly different from R-ioi? wherebY this space is not conjugate
symmetric. The sectional curvature is not determining the curvature tensor be- 1
cause e.g. R-ipi??? but the sPace is not 1 -"Plat since
R - "r - MO C2(l-a)+a2(2+g)] _ ?
R1221 "R1212 'U+a) 4,9^ 2x-R2121 s (2+s J
a From standard properties of the curvature tensor we have
R,?-.?^ = 0, but we
obtain by direct calculation that
a a a a
Rl 211 =
R2in =
R1222 =
R2122 = ?5
such that the above components are the only ones that are not vanishing.
If we try to find the geodesic submanifolds we first observe that
al because r0 = 0 for all a, the submanifolds
? = ?(?,s)|?=??}
are totally geodesic for all a, and thus constitute a geodesic foliation of the
manifold. Following the remarks at the end of section 4, relating geodesic
foliations to the affine dual foliations of Barndorff-Nielsen and Blaesild
(1983), it is of interest to know that also in this example, the maximum likeli-
2 hood estimates of s and ? are independent as expected from the foliation. We
shall now proceed to find the remaining geodesic manifolds.
If we consider manifolds of the type (t,f(t)) with tangent vector
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
208 Steffen L. Lauritzen
e + f E we get
Vf E (e+f E) = V
+ 2fveE
+ f2vEE
+ f E
= .?(1+a)e+(Jr?_.2?Il2aif2)E
2s+s 2s+s
Multiplying the coefficient to e with f and inserting a=f we get the equation
2f2 |1(?+a)
, ? i?lM + ?-? .f T
f(2+r) f(2+r)
Multiplying on both sides with f(2+f ) and collecting terms gives
2f2f2(l+a) + 2ff + ff3 + 2f2 = a-1
and this does not seem to have a particularly nice solution.
Note that f(t) and yt is not a solution since then f=y f=0 and we
obtain the equation for a:
2Y4t2(l+a) + 2?2 = a-1
which can only hold when a = -1 and then we get
2 2y = -2
which is impossible.
In this example the "constant coefficient of variation" does also
not have any simple group transformational properties.
It seems then of interest to see what happens if we consider the
2 model with ?^?(?,s ), Y^N(log s,?) which is related to the example just consid-
ered but where the "constant coefficient of variation" is_ transformational. The
model is also transformational itself (the affine group). By the same argument
as before the skewness tensor becomes identical to that of the univariate Gaus-
sian manifold. The metric, however, becomes
, ? o 2 / 1 0 > g =
-?- j g = s s I 0 3 '
whereby we calculate the Riemannian connection to be
\
G112 = l/s3 f211
= f121
= -1/s3
3 G222
= ~3/s G?1 =
G122 =
G212 =
G221 = 0#
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 209
The a-connections are
a
G122 = (1-a)/s? G121
= G211
= -(1+a)/s 3
a _
a /? ?, 3
a
G222 = (-3"4?)/s3 !",?
= G122
= G212
= G221
= 0,
or in the r. .-symbols: ? j an a?? a-?
G^ = (1-a)/3s G21
= G21
= -(1+a)/s
G22 = "(3+4a)/3s?
The curvature tensor can be calculated to be
2 - (1-a)(3+a) I . (?a)(3-a) K1212 4 K1221
" ' 4
s s
So we do indeed again have a manifold that is not conjugate symmetric. All
other components are again vanishing apart from ^112' R2121*
The sPace is not
flat for any value of a.
Considering the problem of finding all geodesic submanifolds we have
the same situation as earlier in that
_N = {(?,s) |?=??)
??
together constitute a foliation that is geodesic for all values of a, again in
accordance with the independence of ? and s.
Consider now a submanifold of the type [t,f(t)] with tangent
e + f E. We get a a .a .?a ? ^r(e+fE) = ? e + 2fv E + f?vcE + fE
e+fE 'e e E
= - -?- (l+a)e +
bjf-jjT-f + f JE
Multiplying the coefficient to e by f and everything by 3f and reducing, we
obtain the following differential equation:
(3+2a)f2 + 3ff = a-1 .
For a=0 (the Riemannian case), we get
3f2 + 3ff = -1.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
210 Steffen L. Lauritzen
o Letting f = ?G, u = f we obtain as in the Gaussian case the equation
ii = - |
?<-? u = - jt2
+ At + B,
2 i.e. again parabolas in the (?,s ) parametrization but with a different coef-
2 ficient to t .
Note that, in fact, considered as a Riemannian manifold there is no
essential difference between this and the univariate Gaussian manifold, since
we have constant scalar Riemannian curvature equal to
* 4 ? s _ ? " 4
# ?
* s
i.e. again a hyperbolic space.
If a | {1,-2*} the following special parabolas are solutions:
?2 = ?+T2
+ By + ?2 p^t?
B arbitrary?
2 2 s =
aQ is 1-geodesic. For a = -3/2 no parabolas are geodesic. The equation
then reduces to
f f = ? tT 3 '
the general solution to which cannot be obtained in a closed form.
If we consider the transformation submodel of "constant coefficient
of variation" s=?? corresponding to f(t)= t, we get the equation
(3+2a)?2 + 0 = a-1.
Solving this for a we find the following peculiarity:
a = (3?2+1)/(1-2?2) if y2jh
but if ? = ?/2/2, the equation has no solution!! In other words, all "constant
variation coefficient submanifolds" of the manifold studies are a-geodesic for
2 suitably chosen a except one (? = h).
A reasonable explanation for this is at present beyond my imagina-
tion. Is there a missing connection (a=?>)? Have I made a mistake in the cal-
culations? Or is it just due to the fact that the phenomenon is related to how
this model is a submodel of the strange two-dimensional model. In any case,
there is a remarkable disharmony between the group structure and the geometry.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 211
To go a bit further we consider the three-dimensional manifold
2 (^,a,c)-parametrized) obtained from considering X ^ ?(?,s ), ? ^ ?(?,1). The
metric for this becomes
2 s
0 0
and the skewness-tensor and the a-connections are identical to the Gaussian
case when only indices 1 and 2 appear and all involving the third coordinate
are equal to zero. Letting (e,E,F) denote the basis vectors for the tangent
space determined by coordinatewise differentiation, we consider now the "con-
stant coefficient of variation" submanifold:
{(t,Yt, log ? t), t e IR+}
with tangent-vector ? = e + ?? + Ar, and we get
v ?
Wie+YE) + <-
7>F
a a ?a ?. =
vee +
2???? + ? v?E
- -j F
Inserting the expressions for the a-connections we obtain
r, ? ? ^a r&~? . ?+2a\G 1 r ???
= -2t^- (m
+ y-r)E
_7F
= -l[2(l+e)e + (^+Y(H2a))E
+ lF].
If this derivative shall be in N's direction we must have
2(1+a) = 1 -> a=-h9
but also
? = Ir
+ ?(1+2a) -*2?2 = -1?
which is impossible. We conclude thereby that this transformational model is
not a-geodesic for any a, considered as a submodel of the full exponential
model.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
DISCUSSION AND UNSOLVED PROBLEMS
The present paper seems to raise more questions than it answers.
We want to conclude by pointing out some of these, thereby hoping to stimulate
research in the area.
1. How much structure of a statistical model is captured by its
"statistical manifold", the manifold being defined through expected geometries
as by Amari, minimum contrast geometries as by Eguchi or observed geometries as
by Barndorff-Nielsen? On the surface it looks as if only structures up to
third order are there and as if one should include symmetric tensors of higher
order to capture more.
2. Some statistical manifolds (M^g-pD-j)
and (M2,g2,D2) are
"alike", locally as well as globally. Various types of alikeness seems to be of
some interest. Of course the full isomorphism, i.e. maps from M, to M~ that
preserves both the Riemannian metric and the skewness tensor. But also maps
that preserve some structure, but not all could be of interest, in analogy with
the notion of a conformai map in Riemannian geometry (maps that preserve angles,
i.e. the metric up to multiplication with a function). There are several pos-
sibilities here. Isometries that preserve the skewness tensor up to a scalar
or up to a function. Maps that preserve the metric up to scalars and/or func-
tions and do and do not preserve skewness etc. etc.
3. In connection with the above there remains to be done a lot of
work on classification of statistical manifolds in a pure mathematical sense,
i.e. characterize manifolds up to various type of "conformai" equivalence,
"conformai" here taken in the senses described above. A classic result is that
212
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Statistical Manifolds 213
two Riemannian manifolds are locally isomorphic if they have identical curvature
tensors. Do similar things hold for statistical manifolds and their a-curva-
tures? Note that the inverse Gaussian and Gaussian manifolds seem to be alike
but not fully isomorphic. Results of Amari (1985) seem to indicate that a-flat
families are very similar to exponential families. Are they in some sense
equivalent? There might be many interesting things to be seen in this direc-
tion.
4. Some statistical manifolds seem to have special properties. As
mentioned above we have e.g. a-flat families, but also manifolds that are
conjugate symmetric or manifolds with constant a-curvature both for a particular
a and for all a at the same time. Which maps preserve these properties? Can
they in some sense be classified?
5. How does the geometric structures behave when we form marginal
and conditional experiments? Some work has been done on this by Barndorff-
Nielsen and Jupp (1984, 1985).
6. Is there a decomposition theory for statistical manifolds. We
have seen that there might be a connection between the existence of geodesic
foliations and independence of estimates. There might be a de Rham-like theory
to be discovered by studying parallel transports along closed curves in flat
manifolds?
7. Chentsov (1972) showed that the expected geometries were the
only ones that obeyed the axioms of a decision theoretic view of statistics, in
the case of finite sample spaces. It seems of interest to investigate general-
izations of this result, both to more general spaces and to other foundational
frameworks. Picard (1985) has generalized the result to the case of exponential
families and has some results pertaining to the general case.
8. What insight can be gained by studying the difference between
observed and expected geometries?
9. How is the relation between the geometric structure of a Lie-
transformation group and the geometric structure of its transformational statis-
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
214 Steffen L. Lauritzen
ti cal models?
Other questions and problems are raised by Barndorff-Nielsen, Cox,
and Reid (1986) and in the book by Amari (1985).
Acknowledgements
The author is grateful to Ole Barndorff-Nielsen, Preben Blaesild,
and Erik Jtfrgensen for discussions relevant to this manuscript at various
stages.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
REFERENCES
Amari, S.-I. (1982). Differential geometry of curved exponential families -
curvatures and information loss. Ann. Statist. 10, 357-385.
Amari, S.-I. (1985). Differential-Geometrical Methods in Statistics. Lecture
Notes in Statistics Vol. 28, Springer Verlag. Berlin, Heidelberg.
Atkinson, C. and Mitchell, A. F. S. (1981). Rao's distance measure. Sankhya
A 43 345-365.
Barndorff-Nielsen, 0. E. and Blaesild, P. (1983). Exponential models with
affine dual foliations. Ann. Statist. jj_ 753-769.
Barndorff-Nielsen, 0. E., Cox, D. R. and Reid, N. (1986). The role of differen-
tial geometry in statistical theory. Int. Statist. Rev, (to appear).
Barndorff-Nielsen, 0. E. and Jupp, P. E. (1984). Differential geometry, profile
likelihood and L-sufficiency. Res. Rep. 113. Dept. Theor. Stat., Aarhus
University.
Barndorff-Nielsen, 0. E. and Jupp, P. E. (1985). Profile likelihood, marginal
likelihood and differential geometry of composite transformation models.
Res. Rep. 122. Dept. Theor. Stat., Aarhus University.
Boothby, W. S. (1975). An Introduction to Differentiable Manifolds and Rieman-
nian Geometry, Academic Press.
Chentsov, N. N. (1972). Statistical Decision Rules and Optimal Conclusions (in
Russian) Nauka, Moscow. Translation in English (1982) by Amer. Math. Soc.
Rhode Island.
Efron, ?. (1975). Defining the curvature of a statistical problem (with discus-
sion). Ann. Statist. 3 1189-1242.
215
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
216 Steffen L. Lauritzen
Eguchi, S. (1983). Second order efficiency of minimum contrast estimators in a
curved exponential family. Ann. Statist. lj[ 793-303.
Picard, D. (1985). Invariance properties of the Fisher-Rao metric and Chentsov-
Amari connections using le Cam deficiency. Manuscript. Orsay, France.
Rao, C. R. (1945). Information and the accuracy attainable in the estimation of
statistical parameters. Bull. Calcutta Math. Soc. 37 81-91.
Skovgaard, L. T. (1984). A Riemannian geometry of the multivariate normal
model. Scand. J. Statist. 11 211-223.
Spivak, M. (1970-75). Differential Geometry Vol. I-V. Publish or Perish.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
DIFFERENTIAL METRICS IN PROBABILITY SPACES
C. R. Rao*
1. Introduction.219
2. Jensen Difference and Entropy Differential Metric . 222
3. The Quadratic Entropy.226
4. Metrics Based on Divergence Measures . 228
5. Other Divergence Measures . 231
6. Geodesic Distances . 234
7. References.238
Department of Mathematics and Statistics, University of Pittsburgh,
Pittsburgh, PA
217
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
1. INTRODUCTION
In an early paper (Rao, 1945), the author introduced a Riemannian
(quadratic differential) metric over the space of a parametric family of prob-
ability distributions and proposed the geodesic distance induced by the metric
as a measure of dissimilarity between probability distributions. The metric
was based on the Fisher information matrix and it arose in a natural way
through the concepts of statistical discrimination feee also Rao, 1949,1954,1973
pp. 329-332, 1982a). Such a choice of the quadratic differential metric, which
we will refer to as the information metric, has indeed some attractive proper-
ties such as invariance for transformation of the variables as well as the para-
meters. It also seems to provide an appropriate (informative) geometry on the
probability space for studying large sample properties of estimators of para-
meters in terms of simple loss functions as demonstrated by Amari (1982, 1983),
Cencov (1982), Efron (1975, 1982), Eguchi (1983, 1984), Kass (1981) and others.
Kass (1980, Ph.D. thesis) explores the possibility of using differential geo-
metric ideas in statistical inference.
The geodesic distances based on the information metric have been
computed for a number of parametric family of distributions in recent papers by
Atkinson and Mitchell (1981), Burbea (1986), Kass (1981), Mitchell and
Krzanowski (1985), and Oiler and Cuadras (1985).
In two papers, Burbea and Rao (1982a, 1982b) gave some general
methods for constructing quadratic differential metrics on probability spaces,
of which the Fisher information metric belonged to a special class. In view of
the rich variety of possible metrics, it would be useful to lay down some
219
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
220 C. R. Rao
criteria for the choice of an appropriate metric for a given problem. Amari has
stated that a metric should reflect the stochastic and statistical properties
of the family of probability distributions. In particular he emphasized the
invariance of the metric under transformations of the variables as well as the
? parameters. Cencov (1972) shows that the Fisher information metric is unique
under some conditions including invariance. Burbea and Rao (1982a) showed that
the Fisher information metric is the only metric associated with invariant
divergence measures of the type introduced by Cisz?r (1967). However, there
exist other types of invariant metrics as shown in Section 3 of this paper.
The choice of a metric naturally depends on a particular problem
under investigation, and invariance may or may not be relevant. For instance,
consider the space of multinomial distributions, ? = {(?,,...,? ): p. > 0,
S?. = 1}, which is a submanifold of the positive orthant, X = {(x-j,...,x ):
?. > 0} of the Euclidean space Rn. A Riemannian metric on X automatically pro-
vides a metric on the submanifold ?. In a study of linkage and selection of
gametes in a biological population, Shahshahani (1979) considered the metric
? ? ??. 0 ds2=[?Ldx2 (1.1)
1 i
which induces the information metric on ?. This metric provided a convenient
framework for a discussion of certain biological problems. However, Nei (1978)
considered a distance measure associated with the Euclidean metric
ds2 = Sdx2 (1.2)
which he found to be more appropriate for evolutionary studies in biology. The
metric induced on d by (1.2) is not the Fisher information metric. Rao (1982a,
1982b) has shown that a more general type of metric
SS?. .dx.dx. (1.3) IJ I J
called the quadratic entropy is more meaningful in certain sociometric
and biometrie studies.
The object of the present paper is to provide some general methods
of constructing Riemannian metrics on probability spaces, and discuss in
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Metrics in Probability Spaces 221
particular the metric generated by the quadratic entropy which is an ideal
measure of diversity (see Lau, 1985 and Rao, 1982b), and has properties similar
to the information metric, like invariance. We also give a list of geodesic
distances based on the information metric computed by various authors (Atkinson
and Mitchell, 1981; Burbea, 1986; Mitchell and Krzanowski, 1985; Oiler and
Cuadras, 1985 and Rao, 1945).
The basic approach adopted in the paper is first to define a measure
of divergence or dissimilarity between two probability measures, and then to use
it to derive a metric on M, the manifold of parameters, by considering two
distributions defined by two contiguous points in M. We thus provide a method
for the construction of an appropriate geometry or geometries on the parameter
space for discussion of practical problems. Some divergence measures may be
more appropriate for discussing properties of estimators using simple loss
functions while others may be appropriate in the study of population dynamics in
biology. It is not unusual in practice to study a problem under different
models for observed data to examine consistency and robustness of results. The
variety of metrics reported in the paper would be of some use in this direction.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
2. JENSEN DIFFERENCE AND ENTROPY DIFFERENTIAL METRIC
Let ? be a s-finite additive measure defined on a s-algebra of
subsets of a measurable space X9 and P^ be the usual Lebesgue space of ? measur-
able density functions,
? = (p(x): p(x) > 0, ?e?, Lp(x)dv(x) = 1} . (2.1)
We call H: ?->R an entropy (functional) on P^ if
(i) H(p) = 0 when ? is degenerate,
(ii) H(p) is concave on P_.
In such a case, with ? > 0, ? > 0, ?+?=1, Rao (1982a) defined the Jensen
difference between ? and qeP^ as
J(A,y; p,q) = ?(?? + uq) - ??(?) - yH(q) . (2.2)
The function J: P_ ? F^->R is non-negative and vanishes ifp = q(iffp = q when
? is strictly concave). If the entropy function ? is regarded as a measure of
diversity within a population, then the Jensen difference J can be interpreted
as a measure of diversity (or dissimilarity) between two populations. For the
use of Jensen difference in the measurement, apportionment and analysis of di-
versity between populations, the reader is referred to Rao (1982a, 1982b).
Let us now consider a subset of probability densities characterized
by a vector parameter ?
P. = {?(?,?): ?(?,?)e?, ?e?, a manifold in Rn} ?? ?
and assume that ?(?,?) is a smooth function admitting derivatives of a certain
order with respect to ? and differention under the integral sign. For conven-
ience of notation, we write
222
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Metrics in Probability Spaces 223
?(?.?) = ??, ?(?) =
?(??), ?(?,f) = ?(???
+ ??f)
J(e^) = ?(?,f) - ??(?) - ??(f) (2.3)
where ?,fe?. Putting ? = ? + de and denoting the i-th component of a vector
with a subscript i, we consider the formal expansion of J(e,e+de),
J_ 55 3^(?,f=?) . . + JL yyy a3J(e,<F9) de de de + 2!
\\ 3F?3fa. deidej 3!
\\\ 3F?3F?3F|( de1dejdV???
= jr
SS 9ijte)deidej
+ ?G
SSS cijk(e)deid6jdV???
(2'4)
In (2.4), the coefficients of the first order differentials vanish since J(e^)
2 has a minimum at f = ?, and the notation such as 3 ?(?,f=?)/3?.3f. is used for
replacing ? by ? after carrying out the indicated differentiations. u
From the definition of the J function, it follows that the (gin?) is
a non-negative definite matrix and obeys the tensorial law under transformation
of parameters. We define the matrix and the associated differential metric
(gfj) and ??
gj^e-de^. (2.5)
as the ?-entropy information matrix and ?-entropy differential metric respec-
tively. We prove the following theorem which provides an alternative computa-
tion of the ?-information matrix directly from a given entropy H.
Theorem 2.1
H 32?(??O+???)
^3 3??.3( 3 (2.6)
Proof: By definition
IJ 3f^9f4
32?(?,f=?) _ 32?(f=?)
3f^3f? 3F??3f?? (2.7)
Since ?(?,?) attains a minimum at f = ?
3?(?,f=?) _ 3?(?) ?7 ?? 3f.
? 3T. V ' J J
Differentiating both sides of (2.8) with respect to e. we have
32?(?,f=?) 32?(?,f=?) , 92?(?) (2 gx 3?.3f. 3F1?3F1?
3T.3T. V J
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
224 C. R. Rao
which gives (2.6), and the desired result is proved.
Let us consider a general entropy function of the type
H(Pj h(pjdv(x) (2.10)
where h"9 the second derivative of h, is a non-negative function. Then using
(2.6) ??,<?>
? ??jW
? -
^^
3 h(Xp +ypj
? 3 dv(x)
9P* 9PQ
If h(x) = ? log ?, leading to Shannon's entropy, then
'ij 9^(?) = ??
?? 36i 3ej dv(x)
(2.?)
(2.12)
become the elements of Fisher's information matrix. If h(x) = (a-l)~ (xa-x),
a ? 1, we have the a-order entropy of Havrda and Charv?t (1967) and
9?, =
gjj?e) = a??
a logpa a log p.
36i 3T, dv(x) (2.13)
which provide the elements of a-order entropy information matrix, and the
corresponding differential metric given in Burbea and Rao (1982a, 1982b).
We prove Theorem 2.2 which gives alternative expressions for the
coefficients of the third order differentials in the expansion of J(e^).
Theorem 2.2.
H = r 93?(?,f=?) + 33?(?,f=?) + 33?(?,f=?)-, Cljk
" L 3??.3?a?3f|< 3?13f.3f|? 3?^3f?.3f|<
J
Proof: By definition
? , , = 33?(?,f=?) LljkV?;
9f13f3?3f?<
= 33?(?,f=?) 33?(?)
3f13f^.3f|<
" ? 3ei36j30k
(2.14)
(2.15)
From (2.9), writing i = j and j = k we have
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Metrics in Probability Spaces 225
92?(?,f=?) + 32?(?,f=?) = 32?(?)
aej3(|)k 9(i)ja(|)k y
aej"k "
Differentiating with respect to ?.
33?(?,f=?) + 33?(?,f=?) 33?(?,f=?) 33?(?,f=?) , 33?(?)
3??3?.3f|< 3f^3??3f^ 3?..3f.3f^ 3f^3f.3f^ ?
3T^3T.3T^
which gives (2.14) as equivalent to (2.15). This proves Theorem 2.2.
Let ? be Shannon's entropy. Then, an easy computation gives
cijk -
xy([r|J) +
d-x)Tijk] +
[?$ +
(1-?)?.jk] +
[r{??] +
(l-p)T1Jk]} (2.16)
where 2 m 3 log pft 3 log ? 3 log ? 3 log ? 3 log ?
i jk v
3?^3?. 39k ' * i jk
V 3T1
3T. 30k J '
(2.17)
Adopting the notation of Amari for a-connexion
AA . rO) +hT 1 i j k ljk 2 ijk
the expression (2.16) can be written
When ? = ? = 1, (2.18) becomes
c =lrr(0) + r(0) + r(0)l (2 19) cijk 4 Lrijk jki rikjJ
? u*,yj
Remark 1. In the definition of the Jensen difference (2.2), we
used apriori probabilities ? and ? for the two probability distributions ? and
q which have some relevance in population studies. But in problems of statis-
tical inference, a symmetric version may be used by taking ? = ? = j.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
3. THE QUADRATIC ENTROPY
The quadratic entropy was introduced in Rao (1982a) as a general
measure of diversity of a probability distribution over any measurable space.
It is defined as a function Q: P+R+
Q(p) = [ K(x,y)p(x)p(y)dv(x)dv(y) (3.1)
where K(x,y) is symmetric, non-negative and conditionally negative definite,
i.e., nn
? K?x^x^a.aj < 0
for any choice of (x-|,...,x ) and of
(a^,...,a ) such that a,+...+a = 0, with
the further condition K(x,y) = 0 if ? = y. It was shown in Rao (1982b, 1984)
that the quadratic entropy is concave over P_ and its Jensen difference has
nice convexity properties which makes it an ideal measure of diversity. In
view of its usefulness in statistical applications, we give explicit expressions
for the quadratic differential metric and the connection coefficients associated
with the quadratic entropy, in the case of the parametric family P_. ?v
From Theorem 2.1, the (i,j)-th element of the Q-information matrix
(3.2)
is ? n 3^Q(Xp + ?? )
g. .(?) =---*- y!JV?;
3?.3f(].
Observing that
(3(??? +
???) =
j K(x,y)[xp(x,e)+yp(x^)][xp(y,e)+yp(y^)]dv(x)dv(y),
we find the explicit expression for (3.2) as
226
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Metrics in Probability Spaces 227
g?j(e) = -2?? ?K(x,y)3%iIl%iidv(x)3v(y) 3T. 3T.
? J (3.3)
= -2 ?? E[K(x,y) 3 1ogP<x'6> 3
l0g3P(y>6)] . * vi
Using the expression (2.14), we find on carrying out the necessary computations
cV,. = -2??(G... + r... + r,..) "ijk ijk i kj jkiJ
where
rijk J \K(x9y)^^^^Mx)My)
39k 3?.36a- (3.4)
It is of interest to note that the expressions (3.3) and (3.4) are invariant for
transformations of both the parameters and variables.
For further properties of quadratic entropies, the reader is refer-
red to Lau (1984) and Rao (1984).
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
4. METRICS BASED ON DIVERGENCE MEASURES
Burbea and Rao (1982a, 1982b), Burbea (1986) and Eguchi (1984)
have considered metrics arising out of a variety of divergence measures between
probability distributions. A typical divergence measure is of the form
F[p(x,e),p(x^)]dv(x) (4.1) DF(V?V jl
where F satisfies the following conditions:
(i) F(?,?) is a C -function of R+ ?
R+,
(ii) F(x,?) is strictly convex on R+ for every xeR+,
(iii) F(x,x) = 0 for every ? e R+,
(iv) aF^x^ = ?) = 0 for every ? e R^. dy +
Let us consider the expansion
VVW =
2T^ij<e>d6id6j +
?c^iejde^ejde^ ... (4.2)
F F and obtain explicit expressions for g.. and c... .
1 j 1J ?
Theorem 4.1. Let
F1(x)y)=%^-,F2(x,y) =
3^)
c - a2F(x,y) ? _ a2F(x,y) F . a2F(x,y) 11
"
3?2 ' rl2 axay
' r22 "
3y2
r = 93F(x,y) F222
3y3 ?
Then
(i) 9^e)
=
?F22[Pe,Pe]^^dv(x)
r 3pfl 3pfi
228
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Metrics in Probability Spaces 229
?) 'ijk 3?? 9?? 9??
?222[??'??] 3?7 3?-3?Ga?(?) ? j ?
F99CP0>P0]h aV 3Pc 3 ?O 3PC 3 Pc 3PC
22LKe'KeJL3ei36. 3ek 3?t3?? 36j -]dv(x)
-j"Jk ""1
The results are established by straight forward computations.
Let us consider the directed divergence measure of Csisz?r (1967),
which plays an important role in problems of statistical inference,
D(Pe.P,)-Jp(x.e)f({^)dv(x)
where f is a convex function. In this case
(4.3)
?3f.
= f"(1) ? J
J
(4.4)
where g.. are the elements of Fisher's information matrix. Thus a wide class * J
of invariant divergence measures provide the same informative geometry on the
parameter manifold. However, the c... coefficients may depend on the particular ? j ?
convex function f chosen as shown below
f cijk(Q)
* 33D
V9k
-f"(l)Cr{]j[*r{l] +
riy] + (f-(l) +
3f-(l))TiJk (4.5)
where t).'. and T... are as defined in (2.17). 1J ? 1 j ?
The results (4.4) and (4.5) have consequences in estimation theory,
specially in the study of second order efficiency. While a large number of
estimation procedures lead to first order efficient estimates (i.e., having the
same asymptotic variance based on the elements of Fisher information matrix),
they are distinguishable by different second order efficiencies of the derived
estimators (see Rao, 1962).
If f is a convex function, then
f*(u) = uf(l)
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
230 C. R. Rao
is also convex, and the measure (4.3) associated with f+f* is
0*(??>??) :
?PJ& + Pj(ir)3dv(x) (4.6) ? V f
?
which is symmetric in ? and ?. However, we may define (4.5) as a symmetric
divergence measure without requiring f to be a convex function but satisfying
the condition that xf(x" ) +f(x) is non-negative on R+. In such a case
gtjf*(e) =
2f"(i)gij(0)
cX(e)-2f-(l)[r{]) +
r{][] +
rtJj] +
3f-'(l)TiJk
Remarks on Sections 2, 3 and 4. As pointed out by a referee, a unified treat-
ment of the results in these three sections is possible by considering a general
dissimilarity measure D : ? ? ? -> {0.,??} satisfying
(a) D(pn,px) is a c function of ?,f,
(b) D(p,p) = 0 for every ? ? p.
Then putting
D = a3p etc
ui;jk 3?.3?^3f|<
exx?'
and differentiating D .L = 0 yields ?J s-f
[Di;j +
D;ijW?>
[Dik;j +
Di:jk +
Dk;ij +
D;ijkW?
giving expressions for g.. and c... for a general D. However, the approach 1J 1J ?
adopted in the paper enabled a discussion of the construction of the distance
measures D through more basic functions like quadratic entropy, general entropy,
cross entropy, and divergence between probability measures. The results expres-
sed in terms of the basic functions are of some interest.
It is also possible to regard the dissimilarity measures of
Section 3 and 4 as having the common form
D(p,q) = |x x x F(p(x),q(x),p(y),q(y))dv(x,y)
where ? is a symmetric measure on ? ? _X. However, the expressions for g.. and
c... are not simple, ? j ?
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
OTHER DIVERGENCE MEASURES
In the last section, we considered the f-divergence measure which
led to the Fisher information metric. A special case of this measure is the
city block distance, or the overlap distance (see Rao, 1948, 1982a),
|p(x,e)-p(x^)|dv(x) (5.1)
obtained by choosing f(x) = l-min(x,l), which admits a direct interpretation in
terms of errors of classification in discrimination problems. However, this is
not a smooth function and no formula of the type (4.7) is available to deter-
mine the coefficients of the differential metric. But in some cases, it may
turn out that
??(??'?f> ?
Do(lVV =
??(?-f)
is a smooth function of ? and f in which case
32??(?,f=?)
gij= aiUj ? (5?2)
In the case when ?(?,?) is a p-variate normal density with mean y and fixed
variance covariance matrix S, the coefficient (5.2) can be easily computed to be
proportional to a1J, the (i,j)-th element of ?" , which is indeed the (i,j)-th
element of the Fisher information matrix. The same result holds for any ellip-
tical family, as then ?0(?,f) is a function of the Mahalanobis distance between
? and ? (see Mitchell and Krzanowski, 1985).
Let ?(?,?) be the density of a uniform distribution in the interval
[?,?]. Then it is seen that
231
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
232 C. R. Rao
00(?,f) = 2(1 -
|0 if
= 2(1 - |) if
Although this is not a differentiable function, it is seen that
ds2 = 4 a?
(5.3)
is the metric associated with (5.3).
Another general divergence measure which has some practical
applications is
??(??,?f) =
j [?(??)-?(?f)]??(?)
which is indeed a smooth function if ? is so. In this case
f 0 3p 3p 9^(?) =
2|[?'(??)]2^3^?(?)
r 3p 3p 3p
a^(?) =
6|??(??)???(??)^^3^a?(?)
+ 2 [?'(?O)G( 2, 32P? 3p? *\ 3?O 32P0 3?O
3ei3ej 39k 39i3ek 39j 36j39k 39i ) dv(x).
Another measure of interest is the cross entropy introduced in Rao
and Nayak (1985). If ? is any entropy function, then the cross entropy of p? F
with respect to ? was defined as
H[p +?(? -p )]-H(p )
D(pa,pJ = H(pJ -
H(Pfl) - lim ?*-^-*-
Let
H(p)
?+0
h(p)dv(x)
(5.4)
as chosen in (2.10). Then (5.4) reduces to
D(P0.Pj h(pjdv(x)
Then
h'(pj(pa-pjdv(x) +
3Pa 3P0
h(pjdv(x)
??j-?^Pe?aeTieT^)
which is the same as the h-entropy information matrix derived in (2.10), apart
from a constant. Similarly
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Metrics in Probability Spaces 233
ch , r0) + r(D + r(D +T cijk Mjk 'ikj 'jki 'ijk
where
?-,) 32log pfl 3 log ?
9 3 log ? 3 log ? 3 log ?
Tijk ?
E^3P6h"(Pe) +
^"? -?^^?A'-i?-* 1 J ?
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
6. GEODESIC DISTANCES
In Rao (1945) it was suggested that the information metric could be
used to obtain the geodesic distances between probability distributions. Given
any quadratic differential metric
ds = ?? g. .(e)de.de. (6.1)
where the matrix (g.. ) is positive definite, the geodesic curve ? = e(t) can * j
in principle be determined from the Euler-Lagrange equations
? .. ??
pijei
+
?? r1jkS1?j
' ?? k=1'???'" (6?2)
and from the boundary conditions
e(t-j) = ?, e(t2)
= f .
In (6.2), the quantity
1 r 3_ ? , _i_ _ 3
3i 9J'k 3T0
and is known as the "Christoffel symbol of the first kind."
By definition of the geodesic curve ? = e(t), its tangent vector
9 ? = ?(t) is of constant length with respect to the metric ds . Thus
rijk =
7 [3iT gjk +
3?7 gki -
3?7 gij] (6'3)
JT g-.?.?. = constant . (6.4) 1111 11
J J
The constant may be chosen to be of value 1 when the curve parameter t is the
arc length parameter s, 0 < s < s , with ?(0) = ?, e(sQ) = f and sQ
= g(e^)
is the geodesic distance between ? and f.
Aitkinson and Mitchell (1981) describe two other methods of deriving
geodesic distances starting from a given differential metric. The distances
234
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Metrics in Probability Spaces 235
obtained by these authors in various cases are given below. In each case we
give the probability function ?(?,?) and the associated geodesic distance of
(?,f) based on the Fisher information metric.
(1) Poisson distribution
?(?,?) = e"e ??/?!, ? = 0,1,..., ?>0
g(e^) = 21?/?" - /f I
(2) Binomial distribution (n fixed)
?(?>?) = f??(1-?)?"\
? = 0,1,...,?, 0<?<1
g(e9<?>) = 2/?|sin /?* - sin /f|
= 2/?? cos"1^ + /(1-?)(1-fG ].
(3) Exponential distribution
?(?,?) = ee"xe, ? > 0
g(e^) = j log ? - ????| .
(4) Gamma distribution (n fixed)
?(?,?) = e'VtnJJ'V'V^, ? > 0
g(e^) = /? | log ? - log ?|
(5) Normal distribution (fixed variance)
2 2 ?(?,?,s0)
= ?(?,s0;?), aQ
fixed
9(?-|>?2^ =
??1 "
y21/s0
(6) Normal distribution (fixed mean)
2 2 p(x>y0?a ) =
?(?0,s ;x),yQ fixed
2 2 g(ara2)
= /2 I log s? - log a?\
(7) Normal distribution
2 2 ?(?,?;s ) = ?(?,s ;x), ? and s both variable.
The information metric in this case is
ds2 = d^+2do^
(6#5) s s
and the geodesic distance is
= 2/Z tanh^?O^) (6.6)
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
236 C. R. Rao
where s(1,2) is the positive square root of
(?1-?2) +2(?y?2)
(?.|-?2) +2(s-|+s2)
The explicit form (6.6) is given in Burbea and Rao (1982a). From (6.6)
2 2 g^,a.| ;?,s2)
= /2|log s-j - log s2|
2 2 which agrees with result (6). However, g(\i,9o ;?2,s ) does not reduce to result
(7) since s = constant is not a geodesic curve with respect to the metric (6.5)
(8) Multivariate normal distribution
? (?,S;?), S fixed
g(y-| >P2) =
[(?-|-?2)'S (??|-?2)]
which is Mahalanobis distance.
(9) Multivariate normal distribution
?(?,S;?), ? fixed
9(S1,S2)-[2"1 I (log ?.)2]"5
where 0 < ?, <...f ? are the roots of the determinantal equation |S2-?S,| = 0.
The above explicit form is due to S. T. Jensen as mentioned in Atkinson and
Mitchell (1981).
(10) Negative binomial distribution
?(?,?) = [x!r(r)]"1r(x+r)ex(l-e)r, r fixed
9(?,f) = 2/Fcosh"1 ? " ^
??-?)(1-F)
= 2/r log J^?J^Ll '(?-?)?-f)
This computation is due to Oiler and Cuadras (1985).
(11) Multinomial distribution
, n, n.
p(n-|,... ,nk; p^,... ,p^) =
n \[ m n ? p] ?? ?p|< ? n fixed.
Let p. = (p,^,...,p^?)
and p2
= (p^2,...,p^2).
Then
1 k
g(p1,p2) = 2/? COS (? /p??p12)
1
The above computation was originally done by Rao (1945), but an easier method
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Metrics in Probability Spaces 237
of derivation is given by Atkinson and Mitchell (1981).
Recently Burbea (1984) obtained geodesic distances in the case of
independent Poisson and Normal distributions which are given below. These
results (12) and (13) follow directly from (1) and (7) respectively as the
squared geodesic distances behave additively under combination of independent
distributions.
(12) Independent Poisson distributions
?(?1,...,??;?1,...,??) = p? _jl
1 xi!
g(e,.en;+1.F?) =
2[[(/?. - ^)2]1/2
(13) Independent Normal distributions
?(?;?1,s1)...?(??;??,s?)
9[(?11s11),....(??1.s21);(?12,s12),...,(??2,s?2)]
? ? 1+?1.2) 1/? = H [ I log2 ?i-I1'2 k=l
l-6k(l,2)
where d.(1,2) is the positive square root of
(yk1-uk2) +2(ak1-g|c2)
(ykl-yk2) +2(<Jkl+ak2)
(14) Multivariate elliptic distributions
?(?|?,S) = |SG1/??[(?-?)'S"?(?-?)],
for some function h, and S is fixed
gdi-j^) =
Ch^i"^)'2- ^??"?2^
where ch is a constant, which is essentially Mahalanobis distance. This result
is due to Mitchell and Krzanowski (1985).
The use of the c... coefficients defined in (2.4) and (4.2) in the 1J ?
discussion of statistical problems will be considered in a future communication.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
REFERENCES
Amari, S. I. (1982). Differential geometry of curved exponential families -
curvature and information loss. Ann. Stat. 10, 357-385.
Amari, S. I. (1983). A foundation of information geometry. Electronics and
Communications in Japan 66-A, 1-10.
Atkinson, C. and Mitchell, A. F. S. (1981). Rao's distance measure. Sankhya
43, 345-365.
Burbea, J. (1986). Informative geometry in probability spaces. Expo. Math.
4, 347-378.
Burbea, J. and Rao, C. Radhakrishna (1982a). Entropy differential metric,
distance and divergence measures in probability spaces: a unified
approach. J. Multivariate Anal. 12, 575-596.
Burbea, J. and Rao, C. Radhakrishna (1982b). Differential metrics in probabil-
ity spaces. Probability Math. Statist. 3, 115-132.
Cencov, N. N. (1982). Statistical decision rules and optimal inference.
Transactions of Mathematical Monographs 53, Amer. Math. Soc.,
Providence.
Csisz?r, I. (1967). Information-type measures of difference of probability
distributions and indirect observations. Studia Seientiarum
Mathematicarum Hungrica 2, 299-318.
Efron, B. (1975). Defining the curvature of a statistical problem (with
applications to second order efficiency, with discussion). Ann.
Statist. 3, 1189-1217.
238
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
Differential Metrics in Probability Spaces 239
Efron, ?. (1982). Maximum likelihood decision theory. Ann. Statist. 10,
340-356.
Eguchi, S. (1983). Second order efficiency of minimum contrast estimators in a
curved exponential family. Ann. Statist. VU 793-803.
Eguchi, S. (1984). A differential geometric approach to statistical inference
on the basis of contrast functionals. Tech. Report No. 136,
Hiroshima University, Hiroshima, Japan.
Havrda, M. E. and Charvat, F. (1967). Quantification method of classification
processes: Concept of a-entropy. Kybernetika 3, 30-35.
Kass, R. E. (1980). The Riemannian structure of model spaces: a geometrical
approach to inference. Ph.D. thesis, University of Chicago.
Kass, R. E. (1981). The geometry of asymptotic inference. Tech. Rept. 215.
Dept. of Statistics, Carnegie-Mellon University.
Lau, Ka-Sing (1985). Characterization of Rao's quadratic entropy. Sankhya A
47, 295-309.
Mitchell, A. F. S. and Krzanowski, W. J. (1985). The Mahalanobis distance and
elliptic distributions. (To appear in Biometrika).
Nei, M. (1978). The theory of genetic distance and evolution of human races.
Japan J. Human Genet. 23, 341-369.
Oiler, J. M. and Cuadras, C. M. (1985). Rao's distance for negative multi-
nomial distributions. Sankhya 47, 75-83.
Rao, C. Radhakrishna (1945). Information and accuracy attainable in the estima-
tion of statistical parameters. Bull. Calcutta Math. Soc. 37,
81-91.
Rao, C. Radhakrishna (1948). The utilization of multiple measurements in prob-
lems of biological classification (with discussion). J. Roy.
Statist. Soc. BIO, 159-203.
Rao, C. Radhakrishna (1949). On the distance between two populations. Sankhya
9, 246-248.
Rao, C. Radhakrishna (1954). On the use and interpretation of distance
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions
240 R. Rao
functions in statistics. Bull. Inst. Inter. Statist. 34, 90-100.
Rao, C. Radhakrishna (1962). Efficient estimates and optimum inference pro-
cedures in large samples (with discussion). J. Roy. Statist. Soc.
? 24, 46-72.
Rao, C. Radhakrishna (1973). Linear Statistical Inference and its Applications.
(Second edition) Wiley, New York.
Rao, C. Radhakrishna (1982a). Diversity and dissimilarity coefficients: a
unified approach. J. Theoret. Pop. Biology 21, 24-43.
Rao, C. Radhakrishna (1982b). Diversity: its measurement, decomposition,
apportionment and analysis. Sankhya A 44, 1-22.
Rao, C. Radhakrishna (1984). Convexity properties of entropy functions and
analysis of diversity. In Inequalities in Statistics and
Probability, 1RS Lecture Notes, Vol. 5, 68-77.
Rao, C. Radhakrishna and Nayak, ?. K. (1985). Cross entropy, dissimilarity
measures and characterizations of quadratic entropy. IEEE Trans.
Information Theory IT 31, 589-593.
Shahshahani, S. (1979). A new mathematical framework for the study of linkage
and selection. Memoirs of the American Mathematical Society,
No. 211.
This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions