differential geometry in statistical inference || differential geometry in statistical inference

Differential Geometry in Statistical InferenceAuthor(s): S.-I. Amari, O. E. Barndorff-Nielsen, R. E. Kass, S. L. Lauritzen and C. R. RaoSource: Lecture Notes-Monograph Series, Vol. 10, Differential Geometry in StatisticalInference (1987), pp. i-iii+1-17+19+21-95+97-161+163+165-217+219-240Published by: Institute of Mathematical StatisticsStable URL: http://www.jstor.org/stable/4355557 .

Accessed: 18/06/2014 23:33

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access toLecture Notes-Monograph Series.

http://www.jstor.org

This content downloaded from 194.29.185.145 on Wed, 18 Jun 2014 23:33:49 PMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=ims

http://www.jstor.org/stable/4355557?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


Institute of Mathematical Statistics

LECTURE NOTES-MONOGRAPH SERIES

Shanti S. Gupta, Series Editor

Volume 10

Differential Geometry

in

Statistical Inference

S.-l. Amari, ?. E. Barndorff-Nielsen,

R. E. Kass, S. L. Lauritzen, and C. R. Rao


Hayward, California




Lecture Notes-Monograph Series

Series Editor, Shanti S. Gupta, Purdue University

The production of the IMS Lecture Notes-Monograph Series is

managed by the IMS Business Office: Nicholas P. Jewell, IMS

Treasurer, and Jose L. Gonzalez, IMS Business Manager.

Library of Congress Catalog Card Number: 87-82603

International Standard Book Number 0-940600-12-9

Copyright ? 1987 Institute of Mathematical Statistics

All rights reserved

Printed in the United States of America



TABLE OF CONTENTS

CHAPTER 1. Introduction

Robert E. Kass

CHAPTER 2. Differential Geometrical Theory of Statistics

Shun-ichi Amari .19

CHAPTER 3. Differential and Integral Geometry in Statistical Inference

O. E. Barndorff-Nielsen .95

CHAPTER 4. Statistical Manifolds

Steffen L. Lauritzen .163

CHAPTER 5. Differential Metrics in Probability Spaces C. R. Rao.217

in



CHAPTER 1. INTRODUCTION

Robert E. Kass

Geometrical analyses of parametric inference problems have developed

from two appealing ideas: that a local measure of distance between members of a

family of distributions could be based on Fisher information, and that the

special place of exponential families in statistical theory could be understood

as being intimately connected with their loglinear structure. The first led

Jeffreys (1946) and Rao (1945) to introduce a Riemannian metric defined by

Fisher information, while the second led Efron (1975) to quantify departures

from exponentiality by defining the curvature of a statistical model. The

papers collected in this volume summarize subsequent research carried out by

Professors Amari, Barndorff-Nielsen, Lauritzen, and Rao together with their

coworkers, and by other authors as well, which has substantially extended both

the applicability of differential geometry and our understanding of the role it

plays in statistical theory.**

The most basic success of the geometrical method remains its concise

summary of information loss, Fisher's fundamental quantification of departure

from sufficiency, and information recovery, his justification for conditioning.

Fisher claimed, but never showed, that the MLE minimized the loss of information

among efficient estimators, and that successive portions of the loss could be

*

Department of Statistics, Carnegie-Mellon University, Pittsburgh, PA **

These papers were presented at the NATO Advanced Workshop on Differential

Geometry in Statistical Inference at Imperial College, April, 1934.

1



Robert E. Kass

recovered by conditioning on the second and higher derivatives of the log-

likelihood function, evaluated at the MLE. Concerning information loss, recall

that according to the Koopman-Darmois theorem, under regularity conditions, the

families of continuous distributions with fixed support that admit finite-

dimensional sufficient reductions of i.i.d. sequences are precisely the exponen-

tial families. It is thus intuitive that (for such regular families) departures

from sufficiency, that is, information loss, should correspond to deviations

from exponential ity. The remarkable reality is that the correspondence takes a

beautifully simple form. The most transparent case, especially for the untrain-

ed eye, occurs for a one-parameter subfamily of a two-dimensional exponential

family. There, the relative information loss, in Fisher's sense, from using a

statistic ? in place of the whole sample is

lim ??G'?p???^?)] - ?2 + } S2 (1)

where ni(?) is the Fisher information in the whole sample, i (?) is the Fisher

information calculated from the distribution of ?, ? is the statistical curva-

ture of the family and ? is the mixture curvature of the "ancillary family"

associated with the estimator T. When the estimator ? is the MLE, 3 vanishes;

this substantiates Fisher's first claim.

In his 1975 paper, Efron derived the two-term expression for infor-

mation loss (in his equation (10.25)), discussed the geometrical interpretation

of the first term, and noted that the second term is zero for the MLE. He

defined*? to be the curvature of the curve in the natural parameter space that

describes the subfamily, with the inner product defined by Fisher information

replacing the usual Euclidean inner product. The definition of ? is exactly

analogous to that of ?, with the mean value parameter space used instead of the

natural parameter space, but Efron did not recognize this and so did not

identify the mixture curvature. He did stress the role of the ancillary family

associated with the estimator ? (see his Remark 3 of Section 9 and his reply to

discussants, p. 1240), and he also noticed a special case of (1) (in his reply,

p. 1241). The final simplicity of the complete geometrical version of (1)



Introduction

appeared in Amari's 1982 Annals paper. There it was derived in the multi-

parameter case; see equation (4.8) of Amari's paper in this volume.

Prior to Efron's paper, Rao (1961) had introduced definitions of

efficiency and second-order efficiency that were intended to classify estimators

just as Fisher's definitions did, but using more tractable expressions. This

led to the same measure of minimum information loss used by Fisher (correspond-

p ing to ? in equation (1)). Rao (1962) computed the information loss in the

case of the multinomial distribution for several different methods of estimation.

Rao (1963) then went on to provide a decision-theoretic definition of second-

order efficiency of an estimator T, measuring it according to the magnitude of

the second-order term in the asymptotic expansion of the bias-corrected version

of T. Efron's analysis clarified the relationship between Fisher's definition

and Rao's first definition. Efron then provided a decomposition of the second-

order variance term in which the right-hand side of (1) appeared, together with

a parameterization-dependent third term. The extension to the multi parameter

case was derived by Madsen (1979) following the outline of Reeds (1975). It

appears here in Amari's paper as Theorem 3.4.

An analytically and conceptually important first step of Efron's

analysis was to begin by considering smooth subfamilies of regular exponential

families, which he called curved exponential families. Analytically, this made

possible rigorous derivations of results, and for this reason such families

were analyzed concurrently by Ghosh and Subramaniam (1974). Conceptually, it

allowed specification of the ancillary families associated with an estimator:

the ancillary family associated with ? at t is the set of points y in the sample

space of the full exponential family - equivalently, the mean value parameter

space for the family - for which T(y) = t. The terminology and subsequent

detailed analysis is due to Amari but, as noted above, the importance of the

ancillary family, at once emphasized and obscured by Fisher, was apparent from

Efron's presentation.

The ancillary family is also important in understanding information



Robert E. Kass

recovery, which is the reason Amari has chosen to use the modifier "ancillary."

In the discussion of Efron's paper, Pierce (1975) noted another interpretation

of statistical curvature: it furnishes the asymptotic standard deviation of

observed information. More precisely, it is the asymptotic standard deviation

-1/2 ? -1 of the asymptotically ancillary statistic ? '

?(?) [?(?) - ni(e)], where

ni(?) is expected information and ?(?) is observed information; the one-

parameter statement appears in Efron and Hinkley, (1978), and the multiparameter

version is in Skovgaard (1985). When fitting a curved exponential family by the

method of maximum likelihood, this statistic becomes a normalized component of

the residual (in the direction normal to the model within the plane spanned by

the first two derivatives of the natural parameter for the full exponential

family). Furthermore, conditioning on this statistic recovers (in Fisher's

sense) the information lost by the MLE, at least approximately. When this

conditional distribution is used, the asymptotic variance of the MLE may be

estimated by the inverse of observed rather than expected information; in some

problems observed information is clearly superior.

This argument, sketched by Pierce and presented in more detail by

Efron and Hinkley, represented an attempt to make sense of some of Fisher's

remarks on conditioning. In Section 4 of his paper in this volume, Amari

presents a comprehensive approach to information recovery as measured by Fisher

information. He begins by defining a statistic ? to be asymptotically suffi-

cient of order q when

?1(?) - iT(e) = 0(n"q+1)

and asymptotically ancillary of order q when

iT(e) = 0(n"q) .

These definitions differ from some used by other authors, such as Cox (1980),

McCullagh (1984a), and Skovgaard (1985). They are, however, clearly in the

spirit of Fisher's apparent feeling that i (?) is an appropriate measure of

information. To analyze Fisher's suggestion that higher derivatives of the

loglikelihood function could be used to create successive higher-order



Introduction

approximate ancillary statistics, Amari defines an explicit sequence of

combinations of the derivatives: he takes successive components of the residual

in spaces spanned by the first ? derivatives - of the natural parameter for the

ambient exponential family - but perpendicular to the space spanned by the first

p-1, then normalizes by higher-order curvatures. In Theorems 4.1 and 4.2

Amari achieves a complete decomposition of the information. He thereby makes

specific, justifies, and provides a geometrical interpretation for Fisher's

second claim. In Amari's decomposition the p-th term is attributable to the.

p-th statistic in his sequence and has magnitude equal to n"p times the

square of the p-th order curvature. (Actually, Amari's treatment is more

general than the rough description here would imply since he allows for the use

of efficient estimators other than the MLE.)

As far as the basic issue of observed versus expected information is

concerned, Amari (1982b) used an Edgeworth expansion involving geometrically

interpretable terms (as in Amari and Kumon, 1983) to provide a general motiva-

tion for using the inverse of observed information as the estimate of the

conditional variance of the MLE. See Section 4.4 of the paper here. (In truth,

the result is not as strong as it may appear. When we have an approximation ?

to a variance ? satisfying ?(?) = ? (?){1 + 0(n" )}, and we use it to estimate

?(?), we substitute ? (?), where ? is some estimator of ?, and then all we

-112 usually get is ?(?) =

??(?){1 + 0 (?

' )}. For essentially this reason,

observed information does not in general provide an approximation to the con-

ditional variance of the MLE based on the underlying true value ?, having

relative error 0 (n ) - although it does do so whenever expected information is

constant, as it is for a location parameter. Similarly, as Skovgaard, 1985,

points out in his careful consideration of the role of observed information in

inference, when estimated cumulants are used in an Edgeworth expansion it loses

its higher-order approximation to the underlying density at the true value.

This practical limitation of asymptotics does not affect Bayesian inference, in

which observed information furnishes a better approximation to the posterior



Robert E. Kass

variance than does expected information for all regular families.)

For curved exponential families, then, the results summarized in the

first few sections of Amari's paper provide a thorough geometrical interpreta-

tion of the Fisherian concepts of information loss and recovery and also Rao's

concept of second-order efficiency. In addition, in section 3.4 Amari discusses

the geometry of testing, as had Efron, providing comparisons of several commonly-

used test procedures with the locally most powerful test. Curved exponential

families were introduced, however, for their mathemetical and conceptual

simplicity rather than their applicability. To extend his one-parameter

results, Efron, in his 1975 paper, did two things: he noted that any smooth

family could be locally approximated by a curved exponential family, and he

provided an explicit formula for statistical curvature in the general case.

In Section 5 of his paper, Amari shows how results established for curved

exponential families may be extended by constructing an appropriate Hilbert

bundle, about which I will say a bit more below. With the Hilbert bundle,

Amari provides a geometrical foundation, and generalization, for Efron's sugges-

tion. From it, necessary formulas can be derived.

One reason that the role of the mixture curvature in (1) and in the

variance decomposition went unnoticed in Efron's paper was that he had not

made the underlying geometrical structure explicit: to calculate statistical

curvature at a given value 0Q of a single parameter ? in a curved exponential

family, Efron used the natural parameter space with the inner product defined

by Fisher information at the natural parameter point corresponding to eQ. In

order to calculate the curvature at a new point ?,, another copy of the natural

parameter space with a different inner product (namely, that defined by Fisher

information at the natural parameter point corresponding to ?,) would have to be

used. The appropriate gluing together of these spaces into a single structure

involves three basic elements: a manifold, a Riemannian metric, and an affine

connection. Riemannian geometry involves the study of geometry determined by

the metric and its uniquely associated Riemannian connection. In his discussion



Introduction

to Efron's paper, Dawid (1975) pointed out that Efron had used the Riemannian

metric defined by Fisher information, but that he had effectively used a non-

Riemannian affine connection, now called the exponential connection, in cal-

culating statistical curvature. Although Dawid did not identify the role of the

mixture curvature in (1), he did draw attention to the mixture connection as an

alternative to the exponential connection. (Geodesies with respect to the

exponential connection form exponential families, while geodesies with respect

to the mixture connection form families of mixtures; thus, the terminology.)

Amari, who had much earlier researched the Riemannian geometry of Fisher infor-

mation, picked up on Dawid's observation, specified the framework, and provided

the results outlined above.

The manifold with the associated linear spaces is structured in what

is usually called a tangent bundle, the elements of the linear spaces being

tangent vectors. For curved exponential families, the linear spaces are finite-

dimensional, but to analyze general families this does not suffice so Amari

uses Hilbert spaces. When these are appropriately glued together, the result

is a Hilbert bundle. The idea stems from Dawid's remark that the tangent

vectors can be identified with score functions, and these in turn are functions

having zero expectation. As his Hilbert space at a distribution P, Amari takes

the subspace of the usual L2(P) Hilbert space consisting of functions that have

zero expectation with respect to P. This clearly furnishes the extension of

the information metric, and has been used by other authors as well, e.g.,

Beran (1977). Amari then defines the exponential and mixture connections and

notes that these make the Hilbert bundle flat, and that the inherited connec-

tions on the usual tangent bundles agree with those already defined there. He

then decomposes each Hilbert space into tangential and normal components,

which is exactly what is needed to define statistical curvature in the general

setting. Amari goes on to construct an "exponential bundle" by associating

with each distribution a finite-dimensional linear space containing vectors

defined by higher derivatives of the loglikelihood function, and using structure



8 Robert E. Kass

inherited from the Hilbert bundle. With this he obtains a satisfactory version

of the local approximation by a curved exponential family that Efron had

suggested.

This pretty construction allows results derived for curved exponen-

tial families to be extended to more general regular families, yet it is not

quite the all-encompassing structure one might hope for: the underlying

manifold is still a particular parametric family of densities rather than the

collection of all possible densities on the given sample space. Constructions

for the latter have so far proved too difficult.

In his Annals paper, Amari also noted an interesting relationship

between the exponential and mixture connections: they are, in a sense he

defined, mutually dual. Furthermore, a one-parameter family of connections,

which Amari called the a-connections, may be defined in such a way that for each

a the a-connection and the -a-connection are mutually dual, while a=l and -1

correspond to the exponential and mixture connections. See Amari's Theorem 2.1.

This family coincides with that introduced by Centsov (1971) for multinomial

distributions. When the family of densities on which these connections are

defined is an exponential family, the space is flat with respect to the exponen-

tial and mixture connections, and the natural parametrization and mean-value

parameterization play special roles: they become affine coordinate systems for

the two respective connections and are related by a Legendre transformation.

The duality in this case can incorporate the convex duality theory of exponen-

tial families (see Barndorff-Nielsen, 1978, and also Section 2 of his paper in

this volume). In Theorem 2.2 Amari points out that such a pair of coordinate

systems exists whenever a space is flat with respect to an a-connection (with

a f 0). For such spaces, Amari defines a-divergence, a quasi-distance between

two members of the family based on the relationship provided by the Legendre

transformation. In Theorem 2.4 he shows that the element of a curved exponential

family that minimizes the a-divergence from a point in the exponential family

parameter space may be found by following the a-geodesic that contains the



Introduction 9

given point and is perpendicular to the curved family. This generates a new

class of minimum a-divergence estimators, the MLE being the minimum

-1-divergence estimator, an interpretation also discussed by Efron (1978).

As applications of his general methods based on a-connections on

Hilbert bundles, Amari treats the problems of combining independent samples (at

the end of section 5), making inferences when the number of nuisance parameters

increases with the sample size (in section 6), and performing spectral estima-

tion in Gaussian time series (in section 7).

As soon as the a-connections are constructed a mathematical question

arises. On one hand, the a-connections may be considered objects of differen-

tial geometry without special reference to their statistical origin. On the

other hand, they are not at all arbitrary. They are the simplest one-parameter

family of connections based on the first three moments of the score function.

What is it about their special form that leads to the many special properties

of a-connections (outlined by Amari in Section 2)?

Lauritzen has posed this question and has provided a substantial

part of the answer. Given any Riemannian manifold M with metric g there is a

unique Riemannian connection v. Given a covariant 3-tensor D that is symmetric

in its first two arguments and a nonzero number c, a new (symmetric) connection

is defined by

? = ? + c ? D (2)

which means that given vector fields X and Y,

??? =

??? + c ? D(X,Y)

where

g(D(X,Y),Z)E D(X,Y,Z)

for all vector fields Z. Now, when M is a family of densities and g and D are

defined, in terms of an arbitrary parameterization, as

g(dy d.) = E(d.id.a)

D(3., d.9 8k) =

E(3i?,3j?3k5l)



10 Robert E. Kass

where i is the loglikelihood function, and if c = -a/2, then (2) defines the

a-connection.

In this statistical case, D is not only symmetric in its first two

arguments, as it must be in (2), it is symmetric in all three. Lauritzen

therefore defines an abstract statistical manifold to be a triple (M,g,D) in

which M is a smooth m-dimensional manifold, g is a Riemannian metric, and D is

a completely symmetric covariant 3-tensor. With this additional symmetry

constraint alone, he then proceeds to establish a large number of basic proper-

ties, especially those relating to the duality structure Amari described. The

treatment is "fully geometrical" or "coordinate-free." This is aesthetically

appealing, especially to those who learned linear models in the coordinate-free

setting. Lauritzen's primary purpose is to show that the appropriate mathemat-

ical object of study is one that is not part of the standard differential

geometry, but does have many special features arising from an apparently simple

structure. He not only presents the abstract generalities about a-connections

on statistical manifolds, he also examines five examples in full detail. The

first is the univariate Gaussian model, the second is the inverse Gaussian

model, the third is the two-parameter gamma model, and the last two are

specially constructed models that display interesting possibilities of the non-

standard geometries of a-connections. In particular, the latter two statistical

manifolds are not what Lauritzen calls "conjugate symmetric" and so the

sectional curvatures do not determine the Riemann tensor (as they do in

Riemannian geometry). He also discusses the construction of geodesic folia-

tions, which are decompositions of the manifold and are important because they

generate potentially interesting decompositions of the sample space. At the

end of his paper, Lauritzen calls attention to several outstanding problems.

Amari's a-connections, based on the first three moments of the

score function, do not furnish the only examples of statistical manifolds. In

his contribution to this volume, Barndorff-Nielsen presents another class of

examples based instead on certain "observed" rather than expected derivatives



Introduction 11

of the loglikelihood.

Although the idea of using observed derivatives might occur to

any casual listener on being told of Amari's use of expectations, it is not

obvious how to implement it. First of all, in order to define an observed

information Riemannian metric, one needs a definition of observed information

at each point of the parameter space. Apparently one would want to treat each

? as if it were an MLE and then use ?(?). However, ?(?) depends on the whole

sample y rather than on ? alone, so this scheme does not yet provide an explicit

definition. Barndorff-Nielsen's solution is natural in the context of his

research on conditional ity: he replaces the sample y with a sufficient pair

(?,a) where a is the observed value of an asymptotically ancillary statistic A.

This is always possible for curved exponential families, and in more general

models A could at least be taken so that (?,A) is asymptotically sufficient.

With this replacement, the second component may be held fixed at A=a while ?

varies. Writing ?(?) = I/g a\(e) thus allows the definition ?(?) ? I,

a\(e)

to be made at each point ? in the parameter space. Using this definition of

the Riemannian metric, Barndorff-Nielsen derives the coefficients that deter-

mine the Riemannian connection. From the transformation properties of tensors,

he then succeeds in finding an analogue of the exponential connection based on

a certain mixed third derivative of the loglikelihood function (two derivatives

being taken with respect to ? as parameter, one with respect to ? as MLE). In

so doing, he defines the tensor D in the statistical manifold and thus arrives

at his "observed conditional geometry."

Barndorff-Nielsen1s interest in this geometry lies not with

analogues of statistical curvature and other expected-geometry constructs, but

rather with an alternative derivation, interpretation, and extension of an

approximation to the conditional density of the MLE, which had been obtained

earlier (in Barndorff-Nielsen and Cox, 1979). In several papers, Barndorff-

Nielsen (1980, 1983) has discussed generalizations and approximate versions of

Fisher's fundamental density-likelihood formula for location models



12 Robert E. Kass

?(? |a,e) = c - L(e)/L(?) (3)

where ? is the MLE, a is an ancillary statistic, ? is the conditional density

of the MLE, and L is the likelihood function. (This is discussed in Efron and

Hinkley, 1978; Fisher actually treated the location-scale case.) The formula

is of great importance both practically, since it provides a way of computing

the conditional density, and philosophically, since it entails the formal

agreement of conditional inference and Bayesian inference using an invariant

prior. Inspection of the derivation indicates that the formula results from

the transformational nature of the location problem, and Barndorff-Nielsen has

shown that a version of it (with an additional factor for the volume element)

holds for very general transformation models. He has also shown that for non-

transformation models, a version of the right-hand side of (3) while not

exactly equal to the left-hand side, remains a good asymptotic approximation for

it. (See also Hinkley, 1980, and McCullagh, 1984a.) In his paper in this

volume, Barndorff-Nielsen reviews these results, shows how the various observed

conditional geometrical quantities are calculated, and then derives his desired

expansion (of a version of the right-hand side of (3)) in terms of the geo-

metrical quantities that correspond to those used by Amari in his expected

geometry expansions. Barndorff-Nielsen devotes substantial attention to trans-

formation models, which may be treated within his framework of observed

conditional geometry. In this context, the models become Lie Groups, for which

there is a rich mathematical theory.

In the fourth paper in this volume, Professor Rao returns to the

characterization of the information metric that originally led him (and also

Jeffreys) to introduce it: it is an infinitesimal measure of divergence based

on what is now called Shannon entropy. Rao considers here a more general class

of divergence measures, which he has found useful in the study of genetic

diversity, leading to a wide variety of metrics. He derives the quadratic and

cubic terms in Taylor series expansions of these measures and shows how, in the

case of Shannon entropy, the cubic term is related to the a-connections.



Introduction 13

The papers here collectively show that geometrical structures of

statistical models can provide both conceptual simplifications and new methods

of analysis for problems of statistical inference. There is interesting

mathematics involved, but does the interesting mathematics lead to interesting

statistics?

The question arises because geometry has provided new techniques,

and its formalism produces convenient summaries for complicated multivariate

expressions in asymptotic expansions (as in Amari and Kumon, 1983, and

McCullagh, 1984b), but it has not yet created new methodology with clearly

important practical applications. Thus, it is already apparent from (1) that

there exists a wide class of estimators that minimize information loss (and are

second-order efficient): it consists of those having zero mixture curvature

for their associated ancillary families. It is interesting that the MLE is only

one member of this class, and it is nice to have Eguchi's (1983) derivation that

certain minimum contrast estimators are other members, but it seems unlikely -

though admittedly possible - that any competitor will replace maximum likelihood

estimation as the primary method of choice in practice. Similarly, there is

not yet any reason to think that alternative minimum a-divergence estimators or

their observed conditional geometry counterparts will be considered superior to

the MLE.

On the other hand, as I indicated at the outset, geometry does

give a definitive description of information loss and recovery. Since Fisher

remains our wisest yet most enigmatic sage, it is worth our while to try to

understand his pronouncements. Together with the triumvirate of consistency,

?k-k Since Rao's work on second order efficiency arose in an attempt to understand

Fisher's computation of information loss in estimation, it might appear that

Efron's investigation also began as an attempt to understand Fisher. He has

informed me, however, that he set out to define the curvature of a statistical

model and came later to its use in information loss and second-order efficiency.



14 Robert E. Kass

sufficiency, and efficiency, information loss and recovery form the core of

Fisher's theory of estimation. On the basis of the geometrical results, it is

fair to say that we now know what Fisher was talking about, and that what he

said was true. Here, as in other problems (such as inference with nuisance

parameters, discussed in Amari's section 5, or in nonlinear regression, e.g.,

Bates and Watts, 1980, Cook and Tsai, 1985, Kass, 1984, McCullagh and Cox, 1936),

the geometrical formulation tends to shift the burden of derivation of results

away from proofs, toward definitions. Thus, once the statement of a proposition

is understood, its truth is easier to see and in this there is great simplifica-

tion. One could make this argument about much abstract mathematical develop-

ment, but it is particularly appropriate here.

Furthermore, there are reasons to think that future work in this

area could lead to useful results that would otherwise be difficult to obtain.

One important problem that structural research might solve is that of determin-

ing useful conditions under which a particular root of the likelihood equation

will actually maximize the likelihood. Global results on foliations might be

very helpful, as might be formulas relating computable characteristics of

statistical manifolds to the behavior of geodesies. The results in these papers

could turn out to play a central role in the solution of this or some other

practical problem of statistical theory. We will have to wait and see. Until

then, readers may enjoy the papers as informative excursions into an intriguing

realm of mathematical statistics.

Acknowledgements

I thank 0. E. Barndorff-Nielsen, D. R. Cox, and C. R. Rao for their

comments on an earlier draft. This paper was prepared with support from the

National Science Foundation under Grant No. NSF/DMS - 8503019.



REFERENCES

Amari, S. (1982a). Differential geometry of curved exponential families -

curvatures and information loss. Ann. Statist. 10, 357-387.

Amari, S. (1982b). Geometrical theory of asymptotic ancillarity and conditional

inference. Biometrika 69, 1-17.

Amari, S. and Kumon, M. (1983). Differential geometry of Edgeworth expansions

in curved exponential family. Ann. Inst. Statist. Math. 35A, 1-24.

Barndorff-Nielsen, 0. E. (1978). Information and Exponential Families,

New York: Wiley.

Barndorff-Nielsen, 0. E. (1980). Conditionality resolutions. Biometrika 67,

293-310.

Barndorff-Nielsen, 0. E. (1983). On a formula for the distribution of the

maximum likelihood estimator. Biometrika 70, 343-305.

Barndorff-Nielsen, 0. E. and Cox, D. R. (1979). Edgeworth and Saddlepoint

approximations with statistical applications, (with Discussion).

J. R. Statist. Soc. B41, 279-312.

Bates, D. M. and Watts, D. G. (1980). Relative curvature measures of non-

linearity. J. R. Statist. Soc. B42, 1-25.

Beran, R. (1977). Minimum Hellinger distance estimates for parametric models.

Ann. Statist. 5, 445-463.

Centsov, N. N. (1971). Statistical Decision Rules and Optimal Inference (in

Russian). Translated into English (1982), AMS, Rhode Island.

Cook, R. D. and Tsai, C.-L. (1985). Residuals in nonlinear regression.

Biometrika 72, 23-29.

15



16 Robert E. Kass

Cox, D. R. (1980). Local ancillarity. Biometrika 62, 269-276.

Dawid, A. P. (1975). Discussion to Efron's paper. Ann. Statist. 3, 1231-1234.

Efron, ?. (1975). Defining the curvature of a statistical problem (with

applications to second-order efficiency), (with Discussion).

Ann. Statist. 3, 1189-1242.

Efron, ?. (1978). The geometry of exponential families. Ann. Statist. 6,

362-376.

Efron, ?. and Hinkley, D. V. (1978). Assessing the accuracy of the maximum

likelihood estimator: Observed versus expected Fisher information,

(with discussion). Biometrika 65, 457-487.

Eguchi, S. (1983). Second order efficiency of minimum contrast estimators in

a curved exponential family. Ann. Statist. 11, 793-803.

Fisher, R. A. (1925). Theory of statistical estimation. Proc. Camb. Phil. Soc.

22, 700-725.

Fisher, R. A. (1934). Two new properties of mathematical likelihood. Proc.

R. Soc. A144, 285-307.

Ghosh, J. K. and Subramaniam, K. (1974). Second order efficiency of maximum

likelihood estimators. Sankya 36A, 325-358.

Hinkley, D. V. (1980). Likelihood as approximate pivotal distribution.


Jeffreys, H. (1946). An invariant form for the prior probability in estimation

problems. Proc. Roy. Soc. Al86, 453-461.

Kass, R. E. (1984). Canonical parameter!zations and zero parameter-effects

curvature. J. Roy. Statist. Soc. B46, 1, 86-92.

Madsen, L. T. (1979). The geometry of statistical model - a generalization of

curvature. Res. Report 79-1. Statist. Res. Unit, Danish Medical

Res. Council.

McCullagh, P. (1984a). On local sufficiency. Biometrika 71, 233-244.

McCullagh, P. (1984b). Tensor notation and cumulants of polynomials.




Introduction 17

McCullagh, P. and Cox, D. R. (1986). Invariants and likelihood ratio statistics.

Ann. Statist. 14, 1419-1430.

Pierce, D. A. (1975). Discussion to Efron's paper. Ann. Statist. 3, 1219-1221.

Rao, C. R. (1945). Information and accuracy attainable in the estimation of

statistical parameters. Bull. Calcutta Math. Soc. 37, 81-89.

Rao, C. R. (1961). Asymptotic efficiency and limiting information. Proc.

Fourth Berkeley Symp. Math. Statist. Prob., Edited by J. Neyman,

1, 531-545.

Rao, C. R. (1962). Efficient estimates and optimum inference procedures in

large samples (with discussion). J. Roy. Statist. Soc. B24, 46-72.

Rao, C. R. (1963). Criteria of estimation in large samples. Sankya 25, 189-

206.

Reeds, J. (1975). Discussion to Efron's paper. Ann. Statist. 3, 1234-1238.

Skovgaard, I. (1985). A second-order investigation of asymptotic ancillarity.

Ann. Statist. 13, 534-551.



DIFFERENTIAL GEOMETRICAL THEORY OF STATISTICS

Shun-ichi Amari*

1. Introduction. 21

2. Geometrical Structure of Statistical Models . 25

3. Higher-Order Asymptotic Theory of Statistical Inference in

Curved Exponential Family . 38

4. Information, Sufficiency and Ancillarity Higher Order Theory . 52

5. Fibre-Bundle Theory of Statistical Models . 59

6. Estimation of Structural Parameter in the Presence of Infinitely

Many Nuisance Parameters . 73

7. Parametric Models of Stationary Gaussian Time Series . 83

8. References. 91

Department of Mathematical Engineering and Instrumentation Physics, University

of Tokyo, Tokyo, JAPAN

19



1. INTRODUCTION

Statistics is a science which studies methods of inference, from

observed data, concerning the probabilistic structure underlying such data.

The class of all the possible probability distributions is usually too wide to

consider all its elements as candidates for the true probability distribution

from which the data were derived. Statisticians often assume a statistical

model which is a subset of the set of all the possible probability distribu-

tions, and evaluate procedures of statistical inference assuming that the model

is faithful, i.e., it includes the true distribution. It should, however, be

remarked that a model is not necessarily faithful but is approximately so. In

either case, it should be very important to know the shape of a statistical

model in the whole set of probability distributions. This is the geometry of a

statistical model. A statistical model often forms a geometrical manifold, so

that the geometry of manifolds should play an important role. Considering that

properties of specific types of probability distributions, for example, of

Gaussian distributions, of Wiener processes, and so on, have so far been studied

in detail, it seems rather strange that only a few theories have been proposed

concerning properties of a family itself of distributions. Here, by the proper-

ties of a family we mean such geometric relations as mutual distances, flatness

or curvature of the family, etc. Obviously it is not a trivial task to define

such geometric structures in a natural, useful and invariant manner.

Only local properties of a statistical model are responsible for the

asymptotic theory of statistical inference. Local properties are represented

by the geometry of the tangent spaces of the manifold. The tangent space has a

21



22 Shun-ichi Amari

natural Riemannian metric given by the Fisher information matrix in the regular

case. It represents only a local property of the model, because the tangent

space is nothing but local linearization of the model manifold. In order to

obtain larger-scale properties, one needs to define mutual relations of the two

different tangent spaces at two neighboring points in the model. This can be

done by defining a one-to-one affine correspondence between two tangent spaces,

which is called an affine connection in differential geometry. By an affine

connection, one can consider local properties around each point beyond the

linear approximation. The curvature of a model can be obtained by the use of

this connection. It is clear that such a differential-geometrical concept pro-

vides a tool convenient for studying higher-order asymptotic properties of

inference. However, by connecting local tangent spaces further, one can obtain

global relations. Hence, the validity of the differential-geometrical method is

not limited within the framework of asymptotic theory.

It was Rao (1945) who first pointed out the importance in the

differential-geometrical approach. He introduced the Riemannian metric by using

the Fisher information matrix. Although a number of researches have been

carried out along this Riemannian line (see, e.g., Amari (1968), Atkinson and

Mitchell (1981), Dawid (1977), James (1973), Kass (1980), Skovgaard (1984),

Yoshizawa (1971), etc.), they did not have a large impact on statistics. Some

additional concepts are necessary to improve its usefulness. A new idea was

developed by Chentsov (1972) in his Russian book (and in some papers prior to

the book). He introduced a family of affine connections and proved their unique-

ness from the point of view of categorical invariance. Although his theory was

deep and fundamental, he did not discuss the curvature of a statistical model.

Efron (1975, 1978), independently of Chentsov's work, provided a new idea by

pointing out that the statistical curvature plays an important role in higher-

order properties of statistical inference. Dawid (1975) pointed out further

possibilities. Efron's idea was generalized by Madsen (1979) (see also Reeds

(1975)). Amari (1980, 1982a) constructed a differential-geometrical method in



Differential Geometrical Theory of Statistics 23

statistics by introducing a family of affine connections, which however turned

out to be equivalent to Chentsov's. He further defined a-curvatures, and point-

ed out '?he fundamental roles of the exponential and mixture curvatures played in

statistica? inference. The theory has been developed further by a number of

papers (Amrri (1982b, 1983a, b), Amari and Kumon (1983), Kumon and Amari (1983,

1984, 1985), Nagaoka and Amari (1982), Eguchi (1983), Kass (1984)). The new

developments were also shown in the NATO Research Workshop on Differential Geo-

metry in Statistical Inference (see Barndorff-Nielsen (1985) and Lauritzen

(1985)). They together seem to prove the usefulness of differential geometry as

a fundamental method in statistics. (See also Csisz?r (1975), Burbea and Rao

(1982), Pfanzagl (1982), Beale (1960), Bates and Watts (1980), etc., for other

geometrical work.)

The present article gives not only a compact review of various

achievements up to now by the differential geometrical method most of which have

already been published in various journals and in Amari (1985) but also a pre-

view of new results and half-baked ideas in new directions, most of which have

not yet been published. Chapter 2 provides an introduction to the geometrical

method, and elucidates fundamental geometrical properties of statistical mani-

folds. Chapter 3 is devoted to the higher-order asymptotic theory of statisti-

cal inference, summarizing higher-order characteristics of various estimators

and tests in geometrical terms. Chapter 4 discusses a higher-order theory of

asymptotic sufficiency and ancillarity from the Fisher information point of

view. Refer to Amari (1985) for more detailed explanations in these chapters;

Lauritzen (1985) gives a good introduction to modem differential geometry. The

remaining Chapters 5, 6, and 7 treat new ideas and developments which are just

under construction. In Chapter 5 is introduced a fibre bundle approach, which

is necessary in order to study properties of statistical inference in a general

statistical model other than a curved exponential family. A Hilbert bundle and

a jet bundle are treated in a geometrical framework of statistical inference.

Chapter 6 gives a summary of a theory of estimation of a structural parameter



24 Shun-ichi Amari

in the presence of nuisance parameters whose number increases in proportion to

the number of observations. Here, the Hilbert bundle theory plays an essential

role. Chapter 7 elucidates geometrical structures of parametric and non-para-

metric models of stationary Gaussian time series. The present approach is use-

ful not only for constructing a higher-order theory of statistical inference on

time series models, but also for constructing differential geometrical theory of

systems and information theory (Amari, 1983 c). These three chapters are

original and only sketches are given in the present paper. More detailed theo-

retical treatments and their applications will appear as separate papers in the

near future.



2. GEOMETRICAL STRUCTURE OF STATISTICAL MODELS

Metric and a-connection

Let S = {?(?,?)} be a statistical model consisting of probability

density functions ?(?,?) of random variable ?e? with respect to a measure ? on

X such that every distribution is uniquely parametrized by an n-dimensional

vector parameter ? = (? ) = (? ,... ,? ). Since the set {p(x)> of all the den-

sity functions on X is a subset of the L, space of functions in x, S is consid-

ered to be a subset of the L-. space. A statistical model S is said to be geo-

metrically regular, when it satisfies the following regularity conditions

^1 "

^6' anc' S 1S regarded as an ?-dimensional manifold with a coordinate system

?.

A-,. The domain T of the parameter ? is homeomorphic to an n-dimen-

sional Euclidean space Rn.

k2. The topology of S induced from Rn is compatible with the

relative topology of S in the L, space.

A~. The support of ?(?,?) is common for all ?e?, so that ?(?,?)

are mutually absolutely continuous.

A-. Every density function ?(?,?) is a smooth function in ?

uniformly in x, and the partial derivative 9/3T and integration of log ?(?,?)

with respect to the measure P(x) are always commutative.

A5- The moments of the score function (a/ae^log ?(?,?) exist up to

the third order and are smooth in ?.

Ag. The Fisher information matrix is positive definite.

Condition 1 implies that S itself is homeomorphic to R . It is

25



26 Shun-ichi Amari

Figure 1

possible to weaken Condition 1. However, only local properties are treated

here so that we assume it for the sake of simplicity. In a later section, we

assume one more condition which guarantees the validity of Edgeworth expansions.

Let us denote by 3. = a/ae1 the tangent vector e. of the i-th

coordinate curve ?1 (Fig. 1) at point ?. Then, ? such tangent vectors e. = a.,

i = 1,..., ?, span the tangent space ? at point ? of the manifold S. Any tan- ?

gent vector ?e? is a linear combination of the basis vectors a.,

A = aV,

where A are the components of vector A and Einstein's summation convention is

assumed throughout the paper, so that the summation S is automatically taken

for those indices which appear twice in one term once as a subscript and once as

a superscript. The tangent space ? is a linearized version of a small neigh-

borhood at ? of S, and an infinitesimal vector de = de1a. denotes the vector

connecting two neighboring points ? and ? + de or two neighboring distributions

?(?,?) and p(x, ? + de).

Let us introduce a metric in the tangent space T_. It can be done

by defining the inner product g - . (?) = o., a.> of two basis vectors a. and a.

at e. To this end, we represent a vector 3?e?? by a function a.?(x,e) in x,

where ?(x,e) = log p(x,e) and a.(in a.?) is the partial derivative a/ae1.

Then, it is natural to define the inner product by

g (e) = <a.,a.> = Efi[a.?(x,e)a.?(x,e)], (2.1)

where E denotes the expectation with respect to ?(?,?). This g.. is the

Fisher information matrix. Two vectors A and ? are orthogonal when

<A,B> = <A1ai,BJa.>

= AVg... = 0.

It is sometimes necessary to compare a vector ?e? of the tangent ?

space ?? at one point ? with a vector ?e? . belonging to the tangent space ?O? ?D D

at another point ?'. This can be done by comparing the basis vectors a. at T. ? ?

with the basis vectors ai at ? .. Since TA and t l are two different vector Id d d

spaces, the two vectors a. and a', are not directly comparable, and we need some

way of identifying Tn with T., in order to compare the vectors in them. This D D

can be accomplished by introducing an affine connection, which maps a tangent

space ? +. at ? + de to the tangent space ? at e. The mapping should reduce

to the identity map as de-*0. Let m(ai) be the image of a'-e? . mapped to ? .

It is slightly different from a-e?,.. The vector j e

va a. = lim -4r im(3'.) - a.} 9i J de+0 de1 3 3

represents the rate at which the j-th basis vector 34e?? "intrinsically" changes j w

as the point e moves from ? to ?+de (Fig. 2) in the direction a... We call

v. a. the covariant derivative of the basis vector a. in the direction a.. a-j J j '

Since it is a vector of ? , its components are given by D

rijk =

<VrV ' (2?2)

Figure 2

28 Shun-ichi Amari

and

where r.-k =

G?. .mgmk? We call r.-k the components of the affine connection. An

affine connection is specified by defining v. a. or r.... Let A(e) be a vector d. J IJK * -j

field, which assigns to every point QeS a vector A(e) = A (e)a. e TQ. The ? ?

intrinsic change of the vector A(e) as the position ? moves is now given by the

covariant derivative in the direction a. of A(e) = AJ(e)a., defined by ? j

v3A = (d.?p)d. +

?a'(?9^) = (a.Aj +

rikJAk)3j,

in which the change in the basis vectors as well as that in the components

A1(e) is taken into account. The covariant derivative in the direction ? =

??3^ is given by

vRA = ? ? ?.

B a.

We have defined the covariant derivative by the use of the basis

vectors a. which are associated with the coordinate system or the parametriza-

tion e. However, the covariant derivative vJ\ is invariant under any parametri-

zation, giving the same result in any coordinate system. This yields the trans-

formation law for the components of a connection r.... When another coordinate

system (parametrization) e' = e'(e) is used, the basis vectors change from

{3.J} to {a'..,}, where

3V =B1i,3i,

i i i ' and ?., = ae /ae' is the inverse matrix of the Jacobian matrix of the coor-

dinate transformation. Since the components r'.,.,,. of the connection are

written as

rVj'k' ?<?V*y\*>

in this new coordinate system, we easily have the transformation law

r'., ... , = B1. ,b4 ,?^,?. .. + B^B^.g. .(3.BJ!,). ? j k1 ? ' j' k' ijk ? ' k,:,kjx ? j

We introduce the a-connection, where a is a real parameter, in the

statistical manifold S by the formula

rijk =

??[{ Vj*(x,e) +

^G ^?(?,?^??,?^^?,?)]. (2.3)

It is easily checked that the connection defined by (2.3) satisfies the trans-

formation law. In particular, the l-connection is called the exponential con-

nection, and the -l-connection is called the mixture connection.

2.2 Imbedding and a-curvature

Let us consider an m-dimensional regular statistical model M =

{q(x?u)}, which is imbedded in S = {?(?,?)} by

q(x,u) = p{x,e(u)}.

Here, u = (ua) = (u ,...,um) is a vector parameter specifying distributions of

M, and defines a coordinate system of M. We assume that e = e(u) is smooth and

its Jacobian matrix has a full rank. Moreover, it is assumed that M forms an

m-dimensional submanifold in S. We identify a point ?e? with the point

e = e(u) imbedded in S. The tangent space ? (M) at u of M is spanned by m

vectors aa, a = 1,..., m, where 3a = 3/3ua denotes the tangent vector of the a a

coordinate curve ua in M. The basis an can be represented by a function a

3 ?(x,u) in ? as before, where ?(x,u) = log q(x,u). Since M is imbedded in S, a

the tangent space ? (M) of M is regarded as a subspace of the tangent space

??/ x(S) of S at e = e(u). The basis vector ?,e???) is written as a linear ? ̂ u ) au

combination of 3-,

3a =

Bl(u)V

where B^ = 3??(?)/3??. This can be understood from the relation a

aa?(x?u) = ?Y$,{x,e(u)}. a a?

Hence, the tangential directions of M at u is represented by m vectors a, a

(a = 1,...,m) or ? = (?1) in the component form with respect to the basis a. a a ?

of Te(u)(S>-

It is convenient to define n-m vectors a , ? = m + 1,... ,n in ?

? / AS) such that ? vectors {3a,3 }, a = 1,...,m; ? = m + 1,...,n, together e\U) a ?

form a basis of ? , , AS) and moreover 3 's are orthogonal to 3 's, (Fig. 3), ? \U / ? <A

93k(u) =

<3a'V = ?-

30 Shun-ichi Amari

The vectors 3 span the orthogonal complement of ? (M) in ??/ JS). We denote ? u ? ̂ u )

the components of 3 with respect to the basis 3. by 3 = ??(?)3?. The inner

products of any two basis vectors in {a ,3 } are given by

Wu) ?

<VV =

Wij ?

V(u) "

<W ?

B?B?9ij ?

The basis vector 3a may change in its direction as point u moves in a

M. The change is measured by the a-covariant derivative ?.?a'3a of 3 in the b a a

direction 3. , where the notion of a connection is necessary, because we need to

compare two vectors 3, and 3' belonging to different tangent spaces TQ/i\(S) and ? a a e ? u /

( ) ? / ,\(S). The a-covariant derivative v9 a is calculated in S as

" (?bBa

+ B?Barik)J>3j

?

When the directions of the tangent space ? (M) of M do not change as point u

moves in M, the manifold M is said to be a-flat in S, where the tangent direc-

tions are compared by the a-connection. Otherwise, M is curved in the sense of

the a-connection. The a-covariant derivative v,/a'3a is decomposed into the 3b

a

tangential component belonging to ? (M) and the normal component perpendicular

to ? (M). The former component represents the way 3 changes within ? (M), u au

while the latter represents the change of 3^ in the directions perpendicular to a

? (M), as u moves in M. The normal component is measured by

H?b^-a(a)^b'9-=(^bBi +

BbBakia)j)BXr <2-5>

a

which is a tensor called the a-curvature of submanifold M in S. It is usually

called the imbedding curvature or Euler-Shouten curvature. This tensor repre-

sents how M is curved in S. A tensor is a mu? ti-li near mapping from a number of

tangent vectors to the real set. In the present case, for A = Aa3 e? (M)

? = ? 3. e? (?) and C = CK3 belonging to the orthogonal complement of ? (M), we

(a) have the multi-linear mapping Hx ,

H(a)(A,B,C) = h[iI

AaBbCK.

(ai (a) This Hx ' is the a-curvature tensor, and H^.; are its components. The sub-

fa) manifold M is a-flat in S when H\ ' - 0 holds. The m ? m matrix

auK

ru(a)-|2 _ ?(a)?(a) ?? Cd LHM Jab

" HacKHbdxg g

represents the square of the a-curvature of M, where gK and gc are the inverse

matrix of g and g ,, respectively. Efron called the scalar

2 . G?(1)?2 ab ? *

[HM ]ab g

the statistical curvature in a one-dimensional model M, which is the trace of

the square of the exponential- or 1-curvature of M in our terminology.

Let e = e(t) be a curve in S parametrized by a scalar t. The curve

e: e = e(t) forms a one-dimensional submanifold in S. The tangent vector 3. of

the curve is represented in the component form as

\ =

ei(t)ai

or shortly by e, where ? denotes d/dt. When the direction of the tangent

vector a. = e does not change along the curve in the sense of the a-connection,

the curve is called an a-geodesic. By choosing an appropriate parameter, an

a-geodesic e(t) satisfies the geodesic equation

32 Shun-ichi Amari

or in the component form

?(^? = 0

?1 + r^W

= 0 . (2.6)

2.3 Duality in a-flat manifold

Once an affine connection is defined in S, we can compare two

tangent vectors ?e?O and ?'e?O? belonging to different tangent spaces ?? and OD ?

T., by the following parallel displacement of a vector. Let e: ? = e(t) be a D

curve connecting two points e and e'. Let us consider a vector field A(t) =

??(?)3.e? /.? defined on each point e(t) on the curve. If the vector A(t) does

not change along the curve, i.e., the covariant derivative of A(t) in the

direction e vanishes identically

v.A(t) = ?(t) + rjkiAk(t)ej

= 0 ,

the field A(t) is said to be a parallel vector field on c. Moreover,

A(t'^T /.,x at e(t') is said to be a parallel displacement of A(t)eT /.x at

e(t). We can thus displace in parallel a vector ?e?. at e to another point e' D

along a curve e(t) connecting e and e*, by making a vector field A(t) which

satisfies the differential equation v-A(t) = 0, with the boundary conditions D

e = e(0), e' = e(l), and A(0) = ?e? . The vector A' = ?(1)e? , at e' = e(l) is O D

the parallel displacement of A from e to e' along the curve e: e = e(t). We

denote it by A' = p A. When the a-connection is used, we denote the a-parallel

(a) displacement operator by tt . The parallel displacement of A from e to e' in

general depends on the path c: e(t) connecting e and e'. When this does not

depend on paths, the manifold is said to be flat. It is known that a manifold

is flat when, and only when, the Riemann-Christoffel curvature vanishes identi-

cally (see textbooks of differential geometry). A statistical manifold S is

said to be a-flat, when it is flat under the a-connection.

The parallel displacement does not in general preserve the inner

product, i.e., = <A,B> does not necessarily hold. When a manifold has

two affine connections with corresponding parallel displacement operators p

and p* and moreover when

<ttcA,tt*B> = <A,B> (2.7)

holds, the two connections are said to be mutually dual. The two operators p

and p* are considered to be mutually adjoint. We have the following theorem

in this regard (Nagaoka and Amari (1982)).

Theorem 2.1. The a-connection and -a-connection are mutually dual.

When S is a-flat, it is also -a-flat.

When a manifold S is a-flat, there exists a coordinate system (e1)

such that

vW3j = 0 or

r|j)(e) = 0

identically holds. In this case, a basis vector a. is the same at any point e

in the sense that 3?e?? is mapped to 3?e??, by the a-parallel displacement ID ID

irrespective of the path connecting e and e'. Since all the coordinate curves

e1 are a-geodesies in this case, e is called an a-affine coordinate system. A

linear transformation of an a-affine coordinate system is also a-affine.

We give an example of a 1-flat (i.e., a = 1) manifold S. The

density functions of exponential family S = {?(?,?)} can be written as

p(x,e) = ???{???.. - ?(?)}

with respect to an appropriate measure, where e = (e1) is called the natural or

canonical parameter. From

3?.?(?,?) = X. -

3???(?), 3.3j?,(x,e) = -3..3.?(?) ,

we easily have

g1d(e) =

^(e), rg(e) =

? 3.9.9|(? .

Hence, the l-connection r:./ vanishes identically in the natural parameter, ? j ?

showing that ? gives a 1-affine coordinate system. A curve e^t) = ant + b1,

which is linear in the e-coordinates, is a 1-geodesic, and conversely.

Since an a-flat manifold is -a-flat, there exists a -a-flat coor-

dinate system ? = (?..) = (n-|,....n ) in an a-flat manifold S. Let

31 = a/3n-j be the tangent vector of the coordinate curve ?. in the new coordin-

34 Shun-ichi Amari

ate system ?. The vectors {31} form a basis of the tangent space ? (i.e. at

TQ where e = ?(?)) of S. When the two bases {3.} and {31} of the tangent space D I

TQ satisfy D

<3. ,3J> =6'?

at every point e (or ?), where ?3, is the Kronecker delta (denoting the unit

matrix), the two coordinate systems e and ? are said to be mutually dual.

(Nagoaoka and Amari (1982)).

Theorem 2.2. When S is a-flat, there exists a pair of coordinate

systems e = (e1) and ? = (?.) such that i) e is a-affine and ? is -a-affine,

ii) e and ? are mutually dual, iii) there exist potential functions ?(?) and

?(?) such that the metric tensors are derived by differentiation as

9^(3) =

<3i'9j> =

?a??(?) >

g1J(n) = <a\aJ*> = aV^(n) ,

where g.. and g 1J are mutually inverse matrices so that

3i =gij3 ' 3 ''I d3

holds, iv) the coordinates are connected by the Legendre transformation

?1 = 3??(?), ??? =

3?.?(?) (2.8)

where the potentials satisfy the identity

?(?) + f(?) - ??? = 0, (2.9)

where ??? = e ?..

In the case of an exponential family S,? becomes the cumulant

generating function, the expectation parameter ? = (?.)

ni =

??[?.] =

a^(e)

is -I-affine, e and ? are mutually dual, and the dual potential f(?) is given

by the negative entropy,

?(?) = E[1og ?] ,

where the expectation is taken with respect to the distribution specified by ?.


2.4 a-divergence and a-projection

We can introduce the notion of a-divergence D (e,e') in an a-flat

manifold S, which represents the degree of divergence from distribution p(x,e)

to ?(?,?'). It is defined by

Da(e,e') = ?(?) + f(?') - ???? , (2.10)

where ?' = n(e') are the ?-coordinates of the point e', i.e., the -a-coordinates

of the distribution ?(?,?'). The a-divergence satisfies D (e,e') > 0 with the

equality when and only when e = e'. The -a-divergence satisfies D_ (e,e') =

D (e',e). When S is an exponential family, the -1-divergence is the Kullback-

Leibler information,

D^(e,e') = I[p(x,e') : p(x,e)] =

Jp(x,e)1og PJ*;^ dP.

As a preview of later discussion, we may also note that, when

S = {p(x)} is the function space of a non-parametric statistical model, the

a-divergence is written as

D ip(x),q(x)} = ~\ O - 01 1-a?

(p(x)(1"a)/2 q(x)(1+a)/2 dP)

when a f ?1, and is the Kullback information or its dual when a = -1 or 1.

When e and e' = e + de are infinitesimally close,

Da(e,e + de) =

\ g.^eJdeW (2.11)

holds, so that it can be regarded as a generalization of a half of the square

of the Riemannian distance, although neither symmetry nor the triangular

inequality holds for D . However, the following Pythagorean theorem holds

(Efron (1978) in an exponential family, Nagaoka and Amari (1982) in a general

case).

Theorem 2.3. Let c be an a-geodesic connecting two points ? and

e', and let c' be a -a-geodesic connecting two points e' and e" in an a-flat

S. When the two curves c and c' intersect at e' with a right angle such that

e, e' and e" form a right triangle, the following Pythagorean relation holds,



36 Shun-ichi Amari

D (?,?') + D (?1 ,?") = D (?,?") . (2.12) a? ' a? ' a '

Let ? = {q(x,u)} be an m-dimensional submanifold imbedded in an

a-flat ?-dimensional manifold S = {p(x,e)} by e = e(u). For a distribution

p(x,eQ^S, we search for the distribution q(x,u^M, which is the closest dis-

tribution in M to p(x,en) in the sense of the a-divergence (Fig. 4a),

min Da{eQ,e(u)}

= Da{eQ,e(?)}

.

?e?

We call the resulting u(eQ) the a-approximation of

p(x,eQ) in M, assuming such

exists uniquely. It is important in many statistical problems to obtain the

a-approximation, especially the -1-approximation. Let c(u) be the a-geodesic

connecting a point ?(?)e? and eQ, c(u) : e = e(t;u), e(u) = e(0,u), eQ

= e(1,u)

(Fig. 4b). When the a-geodesic c(u) is orthogonal to M at e(u), i.e.,

<e(0;u),a > = 0 a

where aa = 3/sua are the basis vectors of TM(M), we call the u the a-projection a u

of eQ on M. The existence and the uniqueness of the a-approximation and the

a-projection are in general guaranteed only locally. The following theorem was

first given by Amari (1982a) and by Nagaoka and Amari (1982) in more general

form.


Figure 4

Theorem 2.4. The a-approximation ?(eQ) of eQ

in M is given by the

a-projection u(eQ) of eQ on M.



3, HIGHER-ORDER ASYMPTOTIC THEORY OF STATISTICAL INFERENCE IN

CURVED EXPONENTIAL FAMILY

Ancillary family

Let S be an ?-dimensional exponential family parametrized by the

natural parameter e = (?1) and let M = {q(x,u)} be an m-dimensional family

parametrized by u = (ua), a = 1,..., m. M is said to be an (n,m)-curved expo-

nential family imbedded in S = {p(x,e)} by e = e(u), when q(x,u) is written as

q(x,u) = exp[e1(u)xi

- ?{?(?)}].

The geometrical structures of S and M can easily be calculated as follows. The

quantities in S in the e-coordinate system are

gij(e) =

aiV(e) ,

r$ =

^ T.jk ,

Tijk=3iW<9> ?

The quantities in M are

9ab ?

<3a'V ?

BaBb9ij >

rabc vaa W <WBc91j 2 'abc'

Tabc =

BlBbBcTijk - B?=3a9?(u)?

Here, the basis vector 3 of TM(M) is a vector a u

3a ?

Bl3i

in ? / \(S). If we use the expectation coordinate system ? in S, M is repre-

sented by ? = n(u). The components of the tangent vector 3a are given by a

38


Bai "

33??(?) ?

Ba9ji ?

Where : = ? .31, 3? = 3/3?.. ai ?

Let X(-|\> X(2)",#'X(N)

be N independent observations from a distri-

bution q(x,i:)eM. Then, their arithmetic mean

? = (=y(j))/N

is a minimal sufficient statistic. Since the joint distribution q(x#,>,...,

X/M\i u) can be written as

? 4 t q(xm.u)

= exp[N{e (u)x. - ?{?(??)}}], j=l Uj 1

the geometrical structure of M based on ? observations is the same as that

based on one observation except for a constant factor N. We treat statistical

inference based on x. Since a point ? in the sample space X can be identified

with a point ? = ? in S by using the expectation parameter ?, the observed suf-

ficient statistic ? defines a point ? in S whose ?-coordinates are given by x,

? = x. In other words, we regard ? as the point (distribution) ? in S whose

expectation parameter is just equal to x. Indeed, this ? is the maximum likeli-

hood estimator in the exponential family S.

Let us attach an (n-m)-dimensional submanifold A(u) of S to each

point ?e?, such that all the A(u)'s are disjoint (at least in some neighborhood

of M, which is called a tubular neighborhood) and the union of A(u)'s covers S

(at least the tubular neighborhood of M). This is called a (local) foliation of

S. Let ? = (vK), K=m+l,...,nbea coordinate system in A(u). We assume

that the pair (u,v) can be used as a coordinate system of the entire S (at

least in a neighborhood of M). Indeed, a pair (u,v) specifies a point in S such

that it is included in the A(u) attached to u and its position in A(u) is given

by ? (see Fig. 5). Let ? = ?(?,?) be the ?-coordinates of the point specified

by (u,v). This is the coordinate transformation of S from w = (u,v) to ?,

where w = (u,v) = (w^) is an n-dimensional variable, 3 = 1*...> n, such that its

first m components are u = (ua) and the last n-m components are ? = (vK).



40 Shun-ichi Amari

Figure 5

Any point ? (in some neighborhood of M) in S can be represented uniquely by

w = (u,v). We assume that A(u) includes the point ? = ?(?) on M and that the

origin ? = 0 of A(u) is put at the point ?e?. This implies that ?(?,?) is the

point ?(?)e?. We call A = {A(u)} an ancillary family of the model M.

In order to analyze the properties of a statistical inference

method, it is helpful to use the ancillary family which is naturally determined

by the inference method. For example, an estimator ? can be regarded as a map-

ping from S to M such that it maps the observed point ? = ? in S determined by

the sufficient statistic ? to a point ?(?)e?. Its inverse image u" (u) defines

an (n-m)-dimensional subspace A(u) attached to ?e?,

A(u) = ?"?(?) = ineS | u(n) = u} .

Obviously, the estimator ? takes the value u when and only when the observed ?

is included in A(u). These A(u)'s form a family A = {A(u)} which we will call

the ancillary family associated with the estimator u. As will be shown soon,

large-sample properties of an estimator ? are determined by the geometrical

features of the associated ancillary submanifolds A(u). Similarly, a test ?

can be regarded as a mapping from S to the binary set {r,r}, where r and r

imply, respectively, rejection and acceptance of a null hypothesis. The



inverse image T"1(r)c= s is called the critical region, and the hypothesis is

rejected when and only when the observed point ? = xeS is in ? (r). In order

to analyze the characteristics of a test, it is convenient to use an ancillary

family A = {A(u)} such that the critical region is composed of some of the

A(u)'s and the acceptance region is composed of the other A(u)'s. Such an

ancillary family is said to be associated with the test T.

In order to analyze the geometrical features of ancillary submani-

folds, let us use the new coordinate system w = (u,v). The tangent of the

coordinate curve w^ is given by 30 = 3/3w . The tangent space T (S) at point ? n

? = Aw) of S is spanned by {30>, 3 = 1,..., n. They are decomposed into two P

parts {3Q} = {3.3 }, 3 = 1,..., n; a = 1,..., m; ? = m + l,...,n. The former p a ?

part 3 = 3/3ua spans the tangent space ? (M) of M at u and the latter 3 =

3/3vK spans the tangent space ? (A) of A(u). Their components are given by

Bo- = 30n,. (w) in the basis 3?. They are decomposed as pi pi

9a =

B^.31 , 3 = ? .31 , a ai ? ??

with ? . = 3 ?.(?,?), ? . = 3 ?.(u,v). The metric tensor in the w-coordinate ai a ? ?? ? ?

system is given by

U -

<3a'V ?

Bai V?J =

^ij (3?1)

where

Ba =

g1j?aj = ae1(u'v)/3wa ?

The metric tensor is decomposed into three parts:

W") ?

<3a'V =

BaiBb/J (3?2)

is the metric tensor of M,

9kX(u> *

<3<'V ?

BKiBXj91J (3?3)

is the metric tensor of A(u), and

V =

<3a'V =

BaiBKj9?J (3?4)

represents the angles between the tangent spaces of M and A(u). When gaK(u,0)

= 0, M and A(u) are orthogonal to each other at M. The ancillary family

42 Shun-ichi Amari

A = {A(u)} is said to be orthogonal, when g (u) = 0, where f(u) is the abbre-

viation of f(u,0) when a quantity f(u,v) is evaluated on M, i.e., at ? = 0.

We may treat an ancillary family ?.. which depends on the number ? of observa-

tions. In this case g also depends on N. When g = <a ,a > is a quantity of

-1/2 order ? converging to 0 as ? tends to infinity, the ancillary family is

said to be asymptotically orthogonal.

The a-connection in the w-coordinate system is given by

r(a) = <v(a)a , a > = (a ? .)??' - i^T a3? a 3 ?

v air ? 2 a3?

= (a b!)B , +^T . , (3.5) a 3 ?? ? a3?

where ? ? = ?^? ?. ., . The M-part t y? I gives the components of the a-connec- a3? a 3 ? tjk

r abc 3 r

tion of M and the A-part r^' gives those of the a-connection of A(u). When A

is orthogonal, the a-curvatures of M and A(u) are given respectively by

abK abK xAa ?xa

The quantities g (u), H?? and H^ are fundamental in evaluating asymptotic

properties of statistical inference procedures. When a = 1, the l-connection is

called the exponential connection, and we use suffix (e) instead of (1). When

a = -1, the -l-connection is called the mixture connection, and we use suffix

(m) instead of (-1).

3.2 Edgeworth expansion

We study higher-order asymptotic properties of various statistics

with the help of Edgeworth expansions. To this end, let us express the point

? = ? defined by the observed sufficient statistic in the w-coordinate system.

The w-coordinates w = (u,v) are obtained by solving

? = n(w) = ?(?,?) . (3.7)

The sufficient statistic ? is thus decomposed into two parts (u,v) which to-

gether are also sufficient. When the ancillary family A is associated with an

estimator or a test, ? gives the estimated value or the test statistic,

respectively. We calculate the Edgeworth expansion of the joint distribution of

(u,v) in geometrical terms. Here, it is necessary further to assume a condition

which guarantees the Edgeworth expansion. We assume that Cramer's condition is

satisfied. See, for example, Bhattacharya and Ghosh (1978).

When Uq is the true parameter of distribution, ? converges to ?(?0,

0) in probability as the number ? of observations tends to infinity, so that the

random variable w also converges to wQ =

(uq,0). Let us put

* = ?G?? - n(u0,0)} , ft = ?N(w -

w0) ,

u = ?G(? - u0) , ? = ?? ? . (3.8)

Then, by expanding (3.7), we can express w in the power series of x. We can

obtain the Edgeworth expansion of the distribution p(w;uQ) of w = (u,v). How-

ever, it is simpler to obtain the distribution of the one-step bias-corrected

version w* of w defined by

w* = w - e-[w] ,

where E denotes the expectation with respect to p(x,w). The distribution of w

is obtained easily from that of w*. (See Amari and Kumon (1983).)

Theorem 3.1. The Edgeworth expansion of the probability density

p(w*,uQ) of w*, where q(x,uQ) is the underlying true distribution, is given by

p(?M,0> -

n(?*;ga?){l + J- ? he* *

J y*) + O^'2)} ,

6/? (3.9)

AN(w*) = t C2flha? + ? ? fi ?ha?Y5 + ? ? . ?? ?a??def ,

?? ' 4 a3 24 a3?d 72 a3? <$ef

where n(w*;g A is the multivariate normal density with mean 0 and covariance x 3a3

9?^ = (g o)" 9 ha^y etc. are the tensorial Hermite polynomials in w* and

?\*?? " "

a6?

C2? =

G(? r(mJ 9?e9ds , etc. a3 ??a es3

The tensorial Hermite polynomials in w with metric gaB

are defined

by

44 Shun-ichi Amari

a, ...<x. . ?, a., h

' k(w) = (-1)K{D

' ...D Kn(w;g J}/n(w;g ) , ap ap

where Da = ga3(3/3w3), cf. Amari and Kumon (1983), McCullagh (1984). Hence,

h? = 1, ?a = wa, ?a3 = waw3 - ga3 ,

?a3? = wW - g?V - gaV - g3V , etc.

Theorem 3.1 shows the Edgeworth expansion up to order ? of the

joint distribution of u* and v*, which together carry the full Fisher informa-

tion. The marginal distribution can easily be obtained by integration.

Theorem 3.2. When the ancillary family is orthogonal, i.e., g (u)

0, the distribution p(u*,uQ) of u* is given by

p(u*,u0) =

n(u*;gab){l +1

N"1/2KabcK abc

where Kabc

= - 3rabc/3)"

+ N_1AN(u*)}

+ 0(N_3/2) , (3.10)

Vu*> =

? Cabh3b

+ terms common to all the orthogonal ancillary families,

4 ?

^4 +

2<HS4 +

(*ab ? <3?1?>

, p?2 = (m) (m) ce df ir }ab rcda efb 9 9 '

,Mex2 . ?(e) ?(e) cd ?? (Vab

- HacK Hbdx g g ?

(Vab "

HKva HXyb g g ?

3.3 Higher-order efficiency of estimation

Given an estimator ? : S^M which maps the observed point ? = xeS to

?(?)e?, we can construct the ancillary family A = {A(u)} by

A(u) = u"](u) = {neS | u(n) = u} .

The A(u) includes the point ?(?) = ?(?,0), when and only when the estimator is

consistent. (We may treat a case when A(u) depends on N, denoting an ancillary

family by AN(u). In this case, an estimator is consistent if lim AM(u)-^(u,0).)

Let us expand the covariance of the estimation error u = /N(u - uQ) as

cov[?a,?b] = gf

+ gf

N"1/2 + gfrf1

+ 0(N"3/2) .

A consistent estimator is said to be first-order efficient or simply efficient,

when its first-order term ga (u) is minimal among all the consistent estimators

at any u, where the minimality is in the sense of positive semidefiniteness of

matrices. The second- and third-order efficiency is defined similarly.

Since the first-order term ga is given from (3.9) by

ab _ / n n ????-1 gl

- (gab

- VbX9 ) '

the minimality is attained, when and only when g, = 0, i.e., the associated

ax

ancillary family is orthogonal. From this and Theorem 3.2, we have the follow-

ing results.

Theorem 3.3. A consistent estimator is first-order efficient, iff

the associated ancillary family is orthogonal. An efficient estimator is always

second-order efficient, because of g2 = 0.

There exist no third-order efficient estimators in the sense that

g~ (u) is minimal at all u. This can be checked from the fact that g\ includes

a term linear in the derivative of the mixture curvature of A(u), see Amari

(1985). However, if we calculate the covariance of the bias-corrected version

u* = u - E-[u] of an efficient estimator u, we see that there exists the third-

order efficient estimator among the class of all the bias-corrected efficient

cd estimators. To state the result, let

g~ab = g~ 9ca9b(j

be the lower index

o ab version of g~ .

Theorem 3.4. The third-order term g~ . of the covariance of a bias-

corrected efficient estimator u* is given by the sum of the three non-negative

geometric quantities

93ab ?

1 <rm4 +

<<& +

1 (?A ? <3-12>

4 6 Shun-ichi Amari

The first is the square of mixture connection components of M, and depends on

the parametrization of M but is common to all the estimators. The second is

the square of the exponential curvature of M, which does not depend on the

estimator. The third is the square of the mixture curvature of the ancillary

submanifold A(u) at ?(?), which depends on the estimator. An efficient estima-

tor is third-order efficient, when and only when the associated ancillary family

is mixture-flat at ?(?). The m.l.e. is third-order efficient, because it is

given by the mixture-projection of ? to M.

The Edgeworth expansion (3.10) tells more about the characteristics

of an efficient estimator u*. When H\; vanishes, an estimator is shown to be ?xa

mostly concentrated around the true parameter u and is third-order optimal

under a symmetric unimodal loss function. The effect of the manner of paramet-

rizing M is also clear from (3.10). The a-normal coordinate system (parameter)

in which the components of the a-connection become zero at a fixed point is very

important (cf. Hougaard, 1983; Kass, 1984).

3.4 Higher-order efficiency of tests

Let us consider a test ? of a null hypothesis HQ : ueD against the

alternative H, : u^D in an (n,m)-curved exponential family, where D is a region

or a submanifold in M. Let R be a critical region of test ? such that the

hypothesis Hn is rejected when and only when the observed point ? = ? belongs to

R. When ? has a test statistic ?(?), the equation ?(?) = const, gives the

boundary of the critical region R. The power function PT(u) of the test ? at

point u is given by

Pj(u) =

j p(x;u) dx , Ir

where p(x;u) is the density function of ? when the true parameter is u.

Given a test T, we can compose an ancillary family A = {A(u)> such

that the critical region R is given by the union of some of A(u)'s, i.e., it

can be written as

R = IL? A(u) , -HjeRM




where R? is a subset of M. Then, when we decompose the observed statistic

? = ? into (u,v) by ? = ?(?,?) in terms of the related w-coordinates, the hypo-

thesis Hq

is rejected when and only when ueR^ Hence, the test statistics ?(?)

is a function of only u. Since we have already obtained the Edgeworth expansion

of the joint distribution of (u,v) or of (u*,v*), we can analyze the character-

istics of a test in terms of geometry of associated A(u)'s.

We first consider the case where M = {q(x,u)} is one-dimensional,

so that u = (ua) is a scalar parameter, indices a, b, etc becoming equal to 1.

We test the null hypothesis HQ : u = uQ against the alternative H, : u j u~ .

Let u. be a point which approaches uQ as ? tends to infinity by

ut =

u0 + t(Ng)'1/2 , (3.13)

-1/2 i.e., the point whose Riemannian distance from uQ

is approximately tN ' ,

where g = g b(u0)? The power PT(u.,N) of a test ? at u. is expanded as

PT(ut,N) =

PT1(t) +

PT2(t)N"1/2 +

P^tON""1 + 0(N~3/2) .

A test ? is said to be first-order uniformly efficient or, simply, efficient,

if the first-order term Pyi(t)

satisfies Pj-j(t)

> Py.-jU)

at all t, compared

with any other test T' of the same level. The second- and third-order uniform

efficiency is defined

Pj(ut,N)'s defined by

efficiency is defined similarly. Let P(u.,N) be the envelope power function of

P(ut,N) = sup PT(ut,N)

. (3.14)

Let us expand it as

P(ut,N) =

P^t) +

P2(t)N"1/2 +

?3(?)?"? + 0(N"3/2) .

It is clear that a test ? is i-th order uniformly efficient, iff

pTk(t) =

Pk(t)

holds at any t for k = l,...,i.

An ancillary family A = {A(u)} in this case consists of (n-1)-

dimensional submanifolds A(u) attached to each u or ?(?)e?. The critical

region R is bounded by one of the ancillary submanifolds, say A(u+), in the



48 Shun-ichi Amari

one-sided case, and by two submanifolds A(u+) and A(u_) in the two-sided unbias-

ed case. The asymptotic behavior of a test ? is determined by the geometric

features of the boundary 3R, i.e., A(u+)[and A(u_)]. In particular, the angle

between M and A(u) is important. The angle is given by the inner product

ga (u) = <9a?a

> of the tangent 3 of M and tangents 3 of A(u). When g (u) =

0 for all u, A is orthogonal. In the case of a test, the critical region and

hence the associated ancillary A and g (u) depend on N. An ancillary family is

-1/2 said to be asymptotically orthogonal, when g (u) is of order N~ . We can

assume ga (un) = 0, and ga (u.) can be expanded as aK ? 3? l

93?> ?t

W^'172 ? {3?15)

where Q . = 3 g. (uQ). The quantity Q . represents the direction and the

magnitude of inclination of A(u) from being exactly orthogonal to M. We can

now state the asymptotic properties of a test in geometrical terms (Kumon and

Amari (1983), (1985)).

Theorem 3.5. A test ? is first-order uniformly efficient, iff the

associated ancillary family A is asymptotically orthogonal. A first-order

uniformly efficient test is second-order uniformly efficient.

Unfortunately, there exist no third-order uniformly efficient test

(unless the model M is exponential family). An efficient test ? is said to be

third-order tQ-efficient, when its third-order power Pjo(t) is minimal among

all the other efficient tests at tQ, i.e., when PjoUq)

= PoUq),

and when

there exist no tests T' satisfying Pjio(t) >

Pyo(t) "for a^ t. An efficient

test is third-order admissible, when it is tQ

- efficient at some tQ.

We define

the third-order power loss function (deficiency function) APy(t) of an efficient

test ? by

PT(t) = lim

N{P(ut,N) -

PT(ut,N)} =

P3(t) - P-U) . (3.16)

It characterizes the behaviors of an efficient test T. The power loss function

can be explicitly given in geometrical terms of the associated ancillary A

(Kumon and Amari (1983), Amari (1983a)).

Theorem 3.6. An efficient test ? is third-order admissible, only

when the mixture curvature of A(u) vanishes as N-**> and the A(u) is not exactly

orthogonal to M but asymptotically orthogonal to compensate the exponential

(e) curvature H\ ' of model M such that

<W "

cHaebK (3?17>

holds for some constant c. The third-order power loss function is then given by

??t(?) = a.(t,a){c -

Jjit.cOlV , (3.18)

where a.(t,a) is some fixed function of t and a,a being the level of the test,

^ - ^Hitl

h^ gacgbd (3.19)

is the square of the exponential curvature (Efron's curvature) of M, and

J^t.a) = 1 -

t/{2u1(a)},

J2(t,a) = 1 -

t/[2u2(a)tanh{tU2(a)}],

i = 1 for the one-sided case and i = 2 for two-sided case, ? being the standard

normal density function, and u-.(a) and u2(a) being the one-sided and two-sided

100a% points of the normal density, respectively.

The theorem shows that a third-order admissible test is character-

ized by its c value. It is interesting that the third-order power loss function

(3.18) depends on the model M only through the statistical curvature ? , so that

AP-r(t)/y gives a universal power loss curve common to all the statistical

models. It depends only on the value of c. Various widely used tests will next

be shown to be third-order admissible, so that they are characterized by c

values as follows.

Theorem 3.7. The test based on the maximum likelihood estimator

(e.g. Wald test) is characterized by c = 0. The likelihood ratio test is char-

acterized by c = 1/2. The locally most powerful test is characterized by c = 1

2 in the one-sided case and c = 1 -

l/{2u2(a)} in the two-sided case. The con-

(eW ditional test conditioned on the approximate ancillary statistic a =

H\^v

is characterized also by c = 1/2. The efficient-score test is characterized by

50 Shun-ichi Amari

c = 1, and is inadmissible in the two-sided case.

We show the universal third-order power loss functions of various

tests in Fig. 6 in the two-sided case and in Fig. 7 in the one-sided case,

where a = 0.05 (from Amari (1983a)). It is shown that the likelihood ratio test

has fairly good performances throughout a wide range of t, while the locally

most powerful test behaves badly when t > 2. The m.l.e. test is good at around

t = 3a.

We can generalize the present theory to the multi-parameter cases

with and without nuisance parameters. It is interesting that none of the

above tests are third-order admissible in the multi-parameter case. However, it

is easy to modify a test to get a third-order ^-efficient test by the use of

the asymptotic ancillary statistic a (Kumon and Amari, 1985). We can also

design the third-order t^-most-powerful confidence region estimators and the

third-order minimal size confidence region estimators.

It is also possible to extend the present results of estimation and

testing in a statistical model with nuisance parameter ?. In this case, a set

M(Uq) of distributions in which the parameter of interest takes a fixed value

Uq, but ? takes arbitrary values, forms a submanifold. The mixture curvature

and the exponential twister curvature of M(uQ) are responsible for the higher-

order characteristics of statistical inference. The third-order admissibility

of the likelihood ratio test and others is again proved. See Amari (1985).




HAPjU?/r cl = 0.05 two- Sided tests

efficient score test

locally most powerful test

I.e. test

likelihood ratio test

> t

Figure 6

N/lPT(0/r A

0.5 t

o? = 0.05 one-sided tests

?^ efficient score test

(locally most porfrfuf test)

m.l.e. test

likelihood ratio test

->t

Figure 7



4. INFORMATION, SUFFICIENCY AND ANCILLARITY

HIGHER ORDER THEORY

Information and conditional information

Given a statistical model M = {p(x,u)}, u = (ua), we can follow

Fisher and define the amount 9ak(T) of information included in a statistic

? = t(x) by

gab(T) =

E[dat(t,u)3bA(t,u)] , (4.1)

where ?(t,u) is the logarithm of the density function of t when the true para-

meter is u. The information 9ak(O is a positive-semidefinite matrix depending

on u. Obviously, for the statistic X, gak.W 1S the Fl*sher information matrix.

Let T(X) and S(X) be two statistics. We similarly define, by using the joint

distribution of ? and S, the amount gak(T>S) ?f information which ? and S to-

gether carry. The additivity

h^'V -

9ab<T> +

WS>

does not hold except when ? and S are independent. We define the amount of

conditional information carried by ? when S is known by

9ab(T|S) =

EsET|S[aa?(tls'u)3b?(tls>u)] ? (4'2)

where ?(t|s,u) is the logarithm of the conditional density function of ? con-

ditioned on S. Then, the following relation holds,

9ab(T'S> =

9ab(T) +

9ab(SlT> =

9ab(S) +

9ab<TlS> '

From gak(S|T) =

gak(T,S) -

gak(T), we see that the conditional information

denotes the amount of loss of information when we discard s from a pair of

statistics s and t, keeping only t. Especially,

52

*9ab(T) -

9ab{X> "

U? '

9ab(XlT> <4?3>

is the amount of loss of infonnation when we keep only t(x) instead of keeping

the original x. The following relation is useful for calculation,

Agab(T) =

ETCov[3a*(x,u),3bs,(x,u)|t] , (4.4)

9ab(SlT> =

9ab(T> "

9ab(T'S> ' <4?5>

where Cov[.|t] is the conditional covariance.

A statistic S is sufficient, when gak(S)

= 9ahW

or A9ab^ = ?*

When S is sufficient, gak(T|S) = ? holds for any statistic T. A statistic a is

ancillary, when gab(A) = 0. When A is ancillary, g ^(?,?)

= gafc>(T|A)

for any T.

It is interesting that, although A itself has no information, A together with

another statistic ? recovers the amount

9ab(AlT> ?

9ab(T'A> "

9ab{T)

of information. An ancillary statistic carries some information in this sense,

and this is the reason why an ancillarity is important in statistical inference.

We call g ,(A|T) the amount of information of ancillary A relative to statistic

T.

When ? independent observations x,,...,xN are available, the Fisher

information gab(X ) is Ng .(?), ? times that of one observation. When M is a

curved exponential family, ? = ??^/?

is a sufficient statistic, keeping the

whole information, g ,(X) = Ng b(X). Let t(x) be a statistic which is a func-

tion of x. It is said to be asymptotically sufficient of order q, when

^T) =

9abW-WT) = 0(N"q+1) ? (4?6)

Similarly, a statistic t(x) is said to be asymptotically ancillary of order q,

when

gab(T) = 0(N"q) (4.7)

holds. (The definition of the order in the present article is different from

that by Cox (1980) etc.)

54 Shun-ichi Amari

4.2 Asymptotic efficiency and ancillarity

Given a consistent estimator u(x) in an (n,m)-curved exponential

family M, we can construct the associated ancillary family A. By introducing

an adequate coordinate system ? in each A(u), the sufficient statistic ? is de-

composed into two statistics (u,v) by ? = ?(?,?). The amount Ag . (U) of inform-

ation loss of estimator u is calculated from (4.4) by using the stochastic ex-

pansion of 9 ?(x,u) as a

^ab^^aA/"^1)

Hence, when and only when A is orthogonal, i.e., ga (u) = 0, u is first-order

sufficient. In this case, u is (first-order) efficient. The loss of informa-

tion of an efficient estimator ? is calculated as

^)--(\%)?b+(V2H^)lb + 0^) . (4.8)

where (H?) is the square of the exponential curvature of the model M and (H?)

is the square of the mixture curvature of the associated ancillary family A at

? = 0. Hence, the loss of information is minimized uniformly in u, iff the

mixture curvature of the associated ancillary family A(u) vanishes at ? = 0 for

all u. In this case, the estimator ? is third-order efficient in the sense of

the covariance in ?3. The m.l.e. is such a higher-order efficient estimator.

Among all third-order efficient estimators, does there exist one

whose loss of information is minimal at all u up to the term of order N~ ? Is

the m.l.e. such a one? This problem is related to the asymptotic efficiency of

estimators of order higher than three. By using the Edgeworth expansion (3.9)

and the stochastic expansion of 3a?(x,u), we can calculate the terms, which a

depend on the estimator, of the information loss of order N~ in geometrical

terms of the related ancillary family. The loss of order N~ includes a term

related to the derivatives of the mixture curvature H ,; of A in the direction KAa

of 3 and 3a (unpublished note). From this formula, one can conclude that ? a

there exist no estimators whose loss Agab(U) of information is minimal up to

the term of order N~ at all u among all other estimators. Hence, the loss of

information of the m.l.e. is not uniformly minimal at all u, when the loss is




evaluated up to the term of order N~ .

We have already obtained the Edgeworth expansion up to order ? of

the joirt distribution of (u,v), or equivalently (u*,v*) in (3.9). By integra-

tion, we have the distribution of v*,

ptf*;u) = n(>;gK?){l

+ 1 /? ????????

+ OtN'1)), (4.9)

where g (u) and ? . (u) depend on the coordinate system ? introduced to each ka ? Ay

A(u). The information gab(V*) of v* can be calculated from this. It depends on

the coordinate system v, too. It is always possible to choose a coordinate

system ? in each A(u) such that {9 } is an orthonormal system at ? = 0, i.e.,

g (u) = d . Then, v* is first-order ancillary. It is always possible to ?? ??

choose such a coordinate system that ? (u) = 0 further holds at ? = 0 in every

A(u). This coordinate system is indeed given by the (a = - l/3)-normal coor-

dinate system at ? = 0. The v* is second-order ancillary in this coordinate

system. By evaluating the term of order N~ in (4.9), we can prove that there

exists in general no third-order ancillary v.

However, Skovgaard (1985), by using the method of Chernoff (1949),

showed that one can always construct an ancillary ? of order q for any q by

modifying ? successively. The q-th order ancillary ? is a function of ?

depending on N. Hence, our previous result implies only that one cannot in

general construct the third-order ancillary by using a function of ? not depend-

ing on N, or by relying on an ancillary family A = {A(u)} not depending on N.

There is no reason to stick to an ancillary family not depending on N, as

Skovgaard argued.

4.3 Decomposition of information

Since (u,v) together are sufficient, the information lost by sum-

marizing ? into ? is recovered by knowing the ancillary v. The amount of

recovered information g .(V|U) is equal to Ag ,(U). Obviously, the amount of

information of ? relative to u does not depend on the coordinate system of A(u).

In order to recover the information of order 1 in Agak(U)> not all the compo-

nents of ? are necessary. Some functions of ? can recover the full information



56 Shun-ichi Amari

of order 1. Some other functions of ? will recover the information of order ?

and some others further will recover the information of order ? . We can de-

compose the whole ancillary ? into parts according to the order of the magnitude

of the amount of relative information.

The tangent space ? (A) of the ancillary subspace A(u) associated

with an efficient estimator ? is spanned by ? - m vectors 8 . The ancillary ?

can be regarded as a vector ? = vKa belonging to ? (A). Now we decompose ? (A)

as follows. Let us define

h a 1 =

(v?e)---v?e) B? )? ? > 2 (4???) ar--ap ai Vi ap

which is a tensor representing the higher-order exponential curvature of the

feii model. When ? = 2, it is nothing but the exponential curvature H\ '

, and when

? = 3, ? . ?

represents the rate of change in the curvature ? ^ > anc* so on?

For fixed indices a,,...,a. ? 1

is a vector in T (S), and its projection ? ? a-j.. .a u

to ? (A) is given by

a....apK a....ap ??

Let T?(A)_ (p > 2) be the subspace of TM(A) spanned by vectors ? , up- u a,a0K

K, a a ?...,IC a , and let ? be the orthogonal projection from T,,(A) to a-ia^a^K d-?...a ? ?,p U

Tu(A)p. We call

?!?) a = (?? -

p\n ?)? ? (4.11) a-j...a ? ? ? p-r a,...a ? '

the p-th order exponential curvature tensor of the model M, where I = (I ) is

the identity operator. The square of the p-th order curvature is defined by

(^)p,b ?

?ie!,...v,K "?Vvi> ^a'b,-'v,v' ? ???2>

(e? There exists a finite pn such that H,

' a vanishes for ? > pn. ?

aT--ap - ?

Now let us consider the following sequence of statistics,

Tl = {">' T2

=

Ha'a2^K>??? ?

Moreover, let t = a ?(x,?), which vanishes if ? is the m.l.e. Obviously, the a a

sequence Tp? T^, ... gives a decomposition of the ancillary statistic ? = (vK)



into the higher-order curvature directions of M. Let

t1 =

Tr t2 =

{t,TrT2},...,Tp =

{t?-1 ,Tp} .

Then, we have the following theorems (see Amari (1985)).

Theorem 4.1. The set of statistics t is asymptotically sufficient

of order p. The statistic ? carries information of order ? relative to t ?,

9ab<VVl> ?

N"P+2(HM>ab ? <4?13>

into

Theorem 4.2. The Fisher information gab(X) = Ng .(X) is decomposed

W*) ?

p?l WW^ =

WU> +

Pk N"P+2(? ? (4.14)

The theorems imply the following. An efficient estimator ? carries

all the information of order N. The ancillary v, which together with ? carries

the remaining smaller-order information, is decomposed into the sum of p-th

(ei -? order curvature-direction components aa = H: '

a ? , which carries all

-D+2 1 ? 1 ?

the missing information of order ? v relative to t ,. The proof is obtained

by expanding 3 ?(x,u), where ? = u - u, as a

3 ?(X,U) = 3.?(x,u) + ??? -?4-?- 3 3 3 ?(x,u)?al ...? P a a ?? ? ?; a a-?... a

and by calculating g . (? |t ,). The information carried by 3 3 3 ?(x,u)

? is equivalent to (3,?, a ?)? .vK or H,e; a vK relative to t ? up to the

a a-?...a Kl a a??...a ? ?-1

necessary order.

4.4. Conditional inference

When there exists an exact ancillary statistic a, the conditionality

principle requires that statistical inference should be done by conditioning on

a. However, there exist no non-trivial ancillary statistics in many problems.

Instead, there exists an asymptotically ancillary statistic v, which can be

refined to be higher-order ancillary. The asymptotic ancillary statistic car-

ries information of order 1, and is very useful in improving higher-order

characteristics of statistical inference. For example, the conditional covari-

ance of an efficient estimator is evaluated by

58 Shun-ichi Amari

N Cov[u\ub|v] = (gab

+ H^v*)"1

+ higher order terms ,

where gab

+ ?^??

= - 3 3^(?,??) is the observed Fisher information. When two

groups of independent observations are obtained, we cannot get a third-order

efficient estimator for the entire set of observations by combining only the two

third-order efficient estimators ?, and ?p for the respective samples. If we

can use the asymptotic ancillaries H^v? and

H^Vp, we can calculate the

third-order efficient estimator (see Chap. 5). Moreover, the ancillary ?_^ vK

can be used to change the characteristics of an efficient test and of an

efficient interval estimator. We can obtain the third-order t^-efficient test

or interval estimator by using the ancillary for any given t^. It is interest-

ing that the conditional test conditioned on the asymptotic ancillary ? is

third-order admissible and its characteristic (deficiency curve) is the same as

that of the likelihood-ratio test (Kumon and Amari (1983)).

In the above discussions, it is not necessary to refine ? to be a

(e)-K higher-order asymptotic ancillary. The curvature-direction components H\'v

are important, and the other components play no role. Hence, we may say that

^ab ^s use"ful not because it is (higher-order) ancillary but because it re-

covers necessary information. It seems that we need a more fundamental study on

the invariant structures of a model to elucidate the conditionality principle

and ancillarity (see Kariya (1983), Barndorff-Nielsen, (1937).) There are

many interesting discussions in Efron and Hinkely (1978), Hinkley (1980), Cox

(1980), Barndorff-Nielsen (1980). See also Amari (1985).



FIBRE-BUNDLE THEORY OF STATISTICAL MODELS

Hilbert bundle of a statistical model

In order to treat general statistical models other then curved

exponential families, we need the notion of fibre bundle of a statistical model.

Let M = {q(x,u)} be a general regular m-dimensional statistical model parametr-

ized by u - (u ). To each point ueM, we associate a linear space H consisting

of functions r(x) in ? defined by

Hu =

ir(x)|Eu[r(x)] = 0, Eu[r2(x)]<-}, (5.1)

where E denotes the expectation with respect to the distribution q(x,u).

Intuitively, each element r(x)eH denotes a direction of deviation of the dis-

tribution q(x,u) as follows. Let eq(x) be a small disturbance of q(x,u), where

e is a small constant, yielding another distribution q(x,u) + eq(x), which does

not necessarily belong to M. Here, q(x)dP = 0 should be satisfied. The )

logarithm is written as

log{q(x,u) + eq(x)} 4= ?(x,u) + e q^*^

,

where ?(x,u) = log q(x,u). If we put

*? ? ?

?

it satisfies E [r(x)3 = 0. Hence, r(x)eHu denotes the deviation of q(x,u) in

p the direction q(x) = r(x)q(x,u). The condition Eu[r ]?? implies that we con-

sider only deviations having a second moment. (Note that given r(x^Hu, the

function

q(x,u) + er(x)q(x,u)

59

60 Shun-ichi Amari

does not necessarily represent a probability density function, because the

positivity condition

q(x,u) + er(x)q(x,u) > 0

might be broken for ?e even when e is an infinitesimally small constant.)

We can introduce an inner product in the linear space ? by

<r(x),s(x)> = Eu[r(x)s(x)]

for r(x), s(x^Hu. Thus, Hu is a Hilbert space. Since the tangent vectors

3as,(x,u), which span T,,(M), satisfy E[3 ?] = 0, E[(3 ?)2] s gaa(u)<??, they belong a U a d aa

to ? . Indeed, the tangent space ? (M) of M at u is a linear subspace of ? ,

and the inner product defined in ? is compatible with that in ? . Let ? be

the orthogonal complement of ? in ? . Then, ? is decomposed into the direct

sum

Hu ?

Tu +

Nu ?

The aggregate of all ? 's attached to every ?e? with a suitable

topology,

H(M) - uUM Hu , (5.2)

is called the fibre bundle with base space M and fibre space H. Since the fibre

space is a Hilbert space, it is called a Hilbert bundle of M. It should be

noted that H and ? , are different Hilbert spaces when u ^ u'. Hence, it is

convenient to establish a one-to-one correspondence between H and H ,, when u

and u' are neighboring points in M. When the correspondence is affine, it is

called an affine connection. Let us assume that a vector r(x)eH at u corres-

ponds to r(x) + <^(?)e? . . at a neighboring point u + du, where d denotes

infinitesimally small change. From

Eu+du [r(x) + dr(x)] = |{q(x,u)

+ dq(x,u)}{r(x) + dr(x)}dP

Eu[r] +

Eu[dr(x) +

3a*(x,u)r(x)dua] = 0

and E [r] = 0, we see that dr(x) must satisfy

Eu[dr] = -

E[3a?r] dua ,

where we neglected higher-order terms. This leads us to the following defini-


tion of the a-connection: When dr(x) is given by

dr(x) = - ? E[3a?r]

dua - ^ 3a*rdua , (5.3)

the correspondence is called the a-connection. More formally, the a-connection

fai is given by the following a-covariant derivative vv '. Let r(x,u) be a vector

field, which attaches a vector r(x,u) to every point ?e?. Then, the rate of

the intrinsic change of the vector r(x,u) as u changes in the direction 3 is a

given by the a-covariant derivative,

vja)r =

3ar(x,u) -

? Eu[3ar] +

? ra^, (5.4) a

where E[3 &r] = - E[3 r] is used. The a-covariant derivative in the direction a a

A = Aa3 e?,,(?) is given by a u

Aa)r = AMa)r . A da

The l-connection is called the exponential connection, and the -l-connection is

called the mixture connection.

When we attach the tangent space ? (M) to each point ?e? instead of

attaching the Hilbert space ? , we have a smaller aggregate

which is a subset of h[ called the tangent bundle of M. We can define an affine

connection in T(M) by introducing an affine correspondence between neighboring

? and ? ,. When an affine connection is given in h[(M) such that re ? corres-

ponds to r + dreH +. , it naturally induces an affine connection in T(M) such

that reT (M)ch corresponds to the orthogonal projection of r + d^H +. to

? ., (M). It can easily be shown that the geometry of M is indeed that of T(M),

so that the a-connection of T(M) or M, which we have defined in Chapter 2, is

exactly the one which the present a-connection of h[(M) naturally induces.

Hence, the a-geometry of H(M) is a natural extension of that of M.

Let u = u(t) be a curve in M. A vector field r(x't)eHu(t) defined

along the curve is said to be a-parallel, when

?)r =

f-TEuW+T? = () (5?5)



62 Shun-ichi Amari

is satisfied, where r denotes ar/at, etc. A vector r1(x)eH is the a-parallel

shift of ??>(?)e? along a curve u(t) connecting uQ =

u(tQ) and u, = u(t,), when

^q(x) =

^?Xjtg) and r, (x) =

r(x,t-j) in the solution r(x,t) of (5.5).

The parallel shift of a vector r(x) from u to u' in general depends

on the curve u(t) along which the parallel shift takes place. When and only

when the curvature of the connection vanishes, the shift is defined independent-

ly of the curve connecting u and u*. We can prove that the curvature of h[(M)

always vanishes for a = ?1 connections, so that the e-parallel shift (a = 1) and

the m-parallel shift (a = - 1) can be performed from a point u to another point

u1 independently of the curve. Let p" and ^m'??U be the e- and m-parallel

shift operators from u to u'. Then, we can prove the following important

theorem.

Theorem 5.1. The exponential and mixture connections of _H(M) are

curvature-free. Their parallel shift operators are given, respectively, by

(e)^'r(x) = r(x) -

Eu,[r(x)] , (5.6)

The e- and m-connections are dual in the sense of

<r,s> = <(e)TTU'r , Ku's> . , ' u u ' u u

where <.,.> is the inner product at u.

Proof. Let c: u(t) be a curve connecting two points u = u(0) and u' = u(l).

fai Let rv '(x,t) be an a-parallel vector defined along the curve c. Then, it

satisfies (5.5). When a = 1, it reduces to

r(e)(x,t) = Eu(t)[r(e>(x,t)].

Since the right-hand side does not depend x, the solution of this equation with

the initial condition r(x) = r^(x,0) is given by

r(e)(x,t) = r(x) + a(t) .

where a(t) is determined from

Lu(t) [r(e)(x,t)] = 0

as

a(t) = - Eu(t)[r(x)]

.

This yields (5.6), where we put u(t) = u'. Since E ,[r(x)] does not depend on

the path connecting u and u', the exponential connection is curvature free.

Similarly, when a = -1, (5.5) reduces to

?(m)(x,t) + r(m)(x,t)?(x,u(t)) = 0 .

The solution is

r(m)(x,t)q(x,u(t)) = a(x) ,

which yields (5.7). This shows that the mixture connection is also curvature

free. The duality relation is directly checked from (5.6) and (5.7).

(a) We have defined the imbedding a-curvature HAl of a curved exponen-

tial family. The concept of the imbedding curvature (which sometimes is called

the relative or Euler-Schouten curvature) can be defined for a general M as

? follows. Let ? be the projection operator of ? to ? which is the orthogonal

subspace of T (M) in H . Then, the imbedding a-curvature of M is a function in

? defined by

a

which is an element of ? <= ? . The square of the a-curvature is given by

('?e)i?b-<H?c?W-Hbd?W>9Cd? (5?8)

The scalar ? = ga (???') . is the statistical curvature defined by Efron in the

one-dimensional case.

5.2. Exponential bundle

Given a statistical model M = {q(x,u)>, we define the following

elements in ? ,

Xla =

aa?(x'u) '

X2ab V xlb ' a

X = v^ X ka-i.. .a^ 3a? kap.. .a^

and attach to each point ?e? the vector space T^a' '

spanned by these vectors,

64 Shun-ichi Amari

where we assume that they are linearly independent. The aggregate

I(a'k)W =

uUMTu(a'k> (5.9)

with suitable topology is then called the a-tangent bundle of degree k of M.

All the a-tangent bundles of degree 1 are the same, and are merely the tangent

bundle T(M) of M. In the present paper, we treat only the exponential (i.e.,

a = 1) tangent bundle of degree 2, which we call the local exponential bundle

of degree 2, although it is immediate to generalize our results to the general

fa) a-bundle of degree k. Note that when we replace the covariant derivative vv '

by the partial derivative 3, we have the so-called jet bundle. Its structures

(e) are the same as the exponential bundle, because vv ' reduces to 3 in the

logarithm expression 3 ?(x,u) of tangent vectors.

(1 2) (2? The space ? x '

, which we will also more briefly denote by Tv ,

is spanned by vectors X, and Xp, where X, consists of m vectors

Xa(x,u) =

3a?(x,u), a = l,...,m

and Xp consists of m(m + l)/2 vectors

Xab(x>u) =

v^e)3b =

3a3b?(x,u) +

gab(u), a, b = l,...,m . a

(See Fig. 8.) We often omit the indices a or a, b in the notation X or X .,

(2) briefly showing them as X, or Xp. Since the space T/?

' consists of all the

linear combinations of X-, and Xp, it is written as

tJ2) = {??'?.(?,?)}

where the coefficients ? = (? ,? ) consist of ? = (ea), ? = (ea ), and

???. = ?? + ?2?0 = eV + eabX . . ? 1 2 a ab

(2) (2) The set X. forms a basis of the linear space r . The metric tensor of ?* ; is

then given by

9ij =

<??'? =

Eu[Vx>u)Xj(x'u)] *

Here, g?.., denotes an m ? m matrix

9ll =

<Xa'V = E? W]

= gab

Figure 8

which is the metric tensor of the tangent space ? (M) of M. The component

g21 =

g12 rePresents

g21 =

gabc =

<Xab'V =

rabc ?

Similarly, g22 is a quantity having four indices

g22 =

^ab'V '

The exponential connection can be introduced naturally in the local

(2) exponential fibre bundle Tv y(M) of degree 2 by the following principle:

1) The origin of T?+? corresponds to the point

X,du = X (x,u)dua?T(,2) ? a u

(2? (2? 2) The basis vector X.?(x,u

+ du)eT*+?

is mapped to T/j '

by 1-

(2? parallely shifting it in the Hilbert bundle ? and then projecting it to ? .

(2) (2) We thus have the affine correspondence of elements in ?KA and r ' ,

X.(u + du) ^ X.(u) + dX. = X,(u) + r^.X.(u)dua , ? ili a ? j

i (2) where H. are the coefficients of the exponential affine connection in ? (M),

aj ?

The coefficients are given from the above principle (2) by

ral - ?' ral =St4> ra2

= giJE[Xj9a9b8cl(x.u)]

. (5.10)

We remark again that the index i = 1 stands for a single index b, for example,

and i = 2 stands for a pair of indices, for example b, c.

66 Shun-ichi Amari

Let e(u) = ^ [u)*.{x9u)el^

be a point in T^2\

We can shift the

(2) (2) point 8(u)er

' to point e(u*)eT^, belonging to another point u' along a curve

u = u(t). Since the point ??(?)?.(?)eG '

corresponds to the point ?1(u + du)

(2) (X. + dX..) +

X]dueT,j+c|U9 where dX.. is determined from the affine connection and

the last term X,du corresponds to the change in the origin, we have the follow-

ing equation

?1 + r?.ejua + ?iV = 0 . (5.11) aj a

( 2 ) ? i whose solution e(t) represents the corresponding point in Tv/'x, where ? =

?a3 ??(?). Note that we are here talking about the parallel shift of a point in a

affine spaces, and not about the parallel shift of a vector in linear spaces

where the origin is always fixed in the latter case.

(2) Let u' be a point close to u. Let e(u';u) be the point in r J

(2) corresponding to the origin e(u') = 0 of the affine space ?, . The map depends

in general on the curve connecting u and u'. However, when ?u1 - u| is small,

the point ?(?';?) is given by

e1(u';u) = ?jdi'-u)

+ \ d2 (u'-u)2 + 0(|u'-u|3) .

3 Hence, if we neglect the term of order |u'-u| , the map does not depend on the

route. In the component form,

^(u'ju) = ea(u?;u) = u'a-ua ,

e2(u';u) = ebc(u';u) = \ (u'b-ub)(u,c-uc) , (5.12)

3 where we neglected the term of order |u'-u| . Since the origin e(u') = 0 of

r, can be identified with the point u* (the distribution q(x,u')) in the model

M, this shows that, in the neighborhood of u, the model M is approximately re-

(2) presented in G ' as a paraboloid given by (5.12).

Let us consider the exponential family E = ?p(x,e;u)} depending

on u, whose density function is given by

p(x,e;u) = qU^exp?e^?x^)

- ??(?)} , (5.13)

(2) where ? is the natural parameter. We can identify the affine space r ' with

i (2) the exponential family E , by letting the point ? = ? X^r

' represent the



= t<2>

Figure 9

distribution ?(?,?;?)e? specified by ?. We call E the local exponential

family approximating M at u. The aggregate

iW =

uUEM Eu

with suitable topology is called the fibre bundle of local exponential family of

degree 2 of M. The metric and connection maybe defined from the resulting identi-

fication of ?(M) with T^ '(M). The distribution q(x,u) exactly corresponds to

the distribution p(x,0;u) in ? , i.e., the origin ? = 0 of E or T* . Hence,

the point ? = e(u*;u) which is the parallel shift of e(u*) = 0 at E ,, is the

counterpart in E of the q(x,u'^M, i.e., the distribution p{x,e(u',u); ukE

is an approximation in E of ?(?,??')e?. For a fixed u, the distributions

ftu = {qf(x,u';u)> ,

q(xsu?;u) = p{x,e(u';u); u}

form an m-dimensional curved exponential family imbedded in E (Fig. 9). The

point of this construction is that M is approximated by a curved exponential

family M in the neighborhood of u. The tangent spaces ? (M) of M and Tu(Mu)

of M exactly correspond at u, so that their metric structures are the same at

u. Moreover, the squares of the imbedding curvatures are the same for both M

and ftu at u, because the curvature is obtained from the second covariant

68 Shun-ichi Amari

derivative of X^

= 3a?.

This suggests that we can solve statistical inference

problems in the curved exponential family M instead of in M, provided u is suf-

ficiently close to the true parameter uQ.

5.3. Statistical inference in a local exponential family

Given ? independent observations xm???? ?xfu\? we can define the

observed point ?(?)e? , for each u, by

n.Cu) = X-(u) = 1

3l^uyu) . (5.14)

We consider estimators based on the statistics n(u). We temporarily fix a point

u, and approximate model M by ? , which is a curved exponential family imbedded

in E . Let e be a mapping from E to M that maps the observed ?(?)e? to the

estimated value e(u) in M when u is fixed, by denoting it as

e(u) = e{X(u);u} .

The estimated value depends on the point u at which M is approximated by M .

The estimator e defines the associated ancillary family A = {A (u1), ?'e? }

for every u, where

Au(u') = e'^u'iu) =

{n?Eu|e(n;u) = u'} .

When the fixed u is equal to the true parameter un, M approximates M very U Uq well in the neighborhood of uQ. However, we do not know

uQ. To get an estima-

tor ? from e, let us consider the equation

e{X(u);u} = u .

The solution ? of this equation is a statistic. It implies that, when M is

approximated at lj, the value of the estimator e at E- is exactly equal to u.

The characteristics of the estimator u associated with the estimator e in M are

given by the following geometrical theorems, which are direct extensions of the

theorems in the curved exponential family.

Theorem 5.2. An estimator ? derived from e is first-order efficient

when the associated ancillary family A is orthogonal to M . A first-order

efficient estimator is second-order efficient.

Theorem 5.3. The third-order term of the covariance of a bias cor-

rected efficient estimator is given by




a = 1 (VW\2 + fH?eV +1 (H^h2 g3ab 2 {T Ub tHM }ab

+ 2 (HA }ab

'

The bias corrected maximum likelihood estimator is third-order efficient,

because the associated ancillary family has vanishing mixture curvature.

The proof is obtained in the way sketched in the following. The

true distribution q(x,uQ) is identical with the distribution q(x,e(u0);u0) at

un of the curved exponential family M . Moreover, when we expand q(x,u) and U Uq

q(x>e(u)-,UQ) at

uQ in the Taylor series, they exactly coincide up to the terms

of u-Uq

and (u-uQ) , because E is composed of X, and Xp. Hence, if the estima-

tion is performed in ? , we can easily prove that Theorems 5.2 and 5.3 hold, 0

because the Edgeworth expansion of the distribution u is determined from the

expansion of ?(x,u) up to the second order if the bias correction is used. How-

ever, we do not know the true uQ, so that the estimation is performed in E-.

In order to evaluate the estimator u, we can map E- (and M~) to M by the U U Uq

exponential connection. In estimating the true parameter, we first summarize ?

observations into X(u) which is a vector function of u, and then decompose it

into the statistics X(?) = {X,(?),X2(?)},

where e(X(?);?) = ?. The X2(?) be-

comes an asymptotic ancillary. When the estimator is the m.l.e., we have X-.(u)

= 0 and X2(u) =

Ha?'vK in Mq.

The theorems can be proved by calculating the

Edgeworth expansion of the joint distribution of X(u) or (u,v). The result is

the same as before.

We have assumed that our estimator e is based on X(u). When a

general estimator

u'= f(x(!)---x(N))

is given, we can construct the related estimator given by the solution of

ef(X,x;u) = u, where

ef(X;u) =

Eu[f(x(1),...,x(N))|X(u) = X] .

Obviously, ef(X;u) is the conditional expectation of u' given X(u) = X. By

virtue of the asymptotic version of the Rao-Blackwell theorem, the behavior of

ef is equal to or better than u' up to the third-order. This guarantees the



70 Shun-ichi Amari

validity of the present theory.

The problem of testing the null hypothesis HQ:u =

uf? against

H-, : u f Uq can be solved immediately in the local exponential family E . When

Hq is not simple, we can also construct a similar theory by the use of the

statistics ? and X(u). It is possible to evaluate the behaviors of various

third-order efficient tests. The result is again the same as before.

We finally treat the problem of getting a better estimator ? by

gathering asymptotically sufficient statistics X(u)'s from a number of indepen-

dent samples which are subject to the same distribution q(x,u0) in the same

model. To be specific, let X/, x,,... >Xm\n sind Xfo?i'* *#'xi2?N be two "^depen-

dent samples each consisting of ? independent observations. Let u, and u? be

the m.l.e. based on the respective samples. Let x(-?\(u.?)

be the observed point

in ?- , i = 1, 2. The statistic ?,.? consists of two components ?/??\?? =

(X^.xa) and

X^.x2 =

(X(-?)ab)? since

^ is the m.l.e.,

*(i)i(?i> = ?

is satisfied. The statistic u. carries the whole information of order ?

included in the sample and the statistic X2(u.)> which is asymptotically ancil-

lary, carries whole information of order 1 together with u.. Obviously X(-?\o (e) ?

is the curvature-direction component statistic, Xz-mo = H?b m ln tbe curved

exponential family E- . i ?

Given two sets of statistics (u., x(-? )2^,?))> i = 1, 2, which

summarize the original data, the problem is to obtain an estimator ?, which is

third-order efficient for the 2N observations. Since the two statistics X(u.)

give points ?#.? = X(?.) in the different ?- , in order to summarize them it is

necessary to shift these points in parallel to a common ? ,. Then, we can

average the two observed points in the common ? , and get an estimator ? in

this ? ,. The parallel affine shift of a point in Eu

to a different Eu, has

already been given by (5.11) in the ?-coordinate system. This can be rewritten

-1/2 in the ?-coordinate system. In particular, when du = u - u' is of order ?

-1 /2 and n(u) is also of order ? , the parallel affine shift of n(u^Eu

to Eu,

is



given in the following expanded form for ? = (n1,?2), ^

= (? ) and ?_ = (n . ),

na(u') =

na(u) +

gabdub -

nab(u)dub +

\ r^dubduc + 0(N'3/2) ,

"ab(u') =

iiab(u)+0<N"1) ?

Now, we shift the two observed points ?(,??(???) to a common ? ,,

where u' may be any point between u, and u2, because the same estimator ? is

obtained up to the necessary order by using any ? ,. Here, we simply put

u' = (u1

+ u2)/2,

and let d be

d = (\??

- u2)/2

.

Then, the point X/.*(u.) is shifted to X/.x(u') of ? , as

X(l)a =

X(l)a +

W"'^ -

X(l)ab*b +

\ ?mc?^C + 0(N~3/2> '

and we get similar expressions for ?,?? by changing d to -d. Since u. is the

m.l.e., X/.? = 0. The average of X/,x and ?,?? in the common ? , gives the

estimated observed point X(u') = (?-,,??) from the pooled statistics (u-,?,.?

(u^).

Xl 2 (X2ab Xlab)? +

2 rbca6 6 '

<h

X2 =

2 (X2ab +

Xlab) ?

/Xt f\j By taking the m.l.e. in ? , based on (X, ,Xp), we have the estimator

-a .a 1 ?ab/? ? \*c , 1 ab?(m)~c~d u = u * 2 g (X2bc

" Xlbc)?

+ 2 g rcdb?

d '

which indeed coincides with that obtained by the equation e(u) = ? up to the

third order. Therefore, the estimator ? is third-order efficient, so that it

coincides with the m.l.e. based on all the 2N observations up to the necessary

order.

The above result can be generalized in the situation where k

asymptotically sufficient statistics (??-?iX/.,?^) are given in

Eq , i = l,...,k,

ii. being the m.l.e. from N. independent observations. Let

u' = S?.??./S?1

.

Moreover, we define the following matrices

72 Shun-ichi Amari

Giab =

NiCgab(u')+Ir??^-u,C)- W?

Gab =??? Giab ? (Q3b) = ^ba)"1

?

Then, we have the following theorem.

Theorem 5.4. The bias corrected version of the estimator defined by

is third-order efficient.

This theorem shows that the best estimator is given by the weighted

average of the estimators from the partial samples, where the weights are

given by Giab- It is interesting that 6. . is different from the observed

Fisher information matrix

Jiab =

-s9aV(x(i)'u'> ?

They are related by

G. = J. k +?N.r?m)(u? - u'c) . lab lab 2 ? bean '

See Akahira and Takeuchi [1981] and Amari [1985].



6. ESTIMATION OF STRUCTURAL PARAMETER IN THE PRESENCE

OF INFINITELY MANY NUISANCE PARAMETERS

Estimating function and asymptotic variance

Let M = ??(?;?,?)} be a family of probability density functions of

a (vector) random variable ? specified by two scalar parameters ? and ?. Let

x-j? x2,...,xN be a sequence of independent observations such that the i-th

observation x. is a realization from the distribution ?(?;?,?.), where both ?

and r. are unknown. In other words, the distributions of x. are assumed to be

specified by the common fixed but unknown parameter ? and also by the unknown

parameter ?. whose value changes from observation to observation. We call ?

the structural parameter and ? the incidental or nuisance parameter. The prob-

lem is to find the asymptotic best estimator ?.. = ?..(?? ,??,... ,xN) of the

structural parameter ?, when the number ? of observations is large. The asymp-

totic variance of a consistent estimator is defined by

AV(e,H) = lim V[/?(6N

- ?)] (6.1) ?-**>

where V denotes the variance and ? denotes an infinite sequence ? = (?,,?,,,...)

of the nuisance parameter. An estimator ? is said to be best in a class C of

estimators, when its asymptotic variance satisfies, at any ?,

AV[e,H] < ??[?',?]

for all allowable ? and for any estimator ?'e C. Obviously, there does not

necessarily exist a best estimator in a given class C.

Now we restrict our attention to some classes of estimators. An

estimator ? is said to belong to class CQ, when it is given by the solution of

the equation

73

74 Shun-ichi Amari

? , ?? .S, ?(??>?)

= ? , 1=1 ?

where y(x,e) is a function of ? and ? only, i.e., it does not depend on ?. The

function y is called the estimating function. Let C-, be a subclass of Cn, con-

sisting of all the consistent estimators in CQ. The following theorem is well

known (see, e.g., Kumon and Amari [1984]).

Theorem 6.1. An estimator ?eCQ is consistent if and only if its

estimating function y satisfies

Ee^[y(x,e}] = 0 ,

Ee^[a0y(x,e)] j 0 ,

where E. c denotes the expectation with respect to ?(?;?,?) and da = 3/3T. The

asymptotic variance of an estimator ?e?, is given by

??(?,?) = lim ? zV[y(x.,0)] /{(idQy)}2 ,

where S3 v(x.,?)/? is assumed to converge to a constant depending on ? and ?. ? ?

Let HA r(M) be the Hilbert space attached to a point (?,?)e?, ? 9?

????(?) = (a(x) ? E65?[a]

= 0 , E9)?[a2]

< -}.

The tangent space T_ r(M)-<= HQ _(M) is spanned by ?(?;?,?) = 3 ?(?;?,?) and ? ,? ?,? ?

?(?;?,?) = 3 ?(?;?,?) . Let w be

w(x;e,?) = u - ^^P ? ,

<? >

2 where <? > = <?,?>. Then, the partial information g?n is given by ??

2 2 q =q - q /q =<w>, y60 yee yec /ycc

2 2 where g._ = , grr = <v >, g?r = <u,v> are the components of Fisher informa-

?? ?? ??

tion matrix. The theorem shows that the estimating function y(x,e) of a con-

sistent estimator belongs to Hn r for any ?. Hence, it can be decomposed as s,?

y(x,e) = a(?,?)?(?;?,?) + ?(?,?)?(?;?,?) + ?(?;?,?) ,

where ? belongs to the orthogonal complement of ? r in HA _, i.e., ?,? ? ,?

<u,n> = <?,?> = 0 .

The class C, is often too large to guarantee the existence of the

best estimator. A consistent estimator is said to be uniformly informative

(Kumon and Amari, 1984) when its estimating function y(x,e) can be decomposed as

y(x,e) = w(x;e,c) + ?(?;?,?) .

The class of the uniformly informative estimators is denoted by CUI. A uniform-

ly informative estimator satisfies

<y,w>9t? =

<?2>?)? =

9??(?,?) .

Let Cry be the class of the information unbiased estimators introduced by

Lindsay [1982], which satisfy a similar relation,

<y,w>9)? =

<y2>e>? .

Note that <y,w> = <y,u> holds.

Let us define the two quantities

g?(~) = lim 1 <S?(?;?,?.)2> , ?-*?

which depends on the estimating function y(?,?) and

g(s) = 1?G??S(???(?,?.) ,

which latter is common to all the estimators. Then, the following theorem gives

a new boi

(1984)).

a new bound for the asymptotic variance in the class CTU (see Kumon and Amari

Theorem 6.2. For an information unbiased estimator ?

AV[e;n] = g"1 + g"2g? .

We go further beyond this theory by the use of the Hilbert bundle theory.

6.2. Information, nuisance and orthogonal subspaces

We have already defined the exponential and mixture covariant de-

rivatives v^e) and v(m) in the Hilbert bundle H = U, *HQ (M). A field ? ? ? ?, ? ; ? , ? (e)

??;?,?)e?? (?) defined at all (?,?) is said to be e-invariant, when v; 'r = 0 ?,? <3

holds. A field G(?;?,?) is said to be strongly e-invariant (se-invariant),

when r does not depend on ?. A se-invariant field is e-invariant. An estimat-

ing function y(x,e) belonging to C, is an se-invariant field, and conversely,

an se-invariant y(x9e) gives a consistent estimator, provided <u,y> j 0.

Hence, the problem of the existence of a consistent estimator in CQ

reduces to

7 6 Shun-ichi Amari

the problem of the existence of an se-invariant field in the Hilbert bundle

H(M).

We next define the subspace ?? r of HQ r by ?,? ?,?

"l.i'h' {<"^.a(x) |3(?)e??>?1},

i.e., the subspace composed of all the m-parallel shifts to (?,?) of the vectors

belonging to the tangent space ? , at all (?,?1)^ with common ?. Then, ? ??

? is decomposed into the direct sum ? ??

??,? =

??,????,? ?

where H0 , is the orthogonal complement of Hn r. We call H_ _ the orthogonal 8 ?? 8 ,? ? ,?

subspace at (?,?). We next define the nuisance subspace HQ r at (?,?) spanned 8 ,?

by the m-parallel shifts ^v,v from (?,?*) to (?,?) of the ?-score vectors

?(?;?,?') = di for all ?'. It is a subspace of HQ r> so that we have the ? 8 ,?

decomposition

??,? =

??,?^??,? ?

I NT where Hn r is the orthogonal complement of Ha r in HQ r. It is called the

?,? 8,? ?,?

information subspace at (?,?). Hence,

Any vector ??;?,?)e? can uniquely be decomposed into the sum, ? ,?

G(?;?,?) = rl(x9e9E,) + G?(?;?,?) + G?(?;?,?) , (6.2)

where r e?^ r, r e?? r and r e?_ r are called respectively the I-, N- and 0- ? ,? ?,? ?,?

parts of r.

We now define some important vectors. Let us first decompose the

?-score vector u = 3??e?? r into the three components. Let u (?;?,?)e?_ r ? ? ,? ? ,?

be the I-part of the ?-score ?e?. . We next define the vector 8,?

?(?;?,?;??) = (??)p^?(?;?,??)

(6.3)

in ?? r9 which is the m-shift of the ?-score vector ueT. r, from (?,?') to ?,? 8,?

(?,?). Let ?1 be its I-part. The vectors ? (?;?,?;?') in \\\ where (?,?) is 8 , ?

fixed, form a curve parametrized by ?* in the information subspace ? . When 8 > ?

a11 of 9" (?')? (?;?,?;?')e?. r lie in a hyperplane in H* r for all ??, we say 88 ?,? ?,?

that ? are coplanar. In this case, there exists a vector w e?* r for which ?,?

<wI,?I(x;0^;c')> = gee(rj) (6.4)

holds for any ?". The vector w (^,?,?)e?O r is called the information vector. 8 ,?

When it exists, it is unique.

6.3. Existence theorems and optimality theorems

It is easy to show that a field ??;?,?) is se-invariant if its

? nuisance part r vanishes identically. Hence, any estimating function

y(x,0^C-, is decomposed into the sum

y(x.e) = y!(x;e^) + y?(x;e,?) .

We can prove the following existence theorems.

Theorem 6.3. The class C, of the consistent estimators is nonempty

if the information subspace HA r includes a non-zero vector. 8??

Theorem 6.4. The class CUT of the uniformly informative estimators

in C, is nonempty, if ? (?;?,?;?') are coplanar. All the uniformly informative

estimators have the identical I-part y (?;?,?), which is equal to the informa-

tion vector w (?;?,?).

Outline of proof of Theorem 6.3. When the class C, is nonempty,

there exist an estimating function y(x,e) in C,. It is decomposed as

y(x?e) = y!(x;e^) + y (?;?,?) .

Since y is orthogonal to the tangent space HQ r we have 8,?,

<y?,u> = 0 .

By differentiating <y(x9e)> = 0 with respect to ?, we have

0 = <3Qy>

+ <y,u>

= <dQy>

+ <y ,u> .

Since <3Qy> = 0, we see that y (?;?,?) j 0, proving that H. 8 ? ,?

includes a non-zero vector. Conversely, assume that there exists a non-zero

vector a(?,?) in Hz r for some ?. Then, we define a vector ?,?

y(x;e,c') = ^e^'

a(x,e) = a(xfe) - Ee^,[a]

78 Shun-ichi Amari

in each H. _, by shifting a(x,e) in parallel in the sense of the exponential ?,?

connection. By differentiating <a> r = EQ r[a] with respect to ?, we have ?,? ?,?

3 <a> = <3 a> + <a,v> = 0 ,

? because a does not include ? and a is orthogonal to HQ . This proves ? ,?

??)?,[>] = 0.

Hence, the above ?(?;?,?') does not depend on ?' so that it is an estimating

function belonging to C-,. Hence,C, is nonempty, proving theorem 6.3.

Outline of proof of Theorem 6.4. Assume that there exists an

estimating function y(x,e) belonging to Cyj. Then, we have

<y,u(x;e^)>6^ =

5??(?) ,

because of <y,v> = 0. Hence, when we shift y in exponential parallel and we

shift u in mixture parallel along the ?-axis, the duality yields

<(e),y, (m),(g"1u)> = ? .

or

<yI(x;e^), ??(?;?,?;??)>= 9??(?') .

This shows that ? are coplanar, and the information vector w is given by

projecting y to ? ,. Conversely, when ? are coplanar, there exists the ? ,?

information vector w e ?? r. We can extend it to any ?1 by shifting it in ex- ? ,?

ponential parallel,

y(x.e) = (?\??

.

which yields an estimating function belonging to Cyj.

The classes C-. and Cyj

are sometimes empty. We will give an

example later. Even when they are nonempty, the best estimators do not neces-

sarily exist in C, and in Cjy. The following are the main theorems concerning

best estimators. (See Lindsay (1982) and Begun et al. (1983) for other

approaches to this problem.)

Theorem 6.5. A best estimator exists in C,, iff the vector field

? (?;?,?), which is the I-part of the ?-score u, is e-invariant. The best

estimating function y(x,e) is given by the e-invariant u , which in this case

is se-invariant.

Theorem 6.6. A best estimator exists in C..,, iff the information

vector w (?;?,?) is e-invariant. The best estimating function y is given by

the e-invariant w , which in this case is se-invariant.

Outline of proofs. Let ? be an estimator in C, whose estimating

function is y(x9e). It is decomposed into the following sum,

y(x,e) = ?(?,?) u1 + ?^?-,?,?) + y?(x;e^) ,

where u (?,?) is the projection of ?(?;?,?) to H: r, ?(?,?) is a scalar, and 8 ,?

a e?O r is orthogonal to ? in HQ r. The asymptotic variance of ? is calculated 8,? 8 ,?

as

??[?;?] = lim N?z(c.2A. + ?.)}/{(??.?.)2} , ?-*?

ili il

where ? = (?^?^...), ci

= ?(?,???),

and

??? = ,

B1 = <(a!(x))2> + <(y?)2> .

From this, we can prove that, when and only when B. = 0, the estimator is

uniformly best for all sequences ?. The best estimating function is u (?;?,?)

for ? = (?,?,?, ...). Hence it is required that u is se-invariant. This

proves Theorem 6.5. The proof of Theorem 6.6 is obtained in a similar manner

by using w instead of u .

6.4. Some typical examples: nuisance exponential family

The following family of distributions,

?(?;?,?) = exp{s(x,e)c + r(x,e) - ?(?,?)} (6.5)

is used frequently in the literature treating the present problem. When ?

is fixed, it is an exponential family with the natural parameter ?, admitting

a minimal sufficient statistic s(x,e) for ?. We call this an n-exponential

family. We can elucidate the geometrical structures of the present theory by

applying it to this family. The tangent vectors are given by

U = ?3?5

+ 3Qr

- 3?? , V = S -

3?? .

The m-parallel shift of a(x) from (?,?') to (?,?) is

80 Shun-ichi Amari

(G?)p^, a(x) = ?(?)???{(? - ?')d - ?(?) + ?(?*)} .

From this follows a useful Lemma.

? Lemma. The nuisance subspace HA r is composed of random variables - ? ,?

of the following form,

?^? = ?f[s(x,e) - ?(?,?)]} ,

where f is an arbitrary function and ?(?,?) = En r[f(s)]. The I-part a of 8 ,?

a(x) is explicitly given as

a*(x) = a(x) - E. [a(x) | s(x,e)] , (6.6) 8 ,?

by the use of the conditional expectation E[a|s]. The information subspace

Hft , is given by 8,?

?^? =

{h(s9Q9g)(dQs)1 +

fts-.e.c^r)1}

for any f, where h = a f + ??.

We first show the existence of consistent estimators in C, by

applying Theorem 6.3.

Theorem 6.7. The class C, of consistent estimators is nonempty in

an ?-exponential family, unless both s and r are functionally dependent on s,

i.e., unless

(v)! =

(v)J = ? ?

On the other hand, a consistent estimator does not necessarily exist

in general. We give a simple example: Let ? = (?-?,?2) be a pair of random

variables taking on two values 0 and 1 with probabilities

P(x1 = 0) = V(l + exp?e + ?}) ,

P(x2 = 0) = 1/(1 + expikU)}) ,

where k is a known nonlinear function. The family M is of ?-exponential type

only when k is a linear function. We can prove that ?O r = {0}, unless k is ?>?

linear. This proves that there are no consistent estimators in this problem.

Now we can obtain the best estimator when it exists for

?-exponential family. The I-part of the ?-score ? is given by



?^?-,?,?) = (a^)1

+ (a^)1

.

It is e-invariant, when and only when (aQs) = 0. ?

Theorem 6.8. The optimal estimator exists in C-. when and only when

(a.s) = 0, i.e., a.s(x,e) is functionally dependent on s. The optimal ? ?

estimating function is given in this case by the conditional score u = (anr) = ?

a.r - E[3Qr I s], and moreover the optimal estimator is information unbiased in ? ?

this case.

According to Theorem 6.4, in order to guarantee the existence of

uniformly informative estimators, it is sufficient to show the coplanarity of

? (?;?,?;?'), which guarantees the existence of the information vector

w(x;0^H: r. By putting w = h(s)(dAs) + f(s)(3Qr) , this reduces to the ?,? ? ?

integral-differential equation in f,

<w^?(a0s)1 +

(\r)l>c =

5??(??) . (6.7)

When the above equation has a solution f(s;e,c), ? are coplanar and the inform-

ation vector w exists. Moreover, we can prove that when (aQr) = 0, the

information vector w is e-invariant.

Theorem 6.9. The best uniformly informative estimator exists when

(a r) = 0. The best estimating function is given by solving ?

E6)?l[h(s)V[3es I s]] = 9??(?')/?' , (6.8)

where h(s;e) does not depend on ?' and V[Vs I s] is the conditional covariance. ?

We give another example to help understanding. Let ? = (?,,x2)

be

a pair of independent normal random variables, ?-|^?(?,1), ?2^?(??,1). Then,

the logarithm of their joint density is

?(?;?,?) = - \ [(x1

- ?)2 + (x2

- ??)2 - log(2ir)]

= ?$(?,?) + r(x,e) - ?(?,?) ,

where s(x,e) = x]

+ ??2, r(x,e) = - (?2 + x2)/2, ?(?,?) = ?2(1 + ?2)/2 +

log(2ir). From 3As = x0, 3nr = 0, we have ? c ?

OgS)1 =

(X2 -

??^/d t ?2) , O^)1

= ?.

82 Shun-ichi Amari

Hence, from Theorems 6.7 and 6.8, the class C, is nonempty, but the best

estimator does not exist in C-,. Indeed, we have

?^?-,?,?) = ?(?2

- ??^/?

+ ?2) ,

which depends on ? so that it is not e-invariant. Since any vector w in H* r ?,?

can be written as

w = hisMa^)1

for some h(s;e,c), the information vector w (?;?,?)e?O r can be obtained by ? ??

solving (6.4) or (6.7), which reduces in the present case to

Hence, we have

Ee^[h(s)(x2 -

???)] = ?(1 + ?2)

h(s) = s/O + ?2) ,

which does not depend on ?. Therefore, there exists a best uniformly informa-

tive estimator whose estimating function is given by

y(x,e) = w^x.e) = h(s)(3es)T

= (x2

- ??])(?1

+ ??2)/(1

+ ?2)2

or equivalently by (x? - ??, )(x.. + ??2). This is the m.l.e. estimator. This is

not information unbiased.



7. PARAMETRIC MODELS OF STATIONARY GAUSSIAN TIME SERIES

a-representation of spectrum

Let M be the set of all the power spectrum functions S(?) of

zero-mean discrete-time stationary regular Gaussian time series, SU) satisfy-

ing the Paley-Wiener condition,

log S(o))d(A) > - ?? .

Stochastic properties of a stationary Gaussian time series ?xA* t = ..., -1, 0,

1, 2, ..., are indeed specified by its power spectrum SU), which is connected

with the autocovariance coefficients c. by

1 |?p c. =

?- S(cd) cosando) , (7.1)

S(?)) = c0

+ 2 t^0 ct

coso)t , (7.2)

where

ct =

E[xrVt]

for any r. A power spectrum S(u>) specifies a probability measure on the

sample space X = {xt> of the stochastic processes. We study the geometrical

structure of the manifold M of the probability measures given by SU). A

AR specific parametric model, such as the AR model M of order n, is treated as a

submanifold imbedded in M.

Let us define the a-representation jra'U) of the power spectrum

SU) by

r- l (SU)}""01, a j 0 ,

*(a)U) = ? a

(7.3)

1 log SU) , a = 0 .

83



84 Shun-ichi Amari

(Remark: It is better to define the a-representation by - (l/a)[SU)~a- 1],

However, calculations are easier in the former definition, although the follow-

ing discussions are the same for both representations.) We impose the regular-

fa) ity condition on the members of M that iK ' can be expanded into the Fourier

series for any a as

*(a)U) =?<a) + 2 tlQ ??a)a>5?? , (7.4)

where

? 2p ^a)(a))C0Su)td?) , t = 0 1, 2,

We may denote the ^a\?) specified by ?a = u[ah by ?^(?-,?^). An

infinite number of parameters {?!01'} together specify a power function by

[-?*(a)(?;?(a))G1/?, a j 0

S(u-Ja)) A (7.5)

exp U(0W(0))h

Therefore, they are regarded as defining an infinite-dimensional coordinate

system in M. We call ??a ' the a-coordinate system of M. Obviously, the -1-

coordinates are given by the autocovariances, ??~ ' = c.. The negative of the

1-coordinates ?? ', which are the Fourier coefficients of S" (?), are denoted

by c. and are called the inverse autocovariances, ? Oy

7.2. Geometry of parametric and non-parametric time-series models

Let M be a set of the power spectra SU;u) which are smoothly

specified by an ?-dimensional parameter ? = (ua), a = 1, 2, ..., n, such that

M becomes a submanifold of ?., e.g., M could be an autoregressive process.

This M is called a parametric time-series model. However, any member of M can

be specified by an infinite-dimensional parameter u, e.g., by the a-coordinates

?(<0 = {??a)},

t = 0, 1, ... in the form 5(?,?^). The following discussions

are hence common to both the parametric and non-parametric models, irrespective

of the dimension ? of the parameter space.

We can introduce a geometrical structure in M or M in the same

manner as we introduced before in a family of probability distributions on

sample space X, except that X = {x.} is infinite-dimensional in the present

time-series case (see Amari, 1983 c). Let PT(x-j,... ,xT; u) be the joint prob-

ability density of the ? consecutive observations x,,...,x_ of a time series

specified by u. Let

?T(x1,...,xT;u) = log p(x-,,... ,xT*,u)

.

Then, we can introduce in M or M the following geometrical structures as

before,

gab(u)=lim 1

E[3aWTJ ,

rabc ?

jim ? E[{3aVT -

? aaVb*T>8ctT] '

T-x?

However, the limiting process is tedious, and we define the geo-

metrical structure in terms of the spectral density SU) in the following.

Let us consider the tangent space ? at u of M or ? , which is

spanned by a finite or infinite number of basis vectors aa = 3/au associated a

with the coordinate system u. The a-representation of 3^ is the following func- a

tion in ?,

3a = (3/3?3?(a)(?;??) .

a

Hence, in M, the basis 3^ associated with the a-coordinates ?^a' is

1 , t = 0

A*)

2cosu)t , t =f 0 .

Let us introduce the inner product g . of 3a

and 3b

in Ty by

gab(u) =

<8a'9b> =

y3a*(a)(w;u)ab?(a)U;u)] ,

where E is the operator defined at u by

Ea[aU)] =

j{S(w;u)}2aaU)d?) .

The above inner product does not depend on a, and is written as

< VV

= aa[log SU,u)]3b[log S(w,u)]dw . (7.6)

(a) We next define the a-covariant derivative v^ y3b

of ab in the

a

86 Shun-ichi Amari

(a) direction of an by the projection of 3 3. ? ' to ? . Then, the components of

a a d u

the a-connection are given by

f S2a3 3. *(a)3?*(a)a? . (7.7) ill <?> = Aa)w

-

d Vb*

?il^ -

If we use 0-representati on, it is given by

|(3 3jog S - 3 log Sablog S)a log S d? .

From (7.4) and (7.7), we easily see that the a-connection vanishes in M

identically, if the a-coordinate system ?â' is used. Hence, we have

Theorem 7.1. The non-parametric M is a-flat for any a. The

(a) (a) a-affine coordinate system is given by ?? '. The two-coordinate systems ?? '

and ?^~a' are mutually dual.

Since M is a-flat, we can define the a-divergence from S-.U) to

S2U) in M. It is calculated as follows.

Theorem 7.2. The a-divergence from S, to S? is given by

7l/a2) f {[S2U)/S.,U)]a

- 1 - alog[S2/S1]}dw , a f 0

?a(SyS2)=]<

(1/2) f [log Sû)) - log S2U)]2du> , a = 0 .

7.3. a-flat models

An a-model Mj* of order ? is a parametric model such that the

a-representation of the power spectrum of a member in M?| is specified by ? + 1

parameters u = (u ), k = 0, ?,.,.,?, as

ir?Û;u) = Uq

+ 2 k|^ uk

cos ku> .

Obviously, Ma is a-flat (and hence -a-flat), and u is its a-affine coordinate

system.

AR The AR-model M of order ? consists of the stochastic processes

defined recursively by ?

k=0 Vt-k =

et

where {e.} is a white noise Gaussian process with unit variance and a = (aQ, AR

a,,...,a ) is the (n+1)-dimensional parameter specifying the members of M .


Hence, it is an (n+1)-dimensional submanifold of M. The power spectrum SU;a)

of t^e process specified by a is given by

SUa) = | j0 akeikT2 .

AR We can calculate the geometric quantities of M in terms of the AR-coordinate

system a . the above expression.

MA Similarly, the MA-model M of order ? is defined by the pro-

cesses ?

xt =

k=0 bket-k

where b = (bQ,

b, ,...,b ) is the MA-parameter. The power spectrum SU;b) of

the process specified by b is

iko),2 SU;b) =

|g bke'

f order ? introd

posed of the following power spectra SU;e) parameterized by e = (e?, e,,

? SU;e) =

exp{eQ +

2k?Q ekcos kw} .

EXP The exponential model M of order ? introduced by Bloomfield (1973) is com-

en),

given by

AR It is easy to show that the 1-representation of SU;a) in M is

V JkVt-k - k = 0' 1-??>?

where

ck = 0 , k > ?

?(1V,a) = - S"V;a) = l\eUk

This shows that M isa submanifold specified by ck = 0, (k > n) in M. Hence,

it coincides exactly with a one-model M^ , although the coordinate system a is

MA not 1-affine but curved. Similar discussions hold for M .

AR (1 ) Theorem 7.3. The AR-model M coincides with

M^ ' , and hence is

MA (-1 ) ?l-flat. The MA-model M coincides with M^ y, and hence is also ?l-flat.

FXP fOi The exponential model M^

coincides with M^ ', and is 0-flat. Since it is

self-dual, it is an (n+1)-dimensional Euclidean space with an orthogonal

Cartesian coordinate system e.



88 Shun-ichi Amari

7.4. a-approximation and a-projection

Given a parametric model M = {SU;u)}, it is sometimes necessary

to approximate a spectrum S(u>) by one belonging to M . For example, given a

finite observations ?-,, ..., Xy of {x.}, one tries to estimate u in the paramet-

ric model Mn by obtaining first a non-parametric estimate SU) based on x,, ...,

xT and then approximating it by SU;u^Mn. The a-approximation of S is the one

that minimizes the a-divergence D [SU), SU,u)], ?e? . It is well known that

the -1-approximation is related to the maximum likelihood principle. As we

have shown in ?2, the a-approximation is given by the a-projection of SU) to

M . We now discuss the accuracy of the a-approximation. To this end, we con-

sider a family of nested models {M > such that Mnz> ?? zd M0 zd ...M = M. The ? ? ? ? ?? AR MA EXP

{M MM } and {hr } are nested models, in which MQ is composed of the white

noises of various powers.

Let {Ma} be a family of the a-flat nested models, and let S U;? )e

M be the -a-approximation of SU), where ? is the (n+1)-dimensional parameter

given by

min D [S,S ??)] = D [S,S (?;? )] . ? ?.a -a ? -a ? ?

?e ?

The error of the approximation by S e? is measured by the -a-divergence

D_a(S,Sn). We define

??^> =

,mi"a D-a^>V *

D-a^>V ? <7?8)

5?e?? ? ?

It is an interesting problem to find out how E (S) decreases as ? increases.

We can prove the following Pythagorean relation (Fig. 10).

D-a^V =

D-a^Sn+1) +

D.a(Sn+1,Sn) .

The following theorem is a direct consequence of this relation.

Theorem 7.4. The approximation error E (S) of S is decomposed as

En<S> =

kin D-JSk+rSk> ? <7?9>

Figure 10

Hence,

D (S,Sn) =

J D (Sn+1,Sn) . -a 0 n=0 -a n+1 n

The theorem is proved by the Pythagorean relation for the right

triangle aSS Sq composed of the a-geodesic S SQ included in M?? and -a-geodesic

SS intersecting at S perpendicularly. The theorem shows that the approxima-

tion error E (S) is decomposed into the sum of the -a-divergences of the

successive approximations Sk, k = n+1, ...,??, where Sto

= S is assumed. More-

over, we can prove that the -a-approximation of S. in M?? (? < k) is S. In ? ? ?

other words, the sequence {S } of the approximations of S has the following

property that S is the best approximation of S. (k > n) and that the approxima-

tion error E (S) is decomposed into the sum of the -a-divergences between the

further successive approximations. This is proved from the fact that the a-

geodesic in M connecting two points S and S' belonging to M^J is completely in-

cluded in M?? for an a-model M??. ? ? AR

Let us consider the family {M } of the AR-models. It coincides

with M . Let S be the -1-approximation of S. Let c.(S) and c.(S) be, res-

pectively, the autocovariances and inverse autocovariances. Since c. and c^

are the mutually dual -1-affine and 1-affine coordinate systems, the -1-approx-

90 Shun-ichi Amari

imation S of S is determined by the following relations

1) ct(Sn) =

ct(S), t = 0, 1, ..., ?

2) ct(S ) =0, t = n+1, n+2, _

This implies that the autocovariances of S are the same as those of S up to

t = n, and that the inverse autocovariances c. of S vanish for t > n. Similar

relations hold for any other a-flat nested models, where c. and c. are replaced

EXP by the dual pair of a- and -a-affine coordinates. Especially, since {M }

are the nested Euclidean submanifolds with the self-dual coordinates ?^ ', their

properties are extremely simple.

We have derived some fundamental properties of a-flat nested para-

metric models. These properties seem to be useful for constructing the theory

of estimation and approximation of time series. Although we have not discussed

about them here, the ARMA-modes, which are not a-flat for any a, also have in-

teresting global and local geometrical properties.

Acknowledgements

The author would like to express his sincere gratitude to Dr. M.

Kumon and Mr. H. Nagaoka for their collaboration in developing differential

geometrical theory. Some results of the present paper are due to joint work

with them. The author would like to thank Professor K. Takeuchi for his

encouragement. He also appreciates valuable suggestions and comments from the

referees of the paper.



REFERENCES

Akahira, M. and Takeuchi, ?. (1981). On asymptotic deficiency of estimators

in pooled samples. Tech. Rep. Limburgs Univ. Centr. Belgium.

Amari, S. (1968). Theory of information spaces ? a geometrical foundation of

the analysis of communication systems. RAAG Memoirs 4, 373-418.

Amari, S. (1980). Theory of information spaces ? a differential geometrical

foundation of statistics. POST RAAG Report, No. 106.

Amari, S. (1982a). Differential geometry of curved exponential families ?

curvatures and information loss. Ann. Statist. 10? 357-387.

Amari, S. (1982b). Geometrical theory of asymptotic ancillarity and condition-

al inference. Biometrika 69, 1-17.

Amari, S. (1983a). Comparisons of asymptotically efficient tests in terms of

geometry of statistical structures. Bull. Int. Statist. Inst.,

Proc. 44th Session, Book 2, 1190-1206.

Amari, S. (1983b). Differential geometry of statistical inference, Probability

Theory and Mathematical Statistics (ed. Ito, ?. and Prokhorov,

J. V.), Springer Lecture Notes in Math 1021, 26-40.

Amari, S. (1983c). A foundation of information geometry. Electronics and

Communication in Japan, 66-A, 1-10.

Amari, S. (1985). Differential-Geometrical Methods in Statistics. Springer

Lecture Notes in Statistics, 28, Springer.

Amari, S. and Kumon, M. (1983). Differential geometry of Edgeworth expansions

in curved exponential family, Ann. Inst. Statist. Math. 35A,

1-24.

91



92 Shun-ichi Amari

Atkinson, C. and Mitchell, A. F. (1981). Rao's distance measure, Sankya A43,

345-365.

Barndorff-Nielsen, 0. E. (1980). Condi tionality resolutions. Biometrika 67,

293-310.

Barndorff-Nielsen, 0. E. (1987). Differential and integral geometry in

statistical inference. IMS Monograph, this volume.

Bates, D. M. and Watts, D. G. (1980). Relative curvature measures of non-

linearity, J. Roy. Statist. Soc. B40, 1-25.

Beale, E. M. L. (1960). Confidence regions in non-linear estimation. J. Roy.

Statist. Soc. B22, 41-88.

Begun, J. M., Hall, W. J., Huang, W.-M. and Wellner, J. A. (1983). Informa-

tion and asymptotic efficiency in parametric-nonparametric models.

Ann. Statist. 1]_, 432-452.

Bhattacharya, R. N. and Ghosh, J. K. (1978), On the validity of the formal

Edgeworth expansion. Ann. Statist. J5, 434-451.

Bloomfield, P. (1973). An exponential model for the spectrum of a scalar time

series. Biometrika 60, 217-226.

Burbea, J. and Rao. C. R. (1982). Entropy differential metric, distance and

divergence measures in probability spaces: A unified approach.

J. Multi. Var. Analys. 12, 575-596.

Chentsov, N. N. (1972). Statistical Decision Rules and Optimal Inference

(in Russian). Nauka, Moscow, translated in English (1982), AMS,

Rhode Island.

Chernoff, H. (1949). Asymptotic studentization in testing of hypotheses,

Ann. Math. Stat. 20, 268-278.

Cox, D. R. (1980). Local ancillarity. Biometrika 67, 279-286.

Csisz?r, I. (1975). I-divergence geometry of probability distributions and

minimization problems. Ann. Prob. ,3, 146-158.

Dawid, A. P. (1975). Discussions to Efron's paper. Ann. Statist. 3> 1231-

1234.




Dawid, A. P. (1977). Further comments on a paper by Bradley Efron. Ann.

Statist. 5, 1249.

Efron, ?. (1975). Defining the curvature of a statistical problem (with

application to second order efficiency) (with Discussion). Ann.

Statist. 3, 1189-1242.

Efron, ?. (1978). The geometry of exponential families. Ann. Statist. 6,

362-376.

Efron, ?. and Hinkely, D. B. (1978). Assessing the accuracy of the maximum

likelihood estimator: Observed versus expected Fisher information

(with Discussion). Biometrika 65, 457-487.

Eguchi, S. (1983). Second order efficiency of minimum contrast estimators in

a curved exponential family. Ann. Statist. ?, 793-803.

Hinkely, D. V. (1980). Likelihood as approximate pivotal distribution.


Hougaard, P. (1983). Parametrization of non-linear models. J. R. Statist.

Soc. B44, 244-252.

James, A. T. (1973). The variance information manifold and the function on it.

Multivariate Analysis (ed. Krishnaiah, P. K.), Academic Press,

157-169.

Kariya, T. (1983). An invariance approach in a curved model. Discussion paper

Ser. 88, Hitotsubashi Univ.

Kass, R. E. (1980). The Riemannian structure of model spaces: A geometrical

approach to inference. Ph.D. Thesis, Univ. of Chicago.

Kass, R. E. (1984). Canonical parametrization and zero parameter effects

curvature. J. Roy. Statist. Soc. B46, 86-92.

Kumon, M. and Amari, S. (1983). Geometrical theory of higher-order asymptotics

of test, interval estimator and conditional inference, Proc. Roy.

Soc. London A387, 429-458.

Kumon, M. and Amari, S. (1984). Estimation of structural parameter in the

presence of a large number of nuisance parameters. Biometrika 71,

445-459.



94 Shun-ichi Amari

Kumon, M. and Amari, S. (1985). Differential geometry of testing hypothesis:

a higher order asymptotic theory in multi parameter curved exponen-

tial family, METR 85-2, Univ. Tokyo.

Lauritzen, S. L. (1987). Some differential geometrical notions and their use

in statistical theory. IMS Monograph, this volume.

Lindsay, B. G. (1982). Conditional score functions: Some optimality results.


McCullagh, P. (1984). Tensor notation and cumulants of polynomials.

Biometrika 7J_, 461-476.

Madsen, L. T. (1979). The geometry of statistical model ? a generalization

of curvature. Research Report, 79-1, Statist. Res. Unit., Danish

Medical Res. Council.

Nagaoka, H. and Amari, S. (1982). Differential geometry of smooth families of

probability distributions, METR 82-7, Univ. Tokyo.

Pfanzagl, J. (1982). Contributions to General Asymptotic Statistical Theory.

Lecture Notes in Statistics V3> Springer.

Rao, C. R. (1945). Information and accuracy attainable in the estimation of

statistical parameters. Bull. Calcutta. Math. Soc. 37, 81-91.

Reeds, J. (1975). Discussions to Efron's paper. Ann. Statist. 3, 1234-1238.

Skovgaard, lb. (1985). A second-order investigation of asymptotic ancillarity,

Ann. Statist. ]_3, 534-551.

Skovgaard, L. T. (1984). A Riemannian geometry of the multivariate normal

model, Scand. J. Statist. V[, 211-223.

Yoshizawa, T. (1971). A geometrical interpretation of location and scale

parameters. Memo TYH-2, Harvard Univ.



DIFFERENTIAL AND INTEGRAL GEOMETRY IN STATISTICAL INFERENCE

0. E. Barndorff-Nielsen

1. Introduction. 97

2. Review and Preliminaries . 99

3. Transformation Models . 118

4. Transformation Submodels . 127

5. Maximum Estimation and Transformation Models . 130

6. Observed Geometries . 135

7. Expansion of c|j| L. 147

8. Exponential Transformation Models . 152

9. Appendix 1. 154

10. Appendix 2. 156

11. Appendix 3. 157

12. References. 159

Department of Theoretical Statistics, Institute of Mathematics, University of

Aarhus, Aarhus, Denmark

95



1. INTRODUCTION

This paper gives an account of some of the recent developments in

statistical inference in which concepts and results from integral and differen-

tial geometry have been instrumental.

A great many important contributions to the field of integral and

differential geometry in statistics are not discussed or even referred to here,

but a rather comprehensive overview of the field can be obtained from the mate-

rial compiled in the present volume and from the survey paper by Barndorff-

Nielsen, Cox and Reid (1986).

Section 2 reviews pertinent parts of statistics and of integral

and differential geometry, and introduces some of the terminology and notation

that will be used in the rest of the paper.

A considerable part of the material in sections 3, 4, 5 and 8 and

in the appendices, which are mainly concerned with the systematic theory of

transformation models and exponential transformation models, has not been pub-

lished elsewhere.

Sections 6 and 7 describe a theory of "observed geometries" and its

relation to an asymptotic expansion of the formula c|j| C for the conditional

distribution of the maximum likelihood estimator; the results there are mostly

taken from Barndorff-Nielsen (1986a). Briefly speaking, the observed geome-

tries on the parameter space of a statistical model consist of a Riemannian

metric and an associated one-parameter family of affine connections, construct-

ed from the observed information matrix and from an auxiliary statistic a cho-

sen such that (?,a), where ? denotes the maximum likelihood estimator of the

97



98 O. E. Barndorff-Nielsen

parameter of the model, is minimal sufficient. The observed geometries and the

closely related expansion of c|jpL form a parallel to the "expected geometries"

and the associated conditional Edgeworth expansions for curved exponential

families studied primarily by Amari (cf., in particular, Amari 1985, 1986), but

with some essential differences. In particular, the developments in sections 6

and 7 are, in a sense, closer to the actual data and they do not require inte-

grations over the sample space; instead they employ "mixed derivatives of the

log model function." Furthermore, whereas the studies of expected geometries

have been largely concerned with curved exponential families the approach taken

here makes it equally natural to consider other parametric models, and in par-

ticular transformation models. The viewpoint of conditional inference has been

instrumental for the constructions in question. However, the observed geometri-

cal calculus, as discussed in section 6, does not require the employment of

exact or approximate ancillaries.

The observed geometries provide examples of the concept of

statistical manifolds discussed by Lauritzen (1986).

Throughout the paper examples are given to illustrate the general

results.



2. REVIEW AND PRELIMINARIES

We shall consider parametrized statistical models M specified by

(?^,?(?;?),O) where X^ is the sample space, O is the parameter space and ?(?;?)

is the model function, i.e. ?(?;?) = dP /dy for some dominating measure y. The ?

dimension of the parameter ? will usually be denoted by d and we write ? on

coordinate form as (? ,.,.,? ). Generic coordinates of ? will be indicated as

r s t *+,* ? , ? , ? , etc.

The present section is organized in a number of subsections and it

serves two purposes: to provide a survey of previous results and to set the

stage for the developments in the following sections.

Combinants. It is useful to have a term for functions which depend

on both the observation ? and the parameter ? and we shall call any such func-

tion a combinant.

Jacobians. Our vectors are row vectors and we denote transposi-

tion of a matrix by an as ten* ? *. If f is a differentiable transformation of

a space ? then the Jacobian df/dy* of f at yeY is also denoted by J_f(y), while

we write Jf(y) for the Jacobian determinant, i.e. J^ = |JJ . When appropriate

we interpret 3Ay) as an absolute value, without explicitly stating this. We

shall repeatedly use the fact that for differentiable transformations f and g

we have

if 0 g(y) =

^(y)Jf(g(y)) (2.1)

and hence

Jf 0 g(y) =

Jf(g(y))Jg(y)? <2?2)

99

Foliations. A partition of a manifold of dimension k into submani-

folds all of dimension m<k is called a foliation and the submanifolds are said

to be the leaves of the foliation.

A dimension-reducing statistical hypothesis may often, in a natural

way, be viewed as a leaf of an associated foliation of the parameter space O.

Likelihood. We let L = LU) = LU;x) denote an arbitrary version

of the likelihood function for ? and we set 1 = log L. Furthermore, we write

3r = 9/9?G, and lr

= al, 1 = aal, etc. The observed information is the

matrix

?M = -Drs] (2.3)

and the expected information is

1(?) = E j(u>). (2.4) ?

The inverse matrices of j and i are referred to as observed and expected forma-

tion, respectively.

Suppose the minimal sufficient statistic t for M is of dimension k.

We then speak of M as a (k,d)-model (d being the dimension of the parameter ?).

Let (?,a) be a one-to-one transformation of t, where ? is the maximum likeli-

hood estimator of ? and a, of dimension k-d, is an auxiliary statistic.

In most applications it will be essential to choose a so as to be

distribution constant either exactly or to the relevant asymptotic order. Then

a is ancillary and according to the conditionality principle the conditional

model for ? given a is considered the appropriate basis for inference on ?.

However, unless explicitly stated, distribution constancy of a is

not assumed in the following.

There will be no loss of generality in viewing the log likelihood

1 = 1(?) in its dependence on the observation ? as being a function of the

minimal sufficient (?,a) only. Henceforth we shall think of 1 in this manner

and we will indicate this by writing

1=1(?,?,a).

Differential and Integral Geometry in Statistical Inference 101

Similarly, in the case of observed information we write

j = jU;ui,a)

etc. It turns out to be of interest to consider the function

?-(?) =*U;a) = lU;u),a), (2.5)

obtained from lU;u>,a) by substituting ? for ?. Similarly we write

?-U) = ?"U;a) = jU;u),a). (2.6)

For a general parametric model ?(?;?) and for a general auxiliary a

a conditional probability function p*U;?|a) for ? given a may be defined by

p*U;u)|a) = clJl^L (2.7)

where L is the normed likelihood function, i.e.

C = ?(?;?)/?(?;?),

and where c = c(ai,a) is a norming constant determined so as to make the integral

of (2.7) with respect to ? equal to 1.

Suppose now that a is approximately or exactly distribution con-

stant. Then the probability function p*(u>;u>|a), given by (2.7), is to be

considered as an approximation to the conditional probability function ?(?;?|?)

of the maximum likelihood estimator ? given a, cf. Barndorff-Nielsen (1980,

1983). In general, p*U;u>|a) is simple to calculate since it only requires

knowledge of standard likelihood quantities plus an integration over the sample

space to determine the norming constant c. Moreover, to sufficient accuracy

this norming constant can often be approximated by (2tt)~ '

, where d is the

dimension of ?; and a more refined approximation to c solely in terms of mixed

derivatives of the log model function is also available, cf. the next subsection

and section 7. In a great number of cases, including virtually all transforma-

tion models, p*U*,u)|a) is, in fact, equal to pU;ci>|a). Furthermore, outside

these exactness cases one often has an asymptotic relation of the form

pU;u)|a) = p*U;o)|a){l + 0(n"3/2)} (2.8)

uniformly in ? for Z?U-?) bounded, where ? denotes sample size. This holds,

in particular, for (k,d) exponential models. For more details and further



discussion, see Barndorff-Nielsen (1980, 1983, 1984, 1985, 1986a,b) and

Barndorff-Nielsen and Blaesild (1984).

Expansion of cjjl L in the single-parameter case. Suppose ? is

one-dimensional. From formulas (4.2) and (4.5) of Barndorff-Nielsen and Cox

(1984) we have

cj^C = f(?-?; j){l + CjHl

+ A-jÛ-u)))

+ A2(aÛ-o)))}

?{1 + 0(n"3/2)}.

(2.9)

Here (w;y) denotes the probability density function of the normal distribution

with mean 0 and variance ?" . Furthermore, C,, A,, and A? are given by

Cl ?

?{-3U4 +

12U3,1 "5U3 +

24U2,1U3 -

24U2,1 "

12U2,2} <2-10>

and

Aû) =

P1(u)U2J +

P2(u)U3

A2(u) =

P3(u)U2j2 +

P4(u)?2tl +

P5(u)U4 +

P6(u)U3>1 +

P7(u)U3

+ P8?U>U2,1U3

where P.(u), i = 1,...,8, are polynomials, the explicit forms of which are

given in Barndorff-Nielsen (1985), and where U = U n and U ? are defined as ^ ' ? v,0 v,s

? / \ ? = 1,2,3,...

,, , ? as(rv;U^,a)} uv,s(u))-.(v+s)/2-

> * s = 0,1,2...

rv' denoting the v-th order derivative of 1 = lU;??>a) with respect to ? and

8S indicating differentiation s times with respect to ?. Note that, in the

repeated sampling situation, U is of order 0(n"^v s" ''

). Hence the ? ,s

quantities C.s ?-i and A2 are of order 0(n" ), 0(n ) and 0(n~ ), respectively.

Integration of (2.7) yields an approximation to the conditional

distribution of the likelihood ratio statistic

w = 2{1(?) - 1(?0) (2.11)


for testing a dimension reducing hypothesis O0 of O. In particular, if O? is

a po^nt hypothesis, ?~ = {?^}, we have

^0 P*(w;Wfi|a) = ce_i~2W / |j|^ (2.12)

?|w,a

as an app^ imation to p(w;u)Q|a). (The leading term of (2.9) together with

(2.12) yields the usual ? approximation for w. For a connection to Bartlett

adjustment factors see Barndorff-Nielsen and Cox (1984)).

Furthermore, (2.9) may be integrated termwise to obtain expansions

for the conditional distribution function for ? and, by inversion, for confi-

-3/2 dence limits for ?, correct to order 0(n ), conditionally as well as uncon-

ditionally, cf. Barndorff-Nielsen (1985). The resulting expressions allow one

to carry out "conditional inference without conditioning and without integra-

tion."

For extensions to the case of multidimensional parameters see

section 7.

Reparametrization. A basic form of invariance is parametrization

invariance of statistical procedures (though parametrization equivariance might

be a more proper term). If we think of an inference frame as consisting of the

data in conjunction with the model and a particular parametrization of the

model, and of a statistical procedure p as a method which leads from the

inference frame to a conclusion formulated in terms of the parametrization of

the inference frame then parametrization invariance may be formally specified

as commutati vity of the diagram

inference reparametrization ^ inference

frame frame

procedure p

procedure

conclusion -? conclusion reparametri zati on




In words, the procedure p is parametrization invariant if changing the inference

base by shifting to another parametrization and then applying p yields the same

conclusion as first applying p and then translating the conclusion so as to be

expressed in terms of the new parametrization. (We might describe a parametri-

zation invariant procedure as a 0-th order generalized tensor.) Maximum

likelihood estimation and likelihood ratio testing are instances of parametri-

zation invariant procedures.

Example 2.1. Consider any log-likelihood function 1(?), of a one-

dimensional parameter ?. Define the functions r^ = r*-vJ(w), ? = 1,2,...,

recursively by

*Cl]U) = !(1)U)/iU)^

G[?].*?^f1(?)^ j f ?\w/ , v=2,3,..., ??

and set f*-v-* = r-v-*U). The derivatives rLvJ are parametrization invariant,

i.e. r^ takes the same value whatever the parametrization employed.

While parametrization invariance is clearly a desirable property,

there are a number of useful, and virtually indispensable, statistical methods

which do not have this property. Thus procecures which rely on the asymptotic

normality of the maximum likelihood estimator, such as the Wald test or stan-

dard ways of setting confidence intervals in non-linear regression problems,

are mostly not parametrization invariant. However, in cases of non parametri-

zation invariance particular caution must be exercised, as demonstrated for

instance for the Wald test by Hauck and Donner (1977) and Vaeth (1985).

We shall be interested in how various quantities behave under

reparametrizations of the model M. Let ?, of dimension d, be the parameter of

some parametrization of M, alternative to that indicated by ?. Coordinates of

? will be denoted by ??, ?s, etc. and we write a for 3/3?? and ?

r - a r/,,P r - a2 r/aip..o /? ? /?s ?

etc. Furthermore, we write 1(?) for the log likelihood under the parametriza-



tion by ?, though formally this is in conflict with the notation 1(?), and

correspondingly we let lp

= 3 1 = a 1(?), etc.; similarly for other parameter

dependent quantities. Finally, the symbol ? over such a quantity indicates that

the maximum likelihood estimate has been substituted for the parameter.

Using this notation and adopting the summation convention that if a

suffix occurs repeatedly in a single expression then summation over that suffix

is understood, we have

p r /?

1 = "Lcw/ ?* + 1 ?, (2.13)

?s rs /? /s r /?s ? '

1 = ?^+?, ?7 ?, + ?^?, ?, [3] + 1 ?, (2.14) ?st rst /? /s /t rs /?s /tu

j r /?st ? '

etc., where [3] signifies a sum of three similar terms determined by permutation

of the indices ?,s,t. On substituting ? for ? in (2.13) we obtain the well-

known relation

Ko =

jrs%>

which, now by substitution of ? for ?, may be reexpressed as

ho -

K////0 <2?15>

or, written more explicitly,

Equation (2.15) shows that j is a metric tensor on M, for any given value of the

auxiliary statistic a. Moreover, in wide generality ? will be positive definite

on M, and we assum? henceforth that this is the case. In fact, for any ?eO we

have j? = j, i.e. observed information at the maximum likelihood point, which is

generally positive definite (though counterexamples do exist). r, ... r

Let ?(?) = [? ?(?)] be an array, depending on ? and where

sl "?

sq each of the ? + q indices runs from 1 to d. Then A is said to be a (p,q)

tensor, or a tensor of contravariant rank ? and covariant rank q, if under

reparametrization from ? to ? A obeys the transformation law

pr..pn s-, srt p, p^ rr..r

^...o^*)-"/lr-/V/v-^pAv-<w?

Example 2.2. A covariant tensor of rank q is given by

E j al al ?

In particular, the expected information i is a (0,2) tensor.

The inverse [irs] of i = [i ] is a contravariant second order

tensor.

r.r?... t^tp... The (outer) product of two tensors A and ?

sls2??' u1u2... is defined as the array C given by

S,Sp. . .U-^.

. . "~

S-jS?... U-jUp... '

This product is again a tensor, of rank (p' + p", q' + q") if (p',q') and

(p",q") are the ranks of A and B.

Lower rank tensors may be derived from higher rank tensors by con-

traction, i.e. by pairwise identification of upper and lower indices (which

implies a summation).

The parameter space as a manifold. The parameter space O may be

viewed as a (pseudo-) Riemannian manifold with (pseudo-) metric determined by

a metric tensor ?, i.e. ? is a rank 2 covariant, regular and symmetric tensor.

o The associated Riemannian connection ? is determined by the Christoffel symbols

?t rrs

where

?t tu ? r = f r

rs ? rsu

and

?rst =

*<Vst *

Vrs +

Vrt>? {2J6)

If ? is any affine connection with connection symbols r then

these symbols satisfy

y? a, = r* a. (2.17) ar

s rs t v ;

and the transformation law

G?>) -

[?(?>?,>?? +

?*po]*;t . (2.18)

On the other hand, any set of functions [r ] which satisfy the law (2.18)

constitute the connection symbols of an affine connection on O. It follows that

all affine connections on ? are of the form

t ?t t r = rL + S (2.19) rs rs rs v????*/

where the S are characterized by the transformation law

Sp>) =

S?s<??%%*/t * (2?20)

If, for a given metric tensor f, we define r . and S . by

G j. = ru f. and S . = Su f. rst rs^tu rst rs^tu

then (2.18), (2.19) and (2.20) are equivalent to, respectively,

G (?) = G?,^ + (?)?/ ?/ ?/ + F4...(?)?, ?7 (2.21) ?st?

' rst ' /? /s /t tu /?s /t

rrst ?

?rst +

Srst <2'22>

and

?st rst /? /s /t = S .?, ?# ?, . (2.23)

Thus, in particular, [S J is a tensor.

Suppose ?:3 -> ? is a mapping of full rank from an open subset ? of

a Euclidean space of dimension d? < d into O. Then ? is said to be an immer-

sion of ? in O. We denote coordinates of 3 by 3a,3 , etc. If f is a metric

tensor on ? then the metric tensor on ? induced from ? by ? is defined by

*ab(6) =

*rsU)Wb ? (2?24)

If ?* (?) is a connection on O and if r = r" ?. then the induced connection


on ? is defined by ^(3) =

rab(j(eHCd(3) and by

rabc(3) =

rrst(u))w/aw/bw/c +

*tu%bw/c ' (2'25)

Let G be a group acting smoothly on the parameter space. A metric

tensor f is said to be (G-) invariant if

FG5(?) =^??!1fG?5,(9?)^??1_, geG. (2.26) 3? d?a

For a given g let a new parametrization be introduced by ? = go*. From the

transformation law for tensors it follows that F is invariant if and only if

FGd(?) =

FG$(9?), geG. (2.27)

(On the left hand side the tensor is expressed in ? coordinates, on the right

hand side in ? coordinates.) Similarly, a connection r is said to be invariant

1f rJsU)

= r?;s(gu)), g?G. (2.28)

The pseudo-Riemannian connection derived from an invariant metric tensor is

invariant.

In generalization of (2.27) an arbitrary covariant tensor A,

is said to be (G-) invariant if V"rq

A? ? (?) = A (gui), geG. rr..rq rr..rq

If r is a G-invariant connection and if ? and S . are G- ?? G ? G w U

invariant tensors, with ? being a metric tensor, then r defined by

^t t . tU/K r = r + ? S rs rs ? rsu

is a G-invariant connection.

Now, let ? be the information tensor i on O. Then (2.16) takes the

form

?rst ?

E{1rsV +

*ilrVt>.

Obviously,

Trst =

Eilr1slt} (2.29)



satisfies (2.23) and hence, for any real a an affine connection is defined by

"rst=E{1rsV+?E{VsV? <2?30>

These are the a-connections introduced and studied by Chentsov (1972) and

Amari (1982a,b, 1985, 1986).

However, we shall be mainly concerned with another type of connec-

tion, determined from observed information, more specifically from the metric

tensor j-, see sections 6-8. We refer to i and # as expected and observed in-

formation metric on M, respectively.

Suppose, as above, that ?:3 ?+ ? is an immersion of ? in O. The

submodel Mq of M obtained by restricting ? to lie in O = ?(?) has expected

information

iM'WiMW' (2?31)

Thus i(3) equals the Riemannian metric induced from the metric i(?) on O to

the imbedded submanifold ?0? Furthermore, the a-connection of the model M~

equals the connection on ?0 induced from the a-connection on O, by the general

construction (2.25).

The measures on ? defined by

and

|?|\?? (2.32)

???^a? (2.33)

are both geometric measures, relative to expected and observed information

metric, respectively. Note that (2.33) depends on the value of the auxiliary

statistic a. We shall speak of (2.32) and (2.33) as expected and observed

information measure, respectively. It is an important property of these mea-

sures that they are parametrization invariant. This property follows from

the fact that i and ?r are covariant tensors of rank 2. As a consequence we

have that c|j| L (of (2.7)) is parametrization invariant.

Invariant measures. A measure y on ?, is said to be invariant with

respect to a group G acting on X^ if gy = y for all geG.

Invariant measures, when they exist, may often be constructed from

a quasi-invariant measure, as follows.

A measure ? on X is called quasi-invariant with multiplier

? = x(g?x) "if Qy and y are mutually absolutely continuous for every geG and if

d(g_1y)(x) = x(g,x)dy(x).

Furthermore, define a function m on X to be a modulator with associated

multiplier x(g,x) if m is positive and

m(gx) = x(g,x)m(x).

Then, if yx is quasi-invariant with multiplier x(g,x) and if m is a modulator

with the same multiplier we have that

? ? y = m yA

is an invariant measure on IX.

As quasi-invariance is clearly a very weak property the problem in

constructing invariant measures lies mainly in finding appropriate modulators.

It is usually possible to specify the modulators in terms of Jacobians.

In particular, in applications it is often the case that X^ is an

open subset of a Euclidean space. By the standard theorem on transformation

of integrals, Lebesgue measure ? on X is then quasi-invariant with multiplier

J /a\(x). Under mild conditions an invariant measure on X^ is then given by

dy(x) = <]?(2)(?G?<??(?).

(2.34)

Here J , ? denotes the Jacobian determinant of the mapping y(g) of iX onto itself

determined by geG and (z,u) constitutes an orbital decomposition of x, i.e.

(z,u) is a one-to-one transformation of ? such that ?e_? and u is maximal

invariant while ze& and x=zu. For a more detailed discussion see section 3

and appendix 1.

Transformation models. Let G be a group acting on the sample space

X. If the class ? of probability measures given by the statistical model is

invariant under the induced action of G on the set of all probability measures

on iX then the model is called a composite transformation model and if ?


consists of a single orbit we use the term transformation model. For a

composite transformation model, G acts on _P and we may, of course, equally

think of G as acting on the parameter space O. A parameter (function) ? which

is maximal invariant under this action is said to be an index parameter.

Virtually all composite transformation models of interest have the property

that after minimal sufficient reduction (and possibly after deletion of a null

set from _X) there exists a sub-group ? of G such that ? is the isotropy group

for a point on every one of the orbits of _X and of O. Each of these orbits is

then isomorphic to the homogeneous space G/K = {gK.^G} of left cosets of K.

For a transformation model the information measures (2.32) and

(2.33) are invariant measures relative to the action of G on O induced from the

action of G on X via the maximum likelihood estimator ?, which is an equivariant

mapping from _X to O. This action is the same as the above-mentioned action of

G on ? ? ? and also the same as the natural action of G on G/K ? ?.

It follows that relative to information measure on O the formula

(2.7) for the conditional distribution of ? is simply cL. From this it may be

shown that, with the auxiliary a as the maximal invariant statistic, ?*(?,?|a)

is exactly equal to ?(?;?|a).

These results are shown in outline in Barndorff-Nielsen (1983). A

more general statement will be derived in section 5.

Exponential models. A (k,d) exponential model has model function of

the form

p(x;u>) = exp{e(u>)-t(x) - ?(?(?)) - h(x)}. (2.35)

Here k is the order of the model (2.35) and is equal to the common dimension

of the vectors ?(?) and t(x), while d denotes the dimension of the parameter ?.

The full exponential model generated by (2.35) has model function

p(x;e) = exp{e-t(x) - ?(?) - h(x)} (2.36)

and ?(?) is the cumulant transform of the canonical statistic t = t(x). From

the viewpoint of inference on ? there is no restriction in assuming ? = t,

since t is minimal sufficient, and we shall often do so. We set t = t(?) = Et,



i.e. t is the mean value parameter of (2.36), and we write ? for x(int0)

where T denotes the canonical parameter domain of the full model (2.36).

Let f be a real differentiable function defined on an open subset

k t of R . The Legendre transform f of f is defined by

fT(y) = x-y-f(x)

where

y = (Df)(x) =|f(x) .

The Legendre transform is a useful tool in studying various, dualistic aspects

of exponential models (cf. Barndorff-Nielsen (1978a), Barndorff-Nielsen and

Blaesild (1983a)).

In particular, we may use the Legendre transform to define the -1

dual likelihood function 1 of (2.35) by

-1 1 (?) = ??t(?) - 1(t(?)). (2.37)

Here, and elsewhere, ' as top index indicates maximum likelihood estimation

under the full model. Further, in this connection we take 1 as the sup-log-

likelihood function of (2.36) and then 1 is, in fact, the Legendre transform of

?. Note that for t = t(?) e ? we have 1(t) = ??t - ?(?). An inference

methodology, parallel to that of likelihood inference for exponential families,

may be developed from the dual likelihood (2.37). The estimates, tests and

confidence regions discussed by Amari and others under the name of a = -1 (or

mixture) procedures are, essentially, part of the dual likelihood methodology.

More generally, based on Amari's concepts of a-geometry and a- a

divergence, one may for each ae[-1,1] introduce an "a-likelihood" L by

L(?>) = L(a>;t) = exp{-Da(e,e(?)))> (2.38)

where

Da^> =

W$#? <2?39>

Here ?(?;?) is given by (2.36) and the function f is defined as

? log ?, a = 1

f (?) = 4 {1.?(?a)/2}> _1<a<1

a ? c 1-a

-log ?, a = -1

(2.40)

a a

Letting 1 = log L we have, in particular,

1 1(?) = 1(?) = -?(?,?) = ?-t - ?(?) - ?(t) (2.41)

and -1

1(?) = -?(?,?) = ??t - ?(t) - ?(?) (2.42)

where I denotes the discrimination information. Furthermore, for -1<a<1,

1(e) ?-^ [e ? 2 2 2

_1L l-a?

Affine subsets of T are simple from the likelihood viewpoint while,

correspondingly, affine subsets of ? are simple in dual likelihood theory. Dual

affine foliations, of T and ? respectively, are therefore of some particular

interest. Such foliations have been studied in Barndorff-Nielsen and Blaesild

(1983a), see also Barndorff-Nielsen and Blaesild (1983b).

Suppose that the auxiliary component a of (?,a) is approximately or

exactly distribution constant, i.e. a is ancillary. For instance, a may be the

affine ancillary or the directed log likelihood ratio statistic, as defined in

Barndorff-Nielsen (1980, 1986b). We may think of the partitions generated,

respectively, by a and ? as foliations of T, to be called the ancillary

foliation and the maximum likelihood foliation. (Amari's ancillary subspaces

are then, in the present terminology and for a = 1, leaves of the maximum like-

lihood foliation.)

Exponential transformation models. A model M which is both trans-

formational and exponential is called an exponential transformation model. For

such models we have the following structure theorem (Barndorff-Nielsen,

Blaesild, Jensen and Jorgensen (1982), Eriksen (1984b)).

Theorem 2.1. Let M be an exponential transformation model with


acting group G. Suppose X_ is locally compact and that t is continuous. Fur-

thermore, suppose that G is locally compact and acts continuously on _X.

Then there exists, uniquely, a k-dimensional representation A(g) of

G and k-dimensional vectors B(g) and B(g) such that

t(gx) = t(x)A(g) + B(g) (2.43)

e(g) = eteWg"1)* + 8f(g) (2.44)

where ee& denotes the identity element. Furthermore, the full exponential model

generated by M is invariant under G, and &* = {[A(g" )*,&(g)]: geG} is a group of

affine transformations of R leaving T and into invariant in such a way that

e(gP) = e?PjA?g"1)* + B(g), geG, ?e? .

Dually, G = ?[A(g),B(g)]^G} is a group of affine transformations leaving

C = cl conv t( X_ ) as well as ? = x(inte) invariant. Finally, let 6 be the

function given by

6(g) = ?(Q(e))a(Q(g))-?exp('Q(g)M9)). (2.45)

We then have

a(e(gP)) = a(0(P))o(g)"1exp(-e(gP).B(g)). (2.46)

Exponential transformation models that are full are a rarity.

However, important examples of such models are provided by the family of Wishart

distributions and the transformational submodels of this.

In general, then, an exponential transformation model M is a curved

exponential model. It is seen from the above theorem that the full model M

generated by M is a composite transformation model and that, correspondingly,

M (and, hence T and T) is a foliated manifold with M as a leaf. It seems of

interest to study how the leaves of this foliation are related geometric-

statistically. Exponential transformation models of type (k,d), and in partic-

ular those of type (2,1), have been studied in some detail by Eriksen (1984a,c).

In the first of these papers the Jordan normal form of a matrix is an important

tool.




Many of the classical differentiable manifolds with their associated

acting Lie groups are carriers of interesting exponential transformation models.

Instances of this are compiled in table 2.1.

Analogies between exponential models and transformation models.

There are some intriguing analogies between exponential models and transforma-

tion models.

Example 2.3. Under a d-dimensional location parameter model, with

? as the location parameter and for a fixed value of the (ancillary) configura-

tion statistic, the possible score functions are horizontal translates of each

other.

On the other hand, under a (k,d) exponential model, with ? as a

component of the canonical parameter and provided the complementary part of the

canonical statistic is a cut, the possible score functions are vertical trans-

lates of each other. (For details, see Barndorff-Nielsen (1982)).

Example 2.4. Suppose ? is one-dimensional. If ? is the location

parameter of a location model then the correction term C,, given by (2.10),

takes the simple form

1 ?(4) j(3)2 C1

= - 24 {3 -^-

+ 5 :3 > .

Exactly the same expression is obtained for a (1,1) exponential

model with ? as the canonical parameter.

(This was noted in Barndorff-Nielsen and Cox (1984)).

Maximum estimation. Suppose that for a certain class of models we

have an estimation procedure according to which the estimate ? of ? is obtained

by maximizing a positive function ? = ?(?) = ?(?;?) with respect to ?. Let

m = log M and suppose that

? = -[3rasm](20 (2.47)

is positive definite. We shall then say that we have a maximum estimation pro-

cedure. Maximum likelihood estimation and dual maximum likelihood estimation -1

(where m(u>) = 1(?) = ??t(?) - 1(?), cf. (2.37)) are examples of this. More




generally, minimum contrast estimation, as discussed by Eguchi (1983), is of

this type.

Suppose that M depends on ? through the minimal sufficient statis-

tic only and let a be an auxiliary statistic such that (?,a) is minimal suf-

ficient. In generalization of (2.7) we may consider

p*(2f;u)|a) = ?\?\\/?9 (2.48)

as a possible approximation to ?(?;?|?). Here t = iQ) and c is a norming

constant, determined so as to make the integral of the right hand side of

(2.48) with respect t? ? equal to 1.

It will be shown in section 5 that (2.48) is exactly equal to

?(?;?|a) for a considerable range of cases.

Finally, it may be noted that by an argument of analogy it would

seem rather natural to consider the modification of (2.48) in which the func-

tion M is substituted for the likelihood function L. While this approach is

not without interest its general asymptotic degree of accuracy is only 0(n )

-1 -3/2 in comparison with 0(n~ ) or 0(n"

' ) for (2.48). Also, for transformation

models this modification is exact in exceptional cases only.



? ? (O

? F ? O S1 F

? * ?: U ??

4J F U co f S 0)

?H

U ?P ri W

JJ ? <d u * ? :d f ?*?*

3?1 ?

I

F H

CO PS ?P -H C ? ?? 3

I

f > ??

ss

8 ? ft

F -H 4J en

m

co -P (O

t? o

f O

t? O

?? ?M

O en

o H Cn

? ?H ?? ? ?d a*

U + CQ

O CO

? O

U

?

o

(d ? ?> o

o co

?a ? S

o

r-i I

o co

g.

Il

S

?H ?M ? F ? O

O

f ? co

fi ?? ?

?? ? ? F ? -? ? -? m -?

? F ? -? ? -?

?* ? (0 ?

f F

(0 ? ?? ?? ??

CO

? ?? ?

CQ

?

CM

F|

?-? ?

3. TRANSFORMATION MODELS

Transformation models were introduced in section 2. For any ?e?

the set Gx = {gxigeG} of points traversed by ? under the action of 6 is termed

the orbit of x. The sample space )Ms thus partitioned into disjoint orbits,

and if on each orbit we select a point u, to be called the orbit representative,

then any point ? in iX can be determined by specifying the representative u of

Sx and an element zeG such that ? = zu. In this way ? has, as it were, been

expressed in new coordinates (z,u) and we speak of (z,u) as an orbital decompo-

sition of x.

The orbit representative, or any one-to-one transformation thereof,

is a maximal invariant - and hence ancillary - statistic, and inference under

the model proceeds by first conditioning on that statistic.

The action of G on a space _X is said to be transitive if ^consists

of a single orbit and free if for any pair g and h of different elements of G

we have gx j hx for every xeX. Note that after conditioning on a maximal

invariant statistic u we have a transitive action of G on the conditional sample

space. For any ?e_? the set Gx = {g:gx = x) is a subgroup, called the isotropy

group of x. The space X_is said to be of constant orbit type if it is possible

to select the orbit representatives u so that G is the same for all u.

The situation is particularly transparent if the action of G on the

sample space ?X is free. Then for given ? and u there is only one choice of ZeG

such that ? = zu, and X, is thus representable as a product space of the form

U ? G where U is the subset of ^consisting of the orbit representatives u.

Note that u and ? as functions of ? are, respectively, invariant and equivariant

118




? .e.

u(gx) = u(x), z(gx) = gz(x).

It is o'ten feasible to construct an orbital decomposition by first finding an

equivariant mapping ? from X_ onto G and then defining the orbit representative

u for ? bv

? = z" x.

In particular, the maximum likelihood estimate g of g is equivariant, and may be

used as ? provided g(x) exists uniquely for every ?e_? and g(X) = G. In this

case, G's action on ? must also be free.

However, we shall need to treat more general cases where the actions

of 6 on X and on IP are not necessarily free.

Let ? and ? be subsets of G. We say that these constitute a

factorization of G if G is uniquely factorizable as

G = HK

in the sense that to each element geG there exists a unique pair (???)e??? such

that g = hk. We speak of a left factorization if, in addition, ? is a subgroup

of G, and similarly for right factorization. If a factorization is both left

and right then G is said to be the product of the groups H and K. An important

example of such a product is afforded by the well-known unique factorization of

a regular ? ? ? matrix A into a product UT of an orthogonal matrix U and a

lower triangular matrix with positive diagonal elements, i.e., using standard

notations for matrix groups, GL(n) is the product of 0(n) and T+(n).

A relevant left factorization is often generated in the following

way. Let ? be a member of the family P^ of probability measures for a transform-

ation model M, and let ? be the isotropy group Gp, i.e.

? = {geG:gP = P}.

For each ?e?^ we may select an element h of G such that ? = hP, and letting ? be

the set consisting of these elements we have a (left) factorization G = HK.

(In a more technical wording, the elements h are representatives of the left

cosets of K.) Note that G? =

hGph , and that the action of G on ? is free if




and only if ? consists of the identity element alone. The quantity h para-

metrizes f\

Suppose G = HK is a factorization of this kind. For most transform-

ation models of interest, if the action of G on X is not free then there exists

an orbital decomposition (z,u) of ? with ?e? and such that for every u the iso-

tropy group G equals ? and, furthermore, if ? and z% are different elements of

? then zu f z'u.

Example 3.1. Hyperboloid model. This model (Barndorff-Nielsen

(1978b), Jensen (1981)) is analogous to the von Mises-Fisher model but pertains

k-1 k to observations ? on the unit hyperboloid ? of R , i.e.

? k-1

{x:x*x = 1, Xq>0}

where ? = (xq,x,,...,x. ,) and * denotes the non-definite scalar product of

vectors in R which is given by

x*y = x0y0-x1y1-...-xk_1yk_r

The analogue of the orthogonal group 0(k) is the so called pseudo-

orthogonal group 0(1,k-1), which is the subgroup of GL(k) with matrix represent-

ation

0(1,k-1) = {U:U* I U = I}

where ? denotes the k ? k diagonal matrix

1 0

0 -1

0

0 .... -1

For k = 4 this is the Lorentz group of relativistic physics. Topologically,

the group 0(1,k-1) has four connected components, of which one is a subgroup of

0(1,k-1) and is defined by



SO+(l,k-l) = {lfcO(l,k-l):|U| = 1, uQQ>0}

(the elements of U are denoted by u.., i and j = 0,1,...,k-l). This subgroup ' J k-1

is called the special pseudo-orthogonal group and it acts on H by (U,x) -*xU*

k-1 (vector-matrix multiplication). The points of H can be expressed in hyper-

bolic-spherical coordinates as

Xq = cosh u

x, = sinh ? cos v,

Xp = sinh u sin v-. cos Vp

?. , = sinh u sin v, ... sin v. 2 ,

k-1 + and an invariant measure ? on ? , relative to the action of SO (l,k-l), is

specified by

k-2 k-3 dy = sinh u sin v, ... sin v. - dudv, ... dv. 2- (3.1)

The hyperboloid model function, relative to the invariant measure

(3.1) on Hk"\ is

?(?;?,?) = ak(x)e"x?*x (3.2)

where the parameters ? and ?, called the mean direction and the precision,

k-1 satisfy ?e? and ?>0, and where

ak(x) =

??</2-1/{(2p),</2-12?|</2.1(?)} (3.3)

with K. i2 ? ? Bessel function.

For any fixed ?, the hyperboloid distributions (3.2) constitute a

transformation model under the action of S0f(l,k-1), and the induced action on

the parameter space is (?,?) -> ??* (vector-matrix multiplication). The isotropy

group ? of the element ? = (1,0,...,0) may be identified with SO(k-l). Further-

more, S0f(l,k-1) can be factored as

S0*(l,k-1) = HK = H SO(k-l)


where the matrix representation of ??e? is

h =

1 + l+xr

xlx2 1+??

Vl

X2X1 1+Xn

Xk-lXl 1+Xn

1 +

Vl

xlxk-l ?+??

X2Xk-l l+xr l+xr

xk-lx2 1+Xn

.k-1

1 + Ak-1 1+Xn

(3.4)

for ? = (xQ,x..,... ,?. , ) varying over ?

" . In relativity theory a Lorentz

transformation of the type (3.4) is termed a "pure Lorentz transformation" or

a "boost." (It may be noted that S0f(l,k-1) can equally be factored as KH with

the same ? and H as above.)

We have already mentioned the concept of equivariance of a mapping

from X_ onto G. More generally, if s is a mapping of X onto a space S and if

s(x) = s(x') implies s(gx) = s(gx') for ?,?'e?^ and all geG then s is said to be

equivariant. In this case we may define an action of G on S by gs = s(gx)

for s = s(x) and for any ?e?., and we speak of this as the action induced by s.

In the applications to be discussed later S is typically the parameter domain

under some parametrization of the model and s is the maximum likelihood estima-

tor, which is automatically equivariant.

We are now ready to state the results which constitute the main

tools of the theory of transformation models.

Subject to mild topological regularity conditions (for details, see

Barndorff-Nielsen, Blaesild, Jensen and Jorgensen (1982)) we have

Lemma 3.1. Let u be an invariant statistic with range space U =

uOO, let s be an equivariant statistic with range space S = sQO, and assume

that the induced action of G on S is transitive. Furthermore, let y be



invariant measure on IL Then, we have (s,u)QO = S ? U and

(S,u)y = v x ?

where ? is an invariant measure on S and ? is some measure on U.

Suppose r, s and t are statistics on X^ (in general vector-valued).

The symbol rx s|t is used to indicate that r and s are conditionally indepen-

dent given t.

Theorem 3.1. Let the notations and assumptions be as in lemma 3.1,

and suppose that the transformation model has a model function p(x;g) relative

to an invariant measure ? on X such that p(x) = p(x;e) is of the form

p(x) = q(u)r(s,w) (3.5)

for some functions q and r and some invariant statistic w which is a function

of u.

Then the following conclusions are valid.

(i) The model function p(x;g) is of the form

p(x;g) = q(u)r(g"]s,w), (3.6)

and hence the statistic (s,w) is sufficient.

(ii) We have

s i u|w.

(iii) The invariant statistic u has probability function

p(u) = q(u)/r(s,w)dv(s) (3.7)

(where ? is invariant measure on S).

(iv) The conditional probability function of s given w is

p(s;g|w) = c(w)r(g" s,w) <v> (3.8)

where c(w) is a norming constant.

It should be noted that the theorem covers the case where no suffi-

cient reduction is available (take q constant and w = u) as well as the case

where s - typically the maximum likelihood estimator - is sufficient (take w

degenerate). Note also that theorem 3.1 does not assume that the action of G

is free. If, however, the action is free and if (z,u) is an orbital decompo-

sition of ? then the theorem applies with s = z.

Example 3.2. Hyperboloid model (continued). Let x-,,...,? be a

sample from the hyperboloid distribution (3.2) and let ? = (?,,...,? ) and

x+ = x,+ ... +x . Considering ? as fixed, theorem 3.1 applies with u as the

maximal invariant statistic, s = x+// x+*x+ and w = / x+*x+ . In particular,

it turns out that the conditional distribution of s given w (or, equivalently,

given u) is again a hyperboloid distribution, with mean direction ? and pre-

cision wx. This is in complete analogy with the von Mises-Fisher situation,

and accordingly s and w are termed the mean direction and the resultant length

of the sample. For details and further results see Jensen (1981) and Barndorff-

Nielsen, Blaesild, Jensen and Jorgensen (1982).

Lemma 3.1 and theorem 3.1 are formulated in terms of invariant

dominating measures on X^ and S. In applications, however, the probability func-

tions are ordinarily expressed relative to Lebesgue measure - or, more general-

ly, relative to geometric measure when the underlying space is a differentiable

manifold. It is therefore important to have a formula which gives the relation

between the two types of dominating measure.

Let ? be an action of G on a space ? and suppose Y_ has constant

orbit type under this action. Then there exists a subgroup ? of G, a subset ?

of G and an orbital decomposition (z,u) of ye? such that G = ? and ?e? for

every y. We assume that ? can be chosen so that HK constitutes a (left)

factorization of G. If ? is a differentiable manifold and if ? acts differen-

ti ably on ? then an invariant measure y on ? can typically be constructed from

geometric measure ? on _Y, by means of Jacobians. In particular, if ?_ is an

open subset of some Euclidean space Rr, so that ? is Lebesgue measure, then

y defined by

dy(y) = Jy{z)(u)']<lx(y)

(3.9)

will be invariant; here J / % denotes the Jacobian determinant of the mapping

y(g) of ? onto itself. A proof of this is sketched in appendix 1.

Example 3.3. Hyperboloid model (continued). We show here how the

k-1 invariant measure (3.1) on the unit hyperboloid H may be derived from


Lebesgue measure. For simplicity, suppose k = 3. The manifold H2 is in one-

to-one smooth correspondence with R through the mapping

2 2 ?? -> R?

F:

(x0,xrx2) ->

(xrx2)

2 * and we start by finding an invariant measure on R . The action of SO (1,2) on

2 9 ? is given by (U,x) -> xU* and the induced action on R is therefore of the

form (U,y) + f(f~ (y)U*). These actions are transitive, and if we take

u = (0,0) as the orbit representative of R and let ? be the boost

1 + yly2

1+Vn i+y,

y2yl 1 +

0

A

(3.10)

i+yf

y 2 2

1 + y-. + y2? then (u,z) constitutes an orbital decomposition of

2 yeR of the type required for the use of formula (3.9). Letting ? denote the

2 / 2 2~~ action of SO (1,2) on R one finds that J'(z\(u) =^ 1 + Y-i +

Y2 and hence the

measure

dy(y) ?y}?y2

2 is an invariant measure on R . Shifting to hyperbolic-spherical coordinates

(u,v) for (y-j,y?) this measure is transformed to (3.1) with k = 3.

Below and in sections 4 and 5 we shall draw several important con-

clusions from lemma 3.1 and theorem 3.1. Various other applications may be

found in Barndorff-Nielsen, Blaesild, Jensen and Jorgensen (1982).

Corollary 3.1. Let G = HK be a left factorization of G such that

? is the isotropy group of p. Thus the likelihood function depends on g through

h only. Suppose theorem 3.1 applies with S = H and let L(h) = L(h;x) be any

version of the likelihood function. Then, the conditional probability

function of s given w may be expressed in terms of the likelihood function as



p(s;h|w) = c(w)y|j

<v> . (3.11)

In formula (3.11) the likelihood function changes with the value of

s. However, an alternative expression for the conditional probability function

is available which employs only the single observed likelihood function. Sup-

pose for simplicity that ? consists of the identity element alone, so that

S = G. Further, let xQ denote the observed point in X^ and write Lf?(g) for

L(g;xQ). Also, for specificity, let the action of G on S = G be the so called

left action of G on itself, i.e. a geG acts on a point $e$ simply by multiply-

ing s on the left by g, in the group theoretic sense. (Thus, the two possible

interpretations of the symbol gs coincide). The situation here specified

occurs, in particular, if the action of G on X is free and if s is the group

component of an orbital decomposition of x. Setting sQ =

s(xQ) and wQ =

w(xQ),

we are interested in the conditional distribution of s given w = wQ

and by

(3.6) and (3.11) this may be written as

L0(s0rl9) P(s;g|w0)

= c(w0)?T-(i-)-

<a> ,

the invariant measure being denoted here by a, as a standard notation for left

invariant measure on G. This formula, which generalizes a similar

expression for the location-scale model due to Fisher (1934), shows how the

"shape and position" of the conditional distribution of s is simply determined

by the observed likelihood function and the observed sQ, respectively.

Formula (3.11), however, besides being slightly more general, seems

more directly applicable in practice.

4. TRANSFORMATIONAL SUBMODELS

Let M be a transformation model with acting group G. If Pn is any

of the probability measures in M and if GQ

is a subgroup of G then P^ =

{gP0:9eGn* de'f'1'nes a transformation submodel M~ of M. For a given GQ the col-

lection of such submodels typically constitutes a foliation of M.

Suppose G is a Lie group, as is usually the case. The one-parameter

subgroups of G are then in one-to-one correspondence with TG , the tangent

space of G at the identity element e, and this in turn is in one-to-one corre-

spondence with the Lie algebra ? of left invariant vector fields on G. More

generally, each subalgebra h of the Lie algebra of G determines a connected

subgroup H of G whose Lie algebra is h (cf., for instance, Boothby (1975) chap-

ter 4, theorem 8.7). If ?e?? , the one-parameter subgroup of G determined by

A is of the form {exp(tA)^R}. In general, the subgroup of G determined

by r linearly independent elements A,,...,A. of TG may be represented as

exp?^A^.^exp?t A }.

Example 4.1. Let M be a location-scale model,

? ? ?(??,...,??;?,s)

= s"? ? f(s~ (? -y)). (4.1) 1 ? 1=1 ?

Here G is the affine group with elements l\i9o~\ which may be represented by

2?2 matrices

1 0

? s

the group operation being then ordinary matrix multiplication. The Lie algebra

of G, or equivalently TG , is represented as the set of 2 ? 2 matrices of the

127




form

A = 0 0

b a , a^R.

We have

etA = I + tA + ~|- t2A2 +..,

b/a(eta-l) eta

where the last expression is to be interpreted in the limiting sense if a = 0.

There are therefore four different types of submodels. Specifical-

ly, letting Uq^q) denote an arbitrary value of (m9o) and taking PQ as the

corresponding measure (4.1) we have

(i) If a = 0 then ?~ is a pure location model.

(ii) If a f 0, b = 0 and yQ = 0 then Pq is a pure scale model.

(iii) If a j= 0, b = 0 and ?O f 0 then M~ may be characterized as

the submodel of M for which the coefficient of variation y/s is constant and

equal to Uq/oq.

(iv) If both a and b are different from 0 then P~ may be character-

ized as the submodel ?L? of M for which s~ (y+b/a) is constant and equal to

c0 =

s0 (??+â)' ???# ?^ we êt c = â ên Mn 1S determined by

s" (y+c) = CQ. (4.2)

on

Letting F denote the distribution function of f we can express (4.2) as the

condition that (y,a) is such that -c is the F(-c0)-quantile of the distributi

o-]f(o-\x-A).

The above example is prototypical in the sense that G is generally

a subgroup of the general linear group GL(m) for some m and TG may be repre-

sented as a linear subset of the set M(m) of all m ? m matrices.

Example 4.2. Hyperboloid model. The model function of the hyper-

boloid model with k = 3 and a known precision parameter ? may be written as



?(?,?;?,f) = (2^-Vs?nh u e-X{coshx cosh u"s1nhx sinh u cos(v-+)} (43)

where u > ?, ?e[0,2p) and ? > 0, fe[0,2p). The generating group G = S0f(l;2)

may be represented as the subgroup of GL(3) whose elements are of the form

0

COS4

-sind

0

sin4

COSd

coshx sinhx 0

sinhx coshx 0

0 0 1

1+^

? -?

(4.4)

where -??<?<-??. This determines the so called Iwasa decomposition (cf., for

instance, Barut and Raczka (1980) chapter 3) of S0*(l;2) into the product of

three subgroups, the three factors in (4.4) being the generic elements of the

respective subgroups. It follows that TG is the linear subspace of M(3) gen-

erated by the linearly independent elements

Ei -

r

E3 =

0 1

0 1

-1 0

Each of the three subgroups of the Iwasawa decomposition generates

a transformational foliation of the hyperboloid model given by (4.3), as dis-

cussed in general terms above. In particular, the group determined by the

third factor in (4.4) yields, when applied to the distribution (4.3) with

? = F = 0, the following one-parameter submodel of the hyperbolic model:

?(?,?;?)

2 (2 G^? "x(cosh u"^sinh u e"*5^ ^cosh u~sinh u cos V)"2C sinn u sin v>

The general form of the one-parameter subgroups of SO (1;2) is

expit } ,

where a, b, c are fixed real numbers.

5. MAXIMUM ESTIMATION AND TRANSFORMATION MODELS

We shall be concerned with those situations in which there exists an

invariant measure y on X that dominates P_9 where P^ = {gP^G} is transformation-

al. Letting

?(x) = P(x;g)

and writing p(x) for p(x;e) we have

p(x;g) = p(g" ?) .

In most cases of interest the model has the following additional structure (pos-

sibly after deletion of a null set from _X , cf. also section 3). There exists

a left factorization G = ?? of G, a K-invariant function f on X_> and an orbit-

al decomposition (h,u) of ? such that:

(i) G = ? for all u and, furthermore, Gp = K. Hence, in particu-

lar, ? may be viewed as the parameter space of the model.

(ii) For every ?e_? the function m(h) = f(h" x) has a unique maximum

on ? and the maximum point is h.

(iii) ? may be viewed as an open subset of some Euclidean space R

and for each fixed ?e?^ the function m is twice continuously differentiable on H

and the matrix * = 'K(h) given by

is positive definite.

In these circumstances we have:

Proposition 5.1. The maximum estimator h is an equivariant mapping

130


of X. onto ? and the action of G on ? induced by ?? coincides with the natural

action of G on H. Furthermore, if the mapping ? -* (h,u) is proper then there

exists an invariant measure ? on H, and for any fixed u such a measure is given

by

dv(h) = |*|^? (5.1)

where dh indicates the differential of Lebesgue measure on H.

(iii).

Here ? is considered as an open subset of R , in accordance with

'Xj Proof. The equi variance of h follows immediately from (ii). Obvi-

ously, there is a one-to-one correspondence between the family of left cosets

G/K = {gK^G} and H. Let ? be the mapping from G/K to ? which establishes this

correspondence. The natural action ? of G on G/K is given by

G ? G/K ^ G/K

f:

(g,gK) -> ggK

and we have to show that when this action is transferred to ? by ? it coincides

with the action ? of G on ? induced by ?V. In other words, we must verify that

for any geG the diagram

G/K-y H

F(9) j [ ?(9) (5.2)

G/K-y H P

commutes. Let ? be the mapping from G to ? that sends a geG into the uniquely

determined ??e? such that g = hk for some keK. For any ft = ft(x) in H we have

that y(g)?i = ft(gx) is determined by

fUfiitgx)}"1 gx) l fin"1 gx), ?e?. (5.3)

Now, by the K-invariance of f,

fin"1 gx) = f((g-\r\) = fOitg'V'x)

and here n(g h) ranges over all of ? when h ranges over H. Hence (5.3) may be

rewritten as

f??n?rt?gx))}"^} * fUr'x), heH,



i.e., by (ii),

or, equivalently,

f?(x) = n(g"^(gx))

R(x)k = rtgxjK

and this, precisely, expresses the commutativity of (5.2), since ? (h) = hK.

When the mapping ? -> (??,?) is proper the subgroup ? is compact

because ? = Gu- Hence there exists an invariant measure on H, cf. appendix 1.

That |1t| dh is such a measure follows from (3.9) and formula (5.10) below.

In particular, then, there is only one action of G on H at play,

namely ?, and

y(g)h = n(gh). (5.4)

Now, let h -> ? be an arbitrary reparametri zation of the model and

let ?t?(?) = m(h(u))) and

*(?) =*(?>;u) = - ~* (?;??). (5.5) s?s?

This matrix is a (0,2) tensor on O.

We shall now show that

-fc(h) = -R(h;u) = J (e)~]\(e9u)? (e)"1. (5.6) Y(h) Y(h)

Here the unit element e is to be thought of as a point in H.

We have

m(h) = f(h"]x) = f?h"1^) = f({n(?V"1h)}'1u)

where, again, we have used the K-invariance of f. Thus, with ? as the projec-

tion mapping defined above we obtain

M?hixi {h) = i?teLl (n(firlh)) ??f?? (h) (5.7)

and

a2m(h;x) ,.* _ 3?(?tG1?) ,hx a2m(h;u) . /fr-lhU 3n(f?"1h)* ah ah* (h)-ah*

' (h) ahah* Mh h)) ah ahah* *"' ah*-~ X"J ~a??ah*~~ ^" "" dh <h>

ah {r]{ri n)) ahah* + Mhiui(n(rlh)) .

*2iCh)M - (5.8)

In these expressions we have, since n(ft~ h) = ?(?G )h, that

Mf^1 (h) - J t (h). (5.9)

y(h )

On inserting ft for h in (5.7), (5.8) and (5.9) (whereby (5.7) becomes 0) and

combining with (2.1) we obtain (5.6).

From (5.6) we may draw two important conclusions.

First, taking determinants we have

\*(h,u)\h =

J?{h)(e)~]\ne;u)\h (5.10)

and this, by (3.9) and the tensorial nature of -K, implies that |*(?)| ?? is an

invariant measure on O. In connection with formula (5.10) it may be noted that

Jy(h)(e) =

J6(h)(e)

where d denotes left action of the group G on itself. A proof of this latter

formula is given in appendix 2.

Secondly, the tensor *(?) is found to be G-invariant, whatever the

value of the ancillary. In fact, by (5.4) we have, for any ?0e? and c^G,

Y(y(g)h)h0 = i(g) ?

?(?)?0?

Consequently

-rr^h^(e) = ?

A^{e) ???^ ?(y(g)h) ?(?) -y{g)

and this together with (5.6) and (2.26) establishes the invariance.

In particular, observed information ^determines a G-invariant

Riemannian metric on the parameter space. The expected information metric i

can also be shown to be G-invariant.

From proposition 5.1 and corollary 3.1 we find

Corollary 5.1. The model function p*(or,u)|u) = c|l<| L/t is exactly

equal to ?(?;?|?).

By taking m of (ii) equal to the log likelihood function 1 this

corollary specializes to theorem 4.1 of Barndorff-Nielsen (1983).

Suppose, in particular, that the model is an exponential transform-


a ation model. Then the above theory applies with ??(?) = 1(?). The essential

a -1 property to check is that 1(?;?(?)) is of the form f(h x). This follows simply

a from the definition of 1 and theorem 2.1.



6. OBSERVED GEOMETRIES

In section 2 we briefly reviewed how the parameter space of the

model M may be set up as a manifold with expected information i as Riemannian

metric tensor and with an associated family of affine connections, the a-con-

nections (2.30). We shall now discuss a similar type of geometries on the

parameter space, related to observed information and depending on the choice of

the auxiliary statistic a which together with the maximum likelihood estimator

? constitutes a minimal sufficient statistic for M. These latter geometries

are termed observed geometries (Barndorff-Neilsen, 1986a). In applications to

statistical inference questions it will usually be appropriate to take a to

be ancillary but a great part of what we shall discuss does not require dis-

tribution constancy of a and, unless explicitly stated otherwise, the auxil-

iary a is considered arbitrary (except for the implicit smoothness properties).

Let an auxiliary a be chosen. We may now take partial derivatives

of 1 = l(?>;u),a) with respect to the coordinates ? of ? as well as with respect

to ?G. Letting ? = 3/3?G we introduce the notation

1 = a a a a 1 (6.1 ) rr..Vsr..sq rr.. rpsr.. sq

and refer to these quantities as mixed derivatives of the log model function.

The function of ? and a obtained from (6.1) by substituting ? for ? will be

denoted by * . Thus, for instance, rr..rp,sr..sq

*rs;t =

*rs;t(w) =

*rs;tUa) =

?G5;?(?;?'?)?

More generally, for any combinant g of the form g(ar,u),a) we write

135



-f =-?K?)-,a) = g(a>;?),a).

This is in consistency with the notation $ introduced by (2.6). The observed

geometries, to be discussed, are expressed in terms of the mixed derivatives

*r r -s s ? (6'2)

rr..rp,sr..sq

So are the terms of an asymptotic expansion of (2.7), cf. section 7.

Given the observed value of a the observed information tensor 3-, of

(2.6), defines the parameter space of M as a Riemannian manifold. The Rieman-

?t ?t man connection determined by ^ has connection symbols $* given by &* =

.tu? a- *rstand

?rst =

*<Vst -

3As +

Vr^

Employing the notation established above we have d.?r = -?+_+ -*-.+> etc.

u G? rit F5)t

so that

1st =

*rs;t -

^Pst +

W3])? ^

As we shall now show, the quantity

*rst =

-(*rst+*rs;t[3]) (6?4)

is a covariant tensor of rank 3, i.e.

? - -f ,.4.^/ ?, ?, . (6.5) ?st rst /? /a /t

First, from (2.14) we have

^ ~ ^+?/ ?/ ?/ + ^?^? ?/ [3]. (6.6) ?st rst /? /s /t rs /oo /t1" J ? '

Further, from (2.13) we obtain, on differentiating with respect to ?t and then

substituting parameter for estimate,

\ . - ^^.4-?, ?, ?, + ?^.+U/ ?, . (6.7) ?s;t rs;t /? /s /t r;t /?s /t ? '

Finally, differentiating the likelihood equation

*r = ?

we find

or

*rs +

*r;s = ? <6?8)

*r;s =

*rs? <6'9)

Combination of (6.4), (6.6), (6.7) and (6.9) yields (6.5).

It follows from the tensorial nature of ? and from (6.3) and (6.9) a

that for any real a an affine connection ? on M may be defined by

at __ .tu a

prs ' * ?rsu

with

'rst-WT^rsr {6J0)

In particular, we have 1 -1

*rst =

*rs;t ' *rst

= V,rs

^^

where to obtain the latter expression we have used

rst rs;t rt;s r;st

which follows on differentiation of (6.8). It may also be noted that

1-11-1

and

3t*rs "

*rts +

*str "

^str +

^rts

a ,, 1 -, -1 ? = J+2L ? + L?* ? *rst 2 *rst 2 ^rsf

a The connections -f, which we shall refer to as the observed a-con-

a

nections, are analogues of the expected a-connections r given by (2.30). The

a a

analogy between r and jp becomes more apparent by rewriting the skewness tensor

(2.29) as

Vst=-E{1rst +

VtC3^

the validity of which follows on differentiation of the formula

E{lrs +

lry = 0, (6.12)

which, in turn, may be compared to (6.8).

Under the specifications of a of primary statistical interest one

has that, in broad generality, the observed geometries converge to the corre-

sponding expected geometries as the sample size tends to infinity.

For (k,k) exponential models

?(?;?) = a(e)b(x)e6't(x) (6.13)

no auxiliary statistic is involved since ? is minimal sufficient, and we find a a

j- = i and F = r, aeR.

Let i,j,k,... be indices for the coordinates of ?, t and t, using

upper indices for ? and lower indices for t and t.

In the case of a curved exponential model (2.35), we have

lr =

(t-x).ejr (6.14)

and, letting ? denote the maximum likelihood estimator of ? under the full model

generated by (2,35), the relation + = j* takes the form r, s rs

W") =

KiJ(6)9/r^/s

Furthermore,

*rstW '

-*1jk<e>e/re/se/t -

Kij<e)e/re/st[3] +

(*-Vjrst> <6-16>

WU) =

KiJ(e)e/rs*/t=4Vst (6?17)

and

^rs-^J^/t^rs-'rsf (6J8)

It is also to be noted that, under mild regularity conditions, the quantities

?r and ^possess asymptotic expansions the first terms of which are given by

and

2- = ? >st rst {^ke/rse/te/x^

+ V/rse/tAC33

+ V/rst6/^*???' (6?20>

where a , ? = l,...,k-d, are the coordinates of the auxiliary statistic a. For

instance, in the repeated sampling situation and letting aQ denote the affine

ancillary, as defined in Barndorff-Nielsen (1980), we may take a = ? a and

the expansions (6.19) and (6.20) are asymptotic in powers of ? . (For further

comparison with Amari (1982a) it may be noted that the coefficient in the first e e

order correction term of (6.19) may be written as ??.??.?.. = nH where ? ?? /rs /? ij rsA rsA

is Amari's notation for the exponential curvature, or a-curvature with a = 1, of

the curved exponential model viewed as a manifold imbedded in the full (k,k)

model. )

For a transformation model we find

lr(h;x) =

1G,(?(??);?)?(??)^

(cf. the more general formula (5.7)) and hence

+ W<esu>?s +<Kr "?s*

<6-22>

where, for 3 = 3/3hr and a = 3/3hr,

?? =

3snr(h_1h),

so that

while

<? = ?J (e)"1)., (6.23) S "?(h)

rs

Bst ?

V/(fi"lh?

?;t "

^sVr(h_1h)

B;st "

V/(fi"lh'?


Furthermore, to write the coefficients of 1 , ,.,(e;u) in (6.21) and (6.22) as

indicated we have used the relation

^/(h^hiL =

-3?G(^)| A . (6.24) s h=h

s h=h

Formula (6.24) is proved in appendix 3.

We now briefly consider four examples. In the first three the

model is transformational and the auxiliary statistic a is taken to be the max-

imal invariant statistic, and thus a is exactly ancillary. In the fourth ex-

ample a is only approximately ancillary. Examples 6.1, 6.3 and 6.4 concern

curved exponential models whereas the model in example 6.2 - the location-scale

model - is exponential only if the error distribution is normal.

Example 6.1. Constant normal fractile. For known ae(?,?) and

ce(-oo,oo)5 let ? denote the class of normal distributions having the real ?a,C

number c as a-fractile, i.e.

? . = {?(?,s2):(?-?)/s = U }, ?a,C a

where u denotes the a-fractile of the standard normal distribution, and let a

x, ,...,x be a sample from a distribution in ? . The model for x = (x,,... ,xi ? ? ?a,c ? ?

thus defined is a (2,1) exponential model, except for u = 0 when it is a (1,1)

model. Henceforth we suppose that u f 0, i.e. a f ^ The model is also a

transformation model relative to the subgroup G of the group of one-dimensional

affine transformations given by

G = {[c(l - ?),?]:?>0},

the group operation being

[c(l - x),x][c(l - ?'),??] = [c(l - ??'),??']

and the action of G on the sample space being

[c(l - x),x](xr...,xn)

= (c(l - ?) + xxr...,c(l

- ?) + ???).

(Note that G is isomorphic to the multiplicative group.)

Letting

a = (x - c)/s\




where ? = (??, +...+ xn)/n and

s"2 = 1 ? (x, - x)2. ? i=1

t

we have that a is maximal invariant and, parametrizing the model by ? = log s,

that the maximum likelihood estimate is

? = log(bs')

where

b = b(a) = (u /2)a + /l + {(u /2)2 + l}a2. a a

Furthermore, (?,a) is a one-to-one transformation of the minimal sufficient

statistic (x,s*) and a is exactly ancillary.

The log likelihood function may be written as

1(?) = lU;E,a) = ?[? -?- h{b2e2[^] + (ua

+ ab'V^)2}]

from which it is evident that the model for ? given a is a location model.

Indicating differentiation with respect to ? and ? by subscripts ?

and ?, respectively, we find

1 = ?{-1 + ?"2e2(?~?) + ab-1(u + ab~V"^)e^} ? a '

and hence

2r = n{2b"2 + ab"](u + 2ab-1)} a

* = n{4b"2 + ab"](ua

+ 4ab-1)}

? - = -n{4b~2 + ab_1(u + 4ab~])} = *

-9 -1 -1 "? ]

3c - = n{4b ? + ab '(u + 4ab

' )} = -p= --F

?',?? a

and the observed skewness tensor is

Jc = n{8b"2 + 2ab"1(u + 4ab-1)}. a

Note also that a 1

We mention in passing that another normal submodel, that specified



142 0. E. Barndorff-Nielsen

by a known coefficient of variation ?/s, has properties similar to those ex-

hibited by example 6.1.

Example 6.2. Location-scale model. Let data ? consist of a sample

x,,...,x from a location-scale model, i.e. the model function is

?(?;?,s) = s n

? fM?) 1=1

s

for some known probability density function f. We assume that {x:f(x)>0} is an

open interval and that g = -log f has a positive and continuous second order

derivative on that interval. This ensures that the maximum likelihood estimate

(?,s) exists uniquely with probability 1 (cf., for instance, Burridge (1981)).

Taking as the auxiliary a Fisher's configuration statistic

a = (a19...,an)

= ( ?^?

? -? ? ),

which is an exact ancillary, we find

3-(?5s) = s

and, in an obvious notation,

-2 Eg" (a ) zag? (a)

za g"(a ) n+sa g"(a )

Jr = ^~3Eg,M(a.) ???? * 1

Jc = -a"3za.g,M(a.) ??,s Ia l'

^?s,? = "s

"WU^^g'"^)}

^?s,s =

-s"3{2^9???) +

S???9,?(???)>

^,y =

-"3{4Eai9''(ai) +

^,M(ai)>

* = -s ss,s

3{2? + Azaria.)

+ zajg,,,(a1)}

* = a~3zg"'(a,)

-3 * =

a-^g"^.) +

S3?.9"'(3.)}



*?ss =

^""3?4zaig"(ai) + S329"'(?.)}

^sss = s~3{4? +

^F'^) + ^3g"

? (a. )>.

Furthermore,

Jr = 2s~3* ((0,l);a) ??? ???

Jr = -2s~3? ((0,l);a) + 2s~3? ((0,l);a) ??s ??? ??s??

' '

J = -4s"33- ((0,l);a) + 2s"3+ ((0,l);a) ss? ??s ss? '

* = -6s~3^ ((0,l);a) + 2s~3* ((0,l);a). sss ss sss

Example 6.3. Hyperboloid model. Let (u,,?-j),... ,(u ,? ) be a

sample from the hyperboloid distribution (4.3) and suppose the precision ? is

known. The resultant length is

2 2 9 \ a = {(? cosh ???)

- (? sinh u^ cos v..) - (? sinh u. sin v.) }

and a is maximal invariant after minimal sufficient reduction. Furthermore,

the maximum likelihood estimate (?,?) of (?,?) exists uniquely, with probabil-

ity 1, (a,?,f) is minimal sufficient and the conditional distribution of (?,?)

given the ancillary a is again hyperboloidic, as in (4.3) but with u, ? and ?

replaced by ?, ? and ax. It follows that the log likelihood function is

1(?.F) = Hx?<l>;x.?.a) = -ax?coshx coshx - sinhx sinhx cos($-<|>)}

and hence

a a a a -F =-?=?.= -F . . . = 0

??? ??F ?F? FFF

a ? ?? = ax cosh ? sinh ?

?ff

a -f = -ax cosh ? sinh ?,

FF?

whatever the value of a. Thus, in this case, the a-geometries are identical.

We note again that whereas the auxiliary statistic a is taken so

as to be ancillary in the various examples discussed here - exactly distribu-

ti on constant in the three examples above and asymptotically distribution con-

stant in the one to follow - ancillarity is no prerequisite for the general

theory of observed geometries.

Furthermore, let a be any statistic which depends on the minimal

sufficient statistic t, say, only and suppose that the mapping from t to (?,a)

is defined and one-to-one on some subset T~ of the full range ? of values of t

though not, perhaps, on all of T. We can then endow the model M with observed

geometries, in the manner described above, for values of t in T?. The

next example illustrates this point.

The above considerations allow us to deal with questions of non-

uniqueness and nonexistence of maximum likelihood estimates and nonexistence of

exact ancillaries, especially in asymptotic considerations.

Example 6.4. Inverse Gaussian - Gaussian model. Let x(?) and y(?)

be independent Brownian motions with a common diffusion coefficient s = 1 and

drift coefficients ?>0 and ?, respectively. We observe the process x(?) till it

first hits a level x~>0 and at the time u when this happens we record the value

? = y(u) of the second process. The joint distribution of u and ? is then

given by p(u,v;y,c)

- <2,rVV?-V*?t,2)""V"A*?-*\ ,6.25,

Suppose that (u, ,v, ),... ,(u ,v ) is a sample from the distribution

(6.25) and let t = (u,v) where ? and ? are the arithmetic means of the observa-

tions. Then t is minimal sufficient and follows a distribution similar to

(6.25), specifically ?(?,?;?,?)

, ???? 9 -

?(x2+v2)?-1 -

? ?2?+??? - ??2? =

(2p)"'?0?6 ?

?T2e 2 ?

e 2 2 . (6.26)

Now, assume ? equal to ?. The model (6.26) is then a (2,1) exponential model,

still with t as minimal sufficient statistic. The maximum likelihood estimate

of ? is undefined if t^T^ where


?f = it =

(?,v):x0 + ? > 0}

whereas for tej^, ? exists uniquely and is given by

-1 ? =

^(x0 + ?) ? . (6.27)

The event t$Tg happens with a probability that decreases exponentially fast with

the sample size ? and may therefore be ignored for most statistical purposes.

Defining, formally, ? to be given by (6.27) even for t$Tg

and let-

ting

a = F~(?;2??2,2 ??2),

where f"(?;?,?) denotes the distribution function of the inverse Gaussian dis-

tribution with density function

F-(?;?,?) = (Zw)-^ e^* x"3/2 e-,,(xx"1+*x) (6.28)

we have that the mapping t -> (?,a) is one-to-one from ? = it = (?,?):?>0> onto

(-??,+??) ? (0,?>) and that a is asymptotically ancillary and has the property

that p*(y ;y|a) =c | j | L approximates the actual conditional density of ? given

a to order 0(n"3/2), cf. Barndorff-Nielsen (1984).

Letting F_(?;?>?) denote the inverse function of F~(?;?>?) we may

write the log likelihood function for ? as

1(?) = l(y;?,a)

- 2 = n{(x0

+ ?)? - ?? }

= ?F (a;2nx2,2n{?2) {2??-?2} (6.29)

From this we find

so that

??

2 ~9 1 = -2?F (a;2n ??>2?? )

?? - U

2 2 ? =

2?F_(a;2??^ ,2?? )

* =0 ???

and



14 6 O. E. Barndorff-Nielsen

* : = 8?2?(?" ? F /o")(a;2nx2 2??2) ??,? - ? ?

= S = -h ? ??? ???

where f" denotes the derivative of f"(?;?,?) with respect to ?. By the well-

known result (Shuster (1968))

f-(?;?,?) = F(f? - x\'h) + e2^(-foV + xV*)),

where f is the distribution function of the standard normal distribution, f" ?

could be expressed in terms of F and ? = f'.



EXPANSION OF c|j(^L

We shall derive an asymptotic expansion of (2.7), by Taylor expan-

sion of c|j| L in ? around ?, for fixed value of the auxiliary a. The various

terms of this expansion are given by mixed derivatives (cf. (6.2)) of the log

model function. It should be noted that for arbitrary choice of the auxiliary

statistic a the quantity c|j|C constitutes a probability (density) function on

the domain of variation of ? and the expansions below are valid. However,

c|j|L furnishes an approximation to the actual conditional distribution of ?

given a, as discussed in section 2, only for suitable ancillary specification

of a.

To expand c|j| L in ? around ? we first write L as exp{l-l} and

expand 1 in ? around ?. By Taylor's formula,

? r r 1-1= S -L (?-?) ^..(?-?) v(d 3 1)(?)

v=2 V?

rl rv

whence, expanding each of the terms (a ...3^ 1)(?) around ?, rl rv

1-1

f ? \v ri r (-1) /A \ 1 t \ ?

? ? ?-?) ...(?-?) v=2

= ? -r

? S ?-(?-?)5?...(<:-?)5? 3S ...3S \ ...f. (7.1) p=0 1 ? 1 ?

Consequently, writing d for ?-? and d ""'

for (?-?) (?-?) ..., we have

147

i-?-V%s +

wrst(*rs.t +

?*rst)

+ a?rStU^rs;tu+8i-rst;u

+ 3*rstu)

+ ??? ? <7?2>

L? Next, we wish to expand log{ | j |/ \?r\} in ? around ?. To do this we observe

that if A is a d ? d matrix whose elements a depend on ? then

3tlog|A| =

|A|"1at|A|

sr a

Vrs

rs where a denotes the (r,s)-element of the inverse of A. Furthermore, using

3tars = -arvawVa , t U t vw

which is obtained by differentiating a aus = 6S with respect to ? and solving

for ars, we find

3.3 log IA i = -avrasw3 a 3.a + asr3.3 a . t ? a| ' u vw t rs t ? rs

It follows that

logi|j|/M}*--wVVrst+*rs;t)

-?tu{/s(+rstu++rst.u++rsu.t++rs;tu)

+ irVXst^rs;t)(+vwu+J-vw.u)H...

- (7.3)

By means of (7.2) and (7.3) we therefore find

clJl^L = (2p)?/2?Fa(?-?;3-){1

+ A]

+ A2

+ ...} (7.4)

where F?(?;^) denotes the density function of the d-dimensional normal distribu-

tion with mean 0 and precision (i.e. inverse variance-covariance matrix) ?- and

where

A! =

-wV^rsit +

W +

?"*<+?* +

! W (7"5)

and

A2 =

? [- 36tu{2/s(+rstu +

+rst;u +

*rsu;t +

*rsstu)

+ (2/Vw - ?rVwmrs;t

+ W^w;u ^vwu)i

+ *rStU{(3*rstu

+ 8*rst;u

+ 6*rs;tu>

-^VXw;u++vwu)^rs;t +

!+rst)>

+ 36rstuvw(* . +4+ J(+ + U )], (7.6) v rs;t 3 rst/v uv;w 3 uvw/J' v-w

A-j and A2 being of order O(rf^) and 0(n" ), respectively, under ordinary repeat-

ed sampling.

By integration of (7.4) with respect to ? we obtain

(2p)a/2? = 1 + C,

+ ... , (7.7)

where C|

is obtained from Ap by changing the sign of A? and making the sub-

stitutions

_rs .rs d + a-

xrstu ^rs.tur^n d + a- a- [3]

rrstuvw .rs.tu.vwricn d + 3r 3- 3- [15],

the 3 and 15 terms in the two latter expressions being obtained by appropriate

permutations of the indices (thus, for example, <srstu -> jTs? u + j- aSU +

.ru.stx dr a- ).

Combination of (7.4) and (7.7) finally yields

c|j|^L = f(?-?;*){1 + A1

+ (A2+C1)

+ ...} (7.8)

-3/2 with an error term which in wide generality is of order 0(n ) under repeated

sampling. In comparison with an Edgeworth expansion it may be noted that the

expansion (7.8) is in terms of mixed derivatives of the log model function,

rather than in terms of cumulants, and that the error of (7.8) is relative,

rather than absolute.

In particular, under repeated sampling and if the auxiliary statis-

tic is (approximately or exactly) ancillary such that

?(?;?|a) = p*(?;u)|a){l + 0(n'3/2)}

(cf. section 2) we generally have


?(?;?|?) = F?(?-?;*){1

+ ?]

+ (?2

+ C])

+ 0(?"3/2)}. (7.9)

For one-parameter models, i.e. for d = 1, the expansion (7.8) with

A,, A2 and C, as given above reduces to the expansion (2.9). In Barndorff-

-3/2 Nielsen and Cox (1984) a relation valid to order 0(n

' ) was established, for

general d, between the norming constant c of (2.7) and the Bartlett adjustment

factors for likelihood ratio tests of hypotheses about ?. By means of this rel-

ation such adjustment factors may be simply calculated from the above expression

for Zy

Example 7.1. Suppose M is a (k,k) exponential model with model

function (6.13). Then the expression for C-. takes the form

r _ 1 ,0 rs tu (0 ru sv tw , 0 rs tu vwx, Cl

" 24 {3KrstuK

K " KrstKuvw(2K

? ? + 3? K K )}

where, for 3 = 3/3?G and ?(?) = -log a(e),

Vs... =

Vs ??? ?(?)

and where ?rs is the inverse matrix of ? .

From (7.8) we find the following expansion for the mean value of ?:

?? =? +??+?0+... ? ? c

? 1 ? 9 where ?? is of order 0(?" ), ?? is of order 0(n" ), and

y-, - -W a- +r;st

- -h* 3r -Fstr. (7.10)

Hence, from (7.8) and writing d1 for d-?,,

f'?^? = f?(?

-?- \iy ?r) ? +

(?] -

^?-^?*) + ...}

= F?(?

- ? - ??;?){1

+ ?irst(?' ;?)(*rs;t +

f *rst) + ..?>? (7.?)

-1 rT,,rn where the error term is of order 0(n~ ) and where h (?%&) denotes the

tensorial Hermite polynomial (as defined by Amari and Kumon (1983)), relative

write

-1/3

to the tensor ?r . Using (6.10) we may rewrite the last quantity in (7.11) as

+rs;t+f*rst-^rst+^st (7J2)



where

Since

we find


*Let =

k*r.:t -

*(*rt:? + h*..J)? (7.13) rst 3 rs;t v rt;s st;r;

?rs^.t (d1;*) = ?'VVL - ?rVL[3] (7.14)

hrst(6';mrst

and hence (7.11) reduces to . , ? r-t -1/3

c|j|\ = f?(?

- ? - ?-????

- ^G^(d';j-) ^ + ...}, (7.15)

the error term being 0(n" ).

Suppose, in particular, that the model is an exponential (k,d)

model. We may then compare (7.15) with the Edgeworth expansion for an effi-

cient, bias adjusted estimate of ? given an ancillary statistic, provided by

formulas (3.33) and (3.25) in Amari and Kumon (1983). It appears that hrst "1/3 -1/3 abr

(d% \ir) ?? t

of (7.15) is the counterpart of Amari and Kumon's rabch -

e ab \c m a ?-? H.u h h + H , hh . Thus (7.15) offers some simplification over the cor- abK KXa r

responding expression provided by the Amari and Kumon paper.

Note that, again by the symmetry of (7.14), if

-1/3

*rst[3] = 0 (7.16)

for all r,s,t then the first order correction term in (7.15) is 0. Further- a

more, for any one-parameter model M the quantity -F with a = -1/3, can be made

to vanish by choosing that parametrization for which ? is the geodesic coordin-

ate for the -1/3 observed conditional connection. (Note that generally this

parametrization will depend on the value of the ancillary a.) An analogous

result holds for the Edgeworth expansion derived by Amari and Kumon (1983),

referred to above. The parametrization making the a = -1/3 expected connection a r vanish has the interpretation of a skewness reducing parametrization, cf.

Kass (1984).



8. EXPONENTIAL TRANSFORMATION MODELS

Suppose M is an exponential transformation model and that the full

exponential model M generated by M is regular. By theorem 2.1 the group G acts

affinely on ? = t(?), and Lebesgue measure on ? is quasi-invariant (in fact,

relatively invariant) with multiplier |A(g)|. Assuming, furthermore, that M

and G have the structure discussed in section 3 with {g:|A(g)| = 1} <= ? we find,

since the mapping g -> A(g) is a representation of G, that

|A(h(gx))| = |A(g)||A(h(x))|.

Thus m(x) = |A(fi)| is a modulator and

dv(h) = |A(h) |"????? (8.1)

is an invariant measure on H (cf. appendix 1).

Again by theorem 2.1 the log likelihood function is of the form

1(h) = {?(?)?(?G???* + ?(n_1h)}.w - ?(?(?)?(?~??)* + &(?"*??)) (8.2)

where w = t(u) = h" t.

Some interesting special cases are

(i) ?(?) or Bf(.) or both are 0. Then d(?) of (2.45) is a multi-

plier (i.e. a homomorphism of G into (R+,?))? Furthermore, if &(?) = 0 and if

(2.35) is an exponential representation of M relative to an invariant dominat-

ing measure on X. then b(x) is a modulator.

(ii) The norming constant a(e(g)) does not depend on g. If in

addition B(g) does not depend on g, which implies that B(?) = 0, then the con-

ditional distribution of h given w is, on account of the exactness of (2.7),

152

p(h;h|w) = c(w)|j|^ ee(h" h)'w

(8.3)

where the norming constant does not depend on h.

Note that the form (8.3) is preserved under repeated sampling, i.e.

the conditional distribution of h is of the same "type" whatever the sample

size.

The von Mises-Fisher model for directional data with fixed precision

has this structure with w equal to the resultant length r, and as is well-

known the conditional model given r is also of this type, irrespective of

sample size. Other examples are provided by the hyperboloid model with fixed

precision and by the class or r-dimensional normal distributions with mean 0

and precision ? such that |d| = 1.

(iii) M is a (k,k-l) model.

For simplicity we now assume that M has all the above-mentioned

properties. There is then little further restriction in supposing that M is of

the form

?(?,?) = bWexp?-axe^h^hr^e^} (8.4)

where ? is the index parameter, a is maximal invariant and e, and e_, are

known nonrandom vectors. For (8.4) the log likelihood function is

1(h) = -axe^?f^e^ (8.5)

- _i* where we have written A for A . Hence

rrs =

ax(3t3u?ij)(e)elie_ljAj(h)^(h) (8.6)

where ? is given by (6.23). In this case, then, the conditional observed

ot geometries (<r(e;x,a),.F(-;A,a)) are all "proportional" for fixed a, with ax as

the proportionality factor. The geometric leaves of the foliation of M, deter-

mined as the partition of M generated by the index parameter x, are thus highly

similar. In this connection see example 6.3.

APPENDIX 1

Construction of invariant measures

One may usefully generalize the concepts of invariant and relatively

invariant measures as follows. Let a measure y on X be called quasi-invariant

W1'th multiplier ? = x(g,x) if g? and ? are mutually absolutely continuous for

every geG and if

d(g" y)(x) = x(g?x)dy(x).

Furthermore, define a function m on X to be a modulator with associated multi-

plier x(g,x) if m is positive and

m(gx) = x(g,x)m(x). (Al.l)

Then, if ?? is quasi-invariant with multiplier x(g,x) and if m is a modulator

satisfying (Al.l) we have that

? = m~V (Al.2)

is an invariant measure on L?

In particular, to verify that the measure ? defined by (3.9) is

invariant one just has to show that m(y) = J (z\(u) is a modulator with associ-

ated multiplier J /a\(y) because, by the standard theorem on transformation of

integrals, Lebesgue measure x is quasi-invariant with multiplier J / \(y).

Corresponding to the factorization G = HK there are unique factorizations g = hk

and gz = hk and, using repeatedly the assumption that ? = G for every orbit

representative u, we find

m(gy) = Jy(h)(u)

=

JY(g)(y)JY(z)(u)JY(..1)(u)

= JY(g)(y)m(y).

154




In the last step we have used the fact that

J ,k)(u)

= 1 for every keK. (Al.3)

To see the validity of (Al.3) one needs only note that for fixed u the mapping

k -> J /?^(u) is a multiplier on ? and since ? is compact this must be the

trivial multiplier 1. Actually, (Al.3) is a necessary and sufficient condition

for the existence of an invariant measure on ?. This may be concluded from

Kurita (1959), cf. also Santalo (1979), section 10.3.



APPENDIX 2

An equality of Jacobians under left factorizations

Lemma. Let G = HK be a left factorization of G (as discussed in

sections 3 and 5), let ? denote the natural action of G on ? and let d denote

left action of G on itself. Then J'(h\(e)

= J?(h\(e)

for all heH.

Proof. Let g = hk denote an arbitrary element of G. Writing g

symbolically as (h,k) and employing the mappings ? and ? defined by

n:g ?> h c:g -> k

we have, for any h'eH,

?(h')g = 6(h')(h,k) = (n(h'h),c(h'hk))

and hence the differential of 6(h')g is

3n(h'h)*

D6(h')(g) =

3h

3c(h'hk)* 3?(h'hk)* 3h 3k

from which we find, using n(h'h) = y(h')h and c(h'k) = k,

J?(h')(e) =

JY(h')(e)!"Sk lk=e

?i(V){eh

156



APPENDIX 3

An inversion result

The validity of formula (6.24) is established by the following

Lemma. Let G = HK be a left factorization of the group G with the

associated mapping n:g = hk ?> h (as discussed in sections 3 and 5). Further-

more, let h' denote an arbitrary element of H. Then

3n(h;V)*| = _ 3n(h'"1h)*l (A3J) dh W

ah h=h?

Proof. The mapping h -> n(h" h1) may be composed of the three

mappings h + h'" h, g -> g" and ?, as indicated in the following diagram

,?

H

where i indicates the inversion g -> g" . This diagram of mappings between dif-

ferentiable manifolds induces a corresponding diagram for the associated dif-

ferential mappings between the tangent spaces of the manifolds, namely

157




D(h1_1.)

m -> TG , h ^ - h'-\

Di O

Dn

???(??~??)

From this latter diagram and from the well-known relation

(Di)(e) = -I,

where I indicates the identity matrix, formula (A3.1) may be read off immediate-

ly.

Acknowledgements

I am much indebted to Poul Svante Eriksen, Peter Jupp, Steffen L.

Lauritzen, Hans Anton Salomonsen and J?rgen Tornehave for helpful discussions?

and to Lars Smedegaard Andersen for a careful checking of the manuscript.



REFERENCES

Amari, S.-I. (1982a). Differential geometry of curved exponential families -


Amari, S.-I. (1982b). Geometrical theory of asymptotic ancillarity and condi-

tional inference. Biometrika 69, 1-17.

Amari, S.-I. (1935). Differential-Geometric Methods in Statistics. Lecture

Notes in Statistics 28, Springer, New York.

Amari, S.-I. (1986). Differential geometrical theory of statistics - towards

new developments. This volume.

Amari, S.-I. and Kumon, M. (1983). Differential geometry of Edgeworth expansion

in curved exponential family. Ann. Inst. Statist. Math. 35, 1-24.

Barndorff-Nielsen, 0. E. (1978a). Information and Exponential Families.

Wiley, Chichester.

Barndorff-Nielsen, 0. E. (1978b). Hyperbolic distributions and distributions on

hyperbolae. Scand. J. Statist. !5, 151-157.

Barndorff-Nielsen, 0. E. (1980). Conditionality resolutions. Biometrika 67,

293-310.

Barndorff-Nielsen, 0. E. (1982). Contribution to the discussion of R. J.

Buehler: Some ancillary statistics and their properties. J. Amer.

Statist. Assoc. 77, 590-591.

Barndorff-Nielsen, 0. E. (1983). On a formula for the distribution of the maxi-

mum likelihood estimator. Biometrika 70, 343-365.

Barndorff-Nielsen, 0. E. (1984). On conditionality resolution and the likeli-

hood ratio for curved exponential families. Scand. J. Statist. ?, 157-

159




170. Amendment Scand. J. Statist. 12 (1985).

Barndorff-Nielsen, 0. E. (1985). Confidence limits from c|j| E in the single-

parameter case. Scand. J. Statist. 12, 83-87.

Barndorff-Nielsen, 0. E. (1986a). Likelihood and observed geometries. Ann.

Statist. U, 856-873.

Barndorff-Nielsen, 0. E. (1986b). Inference on full or partial parameters

based on the standardized signed log likelihood ratio. Biometrika 73,

307-322.

Barndorff-Nielsen, 0. E. and Blaesild, P. (1983a). Exponential models with

affine dual foliations. Ann. Statist. 11, 753-769.

Barndorff-Nielsen, 0. E. and Blaesild, P. (1983b). Reproductive exponential

families. Ann. Statist. 11, 770-782.

Barndorff-Nielsen, 0. E. and Blaesild, P. (1984). Combination of reproductive

models. Research Report 107, Dept. Theor. Statist., Aarhus University.

Barndorff-Nielsen, 0. E., Blaesild, P., Jensen, J. L. and Jorgensen, B. (1982).

Exponential transformation models. Proc. R. Soc. A 379, 41-65.

Barndorff-Nielsen, 0. E. and Cox, D. R. (1984). Bartlett adjustments to the

likelihood ratio statistic and the distribution of the maximum likelihood

estimator. J. R. Statist. Soc. ? 46, 483-495.

Barndorff-Nielsen, 0. E., Cox. D. R. and Reid, N. (1986). The role of differen-

tial geometry in statistical theory. Int. Statist. Review 54, 83-96.

Barut, A. 0. and Raczka, R. (1980). Theory of Group Representations and Appli-

cations. Polish Scientific Publishers, Warszawa.

Boothby, W. M. (1975). An Introduction to Differentiable Manifolds and

Riemannian Geometry. Academic Press, New York.

Burridge, J. (1981). A note on maximum likelihood estimation for regression

models using grouped data. J. R. Statist. Soc. ? 43, 41-45.

Chentsov, ?. N. (1972). Statistical Decision Rules and Optimal Inference.

(In Russian.) Moscow, Nauka. English translation (1982). Translation of

Mathematical Monographs Vol. 53. American Mathematical Society, Providence,

Rhode Island.



Differential and Integral Geometry in Statistical Inference -^l

Eguchi, S. (1983). Second order efficiency of minimum contrast estimators in a

curved exponential family. Ann. Statist. 11, 793-803.

Eriksen, P. S. (1984a). (k,l) exponential transformation models. Scand. J.

Statist. VL, 129-145.

Eriksen, P. S. (1984b). A note on the structure theorem for exponential trans-

formation models. Research Report 101, Dept. Theor. Statist., Aarhus

University.

Eriksen, P. S. (1984c). Existence and uniqueness of the maximum likelihood

estimator in exponential transformation models. Research Report 103,

Dept. Theor. Statist., Aarhus University.

Fisher, R. A. (1934). Two new properties of mathematical likelihood. Proc.

Roy. Soc. A 144, 285-307.

Hauck, W. W. and Donner, A. (1977). Wald's test as applied to hypotheses in

logit analysis. J. Amer. Statist. Ass. 72, 851-853. Corrigendum:

J. Amer. Statist. Ass. 75 (1980), 482.

Jensen, J. L. (1981). On the hyperboloid distribution. Scand. J. Statist. 8,

193-206.

Kurita, M. (1959). On the volume in homogeneous spaces. Nagoya Math. J. 15,

201-217.

Lauritzen, S. L. (1986). Statistical manifolds. This volume.

Santal ?, L. ?. (1979). Integral Geometry and Geometric Probability. Encyclo-

pedia of Mathematics and Its Applications. Vol. 1, Addison-Wesley, London.

Shuster, J. J. (1968). A note on the inverse Gaussian distribution function.

J. Amer. Statist. Assoc. 63, 1514-1516.

Vaeth, M. (1985). On the use of Wald's test in exponential families. Int.

Statist. Review 53, 199-214.



STATISTICAL MANIFOLDS

Steffen L. Lauritzen

1. Introduction. 165

2. Some Differential Geometric Background . 167

3. The Differential Geometry of Statistical Models . i77

4. Statistical Manifolds . 179

5. The Univariate Gaussian Manifold . I90

6. The Inverse Gaussian Manifold . 198

7. The Gamma Manifold. 203

8. Two Special Manifolds. 206

9. Discussion and Unsolved Problems . 212

10. References. 215

Institute for Electronic Systems, Aalborg University Center, Aalborg, Denmark

163



1. INTRODUCTION

Euclidean geometry has served as the major tool in clarifying the

structural problems in connection with statistical inference in linear normal

models. A similar elegant geometric theory for other statistical problems

does not exist yet.

One could hope that a more general geometric theory could get the

same fundamental role in discussing structural and other problems in more

general statistical models.

In the case of non linear regression it seems clear that the

geometric framework is that of a Riemannian manifold, whereas in more general

cases it seems as if a non-standard differential geometry has yet to be

developed.

The emphasis in the present paper is to clarify the abstract

nature of this differential geometric object.

In section 2 we give a brief introduction to the notions of modern

differential geometry that we need to carry out our study. It is an extract

from Boothby (1975) and Spivak (1970-75) and we are mainly using a coordinate-

free setup.

Section 3 is an ultrashort summary of some previous developments.

The core of the paper is contained in section 4 where we abstract the notion

of a statistical manifold as a triple (M,g,D) where Misa manifold, g is a

metric and D is a symmetric trivalent tensor, called the skewness of the

manifold. Section 4 is fully devoted to a study of this abstract notion.

Sections 5, 6, 7, and 8 are detailed studies of some examples of

165



!66 Steffen L. Lauritzen

statistical manifolds of which some (the Gaussian, the inverse Gaussian and

the Gamma) manifolds are of interest because of their leading role in statis-

tical theory, whereas the examples in section 8 are mostly of interest because

they to a large extent produce counterexamples to many optimistic conjectures.

Through the examples we also try to indicate possibilities for discussing

geometric estimation procedures.

In section 9 we have tried to collect some of the questions that

naturally arise in connection with the developments here and in related pieces

of work.



SOME DIFFERENTIAL GEOMETRIC BACKGROUND

A topological manifold Misa Hausdorff space with a countable

base such that each point ?e? has a neighborhood that is homeomorphic to an

open subset of IRm. m is the dimension of M and is well-defined. A differen-

tiable structure on M is a family

where U is an open subset of M and ? are homeomorphisms from U, onto an open ? ? X

subset of IRm, satisfying the following:

(1) UU, = M

? ?

1 m (2) for any ?^,?^et: ?? ??~ is a C??(IR ) function wherever it is well

defined

(3) if V is open, ?: V -> IRm is a homeomorphism, and ? ? ? ~

, ? ? ?" are ? ?

C?? wherever they are well defined, then (?,?)e?.

The condition (2) is expressed as ?. and ? being compatible. x1 x2

In very simple cases M is itself homeomorphic to an open subset

of IR and the differentiable structure is just given by (?,F?) and all sets

(U.,? ) where U, is an open subset of M and ? ? ? "

is a diffeomorphism. ? ? ? xu

The sets U are called coordinate neighborhoods and ? coordinates. ? ?

The pair (U ,? ) is called a local coordinate system. ? ?

M, equipped with a differentiable structure is called a differenti-

able manifold or a C??-manifo1d.

A differentiable structure can be specified by any system satisfy-

ing (1) and (2). Then there is a unique structure U_ containing the specified

167



168 Steffen L. Lauritzen

local coordinate system.

The differentiable structure gives rise to a natural way of defin-

ing a differentiable function. We say that f: M -* IR is in 0??(?) if it is

a usual C??-function when composed with the coordinates:

f e C??(M) <-* f ? f?

"? e C??UX(U)) for all X.

Important is the notion of a regular submanifold ? ? M of M. A subset N. of M

is a regular submanifold if it is a topological manifold with the relative

topology and if it has preferred coordinate neighborhoods, i.e. to each point

?e?_ there is a local coordinate system (U. ,?.) with ?e?. such that XX X

i) F?(?) = (0.....0); F?(??)

= ]-e,e[m

ii) F?(?????) = ?(?1.???.??.0,....0), |?1'|<e}

?^ inherits then in a natural way the differentiable structure from M by

(??.F?) where

?? =

U/1N, \ -

f?|??,

where (U.,??) is a preferred coordinate system. ? ?

All C??(N)-functions can then be obtained by restriction to N^ of

C??(M)-functions.

For ?e?, C??(p) is the set of functions whose restriction to some

open neighborhood U of ? is in C??(U). We here identify f and g e C??(p) if their

restriction to some open neighborhood of ? are identical.

The tangent space ? (M) to M at ? is now defined as the set of all

maps X : C??(p) -* IR satisfying

1) Xp(af+eg)

= aXp(f)+3Xp(g)

a,? e IR

?) Xp(fg)

= xp(f)g(p)+f(p)xp(g)

f,g e cro(p)

One should think of X as a directional derivative. X is called a tangent ? - ? ?2?

vector.

? (M) is in an obvious way a vector-space and one can show that

dim(Tp(M)) = m.

Statistical Manifolds 169

For each particular choice of a coordinate system, there corre-

sponds a canonical basis for ? (M), with basis vectors being

Eip(f)=?7fU"1(x))|x=*(p)

A vector field is a smooth family of tangent vectors X = (? ,?e?) where

? e? (M). To define "smooth11 in the right way, we demand a vector field X to r H

be a map:

i)

ii)

and now we write

X: C??(M) - C"(M)

X(af+?g) = aX(f)+?X(g) a.?elR

x(fg) = x(f)g+fx(g) f,geC"(M)

xp(f) = x(f) (p)

The vector fields on M are denoted as X_(M). X_(M) is a module over C??(M): if

f,geC??(M), ?,?e?(?) then

(fX+gY) (h) = fX(h) + gY(h)

is also in X^(M). X_(M) is a Lie-algebra with the bracket operation defined as

CX,Y](f) = X(Y(f)) - Y(X(f)).

The Lie-bracket [ ] satisfies

[X,[Y.Z]] + [Y,[Z,X]] + [Z,[X,Y]] = 0

[X,Y] = -[Y,X]

[aX^?Xg.Y] =

a[XrY] +

?[X2,Y] a,??IR

[X,aY1+3Y2] =

a[?,??] +

?[?,?2] a,? e IR

Further one can easily show that

[X,fY] = f[X,Y] + (X(f))Y .

The locally defined vector fields ?., representing differentiation w.r.t. local

coordinates, constitute a natural basis for the module XjU), where U is a

coordinate neighborhood.

A covariant tensor D of order k is a C??-k-linear map

(Jacobi identity)

(anticommutati vity)

(bilinearity)

D: X(M)x...*X(M) - C??(M),




i.e.

D(Xr...,Xk) e C~(M),

D(X1,...,fXi+gYi,X.+1,...,Xk)

= fD(Xr...,Xk)

+ gD(X1,...,Y.,X.+1,...,Xk).

A tensor is always pointwise defined in the sense that if X . = Y ., then XL- pl pi

D(Xr...,Xk)(p) =

D(Yr...,Yk)(p).

This means that any equations for tensors can be checked locally on a basis

e.g. of the form E.. These satisfy [?.,E.] = 0 and all tensorial equations hold

if they hold for vector fields with mutual Lie-brackets equal to zero. This is

a convenient tool for proving tensorial equations and we shall make use of it

in section 3.

A Riemannian metric g is a positive symmetric tensor of order two:

g(X,X) > 0 g(X,Y) = g(Y,X)

Since tensors are pointwise, it can be thought of as a metric g on each of the

tangent spaces ? (M).

A curve ? = (y(t),te[a,b]) is a C??-map of [a,b] into M. Note that

a curve is more than the set of points on it. It involves effectively the

parametrization and is thus not a purely geometric object.

Let now ? denote any vector field such that

;(f)(y(t)) = ? f (Y(t)) for all te[a,b],feC??(M)

The length of the curve ? is now given as

M = /b i/g(Y,Y)y(t)dt. a

Curve length can be shown to be geometric.

An important notion is that of an affine connection on a manifold.

We define an affine connection as an operator ?

?: X?M) ? ?(?) -> ?(?)

satisfying (where we write v?Y for the value)




i) ??(a?+??) =

a??? +

????, a,? e IR

?) Vx(fY) = X(f)Y +

fvxY

iii) vfx+gYZ

= fvxZ +gvYZ

.

An affine connection can be thought of as a directional derivation of vector

fields, i.e. ???

is the "change" of the vector field Y in X's direction.

An affine connection can be defined in many ways, the basic reason

being, that "change" of Y is not well defined without giving a rule for compar-

ing vectors in ? (M) with vectors in ? (M), since they generally are different P-l

- P2

-

spaces.

An affine connection is exactly defining such a rule via the notion

of parallel transport, to be explained in the following. We first say that a

vector field X is parallel along the curve ? if

??? = 0 on ?, ?

where again ? is any vector field representing tt-.

Now for any vector X rx e ? ,? (_M) there is a unique curve of

vectors

XY(t).te[a,b], Xy(t) cTY(t)(M)

such that ??? = 0 on ?, i.e. such that these are all parallel, and such that

X / ? is equal to the given one. We then write

y(b) ?? y(ar

and say that p defines parallel transport along ?. p is in general an affine

map.

Note that p depends effectively on the curve in general.

An affine connection can be specified by choosing a local basis

for the vector-fields (E. ,i=l,... ,m) and defining the symbols (C??-functions)

k r.?,, i,j,k=l,...,m

by

Vrr!jM=k!/^'




where we adopt the summation convention that whenever an index appears in an

expression as upper and lower, we sum over that index. Using the properties of

an affine connection we thus have for an arbitrary pair of vector-fields

X = f^., Y = giEi

vxY =

f1Ei(gJ)Ej+fV4.Ek

A geodesic is a curve with a parallel tangent vector field, i.e. where

??? = 0 on ?. ?

Associated with the notion of a geodesic is the exponential map induced by the

connection.

For all ?e?, ? e ? (M) there is a unique geodesic ?? , such that ?

?? (0) = ? ;? (0) = ? (**) ? ?

This is determined in coordinates by the differential equations below together

with the initial conditions (**)

xk(t) + ^(tl^tlrJ.Wt))

= 0

where ?? (t) = (x (t),...,xm(t)) in coordinates. XP

Defining now for ? e ? (M)

exp?X) = Yy (1)

? xp

we have exp?tX ? = ?? (t). ?

?? The exponential map is in general well defined at least in a neigh-

borhood of zero in ? (M) and can only in special cases be defined globally.

In general, geodesies have no properties of "minimizing" curve

length. However, on any Riemannian manifold, (i.e. a manifold with a metric

tensor g), there is a unique affine connection ? satisfying

i) ??? -

??? - [?,?] ? 0

11) Xg(Y.Z) = g(vxY,z)

+ g(Y,vxz).

This connection is called the Riemannian connection or the Levi-Civita connec-

tion.




Property i) is called torsion-freeness and property ii) means that

the parallel transport p is isometric, which is seen by the argument.

yg(Y.Z) = g(v.Y.Z) + g(Y,v^Z)

= 0 if v^Y

= v^Z

= 0.

We can then write girMf^Z)^^

= g(Y,Z)y(a)

or just g(?^9Ji^l)

= g(Y,Z).

If ? is Riemannian, its geodesies will locally minimize curve length.

To all connections ? there is a torsion free connection ? such that

this has the same geodesies. All connections in the present paper are torsion

free, whereas not all of them are Riemannian.

When the manifold is equipped with a Riemannian metric, it is often

convenient to specify the connection through the symbols (C??-functions) G... , ? j ?

where

rijk "

9(^?G??

Defining the matrix of the metric tensor and its inverse as

gu-gtMj) (g^-ig^r1.

the symbols are related to those previously defined as

The Riemannian connection is given by

^k =

^9jk>+ ??^ -

M^ij??

A connection defines in a canonical way the covariant derivative of

a tensor D as

(vxD)(Xr...,Xk) =

XD(Xr...,Xk) - S

D(X1,...,VxX.,...,Xk).

(???) is again a covariant tensor of order k and the map

S(X,Xr...,Xk) =

(vxD)(Xr...,Xk)

becomes a tensor of order k+1. The fact that the Riemannian connection pre-

serves inner product under parallel translation can then be written as

(vxg)(Y,Z) ? 0.

Similarly, if D is a multilinear map from ?(?)?...??(?) into ?(M) its




covariant derivative is defined as

(vxD)(Xr...,Xk) =

vxD(Xr...fXk) - S

D(X1,...,vxX.,...,Xk).

Such multilinear maps are called tensor fields.

An important tensor field associated with a space with an affine

connection is the curvature field, R: X{tt) ? X(M) ? X(M) -> J((M)

r(XjY)Z = ?????

- ?????

- v[XjY]Z.

A manifold with a connection satisfying R ? 0 is said to be flat. If the

connection is torsion free, the curvature satisfies the following identities:

a) R(X,Y)Z = -R(Y,X)Z

b) R(X,Y)Z + R(Y,Z)X + R(Z,X)Y = 0

(Bianchi's 1st identity)

c) (VXR)(Y,Z,W) +

(vyR)(Z,X,W) +

(vzR)(X,Y,W) = 0

(Bianchi's 2nd identity).

Strictly speaking, a) does not need torsion freeness.

On a Riemannian manifold, we also define the curvature tensor R as

R(X,Y,Z,W) = g(R(X,Y)Z,W)

where R is used in two meanings, both referring to the Riemannian connection.

The Riemannian curvature tensor satisfies

1) R(X,Y,Z,W) = -R(Y,X,Z,W)

ii) R(X,Y,Z,W) + R(Y,Z,X,W) + R(Z,X,Y,W) = 0

ili) R(X,Y,Z,W) = -R(X,Y,W,Z)

iv) R(X,Y,Z,W) = R(Z,W,X,Y)

We shall use the symbol R also for the curvature tensor

R(X,Y,Z,W) = g(R(X,Y)Z,W),

when M has a Riemannian metric g and a torsion-free but not necessarily

Riemannian connection v. Then i) and ii) are satisfied, but not necessarily

iii) and iv).

If (E,-D) is a local basis for ? (M), the curvature tensor can be

calculated as



Rijkm =

R<Ei'EJ>Ek'Em>

? ?El?rjV-Ejtr?k?9sm+(r1nnr?k-rJn.r?k)?

The sectional curvature is given as

?(s ) = 9(*(?.?)?.?) X,Y

g(x,x)g(Y,Y)-g(x,Y)2

and determines in a Riemannian manifold also the curvature. If the curvature

satisfies i) to iv) the sectional curvature also determines R.

Two other contractions of the curvature tensor are of interest:

The Ricci-curvature

ClR(X,X) =

ml] g(R(u.,X)X,u.) ? ?,=] ? ?

= g(x,x)m^ ?(s? )

1=1 ?'???

where (X/g(X,X),u,,...,u ,) is an orthonormal system for ? (M).

Finally the scalar curvature is

S(p) = S c.R(u.,u.) i=l

'

where u-.,...,u is an orthonormal system in ? (M). We then have the identity

S(p) = S ?(s , ). i9j i j

If ? is a regular submanifold of M, the tangent space of ? can in a natural way

be identified with the subspace of _X(M) determined by

? e X(N)5 X(M) *-* [f=g on ? -* X(f) = X(g) on N].

In that way all tensors etc. can be inherited to ? by restriction. If M has a

Riemannian metric, N^ inherits it in an obvious way, and this preserves curve

length, in the sense that the length of a curve in N^w.r.t. the metric inherit-

ed, is equal to that when the curve is considered as a curve in M.

An affine connection is inherited in a more complicated way:

We define

(????)(?) =

??(???)(?)

where ? is the projection w.r.t. g onto the tangent space ? (N)cT (M) of the

vector (???)? which is not necessarily in

Tp(N). In fact we define the


embedding curvature of ? relative to M as the tensor field X(N) ? X_(N) -> X^(M)

or,equivalently, as

HN(X,Y) =

??? -

????

hn(x.y,z) =

g(HN(x,Y),z)

where ?,? e X(N), ? e X(NH (or ? e ?(?)). r r

If HjuiO we say that N^ is a totally geodesic submanifold of M. A

totally geodesic submanifold has the property that any curve in ? which is a

geodesic w.r.t. the connection on j^, also is a geodesic in M.



THE DIFFERENTIAL GEOMETRY OF STATISTICAL MODELS

A family of probability measures ? on a topological space X. inher-

its its topological structure from the weak topology. Most statistical models

are parametrized at least locally by maps (homeomorphisms)

?: U->0clRm

where U is an open subset of P^ and T an open subset of IRm. From this para-

metrization we get P_ equipped with a differentiable structure, provided the

various local parametrizations are compatible. Considering for a while only

local aspects, we can think of ? as {?.?eT}. We let now f(x,e) denote the ?

density of Pa w.r.t. a dominating measure y and assume these to be C??-functions ?

of ?. Under suitable regularity assumptions we can now equip P^with a

Riemannian metric by defining 1(?,?) = log f (?,?) and

9??(?) =

9(?G?.) -le?MDEjO)). ?

The metric is the Fisher information and different parametrizations define the

same metric on P_. Similarly we can define a family of affine connections (the

a-connections) on ^P by the expressions

?ijk =

"rijk -

fTijk' aeIR' where

TlJk(Pe)-?etE1(l)EJ(l)Ek(l)}.and

r... is the Riemannian connection, ? j ?

The Fisher information as a metric was first studied by Rao (1945)

and the a-connections in the case of finite and discrete sample spaces by

Chentsov (1972). Later the a-connections were introduced and investigated

independently and in full generality by Amari (1982).

177




For a more fair description of the history of the subject (the

above is indecently short), see e.g. the introduction by Kass in the present

monograph, Amari (1985) and/or Barndorff-Nielsen, Cox and Reid (1986).

Two of these connections play a special role:

The exponential connection (for a=l) and

the mixture connection (for a=-l). 1

The exponential connection has r... ? 0 when expressed in the ? j ?

canonical parameter in an exponential family, and similarly when we express -1 r... (the mixture connection) in the mean value coordinates of an exponential 1 JK

family r. .. ? 0. Further we have the formulae ? j ?

J^M?Mj?DE^DJand

1 ? T. ., = 2(r. ., - r... )

?jk v

?jk ijk'

which often are useful for computations.

These structures are in a certain sense canonical on a statistical

manifold. Chentsov (1972) showed in the case of discrete sample spaces that

the a-connections were the only invariant connections satisfying certain in-

variance properties related to a decision-theoretic approach. Similarly, the

Fisher information metric is the only invariant Riemannian metric. These re-

sults have recently been generalized to exponential families by Picard (1985).

On the other hand, similar geometric structures have recently

appeared such as minimum-contrast geometries (Eguchi, 1983) and the observed

geometries introduced by Barndorff-Nielsen in this monograph.

The common structure that seems to appear again and again in cur-

rent statistical literature is not standard in modern geometry since it involves

study of the interplay between a Riemannian metric and a non-Riemannian connec-

tion or even a whole family of such connections.

It seems thus worthwhile to spend some time on studying this

structure from a purely mathematical point of view. This has already been done

to some extent by Amari (1985). In the following section we shall outline the

mathematical structures.



STATISTICAL MANIFOLDS

A statistical manifold is a Riemannian manifold with a symmetric

and covariant tensor D or order 3. In other words a triple (M,g,D) where M is

an m-dimensional C??-manifold, g is a metric tensor and D: X(M) ? _X(M) ? X_(M) -*

C??(M) a tri li near map satisfying

D(X,Y,Z) = D(Y,X,Z) = D(Y,Z,X)

(=D(X,Z,Y) = D(Z,X,Y) = D(Z,Y,X))

D is going to play the role T.... in the previous section. We use D to distin- ? j ?

guish the tensor from the torsion field. Dis called the skewness of the

manifold.

Instead of D we shall sometimes consider the tensor field ? defined

as

g(Bf(X,Y),Z) = D(X,Y,Z).

We have here used that the value of a tensor field is fully deter-

mined when the inner product with an arbitrary vector field ? is known for all

Z.

The above defined notion could seem a bit more general than neces-

sary, in the sense that some Riemannian manifolds with a symmetric trivalent

tensor D might not correspond to a particular statistical model.

On the other hand the notion is general enough to cover all known

examples, including the observed geometries studied by Barndorff-Nielsen and

the minimum contrast geometries studied by Eguchi (1983).

Further, all known results of geometric nature for statistical

manifolds as studied by Amari and others can be shown in this generality and

179




it seems difficult to restrict the geometric structure further if all known

examples should be covered by the general notion.

a? Let now (M,g,D) (or(M,g,D)) be a statistical manifold. We now

define a family of connections as follows:

??? =

??? -

|D(X,Y) (3.1)

where ? is the Riemannian connection. We then have a

3.1 Proposition ? as defined by (3.1) is a torsion free connection. It is the

unique connection that is torsion free and satisfies

(vxg)(Y,Z) = aD(X,Y,Z) (3.2)

a Proof: That ? is a connection: Linearity in X is obvious. Scalar linearity

in Y as well. We have

vx(fY) =

vx(fY) -

f D(X,fY) = X(f)Y + fvxY.

Torsion freeness follows from symmetry of D:

??? -

??? - [?,?] =

??? -

??? - [?,?]

-f [D(X,Y) - D(Y,X)] = 0.

a That ? satisfies (3.2) follows from

(vxg)(Y,Z) = Xg(Y,Z) -

g(vxY,Z) -

g(Y,?xZ)

= (vxg)(Y,Z)

+ aD(X,Y,Z) = 0 + aD(X,Y,Z).

If ^ is torsion free and satisfies (3.2) we obtain:

i) Xg(Y,Z) = g(vxY,Z)

+ g(Y,vxZ)

+ aD(X,Y,Z)

ii) Zg(X,Y) = g(vxZ,Y)

+ g(vYZ,X)

+ aD(X,Y,Z)

+ g([z,x],Y) + g([z,Y],x)

iii) Yg(Z,X) = g(^YZ,X)

+ g(vxY,Z)

+ aD(X.Y.Z)

- g([x.Y],z)

Calculating now i) - ii) + iii) we get

Xg(Y,Z) - Zg(X,Y) + Yg(Z,X) = aD(X,Y,Z)

-g([z,x],Y) - g([z,Y],x) - g([x,Y],z) + 29(???,?).




Since this equation also is fulfilled for ? we get

g(vxY,Z) =

g(vxY,Z), whereby ? = v.

0 . Obviously ? = v, the Riemannian connection.

To check what happens when we make a parallel translation we first

consider the notion of a conjugate connection (Amari, 1983).

Let (M,g) be a Riemannian manifold and ? an affine connection. The

conjugate connection v* is defined as

g(v*xY,Z) = Xg(Y,Z) -

9(?,???) (3.3)

3.2 Lemma v* is a connection, (v*)* = v.

Proof: Linearity in X is obvious. So is linearity in Y w.r.t. scalars. We

have

g(v*x(fY),Z) = Xg(fY,Z) -

g(fY,vxZ)

= X(f)g(Y,Z) + fXg(Y,Z) - fg(Y,vxZ)

= g(X(f)Y + fv*xY,Z).

And further

g((v*)*xY,Z) = Xg(Y,Z) -

g(v*xZ,Y)

= Xg(Y,Z) - {Xg(Z,Y) - g(vxY,Z)}

= g(vxY,Z).

If we now let p ,p* denote parallel transport along the curve ? we obtain:

3.3 Proposition

9(???,p*?) = g(X,Y)

Proof: Let X be v-parallel along ? and Y v*-parallel. Then we have

yg(x>Y) = g(vO(,Y) + g(x,v*or) = o.

In words Proposition 3.3 says that parallel transport of pairs of vectors w.r.t.

a pair of conjugate connections is "isometric" in the sense that inner product

is preserved.

Finally we have for the a-connections, defined by (3.1):

3.4 Proposition (?)* = ? .




Proof:

g(vxY>z) =

g(vxY,z) -

|d(x,y,z)

g(Y>vxz) =

g(Y,vxZ) +

|D(X,Z,Y)

Adding and using the symmetry of D together with the defining property of the

Riemannian connection we get

g(vxY,Z) +

g(Y^vxZ) = Xg(Y,Z) (3.4)

The relation (3.4) is important and was also obtained by Amari (1983). If we

now consider the curvature tensors R and R* corresponding to ? and v* we obtain

the following identity:

3.5 Proposition

R(X,Y,Z,W) = -R*(X,Y,W,Z) (3.5)

Proof: Since we shall show a tensorial identity, we can assume [?,?] = 0 as

discussed in section 1. Then we get

XYg(Z,W) = X(g(vYZ,W)

+ g(Z,v*yW))

= g(vxvyZ,W)

+ g(vyZ,v*W)

+ g(vxZ,v*W)

+ g(Z,v*v*W).

By alternation we obtain

0 = [X,Y]g(Z,W) = XYg(Z,W) - YXg(Z,W)

= R(X,Y,Z,W) + R*(X,Y,W,Z).

Note that the Riemannian connection is self-conjugate which gives the well

known identity for the Riemannian curvature tensor, see section 1.

Consequently we obtain

3.6 Corollary The following conditions are equivalent

i ) R = R*

ii) R(X,Y,Z,W) = -R(X,Y,W,Z)

Proof: It follows directly from (3.5).

And, also as a direct consequence:

3.7 Corollary ? is flat if and only if v* is.




The identity ii) is not without interest and we shall shortly

investigate for which classes of statistical manifolds this is true. Before we

get to that point we shall investigate the relation between a statistical mani-

fold and a manifold with a pair of conjugate connections.

We define the tensor field D,, and the tensor D, in a manifold with

a pair (v,v*) of conjugate connections by

?jiX.Y) =

?*?? -

???

?^?,?,?) =

9(0?(?,?),?).

We then have the following

3.8 Proposition If ? is torsion free, the following are equivalent

i) v* is torsion free

i i) D, is symmetric

iii) ? = ^(v+v*)

Proof: That D, is symmetric in the last two arguments follows from the

calculation

?^?,?,?) = g(v*Y,Z) -

g(vxY,Z)

= Xg(Y,Z) - g(Y,vxZ)

- [Xg(Y,Z)-g(Y,v*Z)]

= D.,(X,Z,Y)

The difference between two connections is always a tensor field, i) ?-> ii)

follows from the calculation

g(v*Y-v*X-[X,Y],Z) = g(vxY-vyX-[X,Y],Z)

+ ?^?,?,?)

- ?^?,?,?).

That iii) ?> i) is obvious since then v*=2v-v.

To show that i) ?> iii) we use the uniqueness of the Riemannian con-

nection. We define

? = ^(v+v*)

and see that this is torsion free, when ? and v* both are. But




g(vxY,Z) +

g(Y,vxZ) =

^g(vxY,Z) + ^g(v*Y,Z)

+ ^(?,?*?) + ^(?,???)

= Xg(Y,Z)

showing that ? is Riemannian and thus equal to v.

Suppose now that ? is given with v* being torsion free. We can then

define a family of connections as

??? =

??? -

f ^(?,?)

and we obtain a -a? "1

3.9 Corollary ?* = ?, ?= ?, ? = ?*.

? Proof: It is enough to show ? = v. But

1

V =

^???+???) -

^(???-???) =

V'

We have thus established a one-to-one correspondence between a statistical

manifold (M,g,D) and a Riemannian manifold with a connection ? whose conjugate

v* is torsion free, the relation being given as

D(X,Y) = v*Y - ???

??? =

??? - y)(X,Y).

In some ways it is natural to think of the statistical manifolds as

being induced by the metric (Fisher information) and one connection (v) (the

exponential), but the representation (M,g,D) is practical for mathematical

purposes, because D has simpler transformational properties than v.

By direct calculation we further obtain the following identity for

a statistical manifold and its a-connections

3.10 Proposition

g(vxY,Z) -

g(vxZ,Y) =

g(vxY,Z) -

g(vxZ,Y) (3.6)

Proof: The result follows from

g(vxY,Z) -

g(vxZ,Y) =

g(vxY,Z) -

g(vxZ,Y)

- |D(X,Y,Z)

+ |D(X,Z,Y)

and the symmetry of D.




We shall now return to studying the question of identities for the

curvature tensor of a statistical manifold. We define the tensor

F(X,Y,Z,W) = (vxD)(Y,Z,W)

where D is the skewness of the manifold, and ? is the Riemannian connection. We

then have

3.11 Proposition The following are equivalent a -a

i) R = R for ail aeIR

i i) F is symmetric

Proof: The proof reminds a bit of bookkeeping. We are simply going to estab-

lish the identity

R(X,Y,Z,W) - R(X,Y,Z,W) = a{F(X,Y,Z,W) - F(Y,X,Z,W)} (3.7)

by brute force.

Symmetry of F in the last three variables follows from the symmetry

of D. We have

2aF(X,Y,Z,W) = 2aXD(Y,Z,W)

-2a(D(vxY,Z,W) +

D(Y,VXZ,W) +

D(Y,Z,VXW)) a -a -a a

Since ? = h(v + v) and aD(X,Y,Z) = g(vxY,Z)

- g(vxY,Z)

we further get

2aD(vxY,Z,W) =

2g(vzW,vxY) -

2g(vzW,vxY)

-a a -a -a =

g(vzW,vxY) +

g(vzW,vxY) a a a -a

- g(vzW,vxY)

- g(vzw,vxY),

and similarly for the two other terms. Further we get

2aXD(Y,Z,W) = 2X(g(vYZ,W)

- g(vyZ,W))

= 2g(vxvYZ,W)

- 2g(vxvyZ,W)

-a a a -a +

2g(vyZ,vxW) -

2g(vyZ,vxW)

Collecting terms we get the following table of terms in 2aF(X,Y,Z,W), where

lines 1-3 are from 2aXD(Y,Z,W), 4 and 5 from 2aD(vxY,Z,W)

6 and 7 from

2aD(Y,vxZ,W) and 8 and 9 from 2aD(Y,Z,vxW).




Table of terms of 2aF(X,Y,Z,W)

with + sign with - sign

!? 2g(vx vyZ,W) 2g(vxvyZ,W) -a a a -a

2. g(vxY,vxw) g(vYZ,vxw)

-a a a -a 3.

g(vxY,vxW) g(vYz,vxW)

4. g(vzw,vxY) g(vzW,vxY)

a -a -a -a

g(vzw,vxY) g(vzw,vxY) 5.

6. g(vyW,vxZ) g(vyW,vxZ)

a -a -a a 7.

g(vyw,vxz) g(vyw,vxz)

a a -a -a

a a

g(vyZ,vxW) g(vyZ,vxW) a -a -a a

g(vyZ,vxW) g(vyZ,vxW)

Lines 4 and _5 disappear by torsion freeness and alternation. Lines 2_ + 9 add up

to zero. Lines 3_ + ? disappear by alternation. Lines 6. + 8 also. What is left

over are only terms from line ]_ whereby

2aF(X,Y,Z,W) - 2aF(Y,X,Z,W)

-a a = 2R(X,Y,Z,W) - 2R(X,Y,Z,W)

and the result and (3.7) follows.

A statistical manifold satisfying this kind of symmetry shall be

called conjugate symmetric. We get then immediately

3.12 Corollary The following is sufficient for a statistical manifold to be

conjugate symmetric a

3 a? j such that R ? 0,

i.e. that the manifold is a^-flat.

As shown e.g. in Amari (1985), exponential families are ?1 - flat

and therefore always conjugate symmetric.

In a conjugate symmetric space, the curvature tensor thus satisfies

all the identities of the Riemannian curvature tensor, i.e. also




R(X,Y,Z,W) = -R(X,Y,W,Zfi

\ (3?8)

R(X,Y,Z,W) = R(Z,W,X,Y)J

This implies as mentioned earlier that the sectional curvature determines the

curvature tensor.

We shall later see examples of statistical manifolds actually

generated by a statistical model that are not conjugate symmetric.

It also follows that the condition

3 a0 t 0 such that fP= R? (3.9)

is sufficient for conjugate symmetry.

Amari (1985) investigated the case when the statistical manifold was

aQ (and thus -aA flat in detail, showing the existence of local conjugate coor- a

dinates (??) and (?.) such that r... = 0 in the ?-coordinates and its conjugate -a 3

0 r-Mu

= ? in the ?-coordinates. 1 j ?

Further that potential functions ?(?) and f(?) then exist such that

gij(e) =

EiEj^e) 9ij(n) =

????:?(f(?))

and the ?- and ?-coordinates then are related by the Legendre transform:

?1 = ?.(f(?)) ?. = ?.(?(?))

?(?) + f(?) - ?????

= 0.

In a sense aQ-flat families are geometrically equivalent to exponential families..

If N^ is a regular submanifold of (M,g,D), the tensors g and D are

inherited in a simple way (by restriction). The a-connections are inherited by

orthogonal projections on to the space of tangent vectors to _N, i.e. by the

equation

g$xY,Z) =

g(vxY,Z) for ?,?,? e X(N). (3.10)

It follows from (3.10) that the a-connections induced by the restriction of g

and D to ?.(?) are equal to those obtained by projection (3.10). This consis-

tency condition is rather important although it is so easily verified.

A submanifold is totally a-geodesic (or just a-geodesic) if



a

??? e ?(?) for all ?,? e ?(?).

If the submanifold is a-geodesic for all a we say that it is geodesic. We then

note the following

3.12 Proposition A regular submanifold ? is geodesic if and only if there

exist a, j a2 such that ? is a,-geodesic and g^-geodesic.

Proof: Let ?,? e X(N) and ? e ? (?)1 ? e ?.

Then IN is a.-geodesic, i=l, 2 iff

g(avj?Y,Z)p =

g(vxY,Z) = 0

for all such ?,?,?. But since

g(vxY,Z) =

g(vxY,Z) -

|D(X,Y,Z)

this happens if and only if D(X,Y,Z) = 0 for all such ?,?,?, whereby ? is geo-

desic iff it is a.-geodesic, i=l,2.

In statistical language, geodesic (a-geodesic) submanifolds will be

called geodesic (a-geodesic) hypotheses. A central issue is the problem of

existence and construction of a-geodesic and geodesic foliations of a statisti-

cal manifold.

A foliation of (M,g,D) is a partitioning

M = U ? - ?e^ -?

of the manifold into submanifolds ? of fixed dimension n(<m). N. are called ?? -?

the leaves of the foliation.

The foliation is said to be geodesic (or a-geodesic) if the leaves

are all geodesic (or a-geodesic).

It follows from Proposition 3.12 that geodesic foliations of full

exponential families (and of a-flat families) are those that are affine both in

the canonical and in the mean value parameters, in other words precisely the

affine dual foliations studied by Barndorff-Nielsen and Blaesild (1983). In

the paper cited it is shown that existence of such foliations are intimately

tied to basic statistical properties related to independence of estimates and

ancillarity. Proposition 3.12 shows that the concept itself is entirely geo-


metric in its nature.

It seems reasonable to believe that the existence (locally as well

as globally) of foliations of statistical models could be quite informative. It

plays at least a role when discussing procedures to obtain estimates and an-

cillary statistics on a geometric basis.

Let H be a submanifold of M and suppose that ?e? is an estimate of

p, obtained assuming the model M. Amari (1982, 1985) discusses the a-estimate

of ? assuming ? as follows.

To each point ? of ? we associate an ancillary manifold A (p)

Aa(p) = exp

(Tp(N)A)

a i where exp is the exponential map associated with the a-connection and ? (?) is

the set of tangent vectors orthogonal to ? at p. In general the exponential map

might not be defined on all ? (N) , but then let it be maximally defined.

? is then an a-estimate of p, assuming ? if

? e ?a(?).

Amari (1985) shows that if M is a-flat and ? is -a-geodesic, then the a-estimate

is uniquely determined and it minimizes a certain divergence function.

This suggest that it might be worthwhile studying procedures that

use the -a-estimate for a-geodesic hypotheses H9 and call such a procedure

geometric estimation. In general it seems that one should study the decomposi-

tion of the tangent spaces at ?e?? as

Tp(M) =

??(?)F??(?)?

and especially the maps of these spaces onto itself induced by a-parallel trans-

port of vectors in ? (?), -a parallel transport of vectors in the complement,

both along closed curves in H.

It should also be possible to define a teststatistic in geometric

terms by a suitable lifting of the manifold N, see also Amari (1985). Things

are especially simple in the case where M has dimension 2 and N^ has dimension 1

and we shall try to play a bit with the above loose ideas in some of the

examples to come.



5. THE UNIVARIATE GAUSSIAN MANIFOLD

Let us consider the family of normal distributions ?(?,s ), i.e.

the family with densities

? ?G" 1 2 f(x;y,a) = 1/2ps exp{--? (x-?) },?e^,s>0

2s

w.r.t. Lebesgue measure on IR. This manifold has been studied as a Riemannian

manifold by Atkinson and Mitchell (1981), Skovgaard (1984) and, as a statistical

manifold in some detail by Amari (1982, 1985). Working in the (?,s) parametri-

zation we obtain the following expressions for the metric, the a-connections

and the D-tensor (skewness) expressed as T... (cf. Amari, 1985).

??-ve ?) s M) V

a

G1? =

G122 =

G212 =

G221 = ?

a-? a9 a9 a-, ?? =r? =r? =rl = ? iy ?12 ?21 ?22

?

G112 - (1-a)/s3 G^

= (1-a)/(2s)

a a ? a1 a1 G121

= G211

= -(1+a)/s G?2 =

G21 = "(1+a)/c

G222 = "2(1+2a)/s G22

= -(1+2a)/s

?111 "

?122 ~

?212 "

?221 " ?

??2 =

?121 =

?2? = 2/s ?222

= 8/s

The a-curvature tensor is given by

d - p 2W 4 1212

" * ''? *

190




so the manifold is conjugate symmetric, and the scalar (sectional) curvature by

?a(s]2) =

R1221/(gng22) = -O"?2)/2

For a = 0 (the Riemannian case) we have ?(s,2)

= -1/2 and the manifold is the

space of constant negative curvature (Poincar?*s halfplane or hyperbolic space).

Note that it also has constant a-curvature for all a although nobody knows what

that implies, since such objects have never been studied previously.

To find all a-geodesic submanifolds of dimension 1 we proceed as

follows. Let (e,E) denote the tangent vector fields

e* J- E = -^. d\i do

a-? If we have ? =

?0 constant on _N, _X(N) is spanned by E. Since r22 = 0 we have

a vrE = f E for all a, t a

and thus that the submanifolds

? = {(?,s) |?=??},??e^ -v0

U U

are geodesic submanifolds and the family

(N ,peIR) (4.1)

constitutes a geodesic foliation of the Gaussian manifold.

If ? is non-constant on N, we must be able to parametrize ? locally

as

(t,a(t)), tei SIR.

The tangent space to ? is then spanned by

? = e + s E

d

the manifold by o(x9y):= o(x).

where we have let ?(t) = ^(t) and extended s to a function defined on all of

a a a a ?a

??? =

ve+aE^e+?E^ =

Vee +

2aVeE +

?? + ^? ^'2^

where we have used torsion freeness and the fact that e(a) = s, ?(s) = 0. Using

ak now the expressions for r.., we get

? s 2s s




If this again has to be in the direction of N, we must have

1+a 0-2 _ 1-a _,_ ?? l+2a -2 2s = V1 + s

2s s s

which by multiplication with 2s reduces to the differential equation

20a + 2h2 = (a-1)

? o This is most conveniently solved by letting u = s , whereby ii = 2ss + 2s and

the equation becomes as simple as

?j = a_? ^ u(t) = ^(a-l)t2 + Bt + C, (4.3)

such that the a-geodesic submanifolds are either straight lines (a = 1) or

parabolas in the (?,s )-parametrisation.

The special case a = 1, ? = 0 corresponds to the manifolds

\ = {(?,s) |s=s0>, o^IR+

that give a 1-geodesic foliation.

Another special case is the submanifolds of constant variation

coefficient

V^ =

{(?,s)|s=??},?e^+

that we now see are a-geodesic if and only if a = 1+2? by inserting into (4.3).

V are now connected submanifolds but is composed by two non-connected submani- ??

folds V +

and V "

V + = {(?,s)|?>0}?? , V

" = {(?,s) |y>0}f?V .

The (V ,V ") manifolds do not represent a-geodesic foliations since they are

not a-geodesic for the same value of a. For a = 0 we see that the geodesic sub-

2 2 manifolds are parabola's in (?,s ) with coefficient -h to ? , a result also

obtained by Atkinson and Mitchell (1981) and Skovgaard (1984).

Consider now the hypothesis (?,s) e? , i.e. that of constant varia-

tion coefficient. We shall illustrate the idea of geodesic estimation in this

example as described at the end of section 3.

2 V is a=1+2? geodesic. The ancillary manifolds to be considered

are then -a-geodesic manifolds orthogonal to V .

An arbitrary -a-submanifold is the "parabola"




s = (-(1+?2)?2+??+0?5 ?

which follows from (4.3) with a = -(1+2? ). Its tangent vector is equal to

e+?E = ??: [-2(1+?2)?+?]?+?.

The tangent vector of the hypothesis is

e+??.

They are at right angles at (uo^q)

if and on1y if

1+1 [-2(1+?2)?0+?]=0

~ ?=(1+2?2)?0.

The ancillary manifold intersects at (?0>??0)

if and only if

-(1+?2)?2+(1^)??+0=?2?2 ^ C=0?

2 The -(1+2? )-geodesic ancillary manifolds are thus given as

W = {(?,s (t))|tel }, pcIRxiO}

(Wq =

{(0,a)|a?IR+})

where s 2(t) = -(l+y2)t2 + (l+2Y2)yt and

r 2

I Vl

' 1+2Y?

A

]0, ?%- u[ if U>0

(]-^?-?,0[ Tf ???. V. 1+?

2 The manifolds W , ?e IR actually constitute a -(1+2? ) -foliation of the Gaussian

manifold. To see this, let (x,s) be an arbitrary point in M. If we try to

solve the equation

(x,s2) = (t,-(l+Y2)t2+(l+2Y2)yt)

we obtain exactly one solution ? for xj09 given as

s2

^ (l+Y2)x2+s2 (1+?2)?+?2 ? ~J=

(1+2?2)? (1+2?2) *A (4.4)

s2 i.e. a linear combination of ? and ?jz.

y?X ?, as determined by (4.4) is the geometric estimate of ?, when ?

and s denote the empirical mean and standard deviation of a sample x,,...,x .

It is by construction (see Amari (1982)) consistent and first-order efficient.




A picture of the situation is given below in three different parametrizations:

2 -2 (??s), (?,s ), and (?,s ):

-2 0 2

Fig. 1: Geometric estimation with constant coefficient of variation, (?,s)-

param.

Fig. 2: Geometric estimation, (?,s )-param.




??

Fig. 3: Geometric estimation, (?,-*-) param.

To obtain a geometric ancillary and test-statistic we proceed as follows:

We take a system of vectors on the hypotheses whose directions are

2 -(1+2? ) -parallel and whose lengths are equal to one. Further they are to be

orthogonal to the hypothesis (and thus tangent to the estimation manifolds).

The directions should thus be given as

? = (vrv2) -e + y- E.

2?

To obtain unit length, we get ||v| ?17^ 1 / 2??-1 _ ? 2?2+1

s V 0 2 "

\9 4 2 ?? ?y ?

when s=??, and our orthogonal field is thus

?(?) = [??(?),?2(?)]

= a[-y,^]

4 2 h where a = (2? /(2? +1)) . To find the exponential map

-(1+2?2) exp a?(?)} = (f(t,v),o(t,p))

we shall solve the equations




s2(?,?) = -0+Y2)f(t,y)2 + (l+2y2)f (t ,? )? (4.5)

d fd {?'?)

= -ap and ?(0.?) =

f=2f ? (1+a) *- f = -2/YZf

J- (4.6)

since only the speed of the geodesic has to be determined. (4.6) is easily seen

to be equivalent to

a 2 f = ?s ? for some K+0. (4.7)

Inserting (4.5) into this we obtain

f = K(-(1+Y2)f2 + (l+2Y2)yf)'2Y

and separation of variables yield 2

/J[-(1+Y2)u2 + (1+2?2)??]2? du = Kt+C ?

Substituting v=u/y we get

v4Y2+1G(fit1}il) = Kt+C (4i8)

where G(x) = /J [-(l+Y2)v2+(1+2Y2)v]2Y2dv.

Using the initial condition ?(0,?)=? we get

C = p4y2+1G(1)

and the condition f(0,y) = -ay yields together with (4.7)

? = s4?2(0,?)(-3?) =-3?4?2?4?2+1,

whereby

/Ai ejf?LHi, . .aY4y>2+1t + ?4?2+16(?),

2 4? +1

and dividing by y Y yields thus

^l?hA) = -??4? + G(1)

and therefore f(t,y) = yh(t) where

2 h(t) = G'^-ay^ t + 6(1)).

Inserting this into (4.5) yields

a(t,y) = y /-(HY2)h(t)2+(l+2Y2)h(t)

which is linear in y. If we now interpret points of same "distance" from the




hypothesis as those where t is fixed and only y varying, we see that s/x is in

one-to-one correspondence with t. We shall therefore say that s/x is the

geometric ancillary and this it also is the geometric test statistic for the

hypothesis a=yy.

It is of course interesting, although not surprising, that this

test statistic (ancillary) is obtained solely by geometric arguments but still

equal to the "natural" when considering the transformation structure of the

model.



6. THE INVERSE GAUSSIAN MANIFOLD

Consider the family of inverse Gaussian densities

*fv. ,\ - ?L ^ - ^(xx"1"?- ??) -3/2 . ~ t(?;?,?) -

y/g^e ? ? ???>0

w.r.t. Lebesgue measure on IR . We choose to study this manifold in the para-

metrization (?,?), where

? = x_1 ? = ?-

1???

, 2- . 1(?-1+?2?) f(x;n.e) = h(x)n"S

n 2?

The metric tensor and the skewness tensor can now be calculated either by using

their definition directly or by calculating these in the (?,?) coordinates and

using transformation rules of tensors. We get

?

?? /

-3 -1 -2 3 G112=0, ?1?=? ' ?122=? ? ' ?222=~72~"

* ? ?

The Riemannian connection is now determined by

?ijk =

^Wuc^k9?3' such that

arm = -(Ha)/(2n3), r112

= ?211

= ?]21

= 0

a ? a a ? G221

= (1-a)/(2?tG), G122 =

G212 = -(1+a)/(2?? )

G222 = (3a-1)/(2?2?)

Multiplying with the inverse metric we get

198



a-. a~ a? a,

Ty = -0+a)/? G^

= rj2

= r?1

= O

G22 =

G^ = -(1+a)/(2?) rj2

= (1-a)/?

G22 = (3a-1)/2?.

To find all geodesic submanifolds of dimension one we first notice

a2 that since r,, ? 0, the manifolds

\' <(?.?}|?-?0>

are a-geodesic for all a, i.e. geodesic and they constitute a geodesic foliation

of the inverse Gaussian manifold. Because

f X = ?"1

they correspond to hypotheses of constant expectation.

Consider now a submanifold of the form (n(t),t), i.e. with tangent

? given as

? = ? e + E, Where e = ^

E = ?-

.

We extend ? by letting n(x,y): = n(y)> i.e. such that e(n) = 0, ?(?) = ?. Then

a ?a #ot _ a vMN = ? ? e + 2r>v E + ne + v-E

? e e E

/*' 1+a ?2 . 1-ou , / 1+a ? , 3a-l\r- = (? . __ ? + __)e + (- _ n + ___)?

a We now have V..N = hN iff

?r 1+a ? , 3a-1t _ r?? 1+a ? , 1-a? ?[- ?"?

+ ~2G]

' [? " ~ ? + ~G]

which reduces to the differential equation

3a-l ? a-1 2t " t

'

This is first solved for a = ?:

2 2 n =

--^r+-*n =

-2-logt + C^

n(t) = - |

t log t + C-jt

+ C2.

1-3a 1 o

For a f 2 we get by letting u = nt that u satisfies the differential

equati0n - l+3a 1^3?

? = (a-i)t 2

^u(t) =fe^-t 2 +

C1

Whereby 3^

n(t) = y^-t

+ Bt 2 + C, a ? ]?

For a=l (the exponential connections) we get the parabolas:

n(t) = Bt2 + C

and for a=-l (the mixture connection) we get the curves:

n(t) = -t + B/t + C.

In the Riemannian case (a=0) we get

n(t) = -2t + B/F + C

that are parabolas in (/?G,?).

The curvature tensor is given by

a a a a a a a , 2

R1212

* <?G21>

- Mll^s

+ (rlr2r21

" r2r2r?l > =

^?

The manifold is thus conjugate symmetric (we already know, since it is an ex-

ponential family) and the sectional curvature is

?a(s]2) =

-R12l2/(gllg22) = "?-a2)/2.

Note that the Riemannian curvature (a=0) is again constant equal to -h9 as in

the Gaussian case. In fact the a-curvature is exactly as in the Gaussian case.

We can map the inverse Gaussian manifold to the Gaussian by letting

? = /2? s2 = ?/2

and this map is a Riemannian isometry. However, it does not preserve the skew-

ness tensor and thus the Gaussian and inverse Gaussian manifolds do not seem to

be isomorphic as statistical manifolds, although they are as Riemannian mani-

folds.

Corresponding to the hypothesis of constant coefficient of vari-

ation, we shall investigate the submanifold corresponding to the exponential


transformation model /?? = ?, ? fixed, i.e.

h(x)/?V 2? s>0

which in the (n,e)-parametrization is a straight line through the origin (as

const, coeff. of var.)

{? = ??} = V ??

This submanifold is a-geodesic if and only if

2(g-l) _ 2+? ? =

l^T ~ a -

2+37 ?

The tangent space to V is spanned by ye+ E, and the orthogonal -a-geodesic

submanifolds are given by solving the equations

l-3a

2ia^?=.2iI+^.+ B. 2

+c l-3a l+3a

to get the intersecting point and orthogonality at ( ,_I ' #,?) gives

3a+l

(5.1)

? = 8a y 2

l-9a2

Combining this with (5.1) we get C=0, i.e. the estimation manifolds are given as

3a+l l-3a

ncy (t) - i?Mt - SajL2

2 {Z) "

l+3a Z

? Q 2 r

The manifolds W^,, ?>0 again constitute a -a-foliation of the inverse Gaussian ??

manifold as is seen by solving the equations

(?0??0) =

(n#(t),t)

which gives t=eQ, and

- G (3a-?? "? 4a

3a+1 ? 3a-1

?0 +

^G ?0?0

3a+1

. G(3a-1)(a 90 [_

4a

-t 2

?11+ 9a2-1 ?0 8a ?

0 J

3a+1

This again determines a geometric estimate ? of ? from a sample x^,...,x from

the inverse Gaussian distribution, and this is obtained by letting




?0 = ]/* ?0

= ? S??

" ?/* '

and inserting a = (2+?)/(2+3?) into the expression given above.



7. THE GAMMA MANIFOLD

Consider the family of gamma densities

f(x;y,?) = (?/y)? x?"Vr(6) exp{- ^} y>0, ?>0

w.r.t. Lebesgue measure on IR+. The metric tensor is obtained by direct cal-

culation in the (y,3)-parametrization as

0

where ?(?) = D2 log r(?) - 1/$.

The Riemannian connection is now obtained by

fijk =

^3l9jk +

3jgik -

3kgij] t0 be

fin = -?/y3; f112

= - l/(2y2); f121 =

?211 = 1/(2?2)

f222 = V(?)> f221

= r122

= f212

= 0.

1 Similarly we calculate r... by the formula

? J ?

?1Jk -

t?^Ej?DE^D) to be

1 3 ] 2 G??

= -23/? G121 = 1/p

11111

G122 =

G?2 =

G212 =

G222 =

G221 = ?

and the skewness tensor T..k =

2(^1-j|< "

rijk^

Tlll = 2?/^ ??2

= T121

= T211

= ~1/y T222 = f'(?)

T221 =

T122 =

T212 = ?'

203


whereby the a-connections are determined to be

aT - (1+?)? "

_ad_ hn

" 3 r112 0 2

? 2?

a a ,, a ?

1121 '211 0 2 !222 2 f m 2?

a a a

G122 =

G212 =

G221 = ?"

Multiplying by the inverse metric we get

? = . J+2. ?2 - a'? 11 y "

2?(3)

aA ?1 . J+a "2 _ha f'(?) M2

" *21 23 *22

" 2 Tf?T

and all other symbols equal to zero.

The curvature is by direct calculation found to be

" _ (a2-1)[f(?)+?f'(?)]

1212 4?F(3)

The space is conjugate symmetric and therefore the curvature tensor is fully

determined by the sectional (scalar) curvature which is

(a) _ 2 ? 22 . l-a2 [f(?)+?F'(?)3 K " Rl2129 9 -"?

?2f(?)

Note that this is even for a=0 different from the two previous examples in that

the curvature is non-constant and truly dependent on the shape parameter 3.

To find all geodesic submanifolds we proceed as follows:

If p=yQ is constant on N^, X,(N) is spanned by the tangent vector E corresponding

to differentiation w.r.t. the second coordinate. Since

h-

a -,

vEE- ?

these submanifolds are geodesic for all values of a and constitute a geodesic

foliation of the gamma manifold.

Considering the manifold given by 3=3q? its tangent space is span-

ned by e and since

"=-^e+ "-1 -

e P 2?2f(?)




these are a-geodesic if and only if g=l.

In general let us consider a hypothesis (submanifold) of the type

(f(t),t). Its tangent vector is

f e + E and e(f) = 0, E(f) = f

we have a . ??a .a .. a v? ? ? c(f e + E) = f ? e = 2fv E + f e + ?G? f e + E ' e e E

= r.f2 Ha + f

1+e + i]e + [f2 _??_ + J^a .?^Lje ? ? 2/f(?)

2 F(?)

If we now let ?=t u=f and multiply the coefficient to E by f we obtain the

equation

-(l+a)^(Ha){+f =

4^+2^'(t)f

which unfortunately does not seem soluble in general. For o=l the solutions are

f(t) = t/(At+B).



8. TWO SPECIAL MANIFOLDS

In the present section we shall see that things are not always as

simple as the previous examples suggest, but even then we seem to be able to get

some understanding from geometric considerations.

First we should like to notice that when we combine two experiments

independently with the same parameter space, both the Fisher information metric

and the skewness tensors are additive. Let X^PA ?^?? and let ?., ?. denote the DO 11

derivative of the two log-likelihood functions

Ai =

W7 log f (?;?) Bi =

drlog g(y;e)-

Then the skewness tensor is to be calculated as

V =

E^){VBj)(VBk>

? EWk+ EBiBjBk

since all terms containing both A1s and B's vanish due to the independence and

the fact that EA. = EB. = 0.

If we now let ?^?(?,s ), ?^?(s,?) and X and Y independent we get

by adding the information and skewness tensors that in the ^,a)-parametrization

? ? "Is-1- ?2 (?

?

and that, as in the Gaussian manifold, we have

3 3 Tlll

= T122

= T212

= T221

= ? ??2

= 2/s T222 = 8/s '

a Since derivatives of the metric are as in the Gaussian case, so are the r...-

symbols:

206




a a a a

G1? =

G122 =

G212 =

G221 = ?

a a ? a ^ G121

= G2?

= "(?+a)/s G222=-2(1+2a)/s^.

ak But the a-connections are truly different which is seen by looking at the r..-

* j

symbol s :

a9 ~ a, a?

G^ = (l-a)/(2a+aJ) rj2

= r^

= -(1+a)/s

G22 = -2(1+2a)/(2s+s3)

and all others equal to zero. Considering now the curvature tensor we get

R - (1 ? [2(1+a)+a2(2-a)] _

?

K1212 u a; _4??_2? K211

-a a4(2+a2)

2112

and this is clearly different from R-ioi? wherebY this space is not conjugate

symmetric. The sectional curvature is not determining the curvature tensor be- 1

cause e.g. R-ipi??? but the sPace is not 1 -"Plat since

R - "r - MO C2(l-a)+a2(2+g)] _ ?

R1221 "R1212 'U+a) 4,9^ 2x-R2121 s (2+s J

a From standard properties of the curvature tensor we have

R,?-.?^ = 0, but we

obtain by direct calculation that

a a a a

Rl 211 =

R2in =

R1222 =

R2122 = ?5

such that the above components are the only ones that are not vanishing.

If we try to find the geodesic submanifolds we first observe that

al because r0 = 0 for all a, the submanifolds

? = ?(?,s)|?=??}

are totally geodesic for all a, and thus constitute a geodesic foliation of the

manifold. Following the remarks at the end of section 4, relating geodesic

foliations to the affine dual foliations of Barndorff-Nielsen and Blaesild

(1983), it is of interest to know that also in this example, the maximum likeli-

2 hood estimates of s and ? are independent as expected from the foliation. We

shall now proceed to find the remaining geodesic manifolds.

If we consider manifolds of the type (t,f(t)) with tangent vector




e + f E we get

Vf E (e+f E) = V

+ 2fveE

+ f2vEE

+ f E

= .?(1+a)e+(Jr?_.2?Il2aif2)E

2s+s 2s+s

Multiplying the coefficient to e with f and inserting a=f we get the equation

2f2 |1(?+a)

, ? i?lM + ?-? .f T

f(2+r) f(2+r)

Multiplying on both sides with f(2+f ) and collecting terms gives

2f2f2(l+a) + 2ff + ff3 + 2f2 = a-1

and this does not seem to have a particularly nice solution.

Note that f(t) and yt is not a solution since then f=y f=0 and we

obtain the equation for a:

2Y4t2(l+a) + 2?2 = a-1

which can only hold when a = -1 and then we get

2 2y = -2

which is impossible.

In this example the "constant coefficient of variation" does also

not have any simple group transformational properties.

It seems then of interest to see what happens if we consider the

2 model with ?^?(?,s ), Y^N(log s,?) which is related to the example just consid-

ered but where the "constant coefficient of variation" is_ transformational. The

model is also transformational itself (the affine group). By the same argument

as before the skewness tensor becomes identical to that of the univariate Gaus-

sian manifold. The metric, however, becomes

, ? o 2 / 1 0 > g =

-?- j g = s s I 0 3 '

whereby we calculate the Riemannian connection to be

\

G112 = l/s3 f211

= f121

= -1/s3

3 G222

= ~3/s G?1 =

G122 =

G212 =

G221 = 0#




The a-connections are

a

G122 = (1-a)/s? G121

= G211

= -(1+a)/s 3

a _

a /? ?, 3

a

G222 = (-3"4?)/s3 !",?

= G122

= G212

= G221

= 0,

or in the r. .-symbols: ? j an a?? a-?

G^ = (1-a)/3s G21

= G21

= -(1+a)/s

G22 = "(3+4a)/3s?

The curvature tensor can be calculated to be

2 - (1-a)(3+a) I . (?a)(3-a) K1212 4 K1221

" ' 4

s s

So we do indeed again have a manifold that is not conjugate symmetric. All

other components are again vanishing apart from ^112' R2121*

The sPace is not

flat for any value of a.

Considering the problem of finding all geodesic submanifolds we have

the same situation as earlier in that

_N = {(?,s) |?=??)

??

together constitute a foliation that is geodesic for all values of a, again in

accordance with the independence of ? and s.

Consider now a submanifold of the type [t,f(t)] with tangent

e + f E. We get a a .a .?a ? ^r(e+fE) = ? e + 2fv E + f?vcE + fE

e+fE 'e e E

= - -?- (l+a)e +

bjf-jjT-f + f JE

Multiplying the coefficient to e by f and everything by 3f and reducing, we

obtain the following differential equation:

(3+2a)f2 + 3ff = a-1 .

For a=0 (the Riemannian case), we get

3f2 + 3ff = -1.



o Letting f = ?G, u = f we obtain as in the Gaussian case the equation

ii = - |

?<-? u = - jt2

+ At + B,

2 i.e. again parabolas in the (?,s ) parametrization but with a different coef-

2 ficient to t .

Note that, in fact, considered as a Riemannian manifold there is no

essential difference between this and the univariate Gaussian manifold, since

we have constant scalar Riemannian curvature equal to

* 4 ? s _ ? " 4

# ?

* s

i.e. again a hyperbolic space.

If a | {1,-2*} the following special parabolas are solutions:

?2 = ?+T2

+ By + ?2 p^t?

B arbitrary?

2 2 s =

aQ is 1-geodesic. For a = -3/2 no parabolas are geodesic. The equation

then reduces to

f f = ? tT 3 '

the general solution to which cannot be obtained in a closed form.

If we consider the transformation submodel of "constant coefficient

of variation" s=?? corresponding to f(t)= t, we get the equation

(3+2a)?2 + 0 = a-1.

Solving this for a we find the following peculiarity:

a = (3?2+1)/(1-2?2) if y2jh

but if ? = ?/2/2, the equation has no solution!! In other words, all "constant

variation coefficient submanifolds" of the manifold studies are a-geodesic for

2 suitably chosen a except one (? = h).

A reasonable explanation for this is at present beyond my imagina-

tion. Is there a missing connection (a=?>)? Have I made a mistake in the cal-

culations? Or is it just due to the fact that the phenomenon is related to how

this model is a submodel of the strange two-dimensional model. In any case,

there is a remarkable disharmony between the group structure and the geometry.

To go a bit further we consider the three-dimensional manifold

2 (^,a,c)-parametrized) obtained from considering X ^ ?(?,s ), ? ^ ?(?,1). The

metric for this becomes

2 s

0 0

and the skewness-tensor and the a-connections are identical to the Gaussian

case when only indices 1 and 2 appear and all involving the third coordinate

are equal to zero. Letting (e,E,F) denote the basis vectors for the tangent

space determined by coordinatewise differentiation, we consider now the "con-

stant coefficient of variation" submanifold:

{(t,Yt, log ? t), t e IR+}

with tangent-vector ? = e + ?? + Ar, and we get

v ?

Wie+YE) + <-

7>F

a a ?a ?. =

vee +

2???? + ? v?E

- -j F

Inserting the expressions for the a-connections we obtain

r, ? ? ^a r&~? . ?+2a\G 1 r ???

= -2t^- (m

+ y-r)E

_7F

= -l[2(l+e)e + (^+Y(H2a))E

+ lF].

If this derivative shall be in N's direction we must have

2(1+a) = 1 -> a=-h9

but also

? = Ir

+ ?(1+2a) -*2?2 = -1?

which is impossible. We conclude thereby that this transformational model is

not a-geodesic for any a, considered as a submodel of the full exponential

model.

DISCUSSION AND UNSOLVED PROBLEMS

The present paper seems to raise more questions than it answers.

We want to conclude by pointing out some of these, thereby hoping to stimulate

research in the area.

1. How much structure of a statistical model is captured by its

"statistical manifold", the manifold being defined through expected geometries

as by Amari, minimum contrast geometries as by Eguchi or observed geometries as

by Barndorff-Nielsen? On the surface it looks as if only structures up to

third order are there and as if one should include symmetric tensors of higher

order to capture more.

2. Some statistical manifolds (M^g-pD-j)

and (M2,g2,D2) are

"alike", locally as well as globally. Various types of alikeness seems to be of

some interest. Of course the full isomorphism, i.e. maps from M, to M~ that

preserves both the Riemannian metric and the skewness tensor. But also maps

that preserve some structure, but not all could be of interest, in analogy with

the notion of a conformai map in Riemannian geometry (maps that preserve angles,

i.e. the metric up to multiplication with a function). There are several pos-

sibilities here. Isometries that preserve the skewness tensor up to a scalar

or up to a function. Maps that preserve the metric up to scalars and/or func-

tions and do and do not preserve skewness etc. etc.

3. In connection with the above there remains to be done a lot of

work on classification of statistical manifolds in a pure mathematical sense,

i.e. characterize manifolds up to various type of "conformai" equivalence,

"conformai" here taken in the senses described above. A classic result is that

212




two Riemannian manifolds are locally isomorphic if they have identical curvature

tensors. Do similar things hold for statistical manifolds and their a-curva-

tures? Note that the inverse Gaussian and Gaussian manifolds seem to be alike

but not fully isomorphic. Results of Amari (1985) seem to indicate that a-flat

families are very similar to exponential families. Are they in some sense

equivalent? There might be many interesting things to be seen in this direc-

tion.

4. Some statistical manifolds seem to have special properties. As

mentioned above we have e.g. a-flat families, but also manifolds that are

conjugate symmetric or manifolds with constant a-curvature both for a particular

a and for all a at the same time. Which maps preserve these properties? Can

they in some sense be classified?

5. How does the geometric structures behave when we form marginal

and conditional experiments? Some work has been done on this by Barndorff-

Nielsen and Jupp (1984, 1985).

6. Is there a decomposition theory for statistical manifolds. We

have seen that there might be a connection between the existence of geodesic

foliations and independence of estimates. There might be a de Rham-like theory

to be discovered by studying parallel transports along closed curves in flat

manifolds?

7. Chentsov (1972) showed that the expected geometries were the

only ones that obeyed the axioms of a decision theoretic view of statistics, in

the case of finite sample spaces. It seems of interest to investigate general-

izations of this result, both to more general spaces and to other foundational

frameworks. Picard (1985) has generalized the result to the case of exponential

families and has some results pertaining to the general case.

8. What insight can be gained by studying the difference between

observed and expected geometries?

9. How is the relation between the geometric structure of a Lie-

transformation group and the geometric structure of its transformational statis-




ti cal models?

Other questions and problems are raised by Barndorff-Nielsen, Cox,

and Reid (1986) and in the book by Amari (1985).

Acknowledgements

The author is grateful to Ole Barndorff-Nielsen, Preben Blaesild,

and Erik Jtfrgensen for discussions relevant to this manuscript at various

stages.



REFERENCES

Amari, S.-I. (1982). Differential geometry of curved exponential families -


Amari, S.-I. (1985). Differential-Geometrical Methods in Statistics. Lecture

Notes in Statistics Vol. 28, Springer Verlag. Berlin, Heidelberg.

Atkinson, C. and Mitchell, A. F. S. (1981). Rao's distance measure. Sankhya

A 43 345-365.

Barndorff-Nielsen, 0. E. and Blaesild, P. (1983). Exponential models with

affine dual foliations. Ann. Statist. jj_ 753-769.

Barndorff-Nielsen, 0. E., Cox, D. R. and Reid, N. (1986). The role of differen-

tial geometry in statistical theory. Int. Statist. Rev, (to appear).

Barndorff-Nielsen, 0. E. and Jupp, P. E. (1984). Differential geometry, profile

likelihood and L-sufficiency. Res. Rep. 113. Dept. Theor. Stat., Aarhus

University.

Barndorff-Nielsen, 0. E. and Jupp, P. E. (1985). Profile likelihood, marginal

likelihood and differential geometry of composite transformation models.

Res. Rep. 122. Dept. Theor. Stat., Aarhus University.

Boothby, W. S. (1975). An Introduction to Differentiable Manifolds and Rieman-

nian Geometry, Academic Press.

Chentsov, N. N. (1972). Statistical Decision Rules and Optimal Conclusions (in

Russian) Nauka, Moscow. Translation in English (1982) by Amer. Math. Soc.

Rhode Island.

Efron, ?. (1975). Defining the curvature of a statistical problem (with discus-

sion). Ann. Statist. 3 1189-1242.

215





curved exponential family. Ann. Statist. lj[ 793-303.

Picard, D. (1985). Invariance properties of the Fisher-Rao metric and Chentsov-

Amari connections using le Cam deficiency. Manuscript. Orsay, France.

Rao, C. R. (1945). Information and the accuracy attainable in the estimation of

statistical parameters. Bull. Calcutta Math. Soc. 37 81-91.

Skovgaard, L. T. (1984). A Riemannian geometry of the multivariate normal

model. Scand. J. Statist. 11 211-223.

Spivak, M. (1970-75). Differential Geometry Vol. I-V. Publish or Perish.



DIFFERENTIAL METRICS IN PROBABILITY SPACES

C. R. Rao*

1. Introduction.219

2. Jensen Difference and Entropy Differential Metric . 222

3. The Quadratic Entropy.226

4. Metrics Based on Divergence Measures . 228

5. Other Divergence Measures . 231

6. Geodesic Distances . 234

7. References.238

Department of Mathematics and Statistics, University of Pittsburgh,

Pittsburgh, PA

217



1. INTRODUCTION

In an early paper (Rao, 1945), the author introduced a Riemannian

(quadratic differential) metric over the space of a parametric family of prob-

ability distributions and proposed the geodesic distance induced by the metric

as a measure of dissimilarity between probability distributions. The metric

was based on the Fisher information matrix and it arose in a natural way

through the concepts of statistical discrimination feee also Rao, 1949,1954,1973

pp. 329-332, 1982a). Such a choice of the quadratic differential metric, which

we will refer to as the information metric, has indeed some attractive proper-

ties such as invariance for transformation of the variables as well as the para-

meters. It also seems to provide an appropriate (informative) geometry on the

probability space for studying large sample properties of estimators of para-

meters in terms of simple loss functions as demonstrated by Amari (1982, 1983),

Cencov (1982), Efron (1975, 1982), Eguchi (1983, 1984), Kass (1981) and others.

Kass (1980, Ph.D. thesis) explores the possibility of using differential geo-

metric ideas in statistical inference.

The geodesic distances based on the information metric have been

computed for a number of parametric family of distributions in recent papers by

Atkinson and Mitchell (1981), Burbea (1986), Kass (1981), Mitchell and

Krzanowski (1985), and Oiler and Cuadras (1985).

In two papers, Burbea and Rao (1982a, 1982b) gave some general

methods for constructing quadratic differential metrics on probability spaces,

of which the Fisher information metric belonged to a special class. In view of

the rich variety of possible metrics, it would be useful to lay down some

219



220 C. R. Rao

criteria for the choice of an appropriate metric for a given problem. Amari has

stated that a metric should reflect the stochastic and statistical properties

of the family of probability distributions. In particular he emphasized the

invariance of the metric under transformations of the variables as well as the

? parameters. Cencov (1972) shows that the Fisher information metric is unique

under some conditions including invariance. Burbea and Rao (1982a) showed that

the Fisher information metric is the only metric associated with invariant

divergence measures of the type introduced by Cisz?r (1967). However, there

exist other types of invariant metrics as shown in Section 3 of this paper.

The choice of a metric naturally depends on a particular problem

under investigation, and invariance may or may not be relevant. For instance,

consider the space of multinomial distributions, ? = {(?,,...,? ): p. > 0,

S?. = 1}, which is a submanifold of the positive orthant, X = {(x-j,...,x ):

?. > 0} of the Euclidean space Rn. A Riemannian metric on X automatically pro-

vides a metric on the submanifold ?. In a study of linkage and selection of

gametes in a biological population, Shahshahani (1979) considered the metric

? ? ??. 0 ds2=[?Ldx2 (1.1)

1 i

which induces the information metric on ?. This metric provided a convenient

framework for a discussion of certain biological problems. However, Nei (1978)

considered a distance measure associated with the Euclidean metric

ds2 = Sdx2 (1.2)

which he found to be more appropriate for evolutionary studies in biology. The

metric induced on d by (1.2) is not the Fisher information metric. Rao (1982a,

1982b) has shown that a more general type of metric

SS?. .dx.dx. (1.3) IJ I J

called the quadratic entropy is more meaningful in certain sociometric

and biometrie studies.

The object of the present paper is to provide some general methods

of constructing Riemannian metrics on probability spaces, and discuss in



Differential Metrics in Probability Spaces 221

particular the metric generated by the quadratic entropy which is an ideal

measure of diversity (see Lau, 1985 and Rao, 1982b), and has properties similar

to the information metric, like invariance. We also give a list of geodesic

distances based on the information metric computed by various authors (Atkinson

and Mitchell, 1981; Burbea, 1986; Mitchell and Krzanowski, 1985; Oiler and

Cuadras, 1985 and Rao, 1945).

The basic approach adopted in the paper is first to define a measure

of divergence or dissimilarity between two probability measures, and then to use

it to derive a metric on M, the manifold of parameters, by considering two

distributions defined by two contiguous points in M. We thus provide a method

for the construction of an appropriate geometry or geometries on the parameter

space for discussion of practical problems. Some divergence measures may be

more appropriate for discussing properties of estimators using simple loss

functions while others may be appropriate in the study of population dynamics in

biology. It is not unusual in practice to study a problem under different

models for observed data to examine consistency and robustness of results. The

variety of metrics reported in the paper would be of some use in this direction.



2. JENSEN DIFFERENCE AND ENTROPY DIFFERENTIAL METRIC

Let ? be a s-finite additive measure defined on a s-algebra of

subsets of a measurable space X9 and P^ be the usual Lebesgue space of ? measur-

able density functions,

? = (p(x): p(x) > 0, ?e?, Lp(x)dv(x) = 1} . (2.1)

We call H: ?->R an entropy (functional) on P^ if

(i) H(p) = 0 when ? is degenerate,

(ii) H(p) is concave on P_.

In such a case, with ? > 0, ? > 0, ?+?=1, Rao (1982a) defined the Jensen

difference between ? and qeP^ as

J(A,y; p,q) = ?(?? + uq) - ??(?) - yH(q) . (2.2)

The function J: P_ ? F^->R is non-negative and vanishes ifp = q(iffp = q when

? is strictly concave). If the entropy function ? is regarded as a measure of

diversity within a population, then the Jensen difference J can be interpreted

as a measure of diversity (or dissimilarity) between two populations. For the

use of Jensen difference in the measurement, apportionment and analysis of di-

versity between populations, the reader is referred to Rao (1982a, 1982b).

Let us now consider a subset of probability densities characterized

by a vector parameter ?

P. = {?(?,?): ?(?,?)e?, ?e?, a manifold in Rn} ?? ?

and assume that ?(?,?) is a smooth function admitting derivatives of a certain

order with respect to ? and differention under the integral sign. For conven-

ience of notation, we write

222



?(?.?) = ??, ?(?) =

?(??), ?(?,f) = ?(???

+ ??f)

J(e^) = ?(?,f) - ??(?) - ??(f) (2.3)

where ?,fe?. Putting ? = ? + de and denoting the i-th component of a vector

with a subscript i, we consider the formal expansion of J(e,e+de),

J_ 55 3^(?,f=?) . . + JL yyy a3J(e,<F9) de de de + 2!

\\ 3F?3fa. deidej 3!

\\\ 3F?3F?3F|( de1dejdV???

= jr

SS 9ijte)deidej

+ ?G

SSS cijk(e)deid6jdV???

(2'4)

In (2.4), the coefficients of the first order differentials vanish since J(e^)

2 has a minimum at f = ?, and the notation such as 3 ?(?,f=?)/3?.3f. is used for

replacing ? by ? after carrying out the indicated differentiations. u

From the definition of the J function, it follows that the (gin?) is

a non-negative definite matrix and obeys the tensorial law under transformation

of parameters. We define the matrix and the associated differential metric

(gfj) and ??

gj^e-de^. (2.5)

as the ?-entropy information matrix and ?-entropy differential metric respec-

tively. We prove the following theorem which provides an alternative computa-

tion of the ?-information matrix directly from a given entropy H.

Theorem 2.1

H 32?(??O+???)

^3 3??.3( 3 (2.6)

Proof: By definition

IJ 3f^9f4

32?(?,f=?) _ 32?(f=?)

3f^3f? 3F??3f?? (2.7)

Since ?(?,?) attains a minimum at f = ?

3?(?,f=?) _ 3?(?) ?7 ?? 3f.

? 3T. V ' J J

Differentiating both sides of (2.8) with respect to e. we have

32?(?,f=?) 32?(?,f=?) , 92?(?) (2 gx 3?.3f. 3F1?3F1?

3T.3T. V J

224 C. R. Rao

which gives (2.6), and the desired result is proved.

Let us consider a general entropy function of the type

H(Pj h(pjdv(x) (2.10)

where h"9 the second derivative of h, is a non-negative function. Then using

(2.6) ??,<?>

? ??jW

? -

^^

3 h(Xp +ypj

? 3 dv(x)

9P* 9PQ

If h(x) = ? log ?, leading to Shannon's entropy, then

'ij 9^(?) = ??

?? 36i 3ej dv(x)

(2.?)

(2.12)

become the elements of Fisher's information matrix. If h(x) = (a-l)~ (xa-x),

a ? 1, we have the a-order entropy of Havrda and Charv?t (1967) and

9?, =

gjj?e) = a??

a logpa a log p.

36i 3T, dv(x) (2.13)

which provide the elements of a-order entropy information matrix, and the

corresponding differential metric given in Burbea and Rao (1982a, 1982b).

We prove Theorem 2.2 which gives alternative expressions for the

coefficients of the third order differentials in the expansion of J(e^).

Theorem 2.2.

H = r 93?(?,f=?) + 33?(?,f=?) + 33?(?,f=?)-, Cljk

" L 3??.3?a?3f|< 3?13f.3f|? 3?^3f?.3f|<

J

Proof: By definition

? , , = 33?(?,f=?) LljkV?;

9f13f3?3f?<

= 33?(?,f=?) 33?(?)

3f13f^.3f|<

" ? 3ei36j30k

(2.14)

(2.15)

From (2.9), writing i = j and j = k we have

92?(?,f=?) + 32?(?,f=?) = 32?(?)

aej3(|)k 9(i)ja(|)k y

aej"k "

Differentiating with respect to ?.

33?(?,f=?) + 33?(?,f=?) 33?(?,f=?) 33?(?,f=?) , 33?(?)

3??3?.3f|< 3f^3??3f^ 3?..3f.3f^ 3f^3f.3f^ ?

3T^3T.3T^

which gives (2.14) as equivalent to (2.15). This proves Theorem 2.2.

Let ? be Shannon's entropy. Then, an easy computation gives

cijk -

xy([r|J) +

d-x)Tijk] +

[?$ +

(1-?)?.jk] +

[r{??] +

(l-p)T1Jk]} (2.16)

where 2 m 3 log pft 3 log ? 3 log ? 3 log ? 3 log ?

i jk v

3?^3?. 39k ' * i jk

V 3T1

3T. 30k J '

(2.17)

Adopting the notation of Amari for a-connexion

AA . rO) +hT 1 i j k ljk 2 ijk

the expression (2.16) can be written

When ? = ? = 1, (2.18) becomes

c =lrr(0) + r(0) + r(0)l (2 19) cijk 4 Lrijk jki rikjJ

? u*,yj

Remark 1. In the definition of the Jensen difference (2.2), we

used apriori probabilities ? and ? for the two probability distributions ? and

q which have some relevance in population studies. But in problems of statis-

tical inference, a symmetric version may be used by taking ? = ? = j.

3. THE QUADRATIC ENTROPY

The quadratic entropy was introduced in Rao (1982a) as a general

measure of diversity of a probability distribution over any measurable space.

It is defined as a function Q: P+R+

Q(p) = [ K(x,y)p(x)p(y)dv(x)dv(y) (3.1)

where K(x,y) is symmetric, non-negative and conditionally negative definite,

i.e., nn

? K?x^x^a.aj < 0

for any choice of (x-|,...,x ) and of

(a^,...,a ) such that a,+...+a = 0, with

the further condition K(x,y) = 0 if ? = y. It was shown in Rao (1982b, 1984)

that the quadratic entropy is concave over P_ and its Jensen difference has

nice convexity properties which makes it an ideal measure of diversity. In

view of its usefulness in statistical applications, we give explicit expressions

for the quadratic differential metric and the connection coefficients associated

with the quadratic entropy, in the case of the parametric family P_. ?v

From Theorem 2.1, the (i,j)-th element of the Q-information matrix

(3.2)

is ? n 3^Q(Xp + ?? )

g. .(?) =---*- y!JV?;

3?.3f(].

Observing that

(3(??? +

???) =

j K(x,y)[xp(x,e)+yp(x^)][xp(y,e)+yp(y^)]dv(x)dv(y),

we find the explicit expression for (3.2) as

226

g?j(e) = -2?? ?K(x,y)3%iIl%iidv(x)3v(y) 3T. 3T.

? J (3.3)

= -2 ?? E[K(x,y) 3 1ogP<x'6> 3

l0g3P(y>6)] . * vi

Using the expression (2.14), we find on carrying out the necessary computations

cV,. = -2??(G... + r... + r,..) "ijk ijk i kj jkiJ

where

rijk J \K(x9y)^^^^Mx)My)

39k 3?.36a- (3.4)

It is of interest to note that the expressions (3.3) and (3.4) are invariant for

transformations of both the parameters and variables.

For further properties of quadratic entropies, the reader is refer-

red to Lau (1984) and Rao (1984).

4. METRICS BASED ON DIVERGENCE MEASURES

Burbea and Rao (1982a, 1982b), Burbea (1986) and Eguchi (1984)

have considered metrics arising out of a variety of divergence measures between

probability distributions. A typical divergence measure is of the form

F[p(x,e),p(x^)]dv(x) (4.1) DF(V?V jl

where F satisfies the following conditions:

(i) F(?,?) is a C -function of R+ ?

R+,

(ii) F(x,?) is strictly convex on R+ for every xeR+,

(iii) F(x,x) = 0 for every ? e R+,

(iv) aF^x^ = ?) = 0 for every ? e R^. dy +

Let us consider the expansion

VVW =

2Tîj<e>d6id6j +

?cîejdeêjde^ ... (4.2)

F F and obtain explicit expressions for g.. and c... .

1 j 1J ?

Theorem 4.1. Let

F1(x)y)=%^-,F2(x,y) =

3^)

c - a2F(x,y) ? _ a2F(x,y) F . a2F(x,y) 11

"

3?2 ' rl2 axay

' r22 "

3y2

r = 93F(x,y) F222

3y3 ?

Then

(i) 9ê)

=

?F22[Pe,Pe]^^dv(x)

r 3pfl 3pfi

228


?) 'ijk 3?? 9?? 9??

?222[??'??] 3?7 3?-3?Ga?(?) ? j ?

F99CP0>P0]h aV 3Pc 3 ?O 3PC 3 Pc 3PC

22LKe'KeJL3ei36. 3ek 3?t3?? 36j -]dv(x)

-j"Jk ""1

The results are established by straight forward computations.

Let us consider the directed divergence measure of Csisz?r (1967),

which plays an important role in problems of statistical inference,

D(Pe.P,)-Jp(x.e)f({^)dv(x)

where f is a convex function. In this case

(4.3)

?3f.

= f"(1) ? J

J

(4.4)

where g.. are the elements of Fisher's information matrix. Thus a wide class * J

of invariant divergence measures provide the same informative geometry on the

parameter manifold. However, the c... coefficients may depend on the particular ? j ?

convex function f chosen as shown below

f cijk(Q)

* 33D

V9k

-f"(l)Cr{]j[*r{l] +

riy] + (f-(l) +

3f-(l))TiJk (4.5)

where t).'. and T... are as defined in (2.17). 1J ? 1 j ?

The results (4.4) and (4.5) have consequences in estimation theory,

specially in the study of second order efficiency. While a large number of

estimation procedures lead to first order efficient estimates (i.e., having the

same asymptotic variance based on the elements of Fisher information matrix),

they are distinguishable by different second order efficiencies of the derived

estimators (see Rao, 1962).

If f is a convex function, then

f*(u) = uf(l)



230 C. R. Rao

is also convex, and the measure (4.3) associated with f+f* is

0*(??>??) :

?PJ& + Pj(ir)3dv(x) (4.6) ? V f

?

which is symmetric in ? and ?. However, we may define (4.5) as a symmetric

divergence measure without requiring f to be a convex function but satisfying

the condition that xf(x" ) +f(x) is non-negative on R+. In such a case

gtjf*(e) =

2f"(i)gij(0)

cX(e)-2f-(l)[r{]) +

r{][] +

rtJj] +

3f-'(l)TiJk

Remarks on Sections 2, 3 and 4. As pointed out by a referee, a unified treat-

ment of the results in these three sections is possible by considering a general

dissimilarity measure D : ? ? ? -> {0.,??} satisfying

(a) D(pn,px) is a c function of ?,f,

(b) D(p,p) = 0 for every ? ? p.

Then putting

D = a3p etc

ui;jk 3?.3?^3f|<

exx?'

and differentiating D .L = 0 yields ?J s-f

[Di;j +

D;ijW?>

[Dik;j +

Di:jk +

Dk;ij +

D;ijkW?

giving expressions for g.. and c... for a general D. However, the approach 1J 1J ?

adopted in the paper enabled a discussion of the construction of the distance

measures D through more basic functions like quadratic entropy, general entropy,

cross entropy, and divergence between probability measures. The results expres-

sed in terms of the basic functions are of some interest.

It is also possible to regard the dissimilarity measures of

Section 3 and 4 as having the common form

D(p,q) = |x x x F(p(x),q(x),p(y),q(y))dv(x,y)

where ? is a symmetric measure on ? ? _X. However, the expressions for g.. and

c... are not simple, ? j ?

OTHER DIVERGENCE MEASURES

In the last section, we considered the f-divergence measure which

led to the Fisher information metric. A special case of this measure is the

city block distance, or the overlap distance (see Rao, 1948, 1982a),

|p(x,e)-p(x^)|dv(x) (5.1)

obtained by choosing f(x) = l-min(x,l), which admits a direct interpretation in

terms of errors of classification in discrimination problems. However, this is

not a smooth function and no formula of the type (4.7) is available to deter-

mine the coefficients of the differential metric. But in some cases, it may

turn out that

??(??'?f> ?

Do(lVV =

??(?-f)

is a smooth function of ? and f in which case

32??(?,f=?)

gij= aiUj ? (5?2)

In the case when ?(?,?) is a p-variate normal density with mean y and fixed

variance covariance matrix S, the coefficient (5.2) can be easily computed to be

proportional to a1J, the (i,j)-th element of ?" , which is indeed the (i,j)-th

element of the Fisher information matrix. The same result holds for any ellip-

tical family, as then ?0(?,f) is a function of the Mahalanobis distance between

? and ? (see Mitchell and Krzanowski, 1985).

Let ?(?,?) be the density of a uniform distribution in the interval

[?,?]. Then it is seen that

231



232 C. R. Rao

00(?,f) = 2(1 -

|0 if

= 2(1 - |) if

Although this is not a differentiable function, it is seen that

ds2 = 4 a?

(5.3)

is the metric associated with (5.3).

Another general divergence measure which has some practical

applications is

??(??,?f) =

j [?(??)-?(?f)]??(?)

which is indeed a smooth function if ? is so. In this case

f 0 3p 3p 9^(?) =

2|[?'(??)]2^3^?(?)

r 3p 3p 3p

a^(?) =

6|??(??)???(??)^^3^a?(?)

+ 2 [?'(?O)G( 2, 32P? 3p? *\ 3?O 32P0 3?O

3ei3ej 39k 39i3ek 39j 36j39k 39i ) dv(x).

Another measure of interest is the cross entropy introduced in Rao

and Nayak (1985). If ? is any entropy function, then the cross entropy of p? F

with respect to ? was defined as

H[p +?(? -p )]-H(p )

D(pa,pJ = H(pJ -

H(Pfl) - lim ?*-^-*-

Let

H(p)

?+0

h(p)dv(x)

(5.4)

as chosen in (2.10). Then (5.4) reduces to

D(P0.Pj h(pjdv(x)

Then

h'(pj(pa-pjdv(x) +

3Pa 3P0

h(pjdv(x)

??j-?^Pe?aeTieT^)

which is the same as the h-entropy information matrix derived in (2.10), apart

from a constant. Similarly




ch , r0) + r(D + r(D +T cijk Mjk 'ikj 'jki 'ijk

where

?-,) 32log pfl 3 log ?

9 3 log ? 3 log ? 3 log ?

Tijk ?

E^3P6h"(Pe) +

^"? -?^^?A'-i?-* 1 J ?



6. GEODESIC DISTANCES

In Rao (1945) it was suggested that the information metric could be

used to obtain the geodesic distances between probability distributions. Given

any quadratic differential metric

ds = ?? g. .(e)de.de. (6.1)

where the matrix (g.. ) is positive definite, the geodesic curve ? = e(t) can * j

in principle be determined from the Euler-Lagrange equations

? .. ??

pijei

+

?? r1jkS1?j

' ?? k=1'???'" (6?2)

and from the boundary conditions

e(t-j) = ?, e(t2)

= f .

In (6.2), the quantity

1 r 3_ ? , _i_ _ 3

3i 9J'k 3T0

and is known as the "Christoffel symbol of the first kind."

By definition of the geodesic curve ? = e(t), its tangent vector

9 ? = ?(t) is of constant length with respect to the metric ds . Thus

rijk =

7 [3iT gjk +

3?7 gki -

3?7 gij] (6'3)

JT g-.?.?. = constant . (6.4) 1111 11

J J

The constant may be chosen to be of value 1 when the curve parameter t is the

arc length parameter s, 0 < s < s , with ?(0) = ?, e(sQ) = f and sQ

= g(e^)

is the geodesic distance between ? and f.

Aitkinson and Mitchell (1981) describe two other methods of deriving

geodesic distances starting from a given differential metric. The distances

234

obtained by these authors in various cases are given below. In each case we

give the probability function ?(?,?) and the associated geodesic distance of

(?,f) based on the Fisher information metric.

(1) Poisson distribution

?(?,?) = e"e ??/?!, ? = 0,1,..., ?>0

g(e^) = 21?/?" - /f I

(2) Binomial distribution (n fixed)

?(?>?) = f??(1-?)?"\

? = 0,1,...,?, 0<?<1

g(e9<?>) = 2/?|sin /?* - sin /f|

= 2/?? cos"1^ + /(1-?)(1-fG ].

(3) Exponential distribution

?(?,?) = ee"xe, ? > 0

g(e^) = j log ? - ????| .

(4) Gamma distribution (n fixed)

?(?,?) = e'VtnJJ'V'V^, ? > 0

g(e^) = /? | log ? - log ?|

(5) Normal distribution (fixed variance)

2 2 ?(?,?,s0)

= ?(?,s0;?), aQ

fixed

9(?-|>?2^ =

??1 "

y21/s0

(6) Normal distribution (fixed mean)

2 2 p(x>y0?a ) =

?(?0,s ;x),yQ fixed

2 2 g(ara2)

= /2 I log s? - log a?\

(7) Normal distribution

2 2 ?(?,?;s ) = ?(?,s ;x), ? and s both variable.

The information metric in this case is

ds2 = d^+2do^

(6#5) s s

and the geodesic distance is

= 2/Z tanh^?O^) (6.6)

236 C. R. Rao

where s(1,2) is the positive square root of

(?1-?2) +2(?y?2)

(?.|-?2) +2(s-|+s2)

The explicit form (6.6) is given in Burbea and Rao (1982a). From (6.6)

2 2 g^,a.| ;?,s2)

= /2|log s-j - log s2|

2 2 which agrees with result (6). However, g(\i,9o ;?2,s ) does not reduce to result

(7) since s = constant is not a geodesic curve with respect to the metric (6.5)

(8) Multivariate normal distribution

? (?,S;?), S fixed

g(y-| >P2) =

[(?-|-?2)'S (??|-?2)]

which is Mahalanobis distance.

(9) Multivariate normal distribution

?(?,S;?), ? fixed

9(S1,S2)-[2"1 I (log ?.)2]"5

where 0 < ?, <...f ? are the roots of the determinantal equation |S2-?S,| = 0.

The above explicit form is due to S. T. Jensen as mentioned in Atkinson and

Mitchell (1981).

(10) Negative binomial distribution

?(?,?) = [x!r(r)]"1r(x+r)ex(l-e)r, r fixed

9(?,f) = 2/Fcosh"1 ? " ^

??-?)(1-F)

= 2/r log J^?J^Ll '(?-?)?-f)

This computation is due to Oiler and Cuadras (1985).

(11) Multinomial distribution

, n, n.

p(n-|,... ,nk; p^,... ,p^) =

n \[ m n ? p] ?? ?p|< ? n fixed.

Let p. = (p,^,...,p^?)

and p2

= (p^2,...,p^2).

Then

1 k

g(p1,p2) = 2/? COS (? /p??p12)

1

The above computation was originally done by Rao (1945), but an easier method

of derivation is given by Atkinson and Mitchell (1981).

Recently Burbea (1984) obtained geodesic distances in the case of

independent Poisson and Normal distributions which are given below. These

results (12) and (13) follow directly from (1) and (7) respectively as the

squared geodesic distances behave additively under combination of independent

distributions.

(12) Independent Poisson distributions

?(?1,...,??;?1,...,??) = p? _jl

1 xi!

g(e,.en;+1.F?) =

2[[(/?. - ^)2]1/2

(13) Independent Normal distributions

?(?;?1,s1)...?(??;??,s?)

9[(?11s11),....(??1.s21);(?12,s12),...,(??2,s?2)]

? ? 1+?1.2) 1/? = H [ I log2 ?i-I1'2 k=l

l-6k(l,2)

where d.(1,2) is the positive square root of

(yk1-uk2) +2(ak1-g|c2)

(ykl-yk2) +2(<Jkl+ak2)

(14) Multivariate elliptic distributions

?(?|?,S) = |SG1/??[(?-?)'S"?(?-?)],

for some function h, and S is fixed

gdi-j^) =

Ch^i"^)'2- ^??"?2^

where ch is a constant, which is essentially Mahalanobis distance. This result

is due to Mitchell and Krzanowski (1985).

The use of the c... coefficients defined in (2.4) and (4.2) in the 1J ?

discussion of statistical problems will be considered in a future communication.

REFERENCES

Amari, S. I. (1982). Differential geometry of curved exponential families -

curvature and information loss. Ann. Stat. 10, 357-385.

Amari, S. I. (1983). A foundation of information geometry. Electronics and

Communications in Japan 66-A, 1-10.

Atkinson, C. and Mitchell, A. F. S. (1981). Rao's distance measure. Sankhya

43, 345-365.

Burbea, J. (1986). Informative geometry in probability spaces. Expo. Math.

4, 347-378.

Burbea, J. and Rao, C. Radhakrishna (1982a). Entropy differential metric,

distance and divergence measures in probability spaces: a unified

approach. J. Multivariate Anal. 12, 575-596.

Burbea, J. and Rao, C. Radhakrishna (1982b). Differential metrics in probabil-

ity spaces. Probability Math. Statist. 3, 115-132.

Cencov, N. N. (1982). Statistical decision rules and optimal inference.

Transactions of Mathematical Monographs 53, Amer. Math. Soc.,

Providence.

Csisz?r, I. (1967). Information-type measures of difference of probability

distributions and indirect observations. Studia Seientiarum

Mathematicarum Hungrica 2, 299-318.

Efron, B. (1975). Defining the curvature of a statistical problem (with

applications to second order efficiency, with discussion). Ann.

Statist. 3, 1189-1217.

238




Efron, ?. (1982). Maximum likelihood decision theory. Ann. Statist. 10,

340-356.


curved exponential family. Ann. Statist. VU 793-803.

Eguchi, S. (1984). A differential geometric approach to statistical inference

on the basis of contrast functionals. Tech. Report No. 136,

Hiroshima University, Hiroshima, Japan.

Havrda, M. E. and Charvat, F. (1967). Quantification method of classification

processes: Concept of a-entropy. Kybernetika 3, 30-35.

Kass, R. E. (1980). The Riemannian structure of model spaces: a geometrical

approach to inference. Ph.D. thesis, University of Chicago.

Kass, R. E. (1981). The geometry of asymptotic inference. Tech. Rept. 215.

Dept. of Statistics, Carnegie-Mellon University.

Lau, Ka-Sing (1985). Characterization of Rao's quadratic entropy. Sankhya A

47, 295-309.

Mitchell, A. F. S. and Krzanowski, W. J. (1985). The Mahalanobis distance and

elliptic distributions. (To appear in Biometrika).

Nei, M. (1978). The theory of genetic distance and evolution of human races.

Japan J. Human Genet. 23, 341-369.

Oiler, J. M. and Cuadras, C. M. (1985). Rao's distance for negative multi-

nomial distributions. Sankhya 47, 75-83.

Rao, C. Radhakrishna (1945). Information and accuracy attainable in the estima-

tion of statistical parameters. Bull. Calcutta Math. Soc. 37,

81-91.

Rao, C. Radhakrishna (1948). The utilization of multiple measurements in prob-

lems of biological classification (with discussion). J. Roy.

Statist. Soc. BIO, 159-203.

Rao, C. Radhakrishna (1949). On the distance between two populations. Sankhya

9, 246-248.

Rao, C. Radhakrishna (1954). On the use and interpretation of distance



240 R. Rao

functions in statistics. Bull. Inst. Inter. Statist. 34, 90-100.

Rao, C. Radhakrishna (1962). Efficient estimates and optimum inference pro-

cedures in large samples (with discussion). J. Roy. Statist. Soc.

? 24, 46-72.

Rao, C. Radhakrishna (1973). Linear Statistical Inference and its Applications.

(Second edition) Wiley, New York.

Rao, C. Radhakrishna (1982a). Diversity and dissimilarity coefficients: a

unified approach. J. Theoret. Pop. Biology 21, 24-43.

Rao, C. Radhakrishna (1982b). Diversity: its measurement, decomposition,

apportionment and analysis. Sankhya A 44, 1-22.

Rao, C. Radhakrishna (1984). Convexity properties of entropy functions and

analysis of diversity. In Inequalities in Statistics and

Probability, 1RS Lecture Notes, Vol. 5, 68-77.

Rao, C. Radhakrishna and Nayak, ?. K. (1985). Cross entropy, dissimilarity

measures and characterizations of quadratic entropy. IEEE Trans.

Information Theory IT 31, 589-593.

Shahshahani, S. (1979). A new mathematical framework for the study of linkage

and selection. Memoirs of the American Mathematical Society,

No. 211.



differential geometry in statistical inference || differential geometry in statistical inference

Documents