mapping of probabilities

Mappingof

Probabilities

Theory for the Interpretion ofUncertain Physical Measurements

Albert Tarantola1

Universite de Paris, Institut de Physique du Globe

4, place Jussieu; 75005 Paris; France

E-mail: [email protected]

September 17, 2006

1 c© A. Tarantola, 2006.

iii

To Kike & Vittorio.

v

PrefaceNote: this is an old preface that must be replaced.

In this book, I attempt to reach two goals. The first is purely mathematical: toclarify some of the basic concepts of probability theory. The second goal is physical:to clarify the methods to be used when handling the information brought by mea-surements, in order to understand how accurate are the predictions we may wish tomake.

Probability theory is solidly based on Kolmogorov axioms, and there is no prob-lem when treating discrete probabilities. But I am very unhappy with the usual wayof extending the theory to continuous probability distributions. In this text, I intro-duce the notion of ‘volumetric probability’ different from the more usual notion of‘probability density’. I claim that some of the more basic problems of the theory ofcontinuous probability distributions can only ne solved within this framework, andthat many of the well known ‘paradoxes’ of the theory are fundamental misunder-standings, that I try to clarify.

I start the book with an introduction to tensor calculus, because I choose to de-velop the probability theory considering metric manifolds.

The second chapter deals with the probability theory per se. I try to use intrinsicnotions everywhere, i.e., I only introduce definitions that make sense irrespectivelyof the particular coordinates being used in the manifold under investigation. Thereader shall see that this leads to many developments that are at odds with thosefound in usual texts.

In physical applications one not only needs to define probability distributionsover (typically) large-dimensional manifolds. One also needs to make use of them,and this is achieved by sampling the probability distributions using the ‘Monte Carlo’methods described in chapter 3. There is no major discovery exposed in this chapter,but I make the effort to set Monte Carlo methods using the intrinsic point of viewmentioned above.

The metric foundation used here allows to introduce the important notion of ‘ho-mogeneous’ probability distributions. Contrary to the ‘noninformative’ probabilitydistributions common in the Bayesian literature, the homogeneity notion is not con-troversial (provided one has agreed on a given metric over the space of interest).

After a brief chapter that explains what an ideal measuring instrument shouldbe, the book enters in the four chapters developing what I see as the four more basicinference problems in physics: (i) problems that are solved using the notion of ‘sumof probabilities’ (just an elaborate way of ‘making histograms’), (ii) problems thatare solved using the ‘product of probabilities’ (an approach that seems to be origi-nal), (iii) problems that are solved using ‘conditional probabilities’ (these includingthe so-called ‘inverse problems’), and (iv) problems that are solved using the ‘trans-

vi

port of probabilities’ (like the typical [indirect] mesurement problem, but solved heretransporting probability distributions, rather than just transporting ‘uncertainties’).

I am very indebted to my colleagues (Bartolome Coll, Georges Jobert, KlausMosegaard, Miguel Bosch, Guillaume Evrard, John Scales, Christophe Barnes, Fre-deric Parrenin and Bernard Valette) for illuminating discussions. I am also gratefulto my collaborators at what was the Tomography Group at the Institut de Physiquedu Globe de Paris.

Paris, September 17, 2006Albert Tarantola

Contents

1 Sets 11.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.3 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Sigma-Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.1 Cardinality of a Set . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.2 Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.3 Sigma-Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.1 Some Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.2 Indicator of the Image of a Set . . . . . . . . . . . . . . . . . . . 16

1.4 Assimilation of Data (“Inverse Problems”) . . . . . . . . . . . . . . . . . 181.4.1 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Manifolds 232.1 Manifolds and Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.1.1 Linear Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.1.2 Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.1.3 Changing Coordinates . . . . . . . . . . . . . . . . . . . . . . . . 272.1.4 Tensors, Capacities, and Densities . . . . . . . . . . . . . . . . . 282.1.5 Kronecker Tensors (I) . . . . . . . . . . . . . . . . . . . . . . . . . 312.1.6 Orientation of a Coordinate System . . . . . . . . . . . . . . . . 322.1.7 Totally Antisymmetric Tensors . . . . . . . . . . . . . . . . . . . 332.1.8 Levi-Civita Capacity and Density . . . . . . . . . . . . . . . . . 342.1.9 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.1.10 Dual Tensors and Exterior Product of Vectors . . . . . . . . . . . 362.1.11 Capacity Element . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.1.12 Integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.1.13 Capacity Element and Change of Coordinates . . . . . . . . . . 41

2.2 Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.2.1 Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

viii CONTENTS

2.2.2 Bijection Between Forms and Vectors . . . . . . . . . . . . . . . 442.2.3 Kronecker Tensor (II) . . . . . . . . . . . . . . . . . . . . . . . . . 452.2.4 Fundamental Density . . . . . . . . . . . . . . . . . . . . . . . . 462.2.5 Bijection Between Capacities, Tensors, and Densities . . . . . . 472.2.6 Levi-Civita Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . 482.2.7 Volume Element . . . . . . . . . . . . . . . . . . . . . . . . . . . 492.2.8 Volume Element and Change of Variables . . . . . . . . . . . . . 512.2.9 Volume of a Domain . . . . . . . . . . . . . . . . . . . . . . . . . 522.2.10 Example: Mass Density and Volumetric Mass . . . . . . . . . . 54

2.3 Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562.3.1 Image of the Volume Element . . . . . . . . . . . . . . . . . . . . 562.3.2 Reciprocal Image of the Volume Element . . . . . . . . . . . . . 56

3 Probabilities 573.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.1.1 Preliminary Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 583.1.2 What we Shall Do in this Chapter . . . . . . . . . . . . . . . . . 603.1.3 A Collection of Formulas; I: Discrete Probabilities . . . . . . . . 643.1.4 A Collection of Formulas; II: Probabilities over Manifolds . . . 65

3.2 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683.2.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 683.2.2 Image of a Probability . . . . . . . . . . . . . . . . . . . . . . . . 733.2.3 Intersection of Probabilities . . . . . . . . . . . . . . . . . . . . . 763.2.4 Reciprocal Image of a Probability . . . . . . . . . . . . . . . . . . 793.2.5 Compatibility Property . . . . . . . . . . . . . . . . . . . . . . . 813.2.6 Other Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 833.2.7 Marginal Probability . . . . . . . . . . . . . . . . . . . . . . . . . 85

3.3 Probabilities over Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . 873.3.1 Probability Density . . . . . . . . . . . . . . . . . . . . . . . . . . 873.3.2 Image of a Probability Density . . . . . . . . . . . . . . . . . . . 893.3.3 Marginal Probability Density . . . . . . . . . . . . . . . . . . . . 92

3.4 Probabilities over Volume Manifolds . . . . . . . . . . . . . . . . . . . . 943.4.1 Volumetric Probability . . . . . . . . . . . . . . . . . . . . . . . . 943.4.2 Volumetric Histograms and Density Histograms . . . . . . . . . 1003.4.3 Homogeneous Probability Function . . . . . . . . . . . . . . . . 1043.4.4 Image of a Volumetric Probability . . . . . . . . . . . . . . . . . 1083.4.5 Intersection of Volumetric Probabilities . . . . . . . . . . . . . . 1093.4.6 Reciprocal Image of a Volumetric Probability . . . . . . . . . . . 1103.4.7 Compatibility Property . . . . . . . . . . . . . . . . . . . . . . . 1113.4.8 Assimilation of Uncertain Observations (the Bayes-Popper Ap-

proach) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1123.4.9 Popper-Bayes Algorithm . . . . . . . . . . . . . . . . . . . . . . 1143.4.10 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

CONTENTS ix

3.5 Probabilities over Metric Manifolds . . . . . . . . . . . . . . . . . . . . . 1173.5.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . 1173.5.2 Marginal of a Conditional Probability . . . . . . . . . . . . . . . 1303.5.3 Demonstration: marginals of the conditional . . . . . . . . . . . 1323.5.4 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

4 Examples 1414.1 Homogeneous Probability for Elastic Parameters . . . . . . . . . . . . . 1424.2 Measuring a One-Dimensional Strain (I) . . . . . . . . . . . . . . . . . . 147

4.2.1 Measuring a One-Dimensional Strain (II) . . . . . . . . . . . . . 1494.3 Free-Fall of an Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1524.4 Measure of Poisson’s Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . 1544.5 Mass Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1644.6 Probabilistic Estimation of Earthquake Locations . . . . . . . . . . . . . 166

5 Appendices 1735.1 APPENDICES FOR SET THEORY . . . . . . . . . . . . . . . . . . . . . 173

5.1.1 Proof of the Set Property ϕ[ A∩ϕ-1[ B ] ] = ϕ[A]∩ B . . . . . . 1745.2 APPENDICES FOR MANIFOLDS . . . . . . . . . . . . . . . . . . . . . 175

5.2.1 Capacity Element and Change of Coordinates . . . . . . . . . . 1765.2.2 Conditional Volume . . . . . . . . . . . . . . . . . . . . . . . . . 177

5.3 APPENDICES FOR PROBABILITY THEORY . . . . . . . . . . . . . . . 1805.3.1 Image of a Probability Density . . . . . . . . . . . . . . . . . . . 1815.3.2 Proof of the Compatibility Property (Discrete Sets) . . . . . . . . 1895.3.3 Proof of the Compatibility Property (Manifolds) . . . . . . . . . 1915.3.4 Axioms for the Union and the Intersection . . . . . . . . . . . . 1935.3.5 Union of Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 1945.3.6 Conditional Volumetric Probability (I) . . . . . . . . . . . . . . . 1965.3.7 Conditional Volumetric Probability (II) . . . . . . . . . . . . . . 1985.3.8 Marginal Probability Density . . . . . . . . . . . . . . . . . . . . 2005.3.9 Replacement Gymnastics . . . . . . . . . . . . . . . . . . . . . . 2025.3.10 The Borel ‘Paradox’ . . . . . . . . . . . . . . . . . . . . . . . . . . 2055.3.11 Sampling a Probability . . . . . . . . . . . . . . . . . . . . . . . . 2095.3.12 Random Points on the Surface of the Sphere . . . . . . . . . . . 2185.3.13 Basic Probability Distributions . . . . . . . . . . . . . . . . . . . 2205.3.14 Fisher from Gaussian (Demonstration) . . . . . . . . . . . . . . 2355.3.15 Probability Distributions for Tensors . . . . . . . . . . . . . . . . 2375.3.16 Homogeneous Distribution of Second Rank Tensors . . . . . . . 2405.3.17 Center of a Probability Distribution . . . . . . . . . . . . . . . . 2415.3.18 Dispersion of a Probability Distribution . . . . . . . . . . . . . . 2465.3.19 Monte Carlo (Sampling) Methods . . . . . . . . . . . . . . . . . 247

5.4 APPENDICES FOR INVERSE PROBLEMS . . . . . . . . . . . . . . . . 2525.4.1 Inverse Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

x CONTENTS

5.4.2 Solution in the Observable Parameter Space . . . . . . . . . . . 2615.4.3 Implementation of Inverse Problems . . . . . . . . . . . . . . . . 267

5.5 OTHER APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2845.5.1 Determinant of a Partitioned Matrix . . . . . . . . . . . . . . . . 2845.5.2 Operational Definitions can not be Infinitely Accurate . . . . . . 2855.5.3 The Ideal Output of a Measuring Instrument . . . . . . . . . . . 2865.5.4 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2885.5.5 The ‘Shipwrecked Person’ Problem . . . . . . . . . . . . . . . . 2935.5.6 Problems Solved Using Conditional Probabilities . . . . . . . . 2945.5.7 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

Bibliography 501

Index 601

Chapter 1

Sets

This is the introduction to the chapter. This is the introduction to the chapter. Thisis the introduction to the chapter. This is the introduction to the chapter. This is theintroduction to the chapter. This is the introduction to the chapter. This is the intro-duction to the chapter. This is the introduction to the chapter. This is the introductionto the chapter. This is the introduction to the chapter.

2 Sets

1.1 Sets

1.1.1 Relations

The elements of a mathematical theory may have some properties, and may have somemutual relations. The elements are denoted by symbols (like x or 2 ), and the relationsare denoted by inserting other symbols between the elements (like = or ∈ ). Theelement denoted by a symbol may be variable or may be determined.

Given an initial set of (non contradictory) relations, other relations may be demon-strated to be true or false. A relation (or property) containing variable elements is anidentity if it becomes true for any determined value given to the variables. If R andS are two relations containing variable elements, one says that R implies S , and onewrites R⇒ S , if S is true every time that R is true. If R⇒ S and S⇒ R , then onewrites R⇔ S , and one says that R and S are equivalent (or that R is true if, andonly if, S is true).

The relation ¬R , the negation of R , is true if R is false. Therefore, one has

¬(¬(R) ) ⇔ R . (1.1)

From R⇒ S it follows ¬S ⇒ ¬R :

( R ⇒ S ) ⇒ ( (¬S) ⇒ (¬R) ) . (1.2)

(but it does not follow (¬R) ⇒ (¬S) ).If R and S are two relations, then, R OR S is also a relation, that is true if at least

one of the two relations R , S is true. Similarly, the relation R AND S is true onlyif the two relations R , S are both true. Therefore, for any two relations R , S ,the relation R OR S is false only if both R and S are false:

¬( R OR S ) ⇔ (¬R) AND (¬S) , (1.3)

and R AND S is false if any of the R , S is false:

¬( R AND S ) ⇔ (¬R) OR (¬S) . (1.4)

In theories relations like a = b and a ∈ A make sense, the relation ¬(a = b) iswritten a 6= b , while the relation ¬(a ∈ A) is written a /∈ A .

1.1 Sets 3

1.1.2 Sets

A set is a “well-defined collection” of (abstract) elements. An element belongs to a set,or is a member of a set. If an element a is member of a set A one writes a ∈ A (ora /∈ A if a is not member of A ). If a and b are elements of a given set, they may bedifferent elements ( a 6= b ), or they may, in fact, be the same element ( a = b ) .

Two sets A and B are equal if they have the same elements, and one writes A = B(or A 6= B if they are different). If a set A consists of the elements a, b, . . . , one writesA = a, b, . . . .

The empty set, the set without any element, is denoted ∅ . The set of all the subsetsof a set A is called the power set of A , and is denoted ℘[A] . The Cartesian product oftwo sets A and B , denoted A× B , is the set whose elements are of all the orderedpairs (a, b) , with a ∈ A and b ∈ B .

Given two sets A and B one says that A is a subset of B if every member of Ais also a member of B . One then writes A ⊆ B , and one also says that A is containedin B . In particular, any set A is a subset of itself, A ⊆ A . If A ⊆ B but A 6= B onesays that A is a proper subset of B , and one writes A ⊂ B .

The union of two sets A1 and A2 , denoted A1 ∪A2 , is the set consisting of allelements that belong to A1 , or to A2 , or to both:

a ∈ A1 ∪A2 ⇔ a ∈ A1 OR a ∈ A2 . (1.5)

The intersection of two sets A1 and A2 , denoted A1 ∩A2 , is the set consisting of allelements that belong to both A1 and A2 :

a ∈ A1 ∩A2 ⇔ a ∈ A1 AND a ∈ A2 . (1.6)

If A1 ∩A2 = ∅ , one says that A1 and A2 are disjoint.Given some reference set A0 , to any subset A of A0 , one associates its complement,

that is the set of all the elements of A0 that are not members of A . The complementof a set A (with respect to some reference set) is denoted Ac .

Given some reference set A0 , the indicator function1 of a set A ⊆ A0 is the func-tion that to every element a ∈ A0 associates the number one, if a ∈ A , or thenumber zero, if a /∈ A (see figure 1.1). This function may be denoted by a symbollike χA or ξA . For instance, using the former,

χA(a) =

1 if a ∈ A0 if a /∈ A .

(1.7)

The union and intersection of sets can be expressed in term of indicator functionsas

χA1 ∪A2 = maxχA1 , χA2 = χA1 + χA2 − χA1 χA2

χA1 ∩A2 = minχA1 , χA2 = χA1 χA2 .(1.8)

1The indicator function is sometimes called characteristic function, but there is another sense forthat name in probability theory.

4 Sets

1 111

00

0 0

1

0

AA0 A0

A

a

a

Figure 1.1: The indicator of a subset A (of a given set) is the function that takes thevalue one for every element of the subset, and the value zero for the elements out ofthe subset.

As two subsets are equal if their indicators are equal, the properties of the two oper-ations ∪ and ∩ (like those in equations 1.10–1.12) can be demonstrated using thenumerical relations 1.8.

A partition of a set A is a set P of subsets of A such that the union of all thesubsets equals A (the elements of P “cover” A ) and such that the intersection ofany two subsets is empty (the elements are “pairwise disjoint”). The elements of Pare called the blocks of the partition.

1.1 Sets 5

1.1.3 Basic Properties

Let A1 , A2 , and A3 be arbitrary subsets of some reference set A0 . From A1 ⊆ A2and A2 ⊆ A3 it follows A1 ⊆ A3 , while from A1 ⊆ A2 and A2 ⊆ A1 it followsA1 = A2 . Among the many other properties valid, let us remark the De Morgan laws

(A∩ B)c = Ac ∪ Bc ; (A∪ B)c = Ac ∩ Bc , (1.9)

the commutativity relations

A1 ∪A2 = A2 ∪A1 ; A1 ∩A2 = A2 ∩A1 , (1.10)

the associativity relations

A1 ∪ ( A2 ∪A3 ) = ( A1 ∪A2 )∪A3

A1 ∩ ( A2 ∩A3 ) = ( A1 ∩A2 )∩A3 ,(1.11)

and the distributivity relations

A1 ∪ ( A2 ∩A3 ) = ( A1 ∪A2 ) ∩ ( A1 ∪A3 )A1 ∩ ( A2 ∪A3 ) = ( A1 ∩A2 ) ∪ ( A1 ∩A3 ) .

(1.12)

6 Sets

1.2 Sigma-Fields

1.2.1 Cardinality of a Set

When a set A has finite number of elements, its cardinality, denoted |A| , is the num-ber of elements in the set. Two sets with an infinite number of elements have thesame cardinality if the elements of the two sets can be put in a one-to-one correspon-dence (through a bijection). The sets than can be put in correspondence with the setNNN of natural numbers are called countable (or enumerable). The (infinite) cardinalityof NNN is denoted |NNN| = ℵ0 (aleph-zero), so if a set A is countable, its cardinality is|A| = ℵ0 .

Cantor (1884) proved that the set < of real numbers is not countable: (the realnumbers form a continuous set). The (infinite) cardinality of < is denoted |<| =ℵ1 (aleph-one). Any set A that can be put in correspondence with the set of realnumbers has, therefore, the cardinality |A| = ℵ1 . One can give a clear sense to therelation ℵ1 > ℵ0 , but we don’t need those details in this book.

1.2 Sigma-Fields 7

1.2.2 Field

The power set of a set Ω , denoted ℘(Ω) , has been defined as the set of all possiblesubsets of Ω . If the set Ω has a finite or a countably infinite number of elements, wecan build probability theory on ℘(Ω) , and we can then talk about the probability ofany subset of Ω .

Things are slightly more complicated when the set Ω has an uncountable num-ber of elements. As in most of our applications the set Ω is a (finite-dimensional)manifold, these complications matter. The difficulty is that one can consider subsetsin a manifold whose ‘shape’ is so complicated that it is not possible to assign to thema ‘measure’ (be it a ‘volume’ or a ‘probability’). Then, when dealing with a set Ω

with an uncountable number of elements one needs to only consider subsets of Ω

with shapes that are “simple enough”. This is why we need to review here someof the mathematical concepts necessary for the development of probability theory,notably the notions of field, of sigma-field, and of Borel field.

Definition 1.1 Consider an arbitrary set Ω , and the set ℘(Ω) of all the subsets of Ω .A set F of subsets of Ω , i.e., a set F ⊆ ℘(Ω) , is called a field if

• the empty set belongs to F ,∅ ∈ F , (1.13)

• F is closed under complementation,

A ∈ F ⇒ Ac ∈ F , (1.14)

• F is closed under the union of two sets,

A ∈ F AND B ∈ F ⇒ A∪ B ∈ F . (1.15)

Property 1.1 It immediately follows from this definition that

• the whole set Ω also belongs to F ,

Ω ∈ F , (1.16)

• F is closed under the intersection of two sets,

A ∈ F AND B ∈ F ⇒ A∩ B ∈ F . (1.17)

8 Sets

It also follows that F is closed under finite union and finite intersection of sets, i.e.,for any finite n ,

A1 ∪A2 . . . ∪An ∈ F ; A1 ∩A2 . . . ∩An ∈ F . (1.18)

Example 1.1 Ω being an arbitrary set, F = ∅, Ω is a field (called the trivial field).

Example 1.2 Ω being an arbitrary set, F = ℘(Ω) is a field.

Definition 1.2 Let C be a collection of subsets of Ω . The minimal field containing C ,denoted F (C) , is the smallest field containing C . (See figure 1.2 for an example).

Figure 1.2: Let Ω be an interval [a, b) of the real line (suggested at the top), and letbe C the collection of the two intervals [a1, b1) and [a2, b2) suggested in the middle.The minimal field containing C is the collection of intervals suggested at the bottom.

1.2 Sigma-Fields 9

1.2.3 Sigma-Field

Definition 1.3 Consider an arbitrary set Ω , and the set ℘(Ω) of all the subsets of Ω .A set F of subsets of Ω , i.e., a set F ⊆ ℘(Ω) , is called a sigma-field (σ-field ) or sigma-algebra (σ-algebra ) if

• the empty set belongs to F ,∅ ∈ F , (1.19)

• F is closed under complementation,

A ∈ F ⇒ Ac ∈ F , (1.20)

• F is closed under any countable union of sets,

Ai ∈ F ⇒∞⋃

i=1

Ai ∈ F . (1.21)

For short, one says that a σ-field is a collection of subsets that is closed undercountable unions. It is easy to demonstrate that the same holds for the intersection,so one can say that a σ-field is a collection of subsets of that is closed under count-able unions and intersections. The elements of a σ-field are called measurable sets.

If a field has a finite number of elements, it is also a σ-field . A σ-field is alwaysa field.

Example 1.3 The fields in examples 1.1 and 1.2 are σ-fields .

Example 1.4 Ω being an arbitrary set, consider a finite partition (resp. a countably infinitepartition) of Ω , i.e., sets C1, C2, . . . , whose total union is Ω , and that do not overlap:

∪ i Ci = Ω ; i 6= j ⇒ Ci ∩C j = ∅ . (1.22)

The set F consisting of the empty set plus all finite (resp. countable) unions of sets Ci is aσ-field .

Definition 1.4 Let C be a collection of subsets of Ω . The minimal σ-field containing C ,denoted σ(C) , is the smallest σ-field containing C .

If C is finite, then σ(C) = F (C) .If Ω is non-denumerable and we use F = ℘(Ω) we get into trouble defining

our probability measure because ℘(Ω) is too “huge”, i.e., there will often be sets towhich it will be impossible to assign a unique measure, giving rise to some difficul-ties2. So we have to use a smaller σ-field (e.g. the Borel field of Ω , which is thesmallest σ-field that makes all open sets measurable).

2Like the Banach-Tarski paradox.

10 Sets

Definition 1.5 Borel field. Let Ω = < , and C the set of all intervals of the real line ofthe form [r1, r2) . The Borel field, denoted B , is the set σ(C) , i.e., the minimal σ-fieldcontaining C . The sets of the Borel field are called Borel sets.

The Borel set contains all countable sets of numbers, all open, semi-open, and closedintervals, and all the sets that can be obtained by countably many set operations.Although it contains a large collection of subsets of the real line, it is smaller than℘(<) , the power set of < , and it is possible (but nontrivial) to define subsets of thereal line that are not Borel sets. These sets do not matter for the problems examinedin this book.

Example 1.5 Consider the n-dimensional Euclidean manifold, Ω = <n . Stating from theopen. . . or the closed. . . the Borel field is. . .

1.3 Mappings 11

1.3 Mappings

Consider a mapping (or function) from a set A0 into a set B0 . We write

a 7→ b = ϕ(a) , (1.23)

and we say that a is the argument of the mapping, and that b is the image of a underthe mapping ϕ . Given A ⊆ A0 , the set B ⊆ B0 of all the points that are the imagesof the points in A is called the image of A under the mapping ϕ (see figure 1.3),and one writes

A 7→ B = ϕ[A] . (1.24)

Note that, while we write ϕ( · ) for the function mapping an element into an ele-ment, we write ϕ[ · ] for the function mapping a set into a set. Reciprocally, givenB ⊆ B0 , the set A ⊆ A0 of all the points a ∈ A0 such that ϕ(a) ∈ B is called thereciprocal image (or preimage) of B (see figure 1.3), and one writes

A = ϕ-1[ B ] . (1.25)

The mapping ϕ-1[ · ] is called the reciprocal extension of ϕ( · ) . Of course, the nota-tion ϕ-1 doesn’t imply that the point-to-point mapping x 7→ ϕ(x) is invertible (ingeneral, it is not). Note that there may exists sets B ⊆ B0 for which ϕ-1[ B ] = ∅ .

?

image reciprocal image

?

b = (a)a

A0 B0A B

Figure 1.3: Left: A mapping from a set A0 into a set B0 . Center: the image ϕ[A] ofa set A ⊆ A0 . Right: the reciprocal image ϕ-1[ B ] of a set B ⊆ B0 .

A mapping ϕ from a set A into a set B is called surjective if for every b ∈ Bthere is at least one a ∈ A such that ϕ(a) = b (see figure 1.4). One also saysthat ϕ is a surjection, or that it maps A onto B . A mapping ϕ such that ϕ(a1) =ϕ(a2) ⇒ a1 = a2 is called injective (or one-to-one mapping, or injection). A mappingthat is both, injective and surjective, is called called bijective (or a bijection). It is theninvertible: for every b ∈ B there is one, and only one, a ∈ A such that ϕ(a) = b , thatone denotes a =ϕ-1(b) , and calls the inverse image of b . If ϕ is a bijection, then, thereciprocal image of a set, as introduced by equation 1.25, equals the inverse image ofthe set.

12 Sets

A BA B

bijective mapping(surjective and injective)

surjective mapping(not injective)

injective mapping(not surjective)

A B

mapping(not surjective, not injective)

A B

Figure 1.4: The mapping at the top is not surjective (because there is one element inB that has not a reciprocal image in A ), and is not injective (because two elements inA have the same image in B ). At the bottom, examples of a surjection, an injection,and a bijection.

1.3 Mappings 13

1.3.1 Some Properties

In what follows, let us always denote by ϕ a mapping from a set A0 into a set B0 .The following properties are well known (see Bourbaki [1970] for a demonstration).

For any A ⊆ A0 , one has

A ⊆ ϕ-1[ϕ[A] ] , (1.26)

and one has A = ϕ-1[ϕ[A] ] if the mapping is injective. For any B ⊆ B0 , one has

ϕ[ϕ-1[ B ] ] ⊆ B , (1.27)

and one has ϕ[ϕ-1[ B ] ] = B if the mapping is surjective. For any two subsets ofB0 ,

ϕ-1[B1 ∪ B2] = ϕ-1[B1]∪ϕ-1[B2]

ϕ-1[B1 ∩ B2] = ϕ-1[B1]∩ϕ-1[B2] .(1.28)

For any two subsets of A0 ,

ϕ[A1 ∪A2] = ϕ[A1]∪ϕ[A2]

ϕ[A1 ∩A2] ⊆ ϕ[A1]∩ϕ[A2] ,(1.29)

and one has ϕ[A1 ∩A2] = ϕ[A1]∩ϕ[A2] if ϕ is injective (see figure 1.6.).

Figure 1.5: In general, ϕ[A1 ∩A2] ⊆ ϕ[A1]∩ϕ[A2] , unless when ϕ is injective, inwhich case, the two sets are equal.

I now introduce a relation that is similar to the second relation in 1.29, but withan equality symbol (this relation shall play a major role when we shall become in-terested in problems of data interpretation). For any A ⊆ A0 and any B ⊆ B0 , onehas

ϕ[ A∩ϕ-1[ B ] ] = ϕ[A] ∩ B . (1.30)

14 Sets

A A

B

B

Figure 1.6: One always has ϕ[ A∩ϕ-1[ B ] ] = ϕ[A] ∩ B . Note: this figure shouldperhaps replace figure 1.7.

See appendix 5.1.1 for the proof of this property, and figure 1.7 for a graphical illus-tration.

1.3 Mappings 15

Figure 1.7: Geometrical construction illustrating the property ϕ[ A∩ϕ-1[B] ] =ϕ[A] ∩ B when the sets are one-dimensional manifolds.

16 Sets

1.3.2 Indicator of the Image of a Set

In a later chapter of this book we shall be interested in the notions of image and of re-ciprocal image of a probability. Because a probability can be seen as a generalizationof an indicator, we can here prepare our future work by examining here the imageand the reciprocal image of a set in terms of indicators.

Note: I should, perhaps, use the notation abuse

ϕ-1(b) ≡ ϕ-1[b] . (1.31)

Note: explain that I introduce the definition

χA1 [ A2 ] ≡

0 if A2 = ∅maxa∈A2 χA1(a) if A2 6= ∅ .

(1.32)

1.3.2.1 Indicator of the image

Let ϕ be a mapping from a set A0 into a set B0 . For any A ⊆ A0 , we have intro-duced the image ϕ[ A ] ⊆ B0 , and by definition of indicator,

ξϕ[A](b) =

1 if b ∈ϕ[A]0 if b /∈ϕ[A] ,

(1.33)

How can we express the indicator ξϕ[A] in terms of the indicator χA ?Should the mapping ϕ be a bijection, then, of course,

ξϕ[A](b) = χA(ϕ-1(b) ) , (1.34)

but let us try to be general. If ϕ is an arbitrary mapping, then, for a given pointb ∈ B0 , there is a set (perhaps empty) of points a ∈ A0 that map into b , i.e., suchthat ϕ(a) = b . This is the set ϕ-1[b] . Then,

ξϕ[A](b) =

0 if ϕ-1[b]) = ∅

maxa∈ϕ-1[b]( χA(a) ) if ϕ-1[b] 6= ∅ ,(1.35)

i.e., using the definition 1.32,

for any b ∈ B0 , ξϕ[A](b) = χA(ϕ-1[b] ) . (1.36)

The apparent simplicity of this expression should not mislead us: the evaluation ofξϕ[A](b) requires (for each b ∈ B0 ) the identification of all the elements in ϕ-1[b] ,which may not be easy if the mapping ϕ is complicated.

1.3 Mappings 17

1.3.2.2 Indicator of the reciprocal image.

Let ϕ be a mapping from a set A0 into a set B0 . For any B ⊆ B0 , we can express theindicator of the set ϕ-1[ B ] ⊆ A0 in terms of the indicator of the set B ⊆ B0 : it is thefunction χϕ-1[ B ] that to any a ∈ A0 associates the number χϕ-1[ B ](a) = ξB(ϕ(a) ) :

for any a ∈ A0 , χϕ-1[ B ](a) = ξB(ϕ(a) ) . (1.37)

Using the simplified notation for the indicators,

for any a ∈ A0 , (ϕ-1[ B ])(a) = B(ϕ(a) ) . (1.38)

18 Sets

1.4 Assimilation of Data (“Inverse Problems”)

We need to become familiar with properties like

a ∈ A ⇒ ϕ(a) ∈ ϕ[A]

a ∈ A : ϕ(a) ∈ ϕ[A](1.39)

and

ϕ(a) ∈ B ⇔ a ∈ ϕ-1[ B ] . (1.40)

We also need to remember here that the definition of intersection of sets (equa-tion 1.6) can be written

a ∈ A1 AND x ∈ A2 ⇔ a ∈ A1 ∩A2 . (1.41)

Consider an arbitrary mapping ϕ from a set A0 into a set B0 . For any A ⊆ A0and any B ⊆ B0 , one has

a ∈ AAND

b ∈ BAND

b =ϕ(a)

⇔

a ∈ A′ ≡ A∩ϕ-1[ B ]AND

b ∈ B′ ≡ϕ[A]∩ BAND

b =ϕ(a) .

(1.42)

While the implication ⇐ is trivial, the implication ⇒ is an actual restriction of thedomains. We do not need a formal proof: a look at the bottom panel of figure 1.8 willshow that 1.42 is, indeed, a general property.

What is less obvious is the relation

B′ = ϕ[A′] , (1.43)

that follows from the property ϕ[ A∩ϕ-1[ B ] ] = ϕ[A]∩ B (equation 1.30, demon-strated in appendix 5.1.1).

Remark that we are inside the paradigm typical of an “inverse problem”: themapping a 7→ b =ϕ(a) can be seen as the typical mapping between the “model pa-rameter space” and the “observable parameter space”. Working with sets —insteadof working with volumetric probabilities— corresponds to the “interval estimationphilosophy” that some authors (e.g., Stark [1992, 1997]) prefer. Under this set theorysetting, we can be precise about the “posterior domain” for a and also about the“posterior domain” for b =ϕ(a) .

1.4 Assimilation of Data (“Inverse Problems”) 19

In figure 1.8, the typical problem of “data assimilation” is examined (this is some-time called an “inverse problem”). One starts with a set A0 (the the set of all con-ceivable models) and a set B0 (the set of all conceivable observations). Any model amust satisfy a ∈ A0 and any observation b must satisfy b ∈ B0 (top row of the fig-ure). One then may have the “a priori information” a ∈ A ⊆ A0 and the (imprecise)observation b ∈ B ⊆ B0 (second row of the figure). In the absence of any relationbetween a and b , this is all what one has. But if some “physical theory” is able toprovide a relation b =ϕ(a) (prediction of the observation, given a model), then onehas a ∈ A AND b ∈ B AND b =ϕ(a) (third row of the figure). And, as explainedabove, this is equivalent to a ∈ A∩ϕ-1[ B ] AND b ∈ ϕ[A]∩ B AND b = ϕ(a) (bottom row of the figure).

The use of the (imprecise) observation b ∈ B has allowed to pass (via the relationb = ϕ(a) ) from the “a priori information” a ∈ A to the “a posteriori information”a ∈ A′ ≡ A∩ϕ-1[ B ] ⊆ A . This also increases the available information on b , as onepasses from the initial b ∈ B to the final b ∈ B′ ≡ ϕ[A]∩ B ⊆ B . And, because ofthe identity expressed in equation 1.30, one has the consistency relation B′ =ϕ[A′] .

In chapter 3 of this book this “theory” is generalized so as to work with probabil-ity distributions instead of with sets.

20 Sets

A0 B0

A0

A0

A0

A0

B0

B0

B0

B0

a

a

a

a

a

a

b

b

b

b

b

b

ab

Figure 1.8: A typical problem of “data assimilation” (see text for details).

1.4 Assimilation of Data (“Inverse Problems”) 21

1.4.1 Exercise

This exercise is all explained in the caption of figure 1.9. The exercise is the samethan that in section 3.4.10, excepted that while here we use sets, in section 3.4.10we use volumetric probabilities. The initial intervals here (initial information on xand on y ) correspond to the “two-sigma” intervals of the (truncated) Gaussians ofsection 3.4.10, so the results are directly comparable.

y = (x)

x

y

0 1 2 3 4 5 6 7

-10

0

10

20

30

40

p

q

fg

Figure 1.9: Two quantities x and y have some definite values xtrue and ytrue , thatwe try to uncover. For the time being, we have the following information on thesetwo quantities, xtrue ∈ f = [1, 5] , and ytrue ∈ p = [14, 22] (the black intervals inthe figure). We then learn that, in fact, ytrue = ϕ(xtrue) , with the function x 7→y = ϕ(x) = x2 − (x − 3)3 represented above. To deduce the “posterior” intervalcontaining xtrue , we can first introduce the reciprocal image ϕ-1(p) of the intervalp (blue interval at the bottom), then define the intersection g = f ∩ϕ-1(p) (redinterval at the bottom). To obtain the “posterior” interval containing ytrue , we canjust evaluate the image of the interval g by the mapping: q =ϕ(g) =ϕ( f ∩ϕ-1(p)) .We could have obtained the interval q following a different route. We could firsthave evaluated the image of f , to obtain the interval ϕ( f ) (blue interval at the left).The intersection of the interval p with the interval ϕ( f ) then gives the same intervalq , because of the property ϕ( f ∩ϕ-1(p)) =ϕ( f )∩ p .

22 Sets

Chapter 2

Manifolds

Old text begins.Probability densities play an important role in physics. To handle them properly,

we must have a clear notion of what ‘integrating a scalar function over a manifold’means.

While mathematicians may assume that a manifold has a notion of ‘volume’ de-fined, physicists must check if this is true in every application, and the answer is notalways positive. We must understand how far can we go without having a notion ofvolume, and we must understand which is the supplementary theory that appearswhen we do have such a notion.

It is my feeling that every book of probability theory should start with a chapterexplaining all the notions of tensor calculus that are necessary to develop an intrinsictheory of probability. This is the role of this chapter. In it, in addition to ‘ordinary’tensors, we shall find the tensor capacities and tensor densities that were common inthe books of a certain epoch, but that are not in fashion today (wrongly, I believe).

Old text ends.

24 Manifolds

2.1 Manifolds and Coordinates

In this first chapter, the basic notions of tensor calculus and of integration theory areintroduced. I do not try to be complete. Rather, I try to develop the minimum theorythat is necessary in order to develop probability theory in subsequent chapters.

The reader is assumed to have a good knowledge of tensor calculus, the goalof the chapter being more to fix terminology and notations than to advance in thetheory.

Many books on tensor calculus exist. Among the many books on tensor calculus,the best are (of course) in French, and Brillouin (1960) is the best among them. Manyother books contain introductory discussions on tensor calculus. Weinberg (1972) isparticularly lucid.

Perhaps original in this text is a notation proposed to distinguish between den-sities and capacities. While the trick of using indices in upper or lower positionto distinguish between vectors or forms (or, in metric spaces, to distinguish between‘contravariant’ or ‘covariant’ components) makes formulas intuitive, I propose to usea bar (in upper or lower position) to distinguish between densities (like a probabilitydensity) or capacities (like the capacity element of integration theory), this also lead-ing to intuitive results. In particular the bijection existing between these objects inmetric spaces becomes as ‘natural’ as the one just mentioned between contravariantand covariant components.

All through this book the implicit sum convention over repeated indices is used: anexpression like ti j n j means ∑ j ti j n j .

2.1 Manifolds and Coordinates 25

2.1.1 Linear Spaces

Consider a finite-dimensional linear space L , with vectors denoted u , v . . . If eiis a basis of the linear space, any vector v can be (uniquely) decomposed as

v = vi ei , (2.1)

this defining the components vi of the vector v in the basis ei .A linear form over L is a linear application from L into the set of real numbers, i.e.,

a linear application that to every vector v ∈ L associates a real number. Denoting byf a linear form, the number λ associated by f to an arbitrary vector v is denoted

λ = 〈 f , v 〉 . (2.2)

For any given linear form, say f , there is a unique set of quantities fi such thatfor any vector v ,

〈 f , v 〉 = fi vi . (2.3)

It is easy to see that the set of linear forms over a linear space L is itself a linearspace, that is denoted L∗ . The quantities fi can then be seen as being the compo-nents of the form f on a basis of forms ei , that is called the dual of the vector basisei , and that may be defined by the condition

〈 ei , e j 〉 = δij (2.4)

(where δij is the ‘symbol’ that takes the value ‘one’ when i = j and ‘zero’ when

i 6= j ).The two linear space L and L∗ are the ‘building blocks’ of an infinite series of

more complex linear spaces. For instance, a set of coefficients tijk can be used to

define the linear application

vi , fi , gi 7→ λ = tijk vi f j gk . (2.5)

As it is easy to define the sum of two such linear applications, and the multiplicationof such a linear application by a real number, we can say that the coefficients ti

jkdefine an element of a linear space, denoted L∗ ⊗ L⊗ L . The coefficients ti

jk canthen be seen as the components of an element t of the linear space L∗ ⊗ L⊗ L on abasis that is denoted ei ⊗ e j ⊗ ek , and one writes

t = tijk ei ⊗ e j ⊗ ek . (2.6)

26 Manifolds

2.1.2 Manifolds

Grossly speaking, a manifold is a ‘space of points’. The physical 3D space is anexample of a three-dimensional manifold, and the surface of a sphere is an exampleof a two-dimensional manifold. In our theory, we shall consider manifolds with anarbitrary —but finite— number of dimensions. Those manifolds may be flat or not(although the ‘curvature’ of a manifold will appear only in one of the appendixes[note: what about the curvature of the sphere?]).

We shall examine ‘smooth manifolds’ only. For instance, the surface of a sphereis a smooth manifold. The surface of a cone is smooth everywhere, excepted at thetip of the cone.

The points inside well chosen portions of a manifold can be designated by theircoordinates: a coordinate system with n coordinates defines a one-to-one applicationbetween a portion of a manifold and a portion of <n . We then say that the manifoldhas n dimensions. The term ‘portion’ is used here to stress that many manifoldscan not be completely covered by a single coordinate system: any single coordinatesystem on the surface of the sphere will be pathological at least at one point (thespherical coordinates are pathological at two points, the two poles).

In what follows, smooth manifolds shall be denoted by symbols like M and N ,and the points of a manifold by symbols like P and Q . A coordinate system isdenoted, for instance, by xi .

At each point P of an n-dimensional manifold M one can introduce the lineartangent space, and all the vectors and tensors that can exist1 at that point. When asystem of coordinates xi is defined over the manifold M , at each point P of themanifold there is the natural basis (of the tangent linear space at P ). Actual tensorscan be defined at any point independently of any coordinate system (and of any localbasis), but their components are, of course, only defined when a basis is chosen. Usu-ally, this basis is the natural basis associated to a coordinate system. When changingcoordinates, the natural basis changes, so the components of the tensors change too.The formulas describing the change of components of a tensor under a change ofcoordinates are recalled below.

While tensors are intrinsic objects, it is sometimes useful to introduce ‘tensordensities’ and ‘tensor capacities’, that depend on the coordinates being used in anessential way. These densities and capacities are useful, in particular, to develop thenotion of volume (or of ‘measure’) on a manifold, and, therefore, to introduce thebasic concept of integral. It is for this reason that, in addition to tensors, densitiesand capacities are also considered below.

1The vectors belong to the tangent linear space, and the tensors belong to the different linear spacesthat can be built at point P using the different tensor products of the tangent linear space and its dual.


2.1.3 Changing Coordinates

Consider, over a finite-dimensional (smooth) manifold M , a first system of coordi-nates xi ; (i = 1, . . . , n) and a second system of coordinates xi′ ; (i′ = 1, . . . , n)(putting the ‘primes’ in the indices rather than in the x’s greatly simplifies manytensor equations).

One may write the coordinate transformation using any of the two equivalentfunctions

xi′ = xi′(x1, . . . , xn) ; (i′ = 1, . . . , n)

xi = xi(x1′ , . . . , xn′) ; (i = 1, . . . , n) .(2.7)

We shall need the two sets of partial derivatives2

Xi′i =

∂xi′

∂xi ; Xii′ =

∂xi

∂xi′ . (2.8)

One hasXi′

k Xkj′ = δi′

j′ ; Xik′ Xk′

j = δij . (2.9)

To simplify language and notations, it is useful to introduce two matrices of partialderivatives, ranging the elements Xi

i′ and Xi′i as follows,

X =

X11′ X1

2′ X13′ · · ·

X21′ X2

2′ X23′ · · ·

......

... . . .

; X′ =

X1′1 X1′

2 X1′3 · · ·

X2′1 X2′

2 X2′3 · · ·

......

... . . .

.

(2.10)Then, equations 2.9 just tell that the matrices X and X′ are mutually inverses:

X′ X = X X′ = I . (2.11)

The two matrices X and X′ are called Jacobian matrices. As the matrix X′ is obtainedby taking derivatives of the variables xi′ with respect to the variables xi , one obtainsthe matrix Xi′

i as a function of the variables xi , so we can write X′(x) ratherthan just writting X′ . The reciprocal argument tels that we can write X(x′) ratherthan just X . We shall later use this to make some notations more explicit.

Finally, the Jacobian determinants of the transformation are the determinants of thetwo Jacobian matrices:

X′ = det X′ ; X = det X . (2.12)

Of course, X X′ = 1 .2Again, the same letter X is used here, the ‘primes’ in the indices distinguishing the different

quantities.

28 Manifolds

2.1.4 Tensors, Capacities, and Densities

Consider a finite-dimensional manifold M with some coordinates xi . Let P be apoint of the manifold, and ei a basis of the linear space tangent to M at P , thisbasis being the natural basis associated to the coordinates xi at point P .

Let T = Ti j...k`... ei ⊗ e j · · · ek ⊗ e` · · · be a tensor at point P . The Ti j...

k`... are,therefore, the components of T on the basis ei ⊗ e j · · · ek ⊗ e` · · · .

On a change of coordinates from xi into xi′ , the natural basis will change,and, therefore, the components of the tensor will also change, becoming Ti′ j′ ...

k′`′ ... .It is well known that the new and the old components are related through

Ti′ j′ ...k′`′ ... =

∂xi′

∂xi∂x j′

∂x j · · ·∂xk

∂xk′∂x`

∂x`′· · · Ti j...

k`... , (2.13)

or, using the notations introduced above,

Ti′ j′ ...k′`′ ... = Xi′

i X j′j · · ·Xk

k′ X``′ · · · Ti j...

k`... . (2.14)

In particular, for totally contravariant and totally covariant tensors,

Ti′ j′ ... = Xi′i X j′

j · · · Ti j··· ; Ti′ j′ ... = Xii′ X j

j′ · · · Ti j... . (2.15)

In addition to actual tensors, we shall encounter other objects, that ‘have indices’also, and that transform in a slightly different way: densities and capacities (see forinstance Weinberg [1972] and Winogradzki [1979]). Rather than a general expositionof the properties of densities and capacities, let us anticipate that we shall only findtotally contravariant densities and totally covariant capacities (like the Levi-Civitacapacity, to be introduced below). From now on, in all this text,

• a density is denoted with an overline, like in a ;

• a capacity is denoted with an underline, like in b .

Let me now give what we can take as defining properties: Under the consideredchange of coordinates, a totally contravariant density a = a i j... ei ⊗ e j . . . changescomponents following the law

a i′ j′ ... =1

X′ Xi′i X j′

j · · · a i j... , (2.16)

or, equivalently, a i′ j′ ... = X Xi′i X j′

j · · · a i j... . Here X = det X and X′ = det X′

are the Jacobian determinants introduced in equation 2.12. This rule for the changeof components for a totally contravariant density is the same as that for a totallycontravariant tensor (equation at left in 2.15), excepted that there is an extra factor,the Jacobian determinant X = 1/X′ .


Similarly, a totally covariant capacity b = b i j... ei ⊗ e j . . . changes componentsfollowing the law

b i′ j′ ... =1X

Xii′ X j

j′ · · · b i j... , (2.17)

or, equivalently, b i′ j′ ... = X′ Xii′ X j

j′ · · · b i j... . Again, this rule for the change ofcomponents for a totally covariant capacity is the same as that for a totally covarianttensor (equation at right in 2.15), excepted that there is an extra factor, the Jacobiandeterminant Y = 1/X .

The most notable examples of tensor densities and capacities are the Levi-Civitadensity and Levi-Civita capacity (examined in section 2.1.8 below).

The number of terms in equations 2.16 and 2.17 depends on the ‘variance’ ofthe objects considered (i.e., in the number of indices they have). We shall find, inparticular, scalar densities and scalar capacities, that do not have any index. Thenatural extension of equations 2.16 and 2.17 is (a scalar can be considered to be atotally antisymmetric tensor)

a′ =1

X′ a = X a (2.18)

for a scalar density, and

b′ =1X

b = X′ b (2.19)

for a scalar capacity.The most notable example of a scalar capacity is the capacity element (as ex-

plained in section 2.1.11, this is the equivalent of the ‘volume’ element that can bedefined in metric manifolds). Scalar densities abound; for example, a probabilitydensity.

Let us write the two equations 2.18–2.19 more explicitly. Using x′ as variable,

a′(x′) = X(x′) a(x(x′)) ; b′(x′) =1

X(x′)b(x(x′)) , (2.20)

or, equivalently, using x as variable,

a′(x′(x)) =1

X′(x)a(x) ; b′(x′(x)) = X′(x) b(x) . (2.21)

For completeness, let me mention here that densities and capacities of higherdegree are also usually introduced (they appear briefly below). For instance, undera change of variables, a second degree (totally contravariant) tensor density wouldnot satisfy equation 2.16, but, rather,

a i′ j′ ... =1

(X′)2 Xi′i X j′

j · · · a′ i j... , (2.22)

30 Manifolds

where the reader should note the double bar used to indicate that a i′ j′ ... is a seconddegree tensor density. Similarly, under a change of variables, a second degree (totallycovariant) tensor capacity would not satisfy equation 2.17, but, rather,

b i′ j′ ... =1

X2 Xii′ X j

j′ · · · b i j... . (2.23)

The multiplication of tensors is one possibility for defining new tensors, like inti j

k = f j sik . Using the rules of change of components given above it is easy to

demonstrate the following properties:

• the product of a density by a tensor gives a density (like in pi = ρ vi );

• the product of a capacity by a tensor gives a capacity (like in si j = ti u j );

• the product of a capacity by a density gives a tensor (like in dσ = g dτ ).

Therefore, in a tensor equality, the total number of bars in each side of the equality mustbe balanced (counting upper and lower bars with opposite sign).


2.1.5 Kronecker Tensors (I)

There are two Kronecker’s ‘symbols’, δij and δi

j . They are defined similarly:

δij =

1 if i and j are the same index0 if i and j are different indices , (2.24)

and

δij =

1 if i and j are the same index0 if i and j are different indices . (2.25)

It is easy to verify that these are more than simple ‘symbols’: they are tensors. Forunder a change of variables we should have, using equation 2.14, δi′

j′ = Xi′i X j

j′ δij ,

i.e., δi′j′ = Xi′

i Xij′ , which is indeed true (see equation 2.9). Therefore, we shall say

that δij and δi

j are the Kronecker tensors.Warning: a common error in beginners is to give the value 1 to the symbol δi

i .In fact, the right value is n , the dimension of the space, as there is an implicit sumassumed: δi

i = δ11 + δ2

2 + · · ·+ δnn = 1 + 1 + · · ·+ 1 = n .

32 Manifolds

2.1.6 Orientation of a Coordinate System

The Jacobian determinants associated to a change of variables x y have beendefined in section 2.1.2. As their product must equal +1, they must be both posi-tive or both negative. Two different coordinate systems x = x1, x2, . . . , xn andy = y1, y2, . . . , yn are said to have the ‘same orientation’ (at a given point) if theJacobian determinants of the transformation, are positive. If they are negative, it issaid that the two coordinate systems have ’opposite orientation’.

Note: what follows is a useless complication!Note: what about

√± det g ?

Nota: hablar con Tolo (tiene las ideas muy claras sobre el asunto. . .As changing the orientation of a coordinate system simply amounts to change

the order of two of its coordinates, in what follows we shall assume that in all ourchanges of coordinates, the new coordinates are always ordered in a way that theorientation is preserved. The special one-dimensional case (where there is only onecoordinate) is treated in an ad-hoc way.

Example 2.1 In the Euclidean 3D space, a positive orientation is assigned to a Cartesiancoordinate system x, y, z when the positive sense of the z is obtained from the positivesenses of the x axis and the y axis following the screwdriver rule. Another Cartesian coor-dinate system u, v, w defined as u = y , v = x , w = z , then would have a negativeorientation. A system of theee spherical coordinates, if taken in their usual order r,θ,ϕ ,then also has a positive orientation, but when changing the order of two coordinates, like inr,ϕ,θ , the orientation of the coordinate system is negative. For a system of geographicalcoordinates3, the reverse is true, while r,ϕ, λ is a positively oriented system, r, λ,ϕ isnegatively oriented.

3The geographical coordinate λ (latitude) is related to the spherical coordinate θ as λ+θ = π/2 .Therefore, cosϑ = sinθ .


2.1.7 Totally Antisymmetric Tensors

A tensor is completely antisymmetric if any even permutation of indices does notchange the value of the components, and if any odd permutation of indices changesthe sign of the value of the components:

tpqr... =

+ti jk... if i jk . . . is an even permutation of pqr . . .−ti jk... if i jk . . . is an odd permutation of pqr . . .

(2.26)

For instance, a fourth rank tensor ti jkl is totally antisymmetric if

ti jkl = tikl j = til jk = t jilk = t jkil = t jlki

= tki jl = tk jli = tkli j = tlik j = tl jik = tlki j

= −ti jlk = −tik jl = −tilk j = −t jikl = −t jkli = −t jlik

= −tkil j = −tk jil = −tkl ji = −tli jk = −tl jki = −tlki j

(2.27)

a third rank tensor ti jk is totally antisymmetric if

ti jk = t jki = tki j = −tik j = −t jik = −tk ji , (2.28)

a second rank tensor ti j is totally antisymmetric if

ti j = −t ji . (2.29)

By convention, a first rank tensor ti and a scalar t are considered to be totally anti-symmetric (they satisfy the properties typical of other antisymmetric tensors).

34 Manifolds

2.1.8 Levi-Civita Capacity and Density

When working in a manifold of dimension n , one introduces two Levi-Civita ‘sym-bols’, εi1i2 ...in and εi1i2 ...in (having n indices each). They are defined similarly:

εi jk... =

+1 if i jk . . . is an even permutation of 12 . . . n

0 if some indices are identical−1 if i jk . . . is an odd permutation of 12 . . . n ,

(2.30)

and

εi jk... =

+1 if i jk . . . is an even permutation of 12 . . . n

0 if some indices are identical−1 if i jk . . . is an odd permutation of 12 . . . n .

(2.31)

In fact, these are more than ‘symbols’: they are respectively a capacity and a den-sity. Let us check this, for instance, for εi jk... . In order for εi jk... to be a capacity, oneshould verify that, under a change of variables over the manifold, expression 2.17holds, so one should have ε i′ j′ ... = 1

X Xii′ X j

j′ · · · ε i j... . That this is true, followsfrom the property Xi

i′ X jj′ · · · ε i j... = Xεi′ j′ ... that can be demonstrated using the

definition of a determinant (see equation 2.33). It is not obvious a priori that a prop-erty as strong as that expressed by the two equations 2.30–2.31 is conserved throughan arbitrary change of variables. We see that this is due to the fact that the verydefinition of determinant (equation 2.33) contains the Levi-Civita symbols.

Therefore, εi jk... is to be called the Levi-Civita capacity, and εi jk... is to be calledthe Levi-Civita density. By definition, these are totally antisymmetric.

In a space of dimension n the following properties hold

εs1 ...sn εs1 ...sn = n!

εi1s2 ...sn εj1s2 ...sn = (n− 1)! δ j1

i1

εi1i2s3 ...sn εj1 j2s3 ...sn = (n− 2)! ( δ j1

i1δ

j2i2− δ j2

i1δ

j1i2

)

· · · = · · · ,

(2.32)

the successive equations involving the ‘Kronecker determinants’, whose theory isnot developed here.


2.1.9 Determinants

The Levi-Civita’s densities and capacities can be used to define determinants. Forinstance, in a space of dimension n , the determinants of the tensors Qi j, Ri

j, Sij ,

and Ti j are defined by

Q =1n!εi1i2 ...in ε j1 j2 ... jn Qi1 j1 Qi2 j2 . . . Qin jn

R =1n!εi1i2 ...in ε j1 j2 ... jn Ri1

j1 Ri2j2 . . . Rin

jn

S =1n!εi1i2 ...in ε

j1 j2 ... jn Si1 j1 Si2 j2 . . . Sinjn

T =1n!εi1i2 ...in ε j1 j2 ... jn Ti1 j1 Ti2 j2 . . . Tin jn .

(2.33)

In particular, it is the first of equations 2.33 that is used below (equation 2.73) todefine the metric determinant.

36 Manifolds

2.1.10 Dual Tensors and Exterior Product of Vectors

In a space of dimension n , to any totally antisymmetric tensor Ti1 ...in of rank n oneassociates the scalar capacity

t =1n!εi1 ...in Ti1 ...in , (2.34)

while to any scalar capacity t we can associate the totally antisymmetric tensor ofrank n

Ti1 ...in = εi1 ...in t . (2.35)

These two equations are consistent when taken together (introducing one into theother gives an identity). We say that the capacity t is the dual of the tensor T , andthat the tensor T is the dual of the capacity t .

In a space of dimension n , given n vectors v1 , v2 . . . vn , one defines the scalarcapacity w = εi1 ...in (v1)i1 (v2)i2 . . . (vn)in , or, using simpler notations,

w = εi1 ...in vi11 vi2

2 . . . vinn , (2.36)

that is called the exterior product of the n vectors. This exterior product is usuallydenoted

w = v1 ∧ v2 ∧ · · · ∧ vn , (2.37)

and we shall sometimes use this notation, although it is not manifestly covariant (thenumber of ‘bars’ at the left and the right is not balanced).

The exterior product changes sign if the order of two vectors is changed, and iszero if the vectors are not linearly independent.


2.1.11 Capacity Element

Consider, at a point P of an n-dimensional manifold M , n vectors dr1, dr2, . . . , drnof the tangent linear space (the notation dr is used to suggest that, later on, a limitwill be taken, where all these vectors will tend to the zero vector). Their exteriorproduct is

dv = εi1 ...in dri11 dri2

2 . . . drinn , (2.38)

or, equivalently,dv = dr1 ∧ dr2 ∧ · · · ∧ drn . (2.39)

Let us see why this has to be interpreted as the capacity element associated to then vectors dr1, dr2, . . . , drn .

Assume that some coordinates xi have been defined over the manifold, andthat we choose the n vectors at point P each tangent to one of the coordinate linesat this point:

dr1 =

dx1

0...0

; dr2 =

0

dx2

...0

; · · · ; drn =

00...

dxn

. (2.40)

The n vectors, then, can be interpreted as the ‘perturbations’ of the n coordinates.The definition in equation 2.38 then gives

dv = ε12...n dx1 dx2 . . . dxn . (2.41)

Using a notational abuse, this capacity element is usually written, in mathematicaltexts, as

dv(x) = dx1 ∧ dx2 ∧ · · · ∧ dxn , (2.42)

while in physical texts, using more elementary notations, one simply writes

dv = dx1 dx2 . . . dxn . (2.43)

This is the usual capacity element that appears in elementary calculus to developthe notion of integral. I say ‘capacity element’ and not ‘volume element’ becausethe ‘volume’ spanned by the vectors dr1, dr2, . . . , drn shall only be defined whenthe manifold M shall be a ‘metric manifold’, i.e., when the ‘distance’ between twopoints of the manifold is defined.

The capacity element dv can be interpreted as the ‘small hyperparallepiped” de-fined by the ‘small vectors’ dr1, dr2, . . . , drn , as suggested in figure 2.1 for a three-dimensional space.

Under a change of coordinates (see an explicit demonstration in appendix 5.2.1)one has

dx1′ ∧ · · · ∧ dxn′ = det X′ dx1 ∧ · · · ∧ dxn . (2.44)

This, of course, is just a special case of equation 2.19 (that defines a scalar capacity).

38 Manifolds

Figure 2.1: From three ‘small vectors’ in a three-dimensional space one defines the three-dimensionalcapacity element dv = εi jk dri

1dr j2drk

3 , that can be in-terpreted as representing the ‘small parallelepiped’defined by the three vectors. To this parallelepipedthere is no true notion of ‘volume’ associated, unlessthe three-dimensional space is metric.

dr1

dr2

dr3


2.1.12 Integral

Consider an n-dimensional manifold M , with some coordinates xi , and assumethat a scalar density f (x1, x2, . . . ) has been defined at each point of the manifold(this function being a density, its value at each point depends on the coordinatesbeing used; an example of practical definition of such a scalar density is given insection 2.2.10).

Dividing each coordinate line in ‘small increments’ ∆xi divides the manifold M

(or some domain D of it) in ‘small hyperparallelepipeds’ that are characterized, aswe have seen, by the capacity element (equations 2.41–2.43)

∆v = ε12...n ∆x1 ∆x2 . . . ∆xn = ∆x1 ∆x2 . . . ∆xn . (2.45)

At every point, we can introduce the scalar ∆v f (x1, x2, . . . ) and, therefore, for anydomain D ⊂ M , the discrete sum ∑ ∆v f (x1, x2, . . . ) can be considered, where onlythe ‘hyperparallelepipeds’ that are inside the domain D (or at the border of thedomain) are taken into account (as suggested by figure 2.2).

Figure 2.2: The volume of an arbitrarily shaped, smooth,domain D of a manifold M , can be defined as the limit ofa sum, using elementary regions adapted to the coordinates(regions whose elementary capacity is well defined).

The integral of the scalar density f over the domain D is defined as the limit(when it exists)

I =∫D

dv f (x1, x2, . . . ) ≡ lim ∑ ∆v f (x1, x2, . . . ) , (2.46)

where the limit corresponds, taking smaller and smaller ‘cells’, to consider an infinitenumber of them.

This defines an invariant quantity: while the capacity values ∆v and the densityvalues f (x1, x2, . . . ) essentially depend on the coordinates being used, the integraldoes not (the product of a capacity times a density is a tensor).

This invariance is trivially checked when taking seriously the notation∫D dv f .

In a change of variables x x′ , the two capacity elements dv(x) and dv′(x′) arerelated via (equation 2.19)

dv′(x′) =1

X(x′)dv( x(x′) ) (2.47)

40 Manifolds

(where X(x′) is the Jacobian determinant det∂xi/∂xi′ ), as they are tensorial ca-pacities, in the sense of section 2.1.4. Also, for a density we have

f ′(x′) = X(x′) f ( x(x′) ) . (2.48)

In the coordinates x we have

I(D) =∫

x∈Ddv(x) f (x) , (2.49)

and in the coordinates x′ ,

I(D)′ =∫

x′∈Ddv′(x′) f ′(x′) . (2.50)

using the two equations 2.47–2.48, we imediately obtain I(D) = I(D)′ this showingthat the integral of a density (integrated using the capacity element) is an invariant.


2.1.13 Capacity Element and Change of Coordinates

Note: real text yet to be written, this is a first attempt to the demonstration. Thedemonstration os possibly wrong, as I have not cared to well define the new capacityelement.

At a given point of an n-dimensional manifold we can consider the n vectorsdr1, . . . , drn , associated to some coordinate system x1, . . . , xn , and we have thecapacity element

dv = εi1 ...in (dr1)i1 . . . (drn)in . (2.51)

In a change of coordinates xi 7→ xi′ , each of the n vectors shall have his com-ponents changed according to

(dr)i′ =∂xi′

∂xi (dr)i = Xi′i (dr)i . (2.52)

Reciprocally,

(dr)i =∂xi

∂xi′ (dr)i′ = Xi′i (dr)i . (2.53)

The capacity element introduced above can now be expressed as

dv = εi1 ...in Xi1i′1

. . . Xini′n (dr1)i′1 . . . (drn)i′n . (2.54)

I guess that I can insert here the factor 1n! ε

i′1 ...i′n ε j′1 ... j′n , to obtain

dv = εi1 ...in Xi1i′1

. . . Xini′n ( 1

n! εi′1 ...i′n ε j′1 ... j′n ) (dr1) j′1 . . . (drn) j′n . (2.55)

If yes, then I would have

dv = ( 1n! εi1 ...in ε

i′1 ...i′n Xi1i′1

. . . Xini′n )ε j′1 ... j′n (dr1) j′1 . . . (drn) j′n , (2.56)

i.e., using the definition of determinant (third of equations 2.33),

dv = det X ε j′1 ... j′n (dr1) j′1 . . . (drn) j′n . (2.57)

We recognize here the capacity element dv′ = ε j′1 ... j′n (dr1) j′1 . . . (drn) j′n associated tothe new coordinates. Therefore, we have obtained

dv = det X dv′ . (2.58)

This, of course, is consistent with the definition of a scalar capacity (equation 2.19).

42 Manifolds

2.2 Volume

2.2.1 Metric

OLD TEXT BEGINS.In some parameter spaces, there is an obvious definition of distance between

points, and therefore of volume. For instance, in the 3D Euclidean space the distancebetween two points is just the Euclidean distance (which is invariant under transla-tions and rotations). Should we choose to parameterize the position of a point by itsCartesian coordinates x, y, z , then,

Note: I have to talk about the conmensurability of distances,

ds2 = ds2r + ds2

s , (2.59)

every time I have to define the Cartesian product of two spaces each with its ownmetric.

OLD TEXT ENDS.A manifold is called a metric manifold if there is a definition of distance between

points, such that the distance ds between the point of coordinates x = xi and thepoint of coordinates x + dx = xi + dxi can be expressed as4

ds2 = gi j(x) dxi dx j , (2.60)

i.e., if the notion of distance is ‘of the L2 type’5. At every point of a metric manifold,therefore, there is a symmetric tensor gi j defined, the metric tensor.

The inverse metric tensor, denoted gi j , is defined by the condition

gi j g jk = δik . (2.61)

It can be demonstrated that under a change of variables, its components change likethe components of a contravariant tensor, from where the notation gi j . Therefore,the equations defining the change of components of the metric and of the inversemetric are (see equations 2.15)

gi′ j′ = Xii′ X j

j′ gi j and gi′ j′ = X′i′i X′ j′

j gi j . (2.62)

In section 2.1.2, we introduced the matrices of partial derivatives. It is useful toalso introduce two metric matrices, with respectively the covariant and contravariantcomponents of the metric:

g =

g11 g12 g13 · · ·g21 g22 g23 · · ·

......

... . . .

; g-1 =

g11 g12 g13 · · ·g21 g22 g23 · · ·

......

... . . .

, (2.63)

4This is a property that is valid for any coordinate system that can be chosen over the space.5As a counterexample, in a two-dimensional manifold, the distance defined as ds = |dx1|+ |dx2|

is not of the L2 type (it is L1 ).

2.2 Volume 43

the notation g-1 for the second matrix being justified by the definition 2.61, that nowreads

g-1 g = I . (2.64)

In matrix notation, the change of the metric matrix under a change of variables,as given by the two equations 2.62, is written

g′ = Xt g X ; g′-1 = X′ g-1 X′t . (2.65)

If an every point P of a manifold M there is a metric gi j defined, then, themetric can be used to define a scalar product over the linear space tangent to M atP : given two vectors v and w , their scalar product is

v ·w ≡ gi j vi w j . (2.66)

One can also define the scalar product of two forms f and h at P (forms that belongto the dual of the linear space tangent to M at P ):

f · h ≡ gi j fi h j . (2.67)

The norm of a vector v and the norm of a form f are respectively defined as ‖ v ‖ =√

v · v =√

gi j vi v j and ‖ f ‖ =√

f · f =√

gi j vi v j .

44 Manifolds

2.2.2 Bijection Between Forms and Vectors

Let ei be the basis of a linear space, and ei the dual basis (that, as we haveseen, is a basis of the dual space).

In the absence of a metric, there is no natural association between vectors andforms. When there is a metric, to a vector vi ei we can associate a form whose com-ponents on the dual basis ei are

vi ≡ gi j v j . (2.68)

Similarly, to a form f = fi ei , one can associate the vector whose components on thevector basis ei are

f i ≡ gi j f j . (2.69)

[Note: Give here some of the properties of this association (that the scalar productis preserved, etc.).]

2.2 Volume 45

2.2.3 Kronecker Tensor (II)

The Kronecker’s tensors δij and δi

j are defined that the space has a metric or not.When one has a metric, one can raise and lower indices. Let us, for instance, lowerthe first index of δi

j :δi j ≡ gik δ

kj = gi j . (2.70)

Equivalently, let us raise one index of gi j :

gij ≡ gik gk j = δi

j . (2.71)

These equations demonstrate that when there is a metric, the Kronecker tensor and themetric tensor are identical. Therefore, when there is a metric, we can drop the symbolsδi

j and δij , and use the symbols gi

j and gij instead.

46 Manifolds

2.2.4 Fundamental Density

Let g the metric tensor of the manifold. For any (positively oriented) system ofcoordinates, we define the quantity g , that we call the metric density (in the givencoordinates) as

g =√

det g . (2.72)

More explicitly, using the definition of determinant in the first of equations 2.33,

g =√

1n! ε

i1i2 ...in ε j1 j2 ... jn gi1 j1 gi2 j2 . . . gin jn . (2.73)

This equation immediately suggests what it is possible to prove: the quantity g sodefined is a scalar density (at the right, we have two upper bars under a square root).

The quantityg = 1/g (2.74)

is obviously a capacity, that we call the metric capacity. It could also have been definedas

g =√

det g-1 =√

1n! εi1i2 ...in ε j1 j2 ... jn gi1 j1 gi2 j2 . . . gin jn . (2.75)

2.2 Volume 47

2.2.5 Bijection Between Capacities, Tensors, and Densities

As mentioned in section 2.1.4, (i) the product of a capacity by a density is a tensor,(ii) the product of a tensor by a density is a density, and (iii) the product of a tensorby a capacity is a capacity. So, when there is a metric, we have a natural bijectionbetween capacities and tensors, and between tensors and densities.

For instance, to a tensor capacity ti j...k`... we can associate the tensor

ti j...k`... ≡ g ti j...

k`... (2.76)

to a tensor si j...k`... we can associate the tensor density

si j...k`... ≡ g si j...

k`... (2.77)

and the tensor capacitysi j...

k`... ≡ g si j...k`... , (2.78)

and to a tensor density ri j...k`... we can associate the tensor

ri j...k`... ≡ g ri j...

k`... . (2.79)

Equations 2.76–2.79 introduce an important notation (that seems to be novel): in thebijections defined by the metric density and the metric capacity, we keep the sameletter for the tensors, and we just put bars or take out bars, much like in the bijectionbetween vectors and forms defined by the metric, where we keep the same letter,and we raise or lower indices.

48 Manifolds

2.2.6 Levi-Civita Tensor

From the Levi-civita capacity εi j...k we can define the Levi-Civita tensor εi j...k as

εi j...k = g εi j...k . (2.80)

Explicitly, this gives

εi jk... =

+√

det g if i jk . . . is an even permutation of 12 . . . n0 if some indices are identical

−√

det g if i jk . . . is an odd permutation of 12 . . . n .(2.81)

Alternatively, from the Levi-civita density εi j...k we could have defined the con-travariant tensor εi j...k as

εi j...k = g εi j...k . (2.82)

It can be shown that εi j...k can be obtained from εi j...k using the metric to raise theindices, so εi j...k and εi j...k are the same tensor (from where the notation).

2.2 Volume 49

2.2.7 Volume Element

We may here start by remembering equation 2.38,

dv = εi1 ...in dri11 dri2

2 . . . drinn , (2.83)

that expresses the capacity element defined by n vectors dr1 , dr2 . . . drn .In the special situation where n vectors are taken successively along each of the

n coordinate lines, this gives (equation 2.43) dv = dx1 dx2 . . . dxn . The dxi in thisexpression are mere coordinate increments, that bear no relation to a length. Aswe are now working under the hypothesis that we have a metric, we know thatthe length associated to the coordinate increment, say, dx1 , is6 ds = √

g11 dx1 . If thecoordinate lines where orthogonal at the considered point, then, the volume element,say dv , associated to the capacity element dv = dx1 dx2 . . . dxn would be dv =√

g11 dx1 √g22 dx2 . . .√

gnn dxn . If the coordinates are not necessarily orthogonal,this expression needs, of course, to be generalized.

One of the major theorems of integration theory is that the actual volume as-sociated to the hyperparallelepiped characterized by the capacity element dv , asexpressed by equation 2.83, is

dv = g dv , (2.84)

where g is the metric density introduced above. [Note: Should I give a demon-stration of this property here?] We know that dv is a capacity, and g a density.Therefore, the volume element dv , being the product of a density by a capacity isa true scalar. While dv has been called a ‘capacity element’, dv is called a volumeelement.

The overbar in g is to remember that the determinant of the metric tensor is adensity, in the tensorial sense of section 2.1.4, while the underbar in dv is to remem-ber that the ‘capacity element’ is a capacity in the tensorial sense of the term. Inequation 2.84, the product of a density times a capacity gives the volume elementdv , that is a true scalar (i.e., a scalar whose value is independent of the coordinatesbeing used). In view of equation 2.84, we can call g(x) the ‘density of volume’ in thecoordinates x = x1, . . . , xn . For short, we shall call g(x) the volume density7. It isimportant to realize that the values g(x) do not represent any intrinsic property ofthe space, but, rather, a property of the coordinates being used.

Example 2.2 In the Euclidean 3D space, using spherical coordinates x = r,θ,ϕ , as thelength element is ds2 = dr2 + r2 sin2θ dϕ2 + r2 dθ2, the metric matrix is grr grθ grϕ

gθr gθθ gθϕgϕr gϕθ gϕϕ

=

1 0 00 r2 00 0 r2 sin2θ

. (2.85)

6Because the length of a general vector with components dxi is ds2 = gi j dxi dx j .7So we now have two names for g , the ‘metric density’ and the ‘volume density’.

50 Manifolds

and the metric determinant is

g =√

det g = r2 sinθ . (2.86)

As the capacity element of the space can be expressed (using notations that are not manifestlycovariant)

dv = dr dθ dϕ , (2.87)

the expression dv = g dv gives the volume element

dv = r2 sinθ dr dθ dϕ . (2.88)

2.2 Volume 51

2.2.8 Volume Element and Change of Variables

Assume that one has an n-dimensional manifold M and two coordinate systems,say x1, . . . , xn and x1′ , . . . , xn′ . If the manifold is metric, the components of themetric tensor can be expressed both in the coordinates x and in the coordinates x′ .The (unique) volume element, say dv , accepts the two different expressions

dv =√

det gx dx1 ∧ · · · ∧ dxn =√

det gx′ dx1′ ∧ · · · ∧ dxn′ . (2.89)

The Jacobian matrices of the transformation (matrices with the partial derivatives),X and X′ , have been introduced in section 2.1.3. The components of the metric arerelated through

gi′ j′ = Xii′ X j

j′ gi j , (2.90)

or, using matrix notations, gx′ = Xt gx X . Using the identities det gx′ = det(Xt gx X) =(det X)2 det gx , one arrives at√

det gx′ = det X√

det gx . (2.91)

This is he relation between the two fundamental densities associated to each of thetwo coordinate systems. Of course, this corresponds to equation 2.18 (page 29), usedto define scalar densities.

52 Manifolds

2.2.9 Volume of a Domain

With the volume element available, we can now define the volume of a domain D ofthe manifold M , that we shall denote as

V(D) =∫D

dv , (2.92)

by the expression ∫D

dv ≡∫D

dv g . (2.93)

This definition makes sense because we have already defined the integral of a den-sity in equation 2.46. Note that the (finite) capacity of a finite domain D cannot bedefined, as an expression like

∫D dv would make any (invariant) sense.

We have here defined equation 2.92 in terms of equation 2.93, but it may wellhappen that, in numerical evaluations of an integral, the division of the space intosmall ‘hyperparallelepideds’ that is implied by the use of the capacity element is notthe best choice. Figure 2.3 suggests a division of the space into ‘cells’ having grosslysimilar volumes (to compared with figure 2.4). If the volume ∆vp of each cell isknown, the volume of a domain D can obviously be defined as a limit

V(D) = lim∆vp→0

∑p

∆vp . (2.94)

We will discuss this point further in later chapters.

Figure 2.3: The volume of an arbitrarily shaped, smooth,domain D of a manifold M , can be defined as the limit of asum, using elementary regions whose individual volume isknown (for instance, triangles in this 2D illustration). Thisway of defining the volume of a region does not require thedefinition of a coordinate system over the space.

Figure 2.4: For the same shape of figure 2.3, the volume can beevaluated using, for instance, a polar coordinate system. In a nu-merical integration, regions near the origin may be oversampled,while regions far from the orign may be undersampled. In somesituation, this problem may become crucial, so this sort of ‘coor-dinate integration’ is to be reserved to analytical developmentsonly.

The finite volume obviously satisfies the following two properties:

2.2 Volume 53

• for any domain D of the manifold, V(D) ≥ 0 ;

• if D1 and D2 are two disjoint domains of the manifold, then V(D1 ∪ D2) =V(D1) + V(D2) .

54 Manifolds

2.2.10 Example: Mass Density and Volumetric Mass

Imagine that a large number of particles of equal mass are distributed in the phys-ical space (assimilated to an Euclidean 3D space) and that, for some reason, wechose to work with cylindrical coordinates r,ϕ, z . Choosing small increments∆r, ∆ϕ, ∆z of the coordinates, we divide the space into cells of equal capacity, that(using notations that are not manifestly covariant) is given by

∆v = ∆r ∆ϕ∆z . (2.95)

We can count how many particles are inside each cell (see figure 2.5), and, thereforewhich is the mass ∆m inside each cell. The quantity δm/∆v , being the ratio of ascalar by a capacity is a density. In the limit of an infinite number of particles, wecan take the limit where ∆r , ∆ϕ , and ∆z all tend to zero and the limit

ρ(r,ϕ, z) = lim∆r→0 , ∆ϕ→0 , ∆z→0

∆m∆v

(2.96)

is the mass density at point r,ϕ, z .

ϕ ϕ

r

ϕ

.

..

..

...

..

.

..

.

.

.

...

..

..

.. .

..

.

..

.

.

..

. ..

.

.

.

.

.

.. .

..

...

. .

.

.

..

..

...

.. .

.

....

..

..

.. .

..

.

..

.

.

..

. ..

.

.

.

... .

....

.

. ..

.

..

..

...

..

.

..

.

.

.

...

..

..

.. .

..

.

..

.

.

..

. ..

.

.

.

.

.

.. .

..

...

. .

.

.

..

..

...

.. .

.

....

..

..

.. .

..

.

..

.

.

..

. ..

.

.

.

... .

....

.

. ..

....

. ...

.

.

.

.

.

..

.

..

.

.

.

.

.

.

..

.

.

.

.

.

.

..

.

.

.

.

.

.

..

..

.

.

..

.

.. .

.

.

.

.

.

.

.

.

.

..

..

...

.

.

.

.

.

.

..

..

..

.. .

.

.

..

.. .

..

.

..

.

.

..

.

..

...

..

.

..

Figure 2.5: We consider, in an Euclidean 3D space, a cylinder with a circular basis ofradius 1, and cylindrical coordinates (r,ϕ, z) . Only a section of the cylinder is rep-resented in the figure, with all its thickness, ∆z , projected on the drawing plane. Atthe left, we have represented a ‘map’ of the corresponding circle, and, at the middle,the coordinate lines on a ‘metric representation’ of the space. By construction, all the‘cells’ in the middle have the same capacity ∆v = ∆r ∆ϕ∆z . The points representparticles with given masses. As explained in the text, counting how many particlesare inside each cell directly gives an estimation of the ‘mass density’ ρ(r,ϕ, z) . Tohave, instead, a direct estimation of the ‘volumetric mass’ ρ(r,ϕ, z) , a division ofthe space into cells of equal volume (not equal capacity) should have been done, assuggested at the right.

Given the mass density ρ(r,ϕ, z) , the total mass inside a domain D of the spaceis to be obtained as

M(D) =∫D

dv ρ , (2.97)

2.2 Volume 55

where the capacity element dv appears, not the volume element dv .If instead of dividing the space into cells of equal capacity ∆v , we divide it into

cells of equal volume, ∆v (as suggested at the right of the figure 2.5), then the limit

ρ(r,ϕ, z) = lim∆v→0

∆m∆v

(2.98)

gives the volumetric mass ρ(r,ϕ, z) , different from the mass density ρ(r,ϕ, z) . Giventhe volumetric mass density ρ(r,ϕ, z) , the total mass inside a domain D of the spaceis to be obtained as

M(D) =∫D

dv ρ , (2.99)

where the volume element dv appears, not the capacity element dv . The relationbetween the mass density ρ and the volumetric mass ρ is the universal relationbetween any scalar density and a scalar,

ρ = gρ , (2.100)

where g is the metric density. As in cylindrical coordinates, g = r , the relationbetween mass density and volumetric mass is

ρ(r,ϕ, z) = rρ(r,ϕ, z) . (2.101)

It is unfortunate that in common physical terminology the terms ‘mass density’and ‘volumetric mass’ are used as synonymous. While for common applications,this does not pose any problem, there is a sometimes a serious misunderstanding inprobability theory about the meaning of a ‘probability density’ and of a ‘volumetricprobability’.

56 Manifolds

2.3 Mappings

Note: explain here that we consider a mapping from a p-dimensional manifold M

into a q-dimensional manifold N .

2.3.1 Image of the Volume Element

This section is very provisional. I do not know yet how much of it I will need.We have p-dimensional manifold, with coordinates xa = x1, . . . , xp . At a

given point, we have p vectors dx1, . . . , dxp . The associated capacity element is

dv = εa1 ...ap (dx1)a1 . . . (dxp)ap . (2.102)

We also have a second, q-dimensional manifold, with coordinates ψα = ψ1, . . . ,ψq .At a given point, we have q vectors dψ1, . . . , dψq . The associated capacity ele-ment is

dω = εα1 ...αq (dψ1)α1 . . . (dψq)

αq . (2.103)

Consider now an application

x 7→ ψ = ψ(x) (2.104)

from the first into the second manifold. I examine here the case

p ≤ q , (2.105)

i.e., the case where the dimension of the first manifold is smaller or equal than thatof the second manifold. When transporting the p vectors dx1, . . . , dxp from thefirst into the second manifold (via the application ψ(x) ), this will define on the q-dimensional manifold a p-dimensional capacity element, dωp . We wish to relatedωp and dv .

It is shown in the appendix (check!) that one has

dωp =√

det(Ψt Ψ) dv . (2.106)

Let us be interested in the image of the volume element. We denote by g the met-ric tensor of the first manifold, and by γ the metric tensor of the second manifold.

Bla, bla, bla, and it follows from this that when letting dωp be the (p-dimensional)volume element obtained in the (q-dimensional) second manifold by transport of thevolume element dv of the first manifold, one has

dωp

dv=

√det(Ψt γΨ)√

det g. (2.107)

2.3.2 Reciprocal Image of the Volume Element

I do not know yet if I will need this section.

Chapter 3

Probabilities

Introduction to be written here. Introduction to be written here. Introduction tobe written here. Introduction to be written here. Introduction to be written here.Introduction to be written here. Introduction to be written here. Introduction tobe written here. Introduction to be written here. Introduction to be written here.Introduction to be written here. Introduction to be written here. Introduction to bewritten here. Introduction to be written here. Introduction to be written here.

Note: Through this chapter, I should always say “the probability function P ” and“the probability value P[A] ”.

58 Probabilities

3.1 Introduction

3.1.1 Preliminary Remarks

While probabilities over a discrete set are quite easy to introduce, in most of ourapplications we shall consider probabilities defined over a manifold, and the intro-duction of these probabilities requires care. In this respect, I disagree with some ofthe notions usually presented in the literature. Let me explain why.

A “probability function” over a set is a mapping that to some subsets of the givenset associates a real nonnegative number. When the considered set is, in fact, a (finite-dimensional) manifold, say M , it is a nontrivial question to discover to which kindof subsets of M one can unambiguously associate a probability value. Here, I do aseveryone else, and choose to associate a probability value only to a special class ofsubsets of M , those that belong to the “Borel field” of M (the set of all the subsets[of points] of M having a “reasonably simple shape”). Then, a probability functioncan axiomatically be introduced as the mapping that to every set A of the Borel fieldof M associates a number, say P[A] , the mapping satisfying some conditions (forinstance, the well known set of Kolmogorov axioms).

Once a probability function P has been introduced over a manifold M , one canpass to the next nontrivial definition, that of “conditional probability function”. IfC is a given set of the Borel field of M , with P[C] 6= 0 , the “conditional proba-bility function given C ” is the probability function over M that to any set A ofthe Borel field of M associates a value, denoted P[A|C] , and defined as P[A|C] =P[A∩C] / P[C] .

So far, so good. But in all the interesting situations we shall face (when interpret-ing physical measurements), the “conditioning set” C will not be an ordinary subsetof M , but a submanifold of M , i.e., a manifold N contained in M , but having a lowernumber of dimensions. At this point, one may have the illusion that this is a specialcase of the definition just mentioned, as a submanifold N of M can be seen as thelimit of an “ordinary set” C of M . But problems arise when one realizes that onecan define as many conditional probability functions (given N ) as there are waysof taking the limit C → N . And, with the sole elements so introduced there is no“natural” way of taking the limit. Unfortunately, the problem is well known, and itsurfaces from time to time. The difficulties then encountered are called “paradoxes”(like the so-called Borel paradox, see appendix 5.3.10). Those are no paradoxes, justthe effects of loose definitions.

This problem is solved by just imposing that the convergence of the set C ⊆M into the submanifold N , must be uniform, but this is a notion that can only beintroduced if the manifold M is a metric manifold, i.e., a manifold where the distancebetween points is defined, and is locally Euclidean (i.e., results from a length elementds2 ). (Note: I have to check if I need a metric manifold or just a manifold with a notionof distance between points [not necessarily derivable from a ds2 .]) (Note: explainthat, more dramatically, the notion of center of a probability distribution [“mean value”]

3.1 Introduction 59

can be totally meaningless in the absence of a notion of distance over a manifold [thedefinition is in appendix 5.3.17].)

I think that most mathematicians are reluctant to introduce the extra structurebrought by a metric over the manifold. But in doing so, they leave probability theoryincomplete (and open the way to the mentioned “paradoxes”). I will emphasize inchapter 4 that the abstract manifolds that we will introduce (to represent physicalmeasurements) do accept a (nontrivial) metric.

We will not always need to consider metric manifolds. But there is another impor-tant problem encountered in practical applications that also requires the introductionof an extra structure on the manifold. I claim that some of the most basic inferenceproblems can (and must) be solved using the notion of “intersection of probabili-ties”, the analogue for probabilities of the intersection of sets1. We shall see that thisvery basic, and necessary, notion —the intersection of two probability functions– canonly be considered if one has introduced over the manifold a notion of volume, i.e.,if to every set A of the Borel field is attached a positive real number V[A] , calledthe volume of the set. This volume must be defined intrinsically (i.e., independentlyof any possible choice of coordinates over the manifold). Practically, this means thatone special “measure” function is introduced over the manifold M , that is very spe-cial in that all the probability functions to be considered over M must be “absolutelycontinuous” with respect to the volume measure (i.e., all the probability functions tobe introduced must give zero probability to the sets of zero volume). (Note: I haveto be careful here, as some of the probability functions to be considered may be deltadistributions.)

1Much in the way as the intersection of “fuzzy sets” is an analogue of the intersection of sets.

60 Probabilities

3.1.2 What we Shall Do in this Chapter

The probabilities we are going to formalize below are intended to be used as mathe-matical models for different “real world” situations. The two more common situationsare as follows.

Consider a physical process producing “outcomes” that, when analyzed from acertain point of view, look “random” (in the intuitive sense of the word, withoutwe needing to enter in any metaphysical discussion about the reality of the apparentrandomness). Then a probabilistic (mathematical) model can be used to represent thephysical process. The better the probabilistic model, the better the predictions thatcan be made about some outcomes (unknown or yet to materialize). Should we haveaccess to a potentially infinite sequence of outcomes, the best probabilistic modelwould simply be obtained by taking the limit of properly defined “histograms”.

As a second example, one may wish to use probabilities to represent imperfectstates of information. In particular, any measurement of a physical quantity hasattached uncertainties, and the result of a measurement is well expressed as a prob-ability distribution on the possible values of the quantity being measured. Typically,this probability distribution results in part from statistical considerations, and, inpart, from a more or less subjective estimation of other uncertainties not amenableto statistical treatment (note: mention here ISO). An extreme case of this is when the“measurement” is, in fact, only an educated guess (as when using some forms of“a priori” information in a Bayesian reasoning).

To fix ideas, let us start with the first situation, when we use probabilities tomodel some “random process”. The outcomes could be produced by some physicalprocess, but we will more simple assume that they are generated by some randomalgorithm (practical algorithms are, in fact, pseudo-random). With this situation inmind, it will be possible to illustrate the basic ideas in the chapter, without the needto immediately entering into all the technicalities that will later be necessary.

In the first part of this discussion, let us assume that the outcomes are elementsof a discrete set (and to make the discussion simpler, consider that the number ofelements is finite). In the second part, each outcome will be a point on a manifold,and this will require some extra structure that is better not to introduce yet.

Evaluating a Probability (Intuitive Introduction)

Consider, then, a discrete and finite set, say A0 , with elements a1, a2, . . . , aN , andsome random algorithm that produces, on demand, sample elements of A0 . For in-stance, if the algorithm is run n times, we may successively obtain a7, a1, a7, a7, a3 . . . .Alternatively, we may have n copies of the same algorithm (with different “randomseeds”) and run each algorithm once. When two two situations are “undistinguish-able”, we say that the sample elements are “independent”.

We can use the random algorithm to generate n sample elements, and denote kithe number of time the element ai has materialized. The “experimental frequency”

3.1 Introduction 61

of ai is defined as f (ai; n) = ki/n . If when n → ∞ , the experimental frequenciesf (ai; n) converge to some definite values, say p(ai) , then we say that the probabilityof element ai is p(ai) , and we write

p(ai) = limn→∞ f (ai; n) = lim

n→∞ ki

n. (3.1)

Let A be a subset of A0 . The probability of the set A , denoted P[ A ] , is defined,for obvious reasons, as

P[ A ] = ∑a∈A

p(a) , (3.2)

where a denotes a generic element of A0 . One has P[ ∅ ] = 0 and P[ A0 ] = 1 . Thefunction P that to any set A ⊆ A0 associates the value P[ A ] is called a probabilityfunction (or, for short, a probability), while the number P[ A ] is the probability value ofthe set A . For any a ∈ A0 , p(a) is the elementary probability value of the element a .

Image of a Probability

Let A0 be a discrete and finite set with elements a1, a2, . . . , aN , B0 a discreteand finite set with elements b1, b2, . . . , bN , and ϕ a mapping from A0 into B0 .For any a ∈ A , we write a 7→ b = ϕ(a) . To any probability P over A0 we canassociate a (unique) probability over B0 , that we shall call the image of P , and shalldenote ϕ[ P ] , as follows. Let a, a′, a′′, a′′′, . . . be a set of elements of A0 thatare sample elements obtained from the probability P . We can then consider the setϕ(a) , ϕ(a′) , ϕ(a′′) , ϕ(a′′′) , . . . of elements of B0 . By definition, this shall be aset of sample elements of the probability ϕ[ P ] , image of the probability P via themaping ϕ .

As shown below, if p( · ) is the elementary probability associated with the prob-ability P , and q( · ) is the elementary probability associated with Q = ϕ[ P ] , then,for any element b ∈ B0 ,

q(b) = ∑all a such thatϕ(a) = b

p(a) . (3.3)

Intersection of Probabilities

Given a set A0 , we can consider different probabilities over A0 . For instance, P1may be the probability function that to any A ⊂ A0 associates the probability valueP1[ A ] , while P2 may be the probability function that to any A ⊂ A0 associates theprobability value P2[ A ] . When assimilating a probability to a random algorithm,having two probabilities, means, of course, having two random algorithms.

Assume that we play the following game. We generate a pair of elements of A0 ,one using the random algorithm P1 , and one using the random algorithm P2 . These

62 Probabilities

two elements may be distinct or they may, in fact, be the same element. When theyare distinct we ignore them, and when it is the same element, we keep it. We can thenmake the histogram of the elements obtained in that way (the histogram is made asoutlined above). This histogram will, in the limit when the number of elements tendsto infinity, define a new probability over A0 , that we shall denote

P3 = P1 ∩ P2 (3.4)

and call the intersection of the two probabilities P1 and P2 . We shall see below thatthis operation deserves the name of “intersection” because it generalizes the notionof intersection of sets. We shall also see that the intersection of two probabilities canbe easily be expressed in terms of the associated elementary probabilities: one has,for any a ∈ A0 ,

p3(a) =1ν

p1(a) p2(a) , (3.5)

where ν is the normalization constant ν = ∑a∈A0p1(a) p2(a) .

Reciprocal Image of a Probability

While the image of a probability generalizes, in some sense, the notion of image ofa set, we wish now to generalize for probabilities the notion of reciprocal image ofa set. So, let A0 be a discrete and finite set with elements a1, a2, . . . , aN , B0 adiscrete and finite set with elements b1, b2, . . . , bN , and ϕ a mapping from A0into B0 . For any a ∈ A , we write a 7→ b = ϕ(a) . To any probability Q over B0we wish to associate a (unique) probability over A0 , that we shall call the reciprocalimage of Q , and shall denote ϕ-1[ Q ] . The way is not obvious. For we may justremember two of the properties relating images and reciprocal images of sets (seeequations 1.26 and 1.27):

A ⊆ ϕ-1[ϕ[A] ] ; ϕ[ϕ-1[ B ] ] ⊆ B . (3.6)

(Is this useful?)Given Q, there are many P’s such that Q = ϕ[P] . Then we sample A0 homoge-

neously we sample B0 following Q and we keep . . . when. . . This defines ϕ-1[Q] .The formula is. . .Bla, bla, bla. . .

The Popperian Problem

Let A0 be the “space of possible models”, B0 the “space of possible observables”,and ϕ a mapping from A0 into B0 (the “prediction mapping”). We start with the apriori information that “the true model” belongs to A ⊆ A0 :

a ∈ A ⊆ A0 . (3.7)

3.1 Introduction 63

If a measurement gives us the information that the “the true observable” belongs toB ⊆ B0 ,

b ∈ B ⊆ B0 , (3.8)

then, as, necessarily,b = ϕ(a) , (3.9)

the true model must satisfya ∈ A∩ϕ-1[ B ] . (3.10)

As A∩ϕ-1[ B ] ⊆ A , we may have actually reduced the domain where the truemodel can be. All the models, if any, that were in A but that are not in A∩ϕ-1[ B ]have been falsified.

By the way, as shown elsewhere in this text, one also has

b ∈ ϕ[ A∩ϕ-1[ B ] ] = ϕ[ A ]∩ B . (3.11)

As ϕ[ A ]∩ B ⊆ B , we may have actually reduced the domain where the true ob-servable can be. All the observables, if any, that were in B but that are not inϕ[ A ]∩ B ⊆ B have also been falsified.

We can imagine many algorithms that actually find the set A∩ϕ-1[ B ] and theset ϕ[ A ]∩ B ⊆ B . The simplest consists in visiting all the a ∈ A , in evaluatingb = ϕ(a) for all of them, and in keeping the pair a, b only if b ∈ B . The set ofall the a that have been kept is A∩ϕ-1[ B ] , while the set of all the b that have beenkept is ϕ[ A ]∩ B ⊆ B .

But we need to consider another algorithm, much less efficient, but that opensthe way to “probabilization”. Bla, bla, bla. . .

We have a probability P over A0 , a probability Q over B0 , and a mapping ϕfrom A0 into B0 .

64 Probabilities

3.1.3 A Collection of Formulas; I: Discrete Probabilities

Image of a Probability Function: Q =ϕ[ P ]

q(b) = ∑a :ϕ(a)=b

p(a) . (3.12)

Intersection of Two Probability Functions: P3 = P1 ∩ P2

p3(a) =1ν

p1(a) p2(a) . (3.13)

Reciprocal Image of a Probability Function: P =ϕ-1[Q]

p(a) =1ν

q(ϕ(a) ) . (3.14)

Bayes-Popper Inference: ( ϕ[ P∩ϕ-1[Q] ] = ϕ[ P ]∩Q )

Expression of the term P2 = P1 ∩ϕ-1[Q1] :

p2(a) =1ν

p1(a) q1(ϕ(a) ) . (3.15)

Expression of the term Q2 =ϕ[ P1 ∩ϕ-1[Q1] ] = ϕ[ P1 ]∩Q1 :

q2(b) =1ν

(∑

a : ϕ(a)=bp1(a)

)q1(b) . (3.16)

Note: q2 is both, the intersection of the image of p1 with q1 , and the image of p2 .

Marginals of a Conditional:

p2(a) =1ν

p1(a) q1(ϕ(a) ) . (3.17)

q2(b) =1ν

(∑

a : ϕ(a)=bp1(a)

)q1(b) . (3.18)

Bayes-Popper and marginals of the conditional are identical.

3.1 Introduction 65

3.1.4 A Collection of Formulas; II: Probabilities over Manifolds

Image of a Probability Function: Q =ϕ[ P ]

This is the only situation where we do not need any special structure over the mani-folds (volume or metric). In terms of probability densities, one has2:

g(y) = ∑x :ϕ(x)=y

f (x)√det Φ(x)t Φ(x)

. (3.19)

If the manifolds are metric manifolds3, one can also express the image of a prob-ability function in terms of volumetric probabilities:

g(Q) = ∑P :ϕ(P)=Q

f (P)√

detγ(P)√det Φ(P)t Γ(ϕ(P) ) Φ(P)

. (3.20)

Intersection of Two Probability Functions: P3 = P1 ∩ P2

f3(P) =1ν

f1(P) f2(P) . (3.21)

Equivalently, in terms of probability densities,

f 3(x) =1ν

f 1(x) f 2(x)vx(x)

. (3.22)

Reciprocal Image of a Probability Function: P =ϕ-1[Q]

f (P) =1ν

g(ϕ(P) ) . (3.23)


f (x) =1ν

g(ϕ(x) )vx(x)

vy(ϕ(x) ). (3.24)

2This equation correspond to the case when the dimension of the departure manifold is smaller orequal that the dimension of the arrival manifold. See the text for the other situation.

3If the manifolds are only volume manifolds, it is better to write this making explicit the volumedensities: g(Q) = ∑P : ϕ(P)=Q f (P) vM(P)

vN(ϕ(P) )√

det Φ(P)t Φ(P).

66 Probabilities

Bayes-Popper Inference: ( ϕ[ P∩ϕ-1[Q] ] = ϕ[ P ]∩Q )

Expression of the term P2 = P1 ∩ϕ-1[Q1] :

f2(P) =1ν

f1(P) g1(ϕ(P) ) . (3.25)


f 2(x) =1ν

f 1(x)g1(ϕ(x) )vy(ϕ(x) )

. (3.26)

Expression of the term Q2 =ϕ[ P1 ∩ϕ-1[Q1] ] = ϕ[ P1 ]∩Q1 :

g2(Q) =1ν

(∑

P :ϕ(P)=Q

f1(P)√

detγ(P)√det Φ(P)t Γ(Q) Φ(P)

)g1(Q) . (3.27)


g2(Q) =1ν

(∑

P :ϕ(P)=Q

f 1(P)√det Φ(P)t Γ(Q) Φ(P)

)g1(Q) . (3.28)

Note: g2 is both, the intersection of the image of f1 with g1 , and the imageof f2 .

Note: although I have chosen to write equations 3.27 and 3.28 in the typical casewhen the manifolds have a metric, the only structure required for defining the imageof a probability, the reciprocal image of a probability, and the intersection of proba-bilities, is the volume. Therefore these two equations are expressible in terms of thevolume densities in each of the two manifolds:

g2(Q) =1ν

(∑

P :ϕ(P)=Q

f1(P) vM(P)√det Φ(P)t Φ(P)

) g1(Q)vN(Q)

g2(Q) =1ν

(∑

P :ϕ(P)=Q

f 1(P)√det Φ(P)t Φ(P)

) g1(Q)vN(Q)

.

(3.29)

Marginals of a Conditional

Note: see formulas in figure 3.20.

f2(P) =1ν

f1(P) g1(ϕ(P) )√

det(γ(P) + Φ(P)t Γ(ϕ(P) ) Φ(P))√detγ(P)

.

(3.30)

3.1 Introduction 67


f 2(x) =1ν

f 1(x) g1(ϕ(x) )√

det(γ(x) + Φ(x)t Γ(ϕ(x) ) Φ(x))√detγ(x)

√det Γ(ϕ(x) )

. (3.31)

The other marginal is:

g2(Q) =1ν

(∑

P :ϕ(P)=Q

f1(P)√

det(γ(P) + Φt(P) Γ(Q) Φ(P) )√det Φ(P)t Γ(Q) Φ(P)

)g1(Q) .

(3.32)Equivalently, in terms of probability densities,

g2(y) =1ν

(∑

x :ϕ(x)=yf 1(x)

√det(γ(x) + Φt(x) Γ(y) Φ(x) )√

detγ(x)√

det Φ(x)t Γ(y) Φ(x)

)g1(y) . (3.33)

Comparison Between Bayes-Popper and Marginal of the Conditional

For Bayes-Popper, one has

f2(P) =1ν

f1(P) g1(ϕ(P) )

g2(Q) =1ν

(∑

P :ϕ(P)=Q

f1(P)vM(P)√

det Φ(P)t Φ(P)

) g1(Q)vN(Q)

=1ν

(∑

P :ϕ(P)=Q

f1(P)√

detγ(P)√det Φ(P)t Γ(Q) Φ(P)

)g1(Q) ,

(3.34)

while for the marginal of the conditional, one has:

f2(P) =1ν

f1(P) g1(ϕ(P) )√

det(γ(P) + Φ(P)t Γ(ϕ(P) ) Φ(P))√detγ(P)

g2(Q) =1ν

(∑

P :ϕ(P)=Q

f1(P)√

det(γ(P) + Φt(P) Γ(Q) Φ(P) )√det Φ(P)t Γ(Q) Φ(P)

)g1(Q) .

(3.35)

Besides the mathematical rigor, there is an obvious “nice property” of the Bayes-Popper way of reasoning: To evaluate the important volumetric probability, f2(P) ,no knowledge of the derivatives is needed (compare the first of equations 3.34 withthe first of equations 3.35). In particular, a Monte Carlo sampling method will require(many) evaluations of ϕ(P) (i.e., resolutions of the “forward modeling problem”),but no evaluation of the derivatives Φ(P) . Now, although the analytical expressionfor g2(Q) (second of third of equations 3.34) does contain the derivatives, we donot need to use this analytical expression for sampling: the image of the samples off2(P) (obtained in the way just mentioned) are samples of g2(Q) . So, the samplesof g2(Q) are obtained as a by-product of the process of sampling f2(P) .

68 Probabilities

3.2 Probabilities

3.2.1 Basic Definitions

There are different ways for introducing the notion of probability. One may, forinstance, introduce the Kolmogorov axioms (see, for instance, Kolmogorov, 1950), orfollow Jaynes (2003) ideas. One could also start from the notion of random algorithm(i.e., of algorithm that produces random outputs), and obtain as properties the usualaxioms of the theory. I choose to start the theory using (a simplified version of) theKolmogorov axioms.

Let, then, consider a non-empty set Ω and its power set ℘(Ω) (the set containingall the subsets of Ω ). If the set Ω has a finite number of elements, or if the elementsare numerable, then, one can consider the probability of any set A ∈ ℘(Ω) . This isnot as simple when Ω is not numerable, as, then, there are in ℘(Ω) sets so compli-cated, that are not ‘mesurable’. This is why, as we did when introducing a measureover a set, we must restrict our consideration to a suitable chosen subset of ℘(Ω) ,say F ⊆ ℘(Ω) , that must be a σ-field , this implying, in particular, that

• F contains the empty set, ∅ , and contains the whole set, Ω ;

• if a set belongs to F , then, the complement of the set also belongs to F ;

• the union and the intersection of any finite or countably infinite sequence ofsets of F belongs to F .

(Note: what follows is a little bit redundant with what precedes.) Typically, if Ω

has a finite number of elements, or if the elements are numerable, then, one choosesF = ℘(Ω) . If the elements are not numerable (for instance, when Ω is a finite-dimensional manifold) one chooses for F the Borel field of Ω , that has been intro-duced in section 1.2.3. Remember that this is a huge set of subsets of Ω , containingall the points of Ω , all the open and closed sets, and all sets having any “reasonable”shape (in fact, all the sets to which one could unambiguously associate a “volume”,finite or not). Huge as this set is, it is much smaller than ℘(Ω) .

Definition 3.1 Probability. Given a set Ω and chosen a set F of subsets of Ω that is aσ-field , we shall call probability function (or, when no confusion is possible, probability)a mapping P from F into the real interval [0, 1] such that

P[Ω] = 1 , (3.36)

and such that for any two sets of F

P[A1 ∪A2] = P[A1] + P[A2]− P[A1 ∩A2] . (3.37)

The number P[A] is called the probability value of the set A (or, when no confusion ispossible, the probability of the set A ).

3.2 Probabilities 69

It follows immediately from the definition of probability that if two sets A1 and A2are complementary (with respect to Ω ), then, P[A2] = 1− P[A1] . In particular, theprobability of the empty set is zero:

P[ ∅ ] = 0 . (3.38)

We shall call the triplet Ω,F , P a probability triplet. It is usually called a prob-ability “space”, but we should refrain from using this terminology here: given thepair Ω,F , we shall consider below the space of all probabilities over Ω,F(that we shall endow with an internal operation, the intersection of probabilities).So, here, what would deserve the name of “probability space” would be a given Ω ,a given σ-field F ⊆ ℘(Ω) and the collection of all the probabilities over Ω,F .So, to avoid any confusion, we better don’t use the term “probability space”, andcall Ω,F , P a probability triplet. Also, the set Ω is sometimes called the samplespace, while the sets in F are sometimes called events. We do not ned to use thisterminology here4.

Let Ω,F , P1 and Ω,F , P2 be two probability triplets (i.e., let P1 and P2 betwo possibly different probability functions defined over the same set F ). If

for any A ∈ F one has P1[A ] = P2[ A ] , (3.39)

then one says that P1 and P2 are identical, and one writes

P1 = P2 . (3.40)

Definition 3.2 Conditional Probability. Let Ω,F , P be a probability triplet, and letC ∈ F be a given subset of Ω with non-zero probability, P[C] > 0 . The conditionalprobability of any set A ∈ F “given the set C ” is denoted P[A|C] , and is defined as

P[A|C] =P[A∩C]

P[C]. (3.41)

It is easy to see that, for any given C with P[C] > 0 , the function A 7→ P[A|C] isindeed a probability function (i.e., it satisfies the conditions in definition 3.1).

(Note explain briefly here the intuitive meaning of the definition of conditionalprobability.)

Note: write here the Bayes theorem:

P[A|C] =P[C|A] P[A]

P[C]. (3.42)

4Let us also choose to ignore what a “random variable” could be.

70 Probabilities

Property 3.1 If the elements of the set Ω are numerable (or if there is a finite number ofthem), Ω = ω1,ω2, . . . , then, for any probability P over F = ℘(Ω) , there exist aunique set of non-negative real numbers p(ω1), p(ω2), . . . , with ∑i p(ωi) = 1 , suchthat for any A ⊆ Ω

P[A] = ∑ωi∈A

p(ωi) . (3.43)

It is then clear that the probability of a set containing a single element is

P[ωi] = p(ωi) . (3.44)

Accordingly, we shall call the number p(ωi) the elementary probability (or, for short,probability) of the element ωi . While the function A 7→ P[A] (defined on sets) is theprobability function, we shall call the function a 7→ p(a) (defined on elements) the ele-mentary probability function.

A look at figure 3.1 makes this property obvious. In many practical situations,one does not reason on the abstract function P , that associates a number to everysubset, but on the collection of numbers pi = p(ωi) , one associated to each elementof the set.

p9

p3

p5

p6

p1p1

p2 p4

p7

p8

A1

A2

Figure 3.1: From a practical point of view, defining a probability P over a discreteset consists in assigning an elementary probability pi to each of the elements of theset, with ∑i pi = 1 .

Note that, given a discrete set A0 , we can consider different elementary prob-ability functions over A0 , say p1 , p2 . . . . To any element a ∈ A0 the probabilityfunction p1 shall associate the probability value p1(a) , the probability function p2shall associate the probability value p2(a) , and so on. In a very similar way, givendifferent sets A1 , A2 . . . (all of them subsets of A0 ), the indicator functions associ-ated to each set shall associate, to any element a ∈ A0 , the values χA1(a) , χA2(a) . . .


and so on. As we call any of the p1(a) , p2(a) . . . the probability of a , we could callany of the χA1(a) , χA2(a) . . . the possibility of a : a possibility value of 1 meaningcertainty, and a possibility value of 0 meaning impossibility.

Example 3.1 I have arbitrary selected an element a from some set A0 . Question: is it pos-sible that the element a belongs, in fact, to some subset A ⊆ A0 ? Answer: the possibilitythat the element a belongs to A is χA(a) , i.e., if χA(a) = 1 it is certain that a belongsto A , while if χA(a) = 0 it is impossible that a belongs to A .

We thus see that the notion “possibility of an element a ” bears some resemblanceto the notion “probability of an element a ”. While the possibility of an element canbe 0 or can be 1 , the probability of a element can be any number in the interval[0, 1] , so, in some lousy sense, probabilities are a generalization of indicators (i.e., of“possibilities”). (In that sense, a probability function is a ‘fuzzy set”.) The followingtable displays the correspodences between sets and probability functions (the tableis for probabilities over a discrete set, and a similar table is possible for probabilitiesover a manifold):

set theory probability theoryprobability functions P1 , P2 . . .

sets A1 , A2 . . . that to any A ⊆ A0that are subsets of a set A0 associate the probability values

P1[A] , P2[A] . . .indicator functions elementary probability functionsχA1 , χA2 . . . p1 , p2 . . .

that to any a ∈ A0 that to any a ∈ A0associate the indicator values associate the elementary probability values

χA1(a) , χA2(a) . . . p1(a) , p2(a) . . .

This table suggests that when, in the following pages, we generalize to probabilityfunctions some of the notions usually introduced for sets (intersection of two prob-ability functions, image of a probability function [via a mapping], reciprocal imageof a probability function) we should compare the formulas obtained for elementaryprobability functions (or for volumetric probabilities, in the case of a manifold) withthe formulas of set theory expressed in terms of indicator functions.

Note: explain here that introducing the elementary conditional probability p(ω|C)via

P[A|C] = ∑ω∈A

p(ω|C) , (3.45)

one finds

p(ω|C) =

p(ω) / ∑ω′∈C p(ω′) ifω ∈ C

0 ifω /∈ C .(3.46)

72 Probabilities

Note: I have to define somewhere the notion of support of a probability function5.When we have a probability triplet Ω,F , P1 , the set F will always be ℘(Ω) ,

the set of all the subsets of Ω —if Ω has a numerable number of elements—, theBorel set of Ω —if Ω is a manifold—, or some explicitly introduced set —in the gen-eral case—. So, to simplify the language, instead of saying that a probability functionP is defined over Ω,F , we shall sometimes say that a probability function is de-fined over Ω .

5Recall: if M is a topological space, and f real function defined over M , the support of f , denotedsupp( f ) , is the closure of the set where f does not vanish (i.e., supp f is the smallest closed set suchthat f is zero on the complement of the set). One has the following two properties

supp( f g) ⊆ supp( f )∩ supp(g)supp( f + g) ⊆ supp( f )∪ supp(g) .

Note: I must understand why one has ⊆ instead of = in these two equations. I must use thisdefinition for probability functions, and explain that a probability function can be can be seen as a“modulation” of its support. I also have to explain that I wish the notions to be introduced below of‘image of a probability’, of ‘reciprocal image of a probability’, and of ‘intersection of two probabilities’have to generalize the equivalent notions in set theory (‘image of a set’, of ‘reciprocal image of a set’,and of ‘intersection of two sets’), i.e., we must have

supp[ϕ[P] ] = ϕ[ supp[P] ]supp[ P1 ∩ P2 ] = supp[P1]∩ supp[P2]

supp[ϕ-1[Q] ] = ϕ-1[ supp[Q] ] ,

where, possibly, some of the signs = have to replaced by ⊆ .


3.2.2 Image of a Probability

Let ϕ be a mapping from a set A0 into a set B0 , and let P be a probability functionover A0 . As we shall see below, there is a unique probability function over B0 ,denoted ϕ[ P ] , such that for any6 B ⊆ B0 ,

(ϕ[ P ])[ B ] = P[ϕ-1[ B ] ] . (3.47)

Definition 3.3 We shall say that ϕ[ P ] is the image of the probability P via the map-ping ϕ .

So, in simple words, the image of a probability function (via a mapping ϕ ) is suchthat the probability value of any subset B (in the arrival set) equals the probabilityvalue (in the departure set) of the reciprocal image of B (via the mapping ϕ ).

Note: In set theory, one introduces the notion of image of a set. I must verifythat the following property holds: the image of the support of a probability functionequals the support of the image of the probability function:

supp[ϕ[ P ] ] = ϕ[ supp[ P ] ] (3.48)

So, in this sense, our definition of image of a probability function is consistent withthe definition of image of a set. In fact, the definition of image of a probability func-tion can be seen as a generalization of the definition of image of a set, if a set isconsidered as a special case of a probability function (i.e., of a “fuzzy set”). There isanother way to see this. The image of a set can be defined in terms of the indicatorof a set (see equation 1.36) as

for any b ∈ B0 , ξϕ[A](b) = χA(ϕ-1[b] ) , (3.49)

and bla, bla, bla. . .Let us check that we have, indeed, defined a probability. First, for any B ⊆

B0 , (ϕ[P])[ B ] = P[ϕ-1[ B ] ] ≥ 0 . Second, for any B1 ⊆ B0 , and any B2 ⊆B0 , (ϕ[P])(B1 ∪ B2) = P[ϕ-1[ B1 ∪ B2 ] ] = P[ϕ-1[ B1]∪ϕ-1[ B2 ] ] = P[ϕ-1[ B1] ] +P[ϕ-1[ B2] ] − P[ϕ-1[ B1 ∩ B2 ] ] = (ϕ[P])[ B1 ] + (ϕ[P])[ B2 ] − (ϕ[P])[ B1 ∩ B2 ] . So,that ϕ[P] is a probability follows immediately from the fact that P is a probability.

Taking B =ϕ[A] in the definition 3.47 gives the property

(ϕ[ P ])[ϕ[A] ] = P[ϕ-1[ϕ[A] ] ] . (3.50)

As (equation 1.26) A ⊆ϕ-1[ϕ[A] ] , in general P[ϕ-1[ϕ[A] ] ] ≥ P[A] , and, therefore,

(ϕ[ P ])[ϕ[A] ] ≥ P[A] , (3.51)

6More rigorously, for any B in the selected set F of subsets of B0 (a σ-field).

74 Probabilities

the equality holding if the mapping ϕ is injective (check!).In section 3.3.2 we shall examine what this definition implies when we consider

that the sets A0 and B0 are manifolds (where the probability of any single pointis, typically, zero). But when the two sets A0 and B0 are discrete, we can obtain asimple result:

Property 3.2 Let A0 and B0 be two discrete sets, and ϕ a mapping from A0 into B0 .As equation 3.47 must hold for any B ⊆ B0 , it must also hold for a set containing a singleelement. Therefore,

for any b ∈ B0 , (ϕ[P])(b) = P[ϕ-1[ b ] ] . (3.52)

More simply, we can state this property as follows. If p( · ) is the elementary prob-ability associated with the probability P , and ϕ[p]( · ) is the elementary probabilityassociated with ϕ[ P ] , then, for any element b ∈ B0 ,

(ϕ[p])(b) = ∑all a such thatϕ(a) = b

p(a) . (3.53)

This, of course, just results from the definition of the reciprocal image of a set.

Example 3.2 Figure 3.2 suggests a discrete set A0 = a1, a2, a3 , a discrete set B0 =b1, b2, b3 , and a mapping ϕ from A0 into B0 . To every probability P defined over A0one can associate its image Q = ϕ[P] (defined over B0 ) as just introduced. Given themapping suggested in the figure, one obtains (using equation 3.52)

Q(b1) = P(a1) + P(a2) ; Q(b2) = P(a3) ; Q(b3) = 0 . (3.54)

Note that from P(a1) + P(a2) + P(a3) = 1 it follows Q(b1) + Q(b2) + Q(b3) = 1 .

Note: I have to demonstrate here the following property.

Property 3.3 Let ϕ be a mapping from a set A0 into a set B0 , and let P be a probabilityover A0 . If a1, a2, . . . be a sample of P , then, ϕ(a1),ϕ(a2), . . . is a sample of ϕ[P] .

This property has both a conceptual and a practical importance. Conceptually, be-cause it gives an intuitive understanding of the notion of transport of a probabilityfunction. Practically, because, to obtain a sample of ϕ[p] , one should not try to de-velop an ad-hoc method based on the explicit expressions associated to ϕ[P] (likethe expression for ϕ[p] in equation 3.53). One should rather just obtain a sample ofP , and, then, just map the sample as suggested by property 3.3.

(Note: is the image of a σ-field a σ-field?)(Note: is the reciprocal image of a σ-field a σ-field?)


Figure 3.2: The image of a probabilityvia a mapping.

a1

a2

a3

b1

b2

b3

A0 B0

p(a1) p(a2) p(a3)given

q(b1) = p(a1) + p(a2)q(b2) = p(a3)q(b3) = 0

76 Probabilities

3.2.3 Intersection of Probabilities

The notion of intersection of probabilities was introduced by Tarantola (1987). I thinkthat some problems that are formulated using the notion of conditional probability(and Bayes theorem) and better formulated using this notion. This is true, in partic-ular, for the so-called “inverse problems” see an example in section XXX.

In few words, the notion of intersection of sets plays a major role when formulat-ing problems in terms of sets. As far as one can see a probability as a generalizationof (the indicator of) a set, it is natural to ask what how the intersection of sets gener-alizes when dealing with probabilities.

Of course, there will be strong similarities between the intersection of probabil-ities and the intersection of “fuzzy sets” (Zadeh, 1965), but the final equations arenot quantitatively equivalent, and the domain of application of the two definitionsis quite different.

Note: explain here that the intersection of probability functions can only be de-fined if a given probability function (perhaps an improper one, i.e., a measure func-tion) H has been chose beforehand, such that all other probability functions areabsolutely continuous with respect to it. The special probability functions is namedhomogeneous (note: explain why it is called homogeneous). Typically, when workingwith discrete sets, the (homogeneous) measure of a set will simply correspond to thenumber of elements on the set, while when working with manifolds, the measureof a set will result from an ad-hoc of volume element of the manifold (see examplesXXX and XXX).

Note: the tactic I am going to follow here is to define the intersection as an in-ternal operation between (normalized) probabilities. As the homogeneous measuremust be a probability, this assumes that the measure of the whole set is finite (I callit the homogeneous probability) and represent it by the letter H ). Once the final for-mulas have been obtained, both for discrete sets and for manifolds, we extend thedefinition, when it makes sense, to sets with infinite measure.

Note: be careful, for the intersection of two probabilities to be defined, the inter-section of their support must be non-empty.

Definition 3.4 Intersection of probabilities. Let a volume triplet T = ( Ω , F , V ) begiven. We can consider the space of all probabilities that can be build on the triplet T . Weendow this space of all the probabilities with an internal operation, called the intersection,and denoted with the symbol ∩ , defined by the following conditions:

• the operation is commutative, i.e., for any two probability functions,

P1 ∩ P2 = P2 ∩ P1 , (3.55)

• the operation is associative, i.e., for any three probability functions,

(P1 ∩ P2)∩ P3 = P1 ∩ (P2 ∩ P3) , (3.56)


• the homogeneous probability function H is a neutral element of the operation, i.e., forany probability P ,

P∩H = H ∩ P = P , (3.57)

• for whatever A ∈ F ,

P1[A] = 1 ⇒ (P1 ∩ P2)[A] = 1 , (3.58)

• and for whatever A ∈ F ,

P1[A] = 0 ⇒ (P1 ∩ P2)[A] = 0 . (3.59)

Note: the last conditions implies that the intersection P1 ∩ P2 is absolutely continu-ous with respect to both, P1 and P2 .

Note: The examples below demonstrate here that there is at least one solution tothe previous set of conditions. It remains to prove that this solution is unique7.

Note: I still don’t know if I can prove the relation

supp[ P1 ∩ P2 ] = supp[P1]∩ supp[P2] (3.60)

or if I have to add it as an axiom.

Property 3.4 Assume that the set Ω is discrete, and that we choose F = ℘(Ω) . Let P1and P2 be two probabilities, with elementary probabilities respectively denoted p1 and p2 .The elementary probability representing P1 ∩ P2 , that we may denote p1 ∩ p2 , is given, forany element ω ∈ Ω , by

(p1 ∩ p2)(ω) =1ν

p1(ω) p2(ω) , (3.61)

where ν is the normalization factor ν = ∑ω∈Ω p1(ω) p2(ω) .

It is obvious that with the expression above, the four conditions above are satis-fied.

For the expression of the intersection of probability functions defined over mani-folds, see property 3.10, page 3.10.

7For the time being, I have taken the simple example of a set with only two elements. Denotingby f (x, y) the formula that, in this simple example, expresses the intersection, the axioms impose thefollowing conditions: f (x, y) = f (y, x) , f (0, x) = 0 , f (1/2, x) = x , f (1, x) = 1 , f (x, f (y, z)) =f ( f (x, y), z) . One solution of this is the right expression, f (x, y) = x y/(1− x− y + 2 x y) , but I don’tknow yet if there are other solutions.

78 Probabilities

As we have expressed the intersection of two probabilities in terms of a elemen-tary probability (for discrete sets) of in terms of volumetric probabilities (for mani-folds), and as these are positive, normalized functions, it is clear that the intersectionof two probabilities is, indeed, a probability (the operation is internal).

Note: warn here the reader that the expression 3.153 is not be valid if, instead ofusing volumetric probabilities, one uses probability densities. See appendix ??.

So we see that the intersection of two probabilities is defined in terms of the prod-uct of the elementary probabilities (or the volumetric probabilities). At this point wemay remember the second of equations 1.8, defining the intersection of sets in termsof their indicators: for any ω ∈ Ω , we had

χA1 ∩A2(ω) = min χA1(ω) , χA2(ω) = χA1(ω) χA2(ω) . (3.62)

The second of these expressions is similar to the two equations 3.61 and 3.153, ex-cepted for the normalization factor that makes no sense for indicators. Note that theintersection of fuzzy sets is, rather, defined by the expression in the middle of 3.62.An expression like this one is forbidden to us, because of the third of the conditionsin definition 3.4 (that the homogeneous probability is neutral for the intersection op-eration).

(Note: I have moved the definition of the union of probabilities to the appendices[appendix 5.3.5, page 194].)


3.2.4 Reciprocal Image of a Probability

Let ϕ be a mapping from a set A0 into a set B0 , and let Q be a probability over B0 .How should we define ϕ-1[Q] , the reciprocal image of Q ? A look at figure 3.3suggests that there is no obvious way to use the mapping ϕ to infer the probabilityϕ-1[Q] (contrary to what happens in figure 3.2, where the inference is obvious [forinstance, via property 3.3, page 74]).

As we have already defined the intersection of two probabilities (see section 3.2.3),and we have seen that the intersection of probabilities is a generalization of the no-tion of intersection of sets, we shall define the reciprocal image of a probability byimposing that the equivalent of equation 1.30 (reproduced in equation 3.69 here be-low), that is valid for images, reciprocal images, and intersections of sets, remainsvalid for images, reciprocal images, and intersections of probability functions.

Note: I guess that the right definition is as follows.We consider a mapping from a set A0 into a set B0 . The mapping is arbitrary, in

the sense that is not assumed to be injective or to be surjective.Given a subset A ⊆ A0 one can always (check!) consider the decomposition

of A into a collection of disjoint subsets Ai such that

A = ∪ i Ai , (3.63)

and such the mapping from each of the Ai into B0 is injective [as an example, drawhere a cubic function].

Definition 3.5 Then, given an arbitrary mapping ϕ from a set A0 into a set B0 , andgiven a probability Q over B0 , there is a unique probability over A0 , that we shall denoteϕ-1[Q] and we shall call the reciprocal image of Q , defined by the condition that for anyA ⊆ A0 ,

(ϕ-1[Q])[A] = (ϕ-1[Q])[∪ i Ai] =1ν

∑i

Q[ϕ[Ai] ] , (3.64)

where A = ∪ i Ai is the partition introduced above, and where ν is the normalizationconstant (independent on A ) ensuring that (ϕ-1[Q])[A0] = 1 .

[I have to demonstrate here that I am indeed defining a probability over A0 (i.e.,that the function A 7→ (ϕ-1[Q])[A] so introduced satisfies the Kolmogorov axioms)].

Example 3.3 If the sets A0 and B0 are discrete, then the elementary probability of anyelement a ∈ A0 is given by

(ϕ-1[q])(a) =1ν

q(ϕ(a) ) , (3.65)

where ν is the normalization factor ν = ∑a∈A0q(ϕ(a) ) .

80 Probabilities

See section 3.4.6 for the expression of the reciprocal of a probability function whenconsidering a mapping between manifolds.

Example 3.4 Figure 3.3 suggests a discrete set A0 = a1, a2, a3 , a discrete set B0 =b1, b2, b3 , and a mapping ϕ from A0 into B0 . To every probability Q defined overB0 one can associate its reciprocal image P = ϕ-1[Q] as just defined. Given the mappingsuggested in the figure, one obtains (using equation 3.65)

P(a1) = Q(b1) / ( 2 Q(b1) + Q(b2) )

P(a2) = Q(b1) / ( 2 Q(b1) + Q(b2) )

P(a3) = Q(b2) / ( 2 Q(b1) + Q(b2) ) .

(3.66)

Note that from Q(b1) + Q(b2) + Q(b3) = 1 it follows P(a1) + P(a2) + P(a3) = 1 .

Figure 3.3: The reciprocal image of aprobability via a mapping, P ≡ϕ-1[Q] .

A0 B0

a1

a2

a3

b1

b2

b3

q(b1) q(b2) q(b3)given

p(a1) = q(b1)/νp(a2) = q(b1)/νp(a3) = q(b2)/ν

ν = 2 q(b1) + q(b2)

Note: I have to check that the property

supp[ϕ-1[Q] ] = ϕ-1[ supp[Q] ] (3.67)

holds.


3.2.5 Compatibility Property

We have just introduced the notion of image of a probability, ϕ[ P ] , of reciprocalimage of a probability ϕ-1[Q] , and of intersection of two probabilities, R1 ∩ R2 . Withthe definitions introduced above, the following property holds:

Property 3.5 Let ϕ be a mapping from some set A0 into some other set B0 , P be a prob-ability function defined over A0 , and Q a probability function defined over B0 . For anymapping ϕ , and any probabilities P and Q ,

ϕ[ P∩ϕ-1[Q] ] = ϕ[ P ]∩Q . (3.68)

This property is ALMOST demonstrated in appendix 5.3.2), where it is also demon-strated that the definition of reciprocal image of a probability, introduced above, isthe only definition leading to this compatibility property.

We may remember here the relation 1.30, demonstrated in chapter 1: when Aand B are sets, ϕ[A] and ϕ-1[ B ] respectively denote the image and the reciprocalimage of a set, and C1 ∩C2 denotes the intersection of two sets, one has

ϕ[ A∩ϕ-1[ B ] ] = ϕ[A] ∩ B . (3.69)

82 Probabilities

Note: mention here figure 3.4.

a1

a2

a3

A0

b1

b2

b3

B0

p(a1) p(a2) p(a3)given

s(b1) = p(a1) + p(a2)s(b2) = p(a3)s(b3) = 0

q(b1) q(b2) q(b3)given

r(a1) = q(b1)/νr(a2) = q(b1)/νr(a3) = q(b2)/ν

ν = 2 q(b1) + q(b2)

u(a1) = p(a1) r(a1)/ν' = p(a1) q(b1)/ν'u(a2) = p(a2) r(a2)/ν' = p(a2) q(b1)/ν'u(a3) = p(a3) r(a3)/ν' = p(a3) q(b2)/ν'

ν' = ( p(a1) + p(a2) ) q(b1) + p(a3) q(b2)

v(b1) = q(b1) s(b1)/ν' = q(b1) ( p(a1) + p(a2) )/ν'v(b2) = q(b2) s(b2)/ν' = q(b2) p(a3)/ν'v(b3) = q(b3) s(b3)/ν' = 0

11

22

3

Figure 3.4: Illustration of the compatibility property: that one follows the blue pathor the red path, obe arrives at the same values V(b) . (Note: explain this better.)


3.2.6 Other Properties

I should perhaps write here some properties like the two below.A mapping a 7→ b = ϕ(a) (that to any element a ∈ A0 associates an element

b ∈ B0 ) is traditionally extended into a mapping A 7→ B =ϕ[A] that to any set A ⊆A0 associates a set B ⊆ B0 . The reciprocal of this mapping ϕ[ · ] is then introduced(see section 1.3), and one arrives at the two properties expressed by equations 1.26and 1.27, that we may remember here. For any A ⊆ A0 , one has

A ⊆ ϕ-1[ϕ[A] ] , (3.70)

and one has A = ϕ-1[ϕ[A] ] if the mapping is injective. For any B ⊆ B0 , one has

ϕ[ϕ-1[ B ] ] ⊆ B , (3.71)

and one has ϕ[ϕ-1[ B ] ] = B if the mapping is surjective.As we have extended the mapping ϕ to define a mapping P 7→ Q = ϕ[P] of

probabilities (that to any probability P over A0 associates a probability Q overB0 ), we need to examine how the expression ϕ-1[ϕ[P] ] relates to P , and how theexpression ϕ[ϕ-1[Q] ] relates to Q .

(Note: I have not yet derived general formulas concerning subsets, so, for thetime being, I write here the formulas concerning the elements of discrete sets.)

Property 3.6 For any (discrete) probability p and any mapping ϕ one has8 (for any ele-ment a ∈ A0 )

(ϕ-1[ϕ[p] ] )(a) =1ν

∑a′∈ϕ-1[ ϕ(a) ]

p(a′) , (3.72)

where ν is a normalization factor. Should the mapping ϕ be injective, then the right-handside of the expression would collapse into p(a) , so one would have

ϕ-1[ϕ[p] ] = p . (3.73)

Property 3.7 For any (discrete) probability q and any mapping ϕ one has9 (for any elementb ∈ B0 )

(ϕ[ϕ-1[q] ] )(b) =1ν

n(b) q(b) , (3.74)

where n(b) is the number of elements in A0 that map into the element b (i.e., the measureof the set ϕ-1[b] ), and where ν is a normalization factor. Should the mapping ϕ bebijective, then the right-hand side of the expression would collapse into q(b) , so one wouldhave

ϕ[ϕ-1[q] ] = q . (3.75)

These two properties are illustrated in figures 3.5 and 3.6.8For one has (ϕ-1[ϕ[p] ] )(a) = 1

ν (ϕ[p] )(ϕ(a) ) = 1ν ∑a′∈ϕ-1 [ ϕ(a) ] p(a′) .

9For one has ( ϕ[ϕ-1[q] ] )(b) = ∑a∈ϕ-1 [b](ϕ-1[q])(a) = ∑a∈ϕ-1 [b]

1ν q(ϕ(a)) = 1

ν ∑a∈ϕ-1 [b]q(ϕ(a)) = 1

ν n(b) q(b) .

84 Probabilities

Figure 3.5: Evaluating p′ =ϕ-1[ϕ[p] ] .

a1

a2

a3

b1

b2

b3

A0 B0

p(a1) p(a2) p(a3)given (ϕ[p])(b1) = p(a1) + p(a2)

(ϕ[p])(b2) = p(a3)(ϕ[p])(b3) = 0

p’(a1) = ( p(a1) + p(a2) )/νp’(a2) = ( p(a1) + p(a2) )/ν

p’(a3) = p(a3)/ν ν = 2 p(a1) + 2 p(a2) + p(a3)

Figure 3.6: Evaluating q′ =ϕ[ϕ-1[q] ] .

A0 B0

a1

a2

a3

b1

b2

b3

q(b1) q(b2) q(b3)given(ϕ-1[q])(a1) = q(b1)/ν

(ϕ-1[q])(a2) = q(b1)/ν(ϕ-1[q])(a3) = q(b2)/ν

ν = 2 q(b1) + q(b2)q’(b1) = 2 q(b1)/νq’(b2) = q(b2)/ν

q’(b3) = 0


3.2.7 Marginal Probability

Note: the importance of this definition has to be downplayed: it must not correspondto a whole section.

The definition given here below of marginal probability may seem less generalthan that find in other texts. In fact, I am just trying to give the definition that,when dealing with probabilities over manifolds, makes intrinsic sense, i.e., does notrequire to invoke any special coordinate system.

So, to introduce the notion of marginal probability we do not consider an arbi-trary set, but we consider a set that is the Cartesian product of two sets: A0 × B0 .Let P then be a (general) probability function over A0 × B0 . To any set S ⊆ A0 × B0the probability function P associates the probability value denoted P[ S ] .

The probability function P induces one probability function over A0 and oneprobability function over B0 , respectivelt denoted PA0 and PB0 , and called marginalprobability functions and defined as follows. PA0 is the function that to any set A ⊆A0 associates the probability value

PA0 [ A ] ≡ P[ A× B0 ] , (3.76)

and PB0 is the function that to any set B ⊆ B0 associates the probability value

PB0 [ B ] ≡ P[ A0 × B ] . (3.77)

(Note: check these definitions, and demonstrate that they actually define a probabil-ity function.)

Example 3.5 If the set A0 × B0 is discrete, denoting a generic element of the set as a, b ,we can introduce the elementary probability function p through the condition that for anyset S ⊆ A0 × B0 , P[ S ] = ∑a,b∈S p(a, b) . Similarly, we can introduce the elementaryprobability function pA0 through the condition that for any set A ⊆ A0 , PA0 [ A ] =∑a∈A pA0(a) , and the elementary probability function pB0 through the condition that forany set B ⊆ B0 , PB0 [ B ] = ∑b∈B pB0(b) . It is then easy to see (note: check!) that theelementary marginal probability functions are given (for any a ∈ A0 and any b ∈ B0 ) by

pA0(a) = ∑b∈B0

p(a, b) (3.78)

andpB0(b) = ∑

a∈A0

p(a, b) . (3.79)

86 Probabilities

Example 3.6 If the two sets A0 and B0 are, in fact, two manifolds, then. . . (Note: check ifI really need the volume elements.) Bla, bla, bla, and

C = A×B (3.80)

Bla, bla, bla, anddvC = dvA dvB (3.81)

Bla, bla, bla, and ∫dvC =

∫dvA

∫dvB (3.82)

etc.

3.3 Probabilities over Manifolds 87

3.3 Probabilities over Manifolds

Note: say that all the general definition given above hold. We only need to explicitlydevelop the special mathematics that appear when dealing with manifolds.

3.3.1 Probability Density

In this section we consider that the set Ω is, in fact, a finite-dimensional manifold,that we may denote M . We select for F the (THE?) Borel set of M .

Bla, bla, and selecting some coordinates xi over the manifold, bla, bla, andRadon-Nikodym theorem, and bla, bla, and we write

P[A] =∫x1 ,...,xn∈A

dvx f (x1, . . . , xn) , (3.83)

where dvx = dx1 ∧ · · · ∧ dxn . Using more elementary notations,

dvx = dx1 dx2 . . . dxn , (3.84)

and equation 3.83 can be rewritten under the non manifestly covariant form

P[A] =∫x1 ,...,xn∈A

dx1 dx2 . . . dxn f (x1, . . . , xn) . (3.85)

The function f (x1, . . . , xn) is called the probability density (associated to the prob-ability distribution P and to the coordinates xi ). It is a density, in the tensorialsense of the term, i.e., under a change of variables x y it changes according to theJacobian rule (see below).

Example 3.7 Consider a homogeneous probability distribution at the surface of a sphereof unit radius. When parameterizing a point by its spherical coordinates (θ,ϕ) , the associ-ated (2D) probability density, the homogeneous probability distribution is represented by thefunction

f (θ,ϕ) =1

4πsinθ , (3.86)

and the probability of a domain A is computed as

P[A] =∫ ∫︸︷︷︸θ,ϕ∈A

dθ dϕ f (θ,ϕ) , (3.87)

the integral over the whole surface giving one.

88 Probabilities

A probability density is a density in the tensorial sense of the term. Under achange of variables x1, . . . , xn y1, . . . , yn , expression 3.83 becomes

P[A] =∫y1 ,...,yn∈A

dvy g(y1, . . . , yn) , (3.88)

where dvy = dy1 ∧ · · · ∧ dyn . As this identity must hold for any domain A , it mustalso hold infinitesimally, so we can write

dP = f (x1, . . . , xn) dx1 ∧ · · · ∧ dxn = g(y1, . . . , yn) dy1 ∧ · · · ∧ dyn . (3.89)

We have already seen that the relation between the two capacity elements associatedto the two coordinate systems is (see equation 2.44) dy1 ∧ · · · ∧ dyn = (1/X) dx1 ∧· · · ∧ dxn , so we immediately obtain the Jacobian rule for probability densities

g(y1, . . . , yn) = f (x1, . . . , xn) X(y1, . . . , yn) . (3.90)

Note: explain that the coordinates xi at the right take the values xi = xi(y1, . . . , yn) .Of course, this is the general rule of transformation of scalar densities (equa-

tion 2.18, page 29), that they represent a probability density or any other density.Note that the X appearing in this equation is the determinant of the matrix Xi

j =∂xi/∂y j , not that of the matrix Yi

j = ∂yi/∂x j . Note: I must also give theformula (referred to in the appendices)

g(y1, . . . , yn) =f (x1, . . . , xn)Y(x1, . . . , nn)

, (3.91)

and, perhaps, write it as

g(y) =f ( x(y) )Y( x(y) )

. (3.92)

(Note: correct what follows.) In the literature, the equivalent of equation 3.90is presented taking the absolute sign of the Jacobian determinant. In principle, wedo not heed here this absolute sign, as we have assumed above that the ‘new vari-ables’ are always classified in a way such that the Jacobian determinant is positive.It remains the case of a one-dimensional variable, that we may treat in an ad-hocway10.

(Note: I must warn the reader that if there is a natural notion of volume on themanifold, integrating densities may be quite unefficient.

Note: without a metric, we can not define the conditional probability density.Note: without the notion of volume, we can not define the intersection of proba-

bilities.10When, for instance, passing from ρ to ` = 1/ρ , we have a negative value of the Jacobian ‘deter-

minant’, d`/dρ = −1/ρ2 .


3.3.2 Image of a Probability Density

The notion of image of a probability function has been introduced in section 3.2.2 forprobabilities defined over an arbitrary set. We are now interested in probabilities de-fined over a manifold, and these probabilities are represented by probability densitydensities. So, our task here is to translate the general definition of image of a prob-ability in terms of probability densities. By the way, from the three notions (imageof a probability, intersection of two probabilities, and reciprocal image of a probabil-ity) that we have now to revisit (to see the implications when the probabilities aredefined over manifolds), it is only the first notion, that of image of a probability thatdoes not require any special structure over the manifold. The two other notions (in-tersection of two probabilities, and reciprocal image of a probability) will only makesense if the manifolds have a notion of volume attached. Even worse, we shall latersee that the notion of conditional probability density can only be introduced if themanifolds are metric (i.e., if the distance between points is defined).

So, consider a mapping ϕ from an m-dimensional manifold M into an n-dimen-sional manifold N . As we wish, in a moment, to introduce probability densitiesover M and over N , we need to endow each of the two manifolds with a coordinatesystem. So, let x ≡ xα = x1, . . . , xm be a coordinate system over M , and lety ≡ yi = y1, . . . , yn be a coordinate system over N . We can thus write themapping as x 7→ y = ϕ(x) . Let P be a probability over (the Borel set of) M , withprobability density f (x) . Its image via the mapping ϕ is, as defined in section 3.2.2,a probability Q = ϕ[P] defined over (the Borel set of) M , that shall have someprobability density g(y) .

One obtains something like (see appendix)

g(y) = ∑x :ϕ(x)=y


( if m ≤ n ) , (3.93)

where Φ is the matrix of partial derivatives of the mapping ϕ . In the case m ≥ m ,bla, bla, bla. . .

Note: explain that we can write

g = ϕ[ f ] . (3.94)

For the expression of the image of a probability in terms of volumetric probabili-ties, see section 3.4.4.

We can give here a version of property 3.3 adapted to probabilities over mani-folds. Note: I have to demonstrate here the following property.

Property 3.8 Let ϕ be a mapping from a manifold M into a manifold N , and considera probability function over (the Borel set of) M , represented, in the given coordinates, bythe probability density f (x) , If the set of points x1, x2, . . . is a sample of f (x) , then,y1, y2, . . . = ϕ(x1),ϕ(x2), . . . is a sample of g =ϕ( f ) .

90 Probabilities

Again, to obtain a sample of g = ϕ[ f ] , one should not try to develop an ad-hocmethod based on the analytic expression of g equation 3.93 or equation XXX, but,rather, one should just obtain a sample of f , and, then, just map the sample assuggested by property 3.8.

Note: figure 3.7 is old, and will probably be suppressed.

y = (x) image

?f(x)

preimage

?g(y)

x

X Y

Figure 3.7: Left: Mapping from X into Y . Center: the image of a volumetric prob-ability (evaluating the image of a probability density is a difficult problem). Right:the preimage of a volumetric probability (evaluating the preimage of a probabilitydensity is an easy problem).

Example 3.8 We examine here a common problem in experimental sciences:

• a set of quantities y1, y2, . . . , yq , that are not directly measurable, are defined interms of some other directly measurable quantities x1, x2, . . . , xp ,

y1 = ϕ1(x1, x2, . . . , xp)

y2 = ϕ2(x1, x2, . . . , xp)... =

...

yq = ϕq(x1, x2, . . . , xp) ;

(3.95)

• the result of a measurement of the quantities x1, x2, . . . , xp , is represented by theprobability density f (x1, x2, . . . , xp) ;

• which is the probability density g(y1, y2, . . . , yq) that represents the information in-duced on the quantities y1, y2, . . . , yq ?

The answer is, of course, g = ϕ( f ) . One may wish to write down explicitly the functiong(y) , in which case expression 3.93 or expression XXX has to be used (and the partialderivatives evaluated), or one may just wish to have a set of sample points of f (x) (fromwhich some simple histograms can be drawn), in which case it is much simpler to samplef (x) and to transport the sample points. (Note: explain here that the figure 4.7 below shows aresult of such a Monte Carlo transport, compared with the result of the analytical transport.)

Note: explain why this is intimately related to the ISO recommendation for the“transportation of uncertainties”.


Note: explain here that this is one of the few problems on this book that can befully and consistently posed in terms of probability densities.

Note: for a detailed description of the standard practice of transportation of un-certainties, see Dietrich, 1991. For a complete description of metrology and calibra-tion see Fluke, 1994.

Note: I should explain here that there is another example where the same ques-tion arises, the typical prediction problem in physics: any serious physical theory isto able to make predictions (that may be confronted to experiments). An engineer,for instance, may wish to predict the load at which a given bridge may collapse, oran astrophysicist may wish to predict the flux of neutrinos from the Sun. In these sit-uations, the parameters defining the system (the bridge or the Sun) may be knownwith some uncertainties, and these uncertainties shall reflect as an uncertainty on theprediction.

92 Probabilities

3.3.3 Marginal Probability Density

Note: refer here to section 3.2.7.The notion of marginal probability density (or marginal volumetric probability)

tries to address one simple question: if we have a probability density f (x, y) intwo ‘variables’ x and y , and we don’t care much about y which is the probabilitydensity for x alone?

I choose here to develop the notion using the specific setting of this section (3.5.4),that is well adapted to our future needs. So, again, we consider a p-dimensionalmanifold Rp , metric or not, with some coordinates r = rα , and with the capac-ity element dvr . Consider also a q-dimensional manifold Sq , metric or not, withsome coordinates s = sa , and with the capacity element dvs . We build the Carte-sian product Mp+q = Rp × Sq of the two manifold, i.e., we consider the p + q-dimensional manifold whose points consist in a couple of points, one in Rp and onein Sq . As coordinates over Mp+q we can obviously choose r, s . From the ca-pacity elements dvr and dvs one can introduce the capacity element dv = dvr ∧ dvsover Mp+q .

Assume now that some random process produces pairs or random points, onepoint of the pair in Rp and the other point Sq . If fact, the random process is pro-ducing points on Mp+q . We can make a histogram, and, when enough points havematerialized, we have a normalized probability density f (r, s) , that we assume nor-malized to one: ∫

Mp+qdv(r, s) f (r, s) = 1 . (3.96)

Instead, one may have just made the histogram of the points on Rp , disregardingthe points of Sq , to obtain the probability density f r(r) . It is clear that one has

f r(r) =∫

Sqdvs(s) f (r, s) . (3.97)

This function f r(r) is called the marginal probability density for the ‘variables’ r . Theprobability of a domain Dp ⊂ Rp is to be computed via

P(Dp) =∫Dp

dvr(r) f r(r) , (3.98)

and the probability density is normed to one:∫Rp

dvr(r) f r(r) = 1 .Similarly, one may have just made the histogram of the points on Sq , disregard-

ing the points of Rp , to obtain the probability density f s(s) . One then has

f s(s) =∫

Rpdvr(r) f (r, s) . (3.99)


This function f s(s) is called the marginal probability density for the ‘variables’ s . Theprobability of a domain Dq ⊂ Sq is to be computed via

P(Dq) =∫Dq

dvs(s) f s(s) , (3.100)

and the probability density is normed to one:∫Sq

dvs(s) f s(s) = 1 .To introduce this notion of marginal probability we have not assumed that the

manifolds are metric. Of course, the definitions also make sense when working in ametric context. Let us introduce the basic formulas. Let the distance element over Rp

be ds2r = (gr)αβ drα drβ , and the distance element over Sq be ds2

s = (gs)ab dsa dsb .Then, under our hypotheses here, the distance element over Mp+q is ds2 = ds2

r +ds2

s . We have the metric determinants gr =√

det gr , gs =√

det gs and g = gr gs .The probability densities above are related to the volumetric probabilities via

f (r, s) = g f (r, s) ; f r(r) = gr fr(r) ; f s(s) = gs fs(s) , (3.101)

while the capacity elements are related to the volume elements via

dv(r, s) = g dv(r, s) ; dvr(r) = gr dv(r) ; dvs(s) = gs dv(s) . (3.102)

We can now easily translate the equations above in terms of volumetric probabilities.As an example, the marginal volumetric probability is (equation 3.99)

fs(s) =∫

Rpdvr(r) f (r, s) , (3.103)

the probability of a domain Dq ⊂ Sq is evaluated as (equation 3.100)

P(Dq) =∫Dq

dvs(s) fs(s) , (3.104)

and the volumetric probability is normed to one:∫Sq

dvs(s) fs(s) = 1 .

94 Probabilities

3.4 Probabilities over Volume Manifolds

3.4.1 Volumetric Probability

Bla, bla, bla. . .We shall assume that there is a way of defining the “size” or “volume” of the sets

into consideration. For discrete sets, the volume of the set will just be the numberof its elements. For manifolds, there is an assumption that is commonly made that,I think, it is not correct: that there is a “natural” way for defining the volume ofa set of points. We have seen in the previous chapter that there is a big differencebetween a capacity element and a volume element: capacity elements are attachedto coordinate systems, and they are not intrinsically defined. This results in that onecan not define “the volume of set of points” intrinsically unless a particular measurehas been explicitly introduced: the volume measure. For instance, if the manifold isa metric manifold, the volume measure is deduced from the metric (using the squareroot of the determinant of the metric, see section 2.2.7). Some of the fundamentaloperations to be introduced below –like the intersection of two probabilities— cannot be defined unless this volume measure is given. So, we start here by introducingsuch a volume measure, that we will always assume that it is given beforehand, andonce for all.

Definition 3.6 Measure. Given a set Ω and a suitable set F of subsets of Ω , a mea-sure is a mapping M from F into the real interval [0, ∞] that satisfies the following twoproperties

M(∅) = 0 , (3.105)

and for any two sets inside F

M(A1 ∪A2) = M(A1) + M(A2)− M(A1 ∩A2) . (3.106)

Definition 3.7 Volume measure. Given a set Ω and a suitable set F of subsets of Ω ,the volume measure, denoted V is the particular measure that

• if the set Ω is discrete, for any A ∈ F , V[A] equals the number of elements in A ;

• if the set Ω is a manifold, V is an explicitly introduced measure, that to any A ∈ Fassociates the volume of A , denoted

V[A] =∫

P∈Adv(P) (3.107)

(where dv is the volume element), or, if some coordinates x1, . . . , xn are introducedover the manifold,

V[A] =∫

P∈Adv(x1, . . . , xn) g(x1, . . . , xn) , (3.108)

3.4 Probabilities over Volume Manifolds 95

where dv is the capacity element associated to the coordinates, and where g is a givendensity. In a metric manifold, g = ±

√| det g| (see section XXX for details).

Definition 3.8 Volume triplet. When a set Ω is given, a suitable set F of subsets of Ω

considered, and a volume measure V over F chosen, we shall say that we have a volumetriplet, that we shall denote ( Ω , F , V ) .

Bla, bla, bla. . .

Property 3.9 If the set Ω is a (finite-dimensional) manifold, its elements are called points,that we denote P, P′ . . . As the points of a manifold are not numerable, one may choose forF the usual Borel field B ⊂ ℘(Ω) associated to the manifold. Assume that, as explainedabove, we have introduced a particular volume measure V over F = B , so we have thetriplet ( Ω , B , V ) . As already mentioned, to every set A ∈ B we can then associate itsvolume, that we write as

V[A] =∫

P∈Adv(P) . (3.109)

Then, it follows from the Radon-Nikodym theorem that for any probability P over B thereexist a unique positive function f (P) satisfying∫

P∈Ωdv(P) f (P) = 1 , (3.110)

such that for any A ∈ BP[A] =

∫P∈A

dv(P) f (P) . (3.111)

We call f (P) the volumetric probability of the point P .

96 Probabilities

3.4.1.1 Second Piece of Text

Consider an n-dimensional smooth manifold M , over which a notion of volume (asdefined in the previous chapter) exists. Then, to any domain A ⊂ M one can asso-ciate its volume:

A 7→ V[A] . (3.112)

A particular volume distribution having been introduced over M , once for all, dif-ferent ‘probability distributions’ may be considered, that we are about to characterizeaxiomatically.

We shall say that a probability distribution (or, for short, a probability) has beendefined over M if to any domain A ⊂ M we can associate an adimensional realnumber,

A 7→ P[A] (3.113)

called the probability of A , that satisfies

Postulate 3.1 for any domain A ⊂ M ,

P[A] ≥ 0 ; (3.114)

Postulate 3.2 for disjoints domains of the manifold M , the probabilities are additive:

A1 ∩A2 = ∅ ⇒ P(A1 ∪A2) = P(A1) + P(A2) ; (3.115)

Postulate 3.3 the probability distribution must be absolutely continuous with respect tothe volume distribution, i.e., the probability P(A) of any domain A ⊂ M with vanishingvolume must be zero:

V[A] = 0 ⇒ P[A] = 0 . (3.116)

The probability of the whole manifold M may be zero, it may be finite, or it maybe infinite. The first two axioms are due to Kolmogorov (1933). In common texts,there is usually an axiom concerning the behaviour of a probability when we con-sider an infinite collection11 of sets, A1 , A2 , A3 . . . , but this is a technical issue thatI choose to ignore. Our third axiom here is not usually introduced, as the distinctionbetween the ‘volume distribution’ and a ‘probability distribution’ is generally notmade: both are just considered as examples of ‘measure distributions’. This distinc-tion shall, in fact, play a major role in the theory that follows.

When the probability of the whole manifold is finite, a probability distributioncan be renormalized, so as to have P(M) = 1 . We shall then say that we face an‘absolute probability’. If a probability distribution is not normalizable, we shall say

11Presentations of measure theory that pretend to mathematical rigor, assume ‘finite additivity’ or,alternatively, ‘countable additivity’. See, for instance, the interesting discussion in Jaynes (1995).


that we have a ‘relative probability’: in that case, what usually matters is not theprobability P[A] of a domain A ∈ M , but the relative probability between twodomains A1 and A2 , denoted P( A1 ; A2 ) , and defined as

P( A1 ; A2 ) ≡ P(A1)P(A2)

. (3.117)

98 Probabilities

3.4.1.2 Third Piece of Text

In what follows, a generic point of the manifold M is denoted with the symbol P ,as in P ∈ M . The point of the manifold where some scalar quantity is consideredis explicitly designated. For instance, the expression giving the volume of a domainA ⊂ M than in equation 2.92 was simply written V[A] =

∫A dv , shall, from now on,

be writtenV[A] =

∫P∈A

dv(P) . (3.118)

We have just defined a probability distribution over the manifold M that is abso-lutely continuous with respect to the volume distribution over the manifold. Then,by virtue of the Radon-Nikodym theorem (e.g., Taylor, 1966), one can define overM a volumetric probability f (P) such that the probability of any domain A of themanifold can be obtained as

P[A] =∫

P∈Adv(P) f (P) . (3.119)

Note that this equation makes sense even if no particular coordinate system isdefined over the manifold M , as the integral here can be understood in the sensesuggested in figure 2.3. If a coordinate system x = x1, . . . , xn is defined over M ,we may well wish to write equation 3.119 as

P[A] =∫

x∈Advx(x) fx(x) , (3.120)

where, now, dvx(x) is to be understood as the special expression of the volumeelement in the coordinates x . One may be interested in using the volume elementdvx(x) directly for the integration (as suggested in figure 2.3). Alternatively, onemay wish to use the coordinate lines for the integration (as suggested in figure 2.4).In this case, one writes (equation 2.84)

dvx(x) = gx(x) dvx(x) , (3.121)

to get

P[A] =∫

x∈Advx(x) gx(x) fx(x) . (3.122)

Using dvx(x) = dx1 ∧ · · · ∧ dxn (equation 2.39) and gx(x) =√

det g(x) (equa-tion 2.72), this expression can be written in the more explicit form

P[A] =∫

x∈Adx1 ∧ · · · ∧ dxn

√det g(x) fx(x) . (3.123)

These two (equivalent) expressions may be useful for analytical developments, butnot for numerical evaluations, where one should choose a direct handling of expres-sion 3.120.


3.4.1.3 Fourth Piece of Text

In a change of coordinates x 7→ y(x) , the expression 3.120

P[A] =∫

x∈Advx(x) fx(x) (3.124)

becomesP[A] =

∫y∈A

dvy(y) fy(y) (3.125)

where dvy(y) and fy(y) are respectively the expressions of the volume element andof the volumetric probability in the coordinates y . These are actual invariants (in thetensorial sense), so, when comparing this equation (written in the coordinates y ) toequation 3.120 (written in the coordinates x ), one simply has, at every point,

fy = fx

dvy = dvx, (3.126)

or, to be more explicit,

fy(y) = fx( x(y) ) ; dvy(y) = dvx( x(y) ) . (3.127)

That under a change of variables x y one has fy = fx for volumetric probabilities,is an important property. It contrasts with the property found in usual texts (wherethe Jacobian of the transformation appears): remember that we are considering herevolumetric probabilites, not the usual probability densities.

Later on this text, a systematic procedure for representing volumetric probabili-ties under a change of variables is suggested (see figures 4.7, 4.10 and 4.11.)

100 Probabilities

3.4.2 Volumetric Histograms and Density Histograms

A volumetric probability or a probability density can be obtained as the limit of ahistogram. It is important to understand which kind of histogram produces as a limita volumetric probability and which other kind of histogram produces a probabilitydensity.

In short, when counting the number of samples inside a division of the space intocells of equal volume one obtains a volumetric probability. If, instead, one dividesthe space into cells of equal capacity (i.e., in fact, one divides the space using constantcoordinate increments), one obtains a probability density.

Figure 3.8 presents a one-dimensional example of the building of a histogram.The manifold M into consideration here is a one-dimensional manifold where eachpoint represents the volumetric mass ∆M/∆V of a rock. As a ‘coordinate’ over thisone-dimensional manifold, we can use the value ρ = ∆M/∆V , but we could use itsinverse ` = 1/ρ = ∆V/∆M as well (or, as considered below, the logarithmic volu-metric mass). Let us first use the volumetric mass ρ . To make a histogram whoselimit would be a volumetric probability we should divide the one-dimensional man-ifold M into cell of equal ‘one-dimensional volume’, i.e., of equal length. This re-quires that a definition of the distance between two point of the manifold is used.

For reasons exposed in chapter ??, the ‘good’ definition of distance between thepoint ρ1 and the point ρ2 is D = | log(ρ2/ρ1) | . When dividing the manifold M

into cells of constant length ∆D one obtains the histogram in the middle of figure,whose limit is a (one-dimensional) volumetric probability f (ρ) .

When instead of dividing the manifold M in cells of constant length ∆D oneuses cells of constant ‘coordinate increment’ ∆ρ , one obtains a different histogram,displayed at the top of the figure. Its limit is a probability density f (ρ) .

The relation between the volumetric probability f (ρ) and the probability densityf (ρ) is that expressed in equation ??. As ∆D = log((ρ+ ∆ρ)/ρ) = ∆ρ/ρ+ . . . , inthis one-dimensional example, the equivalent of equation 3.121 is

dD =1ρ

dρ . (3.128)

Therefore, equation ?? here becomes

f (ρ) = f (ρ)/ρ , (3.129)

this being the relation between the volumetric probability and the probability den-sity obtained in figure 3.8.

The advantage of using the histogram that produces a probability density is thatone does not need to care about possible definitions of length or volume: whateverthe coordinate being used, ρ , or ` = 1/ρ , one has only to consider constant coordi-nate increments, δρ , or δ` , and make the histogram.

The advantage in using the histogram that produces a volumetric probability isthat, as obvious in the figure, when dividing the manifold into consideration in cells


Figure 3.8: Histograms of the volumetricmass of the 571 different rock types quotedby Johnson and Olhoeft (1984). The his-togram at the top, where cells with con-stant values of ∆ρ have been used, has aprobability density f (ρ) as limit. The his-togram in the middle, where cells with con-stant length ∆D = ∆ρ/ρ have been used,has a volumetric probability f (ρ) as limit.The relation between the two is f (ρ) =f (ρ)/ρ (see text for details). In reality thereis one natural variable for this problem(bottom), the logarithmic volumetric massρ∗ = log(ρ/ρ0) , as for this variable, in-tervals of constant length are also intervalswith constant increment of the variable.

0 ρ

ρ

10 g/cm3 20 g/cm3

0 10 g/cm3 20 g/cm3

ρ∗ = log10(ρ/ρ0) ρ0 = 1 g/cm3

0.0 0.5 1.0 1.5

102 Probabilities

of equal volume (here equal length), the number of samples inside each cell tendsto be more equilibrated, and the histograms converges more rapidly into significantvalues.

Example 3.9 Note: explain here what is a volumetric histogram and a density histogram.Say that while the limit of a volumetric histogram is a volumetric probability, the limit of adensity histogram is a probability density. Note: Introduce the notion of ‘naıve histogram’.Consider a problem where we have two physical properties to analyze. The first is the propertyof electric resistance-conductance of a metallic wire, as it can be characterized, for instance, byits resistance R or by its conductance 4 C = 1/R . The second is the ‘cold-warm’ property ofthe wire, as it can be charcterized by its temperature T or its thermodynamic parameter β =1/kT (k being the Boltzmann constant). The ‘parameter manifold’ is, here, two-dimensional.In the ‘resistance-conductance’ manifold, the distance between two points, characterized bythe resistances R1 and R2 , or by the conductances C1 and C2 is, as explained in sectionXXX,

D =∣∣∣∣log

R2

R1

∣∣∣∣ =∣∣∣∣log

C2

C1

∣∣∣∣ . (3.130)

Similarly, in the ‘cold-warm’ manifold, the distance between two points, characterized by thetemperatures T1 and T2 , or by the thermodynamic parameters β1 and β2 is

D =∣∣∣∣log

T2

T1

∣∣∣∣ =∣∣∣∣log

β2

β1

∣∣∣∣ . (3.131)

An homogeneous probability distribution can be defined as . . . Bla, bla, bla. . . In figure ??, thetwo histograms that can be made from the two first diagrams give the volumetric probabil-ity. The naıve histrogram that could be made form the diagram at the right would give aprobability density.

Figure 3.9: Note: explain here how to make a volu-metric histogram. Explain that when the electric re-sistance or the temperature span orders of magnitude,some diagrams become totally impractical.

-1 0 +1 +2 +3

+30

+40

+50

ν∗ = log10 ν/ν0

0(ν = 1 Hz)

ν∗

= l

og

10 ν

/ν0


Figure 3.10: Note: write this caption.

1000 H

z

500 H

z

2 10 Hz50

0 H

z

1500 H

z

0 Hz

10 Hz50

104 Probabilities

3.4.3 Homogeneous Probability Function

kakakakakaka

Definition 3.9 Homogeneous probability. Let a volume triplet ( Ω , F , V ) be given.If the volume measure of the whole set, V(Ω) , is finite, then one can introduce the ho-mogeneous probability, denoted H , that, by definition, to any set A ∈ F associates aprobability H[A] proportional to V[A] , the volume measure of A .

Example 3.10 If Ω is a set with a finite number of elements, say n , then the homogenenousprobability H associates, to every element ω ∈ Ω the (constant) elementary probability

p(ω) = 1/n . (3.132)

Example 3.11 If Ω is a manifold with finite volume measure V(Ω) , then the homogene-nous probability H associates, to every point P ∈ Ω the (constant) volumetric probability

f (P) = 1/V(Ω) . (3.133)

Definition 3.10 Step probability. Let a volume triplet ( Ω , F , V ) be given. To everysubset A ∈ F with finite volume measure V[A] , we associate a probability, denoted HA ,that to any A′ ∈ F associates a probability proportional to the volume measure of A∩A′ .

Example 3.12 For a discrete probability, let k be the number of elements in A , k = V[A] .The elementary probability associated to HA is (see figure 3.11)

p(ω) =

1/k if ω ∈ A

0 if ω /∈ A .(3.134)

Example 3.13 For a manifold, let V[A] be the volume measure of A . The volumetricprobability associated to HA is

f (P) =

1/V[A] if P ∈ A

0 if P /∈ A .(3.135)


p1A

1/4

0

00

00

1/4

1/4

1/4

Figure 3.11: The step probability HA associated to a set A . The values of the ele-mentary probability associated to HA are proportional to the values of the indicatorof the set A (compare this with figure 1.1).

Example 3.14 (Note: compare this with example 3.7) Consider a homogeneous probabilitydistribution at the surface of a sphere of unit radius. When parameterizing a point by itsspherical coordinates (θ,ϕ) , the associated (2D) volumetric probability is

f (θ,ϕ) =1

4π. (3.136)

The probability of a domain A of the surface is computed as

P[A] =∫ ∫︸︷︷︸θϕ∈A

dS(θ,ϕ) f (θ,ϕ) , (3.137)

where dS(θ,ϕ) = sinθ dθ dϕ is the usual surface element of the sphere. The total probabil-ity (over the whole sphere) of course equals one.

kaakakakakaNote: explain that the measure (or size) of a discrete set is the number of its

elements. The measure of a set A ⊆ A0 can be expressed as

M[A] = ∑a∈A

χA0(a) , (3.138)

where χA is the indicator function associated to a set A . If the measure of the wholeset A0 is finite, say M0 ,

M0 = M[A0] , (3.139)

106 Probabilities

then we can introduce a probability function, denoted H , that to any A ⊆ A0 ,associates the probability value

H[A] =M[A]

M0. (3.140)

Then,H[A] = ∑

a∈Ah(a) , (3.141)

with the elementary probability function

h(a) =χA0(a)

M0. (3.142)

kakakakakakaIf the set if a manifold, there is not “natural” measure of a set (that would inde-

pendent of any coordinate system), and, for this reason, we have assumed above theexistence of a particular volume element dv . Then, the measure (or volume) of a setA ⊆ M is

M[A] =∫

P∈Adv . (3.143)

a special definition must be introduced. This is typically done by choosing anarbitrary system of coordinates, and, in those coordinates, select one particular “vol-ume density” (or “measure density”).

Note: rewrite this section.A probability distribution P is homogeneous if the probability asociated by P to

any domain A ⊂ M is proportional to the volume of the domain, i.e., if there is aconstant k such for any A ⊂ M ,

P[A] = k V[A] . (3.144)

The homogenenous probability distribution is not necessarily normed. It followsimmediately from this definition that the volumetric probability f associated to the ho-mogeneous probability distribution is constant:

f (P) = k . (3.145)

Should one not work with volumetric probabilities, but with probability den-sities, one should realize that the probability density f associated to the homogeneousprobability distribution is not necessarily constant. For a probability density dependson the coordinates being used. When using a coordinate system x = xi , where


the metric determinant takes the value gx(x) , the probability density representingthe homogeneous probability distribution is (using equation ??)

f x(x) = k gx(x) . (3.146)

For instance, in the physical 3D space, when using spherical coordinates, the ‘homo-geneous probability density’ (this is a short name for ‘the probability density repre-senting the homogeneous probability distribution’) is f (r,θ,ϕ) = k r2 sinθ .

108 Probabilities

3.4.4 Image of a Volumetric Probability

In section 3.3.2 we have obtained the expression for the image of a probability func-tion (defined over a manifold) in terms of probability densities. This has been possi-ble because the notion of volume is not necessary to define the image of a probabilitydensity through a mapping. We arrived at (equation 3.93)

g(y) = ∑x :ϕ(x)=y


( if m ≤ n ) . (3.147)

In the case n ≤ m , bla, bla, bla. . .Now, should the manifolds have a notion of volume defined, there would be, for

the coordinates x and y being used over the two manifolds, the volume densities(see chapter 2) vM(x), and vN(y) , so that the respective volume elements can bewritten in terms of the capacity elements as

dvM(x) = vM(x) dvM ; dvN(y) = vN(y) dvN , (3.148)

and the probability densities would be related to the volumetric probabilities as

f (x) = vM(x) f (x) ; g(y) = vN(y) g(y) . (3.149)

Equation 3.147 would then translate into

g(Q) = ∑P :ϕ(P)=Q

f (P)vM(P)

vN(ϕ(P) )√

det Φ(P)t Φ(P)( if m ≤ n ) (3.150)

(note: explain this). Should the two volume elements derive, in fact, from two metrictensors,

vM(P) =√

detγM(P) ; vN(Q) =√

detγN(Q) , (3.151)

and we could rewrite equation 3.150 as

g(Q) = ∑P :ϕ(P)=Q

f (P)√

detγ(P)√det Φ(P)t Γ(ϕ(P) ) Φ(P)

( if m ≤ n ) . (3.152)


3.4.5 Intersection of Volumetric Probabilities

Note: say that the definition of intersection of probability functions is in section 3.2.3

Property 3.10 Assume that the set Ω is a manifold, and that we choose for F the usualBorel field of Ω . Let P1 and P2 be two probabilities, with volumetric probabilities respec-tively denoted f1 and f2 . The volumetric probability representing P1 ∩ P2 , that we maydenote f1 ∩ f2 , is given, for any point P ∈ Ω , by

( f1 ∩ f2)(P) =1ν

f1(P) f2(P) . (3.153)

where ν is the normalization factor ν =∫P∈Ω dv(P) f1(P) f2(P) .

Example 3.15 Product of probability distributions. Let S represent the surface of theEarth, using geographical coordinates (longitude ϕ and latitude λ ). An estimation of theposition of a floating object at the surface of the sea by an airplane navigator gives a probabil-ity distribution for the position of the object corresponding to the (2D) volumetric probabilityf (ϕ, λ) . By definition, then, the probability that the floating object is inside some region Aof the Earth’s surface is P[A] =

∫A dS(ϕ, λ) f (ϕ, λ) , where dS(ϕ, λ) = cos(λ) dϕ dλ .

An independent (and simultaneous) estimation of the position by another airplane navigatorgives a probability distribution corresponding to the volumetric probability g(ϕ, λ) . Howthe two volumetric probabilities f (ϕ, λ) and g(ϕ, λ) should be ‘combined’ to obtain a ‘re-sulting’ volumetric probability? The answer is given by the intersection of the two volumetricprobabilities densities:

( f · g)(ϕ, λ) =f (ϕ, λ) g(ϕ, λ)∫

S dS(ϕ, λ) f (ϕ, λ) g(ϕ, λ). (3.154)

110 Probabilities

3.4.6 Reciprocal Image of a Volumetric Probability

If the sets A0 and B0 are manifolds, with points respectively denoted A, A′ . . . andB, B′ . . . , then the volumetric probability A 7→ f (A) representing the probabilityP = ϕ-1[Q] can be expressed in terms of the volumetric probability B 7→ g(B)representing Q via

f (A) =1ν

g(ϕ(A) ) , (3.155)

where ν =∫A∈A0

dV g(ϕ(A) ) .We shall use the notation

f = ϕ-1(g) . (3.156)

See figure 3.7.


3.4.7 Compatibility Property

Note: remember that we have

ϕ( f ∩ (ϕ-1(g) ) ) = ϕ( f ) ∩ g . (3.157)

Properties like this one suggest that volumetric probabilities can be interpretedas “fuzzy sets”. It is then arguable that probability theory is the fuzzy set theory.

112 Probabilities

3.4.8 Assimilation of Uncertain Observations (the Bayes-Popper Ap-proach)

Note: the preset text is not Bayes-Popper. It has to be modified. And I have to findan example, where the “approach” becomes a theorem.

An expression like x ∈ A has a crystal clear meaning: the element x belongs tothe set A . I want to give meaning to the expression x ε f , where f is a volumetricprobability. (For the time being, I use here the symbol ε instead of the symbol ∈ ,with the hope of being later able to use a unique symbol). It could be somethinglike ”the point x is a sample point of the volumetric probability f (x) ”. But thisseems not to lead to simple consequences. So, at present, my guess is that the rightdefinition is as follows:

When f is a volumetric probability, the expression x ε f means that “the actualvalue of the quantity x is unknown, but, should you have to place bets on what thisactual value could be, we inform you that you should place bets according to thevolumetric probability f (x) ”.

For short:

The expression x ε f is to be read “ x PYBAT f (x) ”, this meaning “for x PlaceYour Bets According To f (x) ”.

Now, if equation XXX XXX XXX XXX remains valid for volumetric probabilities(it has become equation ??), will equation 1.42 also remain valid, i.e., can we demon-strate that the expression

x ε f AND ϕ(x) ε p ⇒ x ε f ′ AND ϕ(x) ε p′ ,

where f ′ = f ∩ϕ-1(p) , p′ = ϕ( f )∩ p ,

(3.158)is a theorem? The meaning of the theorem would be as follows.

There is some given (but unknown) value of x and the associated value ϕ(x) .

• If we are informed that we should place our bets on the actual value of x ac-cording to f (x) ,

and, independently,

• we are informed that we should place our bets on the actual value of y =ϕ(x)according to p(y) ,

then, we must conclude that, in fact,


• we should place our bets on the actual value of x according to f ′(x) ,

and

• we should place our bets on the actual value of y =ϕ(x) according to p′(y) .

This conjecture may be false. But, if it is true, this would be, at last, a justificationof the formulas that Tarantola (1987, 2005) and Mosegaard and Tarantola (1995) havebeen using.

I find quite pleasant that there is some flavor of “game theory” in this conjecture.I am working to demonstrate it.

114 Probabilities

3.4.9 Popper-Bayes Algorithm

Let Mp be a p-dimensional manifold, with volume element dvp and let Mq a q-dimensional manifold, with volume element dvq . The points of Mp are denotedP, P′ . . . , while the points of Mq are denoted Q, Q′ . . . . Let f be a volumetric prob-ability over Mp and let ϕ be a volumetric probability over Mq . Let

P 7→ Q = a(P) (3.159)

be an application from Mp into Mq . Let P be a sample point of f , and submit thispoint to a survive or perish test, the probability of survival being

π = ϕ( a(P) )/ϕmax (3.160)

(and the probability of perishing being 1−π ). If the point P perishes, then, we startagain the survive or perish test with a second sample point of f , and so on, untilone point survives. Property: The surviving point is a sample point of a volumetricprobability h over Mp whose normalized expression is

h(P) =1ν

f (P)ϕ( a(P) ) , (3.161)

the normalizing constant being ν =∫M dvp f (P)ϕ( a(P) ) . [NOTE: this is still a

conjecture, that shall be demonstrated here.]


3.4.10 Exercise

This exercise is all explained in the caption of figure 3.12. The exercise is the samethan that in section 1.4.1, excepted that while here we use volumetric probabilities, insection 1.4.1 we use sets. The initial (truncated) Gaussians here (initial informationon x and on y ) are such that the “two-sigma” intervals of the Gaussians correspondto the intervals defining the sets in section 1.4.1. Therefore, the results are directlycomparable.

116 Probabilities

y = (x)

x

y

0 1 2 3 4 5 6 7

-10

0

10

20

30

40

p(y)

q(y)

f(x)

g(x)

Figure 3.12: Two quantities x and y have some definite values xtrue and ytrue , thatwe try to uncover. For the time being, we have some (independent) informationon these two quantities, represented by the volumetric probabilities f (x) and p(y)(black functions in the figure). We then learn that, in fact, ytrue = ϕ(xtrue) , withthe function x 7→ y = ϕ(x) = x2 − (x− 3)3 represented at the top-right. To obtainthe volumetric probability representing the “posterior” information on xtrue thereis only the following way. The volumetric probability p(y) is transported from theordinate to the abscissa axis, by application of the notion of preimage of a volumetricprobability. This gives the volumetric probability [ϕ-1(p)](x) , by application of theformula ??. This function is represented in blue (dashed line) at the bottom of thefigure. The volumetric probability representing the posterior information on xtrue isthen obtained by the intersection of f (x) and of [ϕ-1(p)](x) , using formula ??. Thisgives the function g(x) represented in red at the bottom. In total, then, we have eval-uated g = f ∩ϕ-1(p) . To obtain the “posterior” information on ytrue we just trans-port the g(x) just obtained from the abscissa into the ordinate axis, i.e., we computethe image p =ϕ(g) of g using formula ??. This gives the function q(y) representedin red at the left of the figure. We then have q = ϕ(g) = ϕ( f ∩ϕ-1(p)) . We couldhave arrived at q(y) following a different route. We could first have evaluated theimage of f (x) , to obtain the function [ϕ( f )](y) , represented in blue (dashed line) atthe left. The intersection of p(y) with [ϕ( f )](y) then gives the same q(y) , because,as demonstrated in the text, ϕ( f ∩ϕ-1(p)) = ϕ( f )∩ p . Note: the original functionsf (x) and p(y) were two Gaussians (respectively centered at x = 3 and y = 18 ,and with standard deviations σx = 1 and σy = 2 ), truncated inside the workingintervals x ∈ [−1/2 , 7 ] and y ∈ [ϕ(−1, 2) , ϕ(7) ] .

3.5 Probabilities over Metric Manifolds 117

3.5 Probabilities over Metric Manifolds

Note: explain here that to introduce the notion of “conditional volumetric probabil-ity given the constraint Q =ϕ(P) ”, we need more than volume manifolds, we needmetric manifolds (in order to be able to taje an “uniform limit”).

3.5.1 Conditional Probability

3.5.1.1 First Piece of Text

Note: explain here that a condition is a subset.

• Example: a ≥ 3 . ( P( a | a ≥ 3 ) )

• Example: b = a2 . ( P( a, b | b = a2 ) )

• Example: b 6= a2 . ( P( a, b | b 6= a2 ) )

Consider a probability P over a set A0 , and a given set C ⊆ A0 of nonzeroprobability (i.e, such that with P[C] 6= 0 ). The set C is called “the condition”. Theconditional probability (with respect to the condition C ) is, by definition, the probabil-ity (over A0 ) that to any A ⊆ A0 associates the number, denoted P[ A |C ] , definedas

P[ A |C ] =P[ A∩C ]

P[C]. (3.162)

This number is called “the conditional probability of A given (the condition) C ”.This, of course, is the original Kolmogorov’s definition of conditional probability,

so, to demonstrate that this defines, indeed, a probability, we can just outline herethe original demonstration. (Note: do it!)

Example 3.16 A probability P over a discrete set A0 with elements a, a′ . . . is rep-resented by the elementary probability p defined, as usual, a p(a) = P[a] . Then, theprobability of any set A ⊆ A0 is P[A] = ∑a∈A p(a) . Given, now, a set C ⊆ A0 , theelementary probability that represents the probability P[ · |C ] , denoted p( · |C ) , is given(for any a ∈ A0 ) by

p( a |C ) =

p(a) / ν if a ∈ C

0 if a /∈ C ,(3.163)

where ν is the normalizing constant ν = ∑a∈C p(a) . Then, for any A ⊆ A0 , P[ A |C ] =∑a∈A p( a |C ) .

118 Probabilities

Example 3.17 Let us consider a probability over NNN×NNN . By definition, then, each elementa ∈ NNN×NNN is an ordered pair of natural numbers, that we may denote a = n, m . To anyprobability P over NNN×NNN is associated a nonnegative real function p(n, m) defined as

p(n, m) = P[n, m] . (3.164)

Then, for any A ⊆ NNN×NNN

P[A] = ∑n,m∈A

p(n, m) . (3.165)

As suggested above, we call p(n, m) the elementary probability of the element n, m .While P associates a number to every subset of NNN×NNN , p associates a number to everyelement of NNN×NNN . (Note: I have already introduced this notion above; say where.) Introducenow the condition m = 3 . This corresponds to the subset C of NNN×NNN made by all pairs ofnumbers of the form n, 3 . If P is such that P[C] 6= 0 , we can introduce the conditionalprobability P[ · |C ] . To this conditional probability is associated an elementary probabilityq(n, m) , that that we may also denote p( n, m |m = 3 ) . It can be expressed as

q(n, m) = p( n, m |m = 3 ) =

p(n, m) / ∑n∈NNN p(n, m) if m = 3

0 if m 6= 3 .(3.166)

Note that, although the elementary probability q(n, m) takes only nonzero values whenm = 3 , it is a probability over NNN×NNN , not a probability over NNN .

The definition of conditional probability applies equally well when we deal withdiscrete probabilities or when we deal with manifolds. But when working with man-ifolds, there is one particular circumstance that needs clarification: when instead ofconditioning the original probability by a subset of points that has (as a manifold)the same dimensions as the original manifold, we consider a submanifold with lowernumber of dimensions. As this situation shall have a tremendous practical impor-tance (when dealing with the so-called inverse problems), let us examine it here.

Consider a manifold M with m dimensions, and let C be a (strict) submanifoldof M . Denoting my c the number of dimensions of C , then, by hypothesis, c < m .Consider now a probability P defined over M . As usual, P[A] then denotes theprobability of any set of points A ⊆ M . Can we define P[ A |C ] , the conditionalprobability over M given the (condition represented by the) submanifold C ? Theanswer is negative (unless some extra ingredient is added). Let us see this.

(Note: explain here that we need to take a “uniform limit”, and, for that, we needa metric manifold.)

Example 3.18 Let M be a finite-dimensional manifold, with points respectively denotedP, P′ . . . , and let C be a submanifold of M , i.e., a manifold contained in M and havinga smaller number of dimensions. Let now P denote a probability function over M : to


any set A ⊆ M it associates the (unconditional) probability value P[ A ] . How can wecharacterize the conditional probability function P[ · |C ] that to any set A ⊆ M associatesthe probability value P[ A |C ] ? This can be done by considering some set C ⊆ M (havingthe same dimension than M , and taking an uniform limit C → C . Such a limit canonly be taken if the manifold M is metric (note: explain why). This has two implications.First, there is a volume element over M , that we may denote dvM , and, therefore, theunconditional probability P can be represented by a volumetric probability f (P) , such that,for any A ⊆ M ,

P[ A ] =∫

P∈AdvM f ( P ) . (3.167)

The second implication is that a metric over the manifold M induces a metric over anysubmanifold of M . Therefore, there will also be a volume element dvC over C (whoseexpression we postpone to example 3.19). Given this volume element, we can introduce theconditional volumetric probability over C , that we may denote f ( P |N ) . By definition,then, for any set B ⊆ C ,

P[ B |C ] =∫

P∈BdvC f ( P |C ) . (3.168)

It turns out (note: refer here to the demonstration in the appendixes) that (at every point P

of C ) the conditional volumetric probability equals the unconditional one:

f ( P |C ) = f ( P ) . (3.169)

This result is not very surprising, as we have introduced volumetric probabilities to have thiskind of simple results (that would not hold if working with probability densities). It remains,in this example, that the most important problem —to express the induced volume elementdvC — is not solved (it is solved below).

Example 3.19 Let M and N two finite-dimensional manifolds, with points respectivelydenoted P, P′ . . . and Q, Q′ . . . , and let ϕ be a mapping from M into N . We can representthis mapping as P 7→ Q = ϕ(P) . The points P, Q of the manifold M×N that satisfyQ = ϕ(P) , i.e., the points of the form P , ϕ(P) , constitute a submanifold, say C , ofM×N . The dimension of C equals the dimension of M . Consider now a probability Pover M×N . By definition, the (unconditional) probability value of any set A ⊆ M×N

is P[A] . Which is the conditional probability P[ A | Q = ϕ(P) ] ? Again, one starts byintroducing some set C ⊆ M×N , and tries to define the conditional probability functionby taking an uniform limit C → C . To define an uniform limit, we need a metric overM ×N . Typically, one has a metric over M , with distance element ds2

M , and a metricover N , with distance element ds2

N , and one defines ds2M×N = ds2

M + ds2N . Let gM

denote the metric tensor over M and let gN denote the metric tensor over N . Then, themetric tensor over M×N is gM×N = gM ⊗ gN , and as demonstrated in appendix XXX,the metric tensor induced over C is

gC = gM + Φt gN Φ , (3.170)

120 Probabilities

where Φ denotes the (linear) tangent mapping to the mapping ϕ (at the considered point),and where Φt is the transpose of this linear mapping. We then have (i) a volume elementdvM over M (remember that a choice of coordinates over a metric manifold induces a capac-ity element dv , and that the volume element can then be expressed as dv = g dv , with g =√

det g ), (ii) a volume element dvN over N , (iii) a volume element dvM×N = dvM dvN

(note: check this!) over M×N , and (iv) a volume element dvC over C (the volume elementassociated to the metric given in equation 3.170). We can now come back to the original (un-conditional) probability P[ · ] defined over M×N . Associated to it is the (unconditional)volumetric probability f (P, Q) , and, by definition, the probability of any set A ⊆ M×N

isP[ A ] =

∫P,Q∈A

dvM×N f (P, Q) . (3.171)

The conditional volumetric probability f ( P , Q | Q = ϕ(P) ) , associated to the conditionalprobability function P[ · | Q =ϕ(P) ] is, by definition, such that the conditional probabilityof any set B ⊆ C is obtained as

P[ B | Q =ϕ(P) ] =∫P,Q∈B

dvC f ( P , Q | Q =ϕ(P) ) . (3.172)

But, for the reason explained in the previous example, for any point (on C ), f ( P , Q | Q =ϕ(P) ) = f (P, Q) , so, for any set B ⊆ C ,

P[ B | Q =ϕ(P) ] =∫P,Q∈B

dvC f (P, Q) . (3.173)

There are two differences between the sums at the right in expressions 3.171 and 3.173: (i)the first sum is over a subset of the the manifold M×N , while the second sum is over asubset of the submanifold C ; and (ii) in the first sum, the volume element is the originalvolume element over M×N , while in the second sum, it is the volume element over C thatis associated to the induced metric gC (expressed in equation 3.170). (Note: Say here thatthe expression of this volume element when some coordinates are chosen over the manifolds,is given elsewhere in this text.)

Note: In the next section, the marginal of the conditional is defined. Using theformulas obtained elsewhere in this text (see section 3.5.4), we obtain the result —forthe marginal of the conditional— that the probability of any set M ⊆ M is

PM[ M | Q =ϕ(P) ] =1ν

∫P∈M

dvM ω(P) f ( P , ϕ(P) ) , (3.174)

whereω(P) =

√det( gM + Φt gN Φ ) , (3.175)

and where ν is a normalization constant. In a typical “inverse problem” one has

f (P, Q) = g(P) h(Q) , (3.176)


and equation 3.174 becomes

PM[ M | Q =ϕ(P) ] =1ν

∫P∈M

dvM ω(P) g(P) h(ϕ(P) ) . (3.177)

While the “prior” volumetric probability is g(P) , we see that the “posterior” volu-metric probability is

i(P) =1ν

g(P) L(P) , (3.178)

where the “likelihood volumetric probability” L(P) is

L(P) = ω(P) h(ϕ(P) ) . (3.179)

Note that we have the factor ω(P) that is not there when, instead of the “marginalof the conditional” approach, we follow the “mapping of probabilities” approach.

122 Probabilities


We need here to introduce a very special probability distribution, denoted HB , thatis homogeneous inside a domain D ⊂ M and zero outside: for any domain B withfinite volume V(B) we then have the volumetric probability

hB(P) =

1/V(B) if P ∈ B

0 if P 6∈ B .(3.180)

Warning: this definition has already been introduced.What is the result of the intersection P∩HB of this special probability distribu-

tion with a normed probability distribution P ? This intersection can be evaluatedvia the product of the volumetric probabilities, that, in a normalized form (equa-tion ??) gives (p · hB)(P) = p(P)/

∫P′∈B dv(P′) p(P′) if P ∈ B , and zero if P 6∈ B .

As∫P∈B dv(P) p(P) = P(B) , we have, in fact,

(p · hB

)(P) =

p(P)/P(B) if P ∈ B

0 if P 6∈ B .(3.181)

The probability of a domain A ⊂ M is to be calculated as (P∩Q)(A) =∫P∈A

(p ·

hB)(P) = 1

P(B)∫P∈A∩B p(P) , and this gives

(P∩HB

)(A) =

P(A∩B)P(B)

. (3.182)

The expression at the right corresponds to the usual definition of conditional prob-ability. One usually says that it represents the probability of the domain A ‘given’ thedomain B , this suggesting a possible use of the definition. For the time being, I preferjust to regard this expression as the result of the intersection of an arbitrary probabil-ity distribution P and the probability distribution HB (that is homogeneous insideB and zero outside).

Note: should I introduce here the notion of volumetric probability, or should Isimply send the reader to the appendix?

Note: mention somewhere figures 3.13 and 3.14.

Figure 3.13: Illustration of the intersec-tion of two probability distributions, viathe product of the associated volumetricprobabilities.

P( ).

f(x)

Q( )

g(x)

.(P∩Q)( )

(f g)(x) = k f(x) g(x)

(f g)(x)

P∩Q

.

.


Figure 3.14: Illustration of the definitionof conditional probability. Given an in-tial probability distribution P( · ) (left ofthe figure) and a set B (middle of thefigure), P( · |B) is identical to P( · ) in-side B (except for a renormalization fac-tor guaranteeing that P(B|B) = 1 ) andvanishes outside B (right of the figure)

B

H(x)

P( ).

p(x)

P( |B).

p(x|B)

P(A∩B)

P(B)P(A|B) = p(x|B) = k p(x) H(x)

124 Probabilities

3.5.1.3 Third Piece of Text

Assume that there is a probability distribution defined over an n-dimensional metricmanifold Mn . This probability distribution is represented by the volumetric proba-bility fn(P) . Inside the manifold Mn we consider a submanifold Mp with dimen-sion p ≤ n . Which is the probability distribution induced over Mp ? This situation,schematized in figure 3.15, has a well defined solution because we assume that theinitial manifold Mn is metric12.

Figure 3.15: In an n-dimensional metric manifold Mn ,some random points suggest a probability distribution. Onthis manifold, there is a submanifold Mp , and we wish toevaluate the probability distribution over Mp induced bythe probability distribution over Mn .

.

.

.

.

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

.

.

.

.

.

.

.

.

.

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. .

.

.

.

..

.

.

. ..

.

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

...

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Mn

Mp

The careful, quantitative, analysis of this situation is done in appendix 5.3.6. Theresult obtained is quite simple, and could have been guessed. Let us explain it herewith some detail.

The initial volumetric probability (over Mn ) is fn(P) , that me may assume to benormalized to one. The probability of a domain Dn ⊂ Mn is evaluated via

P(Dn) =∫

P∈Dndvn(P) fn(P) . (3.183)

It happens (see appendix 5.3.6) that the induced volumetric probability over Mp hasthe value

fp(P) =fn(P)∫

P′∈Dpdvp(P′) fp(P′)

(3.184)

at any point P ∈ Mp (and is undefined at any point P 6∈ Mp ). So, on the submam-ifold Mp , the induced volumetric probability fp(P) takes the same values than theoriginal volumetric probability fn(P) , excepted for a renormalization factor. Theprobability of a domain Dp ⊂ Mp is to be evaluated as

P(Dp) =∫

P∈Dpdvp(P) fp(P) . (3.185)

12Therefore, we can consider a ‘bar’ of constant ‘thickness’ around Mp and take the limit where thethickness tends uniformly to zero. All this can be done without using any special coordinate system,so we obtain an invariant result. If the manifold is not metric, there is no way to define a uniformlimit.


Example 3.20 In the Euclidean 3D space, consider an isotropic Gaussian probability dis-tribution with standard deviation σ . Which is the conditional (2D) volumetric probabilityit induces on the surface of a sphere of unit radius whose center is at unit distance from thecenter of the Gaussian? Using geographical coordinates (see figure 3.16), the answer is givenby the (2D) volumetric probability

f (ϕ, λ) = k exp(

sin λσ2

), (3.186)

where k is a norming constant (demonstration in section 5.3.14). This is the celebratedFisher probability distribution, widely used as a model probability on the sphere’s sur-face. The surface element over the surface of the sphere could be obtained using the equa-tions 3.191–3.192, but it is well known to be dS(ϕ, λ) = cos λ dϕ dλ .

Figure 3.16: The spherical Fisher distribution correspondsto the conditional probability distribution induced over asphere by a Gaussian probability distribution in an Eu-clidean 3D space (see example 3.20). To have a full 3D rep-resentation of the property, this figure should be ‘rotatedaround the vertical axis’.

ϑ

Equations 3.184–3.185 contain the induced volume element dvp(P) . Sometimes,this volume element is directly known, as in example 3.20, but when dealing withabstract manifolds, while the original n-dimensional volume element dvn(P) maybe given, the p-dimensional volume element dvp(P) must be evaluated.

To do this, let us take over Mn a coordinate system x = x1, . . . , xn adaptedto our problem. Separating these coordinates into a group of p coordinates r =x1, . . . , xp and a group of q = n− p coordinates s = s1, . . . , sq , such that thep-dimensional manifold Mp is defined by the set of q constraints

s = s(r) . (3.187)

Note that the coordinates r define a coordinate system over the submanifold Mp .The two equations 3.184–3.185 can now be written

fp(r|s(r)) =f (r, s(r))∫

r∈Mpdvp(r) f (r, s(r))

(3.188)

and

P(Dp) =∫

r∈Dpdvp(r) fp(r|s(r)) . (3.189)

126 Probabilities

In these equations, the notation fp(r|s(r)) is used instead of just fp(r) , to rememberthe constraint defining the conditional volumetric probability.

The volume element of the submanifold can be written

dvp(r) = gp(r) dvp(r) , (3.190)

with dvp(r) = dr1 ∧ · · · ∧ drp , and where gp(r) is the metric determinant of themetric gp(r) induced oner the submanifold Mp by the original metric over Mn :

gp(r) =√

det gp(r) . (3.191)

We have obtained in section 5.2.2 (equation 5.15, page 178):

gp(r) = grr + grs S + ST gsr + ST gss S , (3.192)

where S is the matrix of the partial derivatives of the relations s = s(r) . The prob-ability of a domain Dp ⊂ Mp can then either be computed using equation 3.189 oras

P(Dp) =∫

r∈Dpdvp(r) gp(r) fp(r|s(r)) . (3.193)

The reader should realize that have here a conditional volumetric probability,not a conditional probability density. The expression for the conditional probabilitydensity is given in appendix 5.3.7 (equation 5.92, page 198), and is quite complicated.This is so because we care here to introduce a proper (i.e., metric) limit. The expres-sions proposed in the literature, that look like expression 3.188, but are written withprobability densities, are quite arbitrary.

The mistake, there, is to take, instead a uniform limit, a limit that is guided by thecoordinates being used, as if a coordinate increment was equivalent to a distance.See the discussion in figures 3.17 and 3.18.

Figure 3.17: In a two-dimensional manifold, some random pointssuggest a probability distribution. On the manifold, there is acurve, and we wish to evaluate the probability distribution over thecurve induced by the probability distribution over the manifold.

.

.

.

.

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Note: say somewhere that a nontrivial example of application of the notion ofconditional volumetric probability is made in section 5.5.6.3 (adjusting a measure-ment to a theory).


Figure 3.18: To properly define the induced probability over the curve (see fig-ure 3.17), one has to take a domain around the curve, evaluate the finite probabil-ity, and take the limit when the size of the domain tends to zero. The only intrinsicdefinition of limit can be made when the considered manifold is metric, as the limitcan be taken ‘uniform’ and ‘normal’ to the curve. Careless definitions of ‘conditionalprobability density’ work without assuming the there is metric over the manifold,and just take a sort of ‘vertical limit’ as suggested in the middle. This is as irrelevantas it would be to take a ‘horizontal limit’ (at the right). Some of the paradoxes ofprobability theory (like Borel’s paradox) arise from this inconsistency.

Example 3.21 In the case where we work in a two-dimensional manifold M2 , with p =q = 1 , we can use the notation r and s instead of r and s , so that the constraint 3.187 iswritten

s = s(r) , (3.194)

and the ‘matrix’ of partial derivatives is now a simple real quantity S = ds/dr . The con-ditional volumetric probability on the line s = s(r) induced by a volumetric probabilityf (r, s) is (equation 3.188),

f1(r) =f (r, s(r))∫

d`(r′) f (r′, s(r′)), (3.195)

where, if the metric of the manifold M2 is written g(r, s) =(

grr(r, s) grs(r, s)gsr(r, s) gss(r, s)

), the

(1D) volume element is (equations 3.190–3.192)

d`(r) =√

grr(r, s(r)) + 2 S(r) grs(r, s(r)) + S(r)2 gss(r, s(r)) dr . (3.196)

The probability of an interval (r1 < r < r2) along the line s = s(r) is then P =∫ r2r1

d`(r) f1(r) . If the constraint 3.194 is, in fact, s = s0 , then, equation 3.195 simpli-fies into

f1(r) =f (r, s0)∫

d`(r′) f (r′, s0), (3.197)

and, as the partial derive vanishes, S = 0 , the length element 3.196 becomes

d`(r) =√

grr(r, s0) dr . (3.198)

128 Probabilities

Example 3.22 Consider two Cartesian coordinates x, y on the Euclidean plane, associ-ated to the usual metric ds2 = dx2 + dy2 . It is easy to see (using, for instance, equa-tion 2.65) that the metric matrix associated to the new coordinates (see figure 3.19)

r = x ; s = x y (3.199)

is

g(r, s) =(

1 + s2/r4 −s/r3

−s/r3 1/r2

), (3.200)

with metric determinant√

det g(r, s) = 1/r . Assume that all what we know about theposition of a given point is described by the volumetric probability f (r, s) . Then, we aretold that, in fact, the point is on the line defined by the equation s = s0 . What can we nowsay about the coordinate r of the point? This is clearly a problem of conditional volumetricprobability, and the information we have now on the position of the point is represented bythe volumetric probability (on the line s = s0 ) given by equation 3.197:

f1(r) =f (r, s0)∫

d`(r′) f (r′, s0). (3.201)

Here, considering the special form of the metric in equation 3.200, the length element givenby equation 3.198 is

d`(r) =√

1 + s20/r4 dr . (3.202)

The special case s = s0 = 0 gives

f1(r) =f (r, 0)∫

d`(r′) f (r′, 0); d`(r) = dr . (3.203)

Figure 3.19: The Euclidianplane, with, at the left, twoCartesian coordinates x, y ,and, at the right the two coor-dinates u = x ; v = x y .

x = 0

y = -1

y = -0.5

y = 0

y = +0.5

y = +1

x = 0.2

x = 0.4

x = 0.6

x = 0.8

x = 1

v = -1

v = -0.5

v = 0

v = +0.5

v = +1

u = 0

u = 0.2

u = 0.4

u = 0.6

u = 0.8

u = 1

Example 3.23 To address a paradox mentioned by Jaynes (2003), let us solve the same prob-lem solved in the previous example, but using the Cartesian coordinates x, y . The infor-mation that was represented by the volumetric probability f (r, s) is now represented by thevolumetric probability h(x, y) given by (as volumetric probabilities are invariant objects)

h(x, y) = f (r, s)|r=x ; s=x y . (3.204)


As the condition s = 0 is equivalent to the condition y = 0 , and as the metric matrix is theidentity, it is clear that the shall arrive, for the (1D) volumetric probability representing theinformation we have on the coordinate x to

h1(x) =h(x, 0)∫

d`(x′) h(x′, 0); d`(x) = dx . (3.205)

Not only this equation is similar in form to equation 3.203; replacing here h by f (usingequation 3.204) we obtain an identity that can be expressed using any of the two equivalentforms

h1(x) = f1(r)|r=x ; f1(r) = h1(x)|x=r . (3.206)

Along the line s = y = 0 , the two coordinates r and s coincide, so we obtain the samevolumetric probability (with the same length elements d`(x) = dx and d`(r) = dr ).Trivial as it may seem, this result is not that found the traditional definition of conditionalprobability density. Jaynes (2003) lists this as one of the paradoxes of probability theory. Itis not a paradox, it is a mistake one makes when falling into the illusion that a conditionalprobability density (or a conditional volumetric probability) can be defined without invokingthe existence of a metric (i.e., of a notion of distance) in the working space. This ‘paradox’ isrelated to the ‘Borel-Kolmogorov paradox’, that I address in appendix 5.3.10.

130 Probabilities

3.5.2 Marginal of a Conditional Probability

Let us place here in the situation when both, conditional and marginal probabilitiesmake sense:

• we have two sets A0 and B0 , and we introduce their Cartesian product S0 =A0 × B0 ;

• if the two sets A0 and B0 are, in fact, manifolds, they are assumed to be metric(with respective length elements being ds2

A0and ds2

B0), and the metric over S0

is introduced as ds2C0

= ds2A0

+ ds2B0

.

Given a particular probability function P over S0 = A0 × B0 and given a par-ticular set C ⊆ S0 with P[C] 6= 0 (the set C is “the condition”), the conditionalprobability function given C is introduced as usual: it is the probability functionover S0 that to any set S ⊆ S0 associates the probability value

P[ S |C ] ≡ P[ S ∩ C ]P[ C ]

. (3.207)

This conditional probability function has two marginal probability functions (seesection 3.2.7):

• the probability function over A0 that to any set A ⊆ A0 associates the proba-bility value

PA0 [ A |C ] ≡ P[ A× B0 |C ] , (3.208)

• and the probability function over B0 that to any set B ⊆ B0 associates theprobability value

PB0 [ B |C ] ≡ P[ A0 × B |C ] . (3.209)

Explicitly, this gives

PA0 [ A |C ] =P[ A× B0 ∩ C ]

P[ C ]and PB0 [ B |C ] =

P[ A0 × B ∩ C ]P[ C ]

.

(3.210)

Example 3.24 When the two sets A0 and B0 are discrete, bla, bla, bla, and introducing theelementary probability p(a, b) via

P[ S ] = ∑a,b∈S

p(a, b) , (3.211)

and when the conditioning set C corresponds to a mapping a 7→ b =ϕ(a) , then, introduc-ing the two elementary probabilities pA0( a | b =ϕ(a) ) and pB0( b | b =ϕ(a) ) via

PA0 [ A | b =ϕ(a) ] = ∑a∈A

pA0( a | b =ϕ(a) ) , (3.212)


PB0 [ B | b =ϕ(a) ] = ∑b∈B

pB0( b | b =ϕ(a) ) , (3.213)

we arrive at (note: check this!)

pA0( a | b =ϕ(a) ) =1ν

p( a ,ϕ(a) ) (3.214)

and (note: explain that, for every given b ∈ B0 , the summation is over all a ∈ A0 such thatϕ(a) = b )

pB0( b | b =ϕ(a) ) =1ν

∑a : ϕ(a)=b

p( a ,ϕ(a) ) , (3.215)

where ν is the normalization constant ν = ∑a∈A0p( a , ϕ(a) ) , or, equivalently (note:

check!), ν = ∑b∈B0 ∑a : ϕ(a)=b p( a , ϕ(a) ) .

Example 3.25 (Note: give here a concrete example of the situation analyzed in example 3.24.One of the drawings?)

Example 3.26 The most important example for us shall be when the two sets A0 and B0are, in fact, manifolds —say M and N —, and the conditioning set C is not an ordinarysubset of M×N but it is a submanifold of M×N . Then. . .

Example 3.27 (Note: give here a concrete example of the situation analyzed in example 3.26.)

A common situation is when the original mapping P (over A0 × B0 ) is the prod-uct of two marginal mappings. . . (Note: continue this, checking that the notion ofindependence has been already introduced.)

132 Probabilities

3.5.3 Demonstration: marginals of the conditional

Equations 3.30 and 3.32 are demonstrated as follows.One considers a p-dimensional metric manifold M , that, for the sake of the

demonstration we may endow with some coordinates x = xα = x1, . . . , xp .Denoting as γ(x) the metric tensor, the volume element can then be written as

dvM(x) =√

detγ(x) dx1 ∧ · · · ∧ dxp . (3.216)

One also considers a q-dimensional metric manifold N , also endowed with some co-ordinates y = yi = y1, . . . , yq . Denoting as Γ(y) the metric tensor, the volumeelement is, then,

dvN(y) =√

det Γ(y) dy1 ∧ · · · ∧ dyq . (3.217)

Finally, one considers a mapping ϕ from M into N , that, with the given coordi-nates, can be written as y = ϕ(x) . We shall denote as Φ the tangent linear map-ping, in fact, the matrix of partial derivatives Φi

α = ∂yi/∂xα . This is defined atevery point x , so we can write Φ(x) .

One introduces the p + q-dimensional manifold M×N . It is easy to see that themapping x 7→ y = ϕ(x) defines, inside M×N , a submanifold whose dimensionis p (the dimension of M ). Therefore, the coordinates xα (of M ) can also beused as coordinates over that submanifold. Introducing, on that submanifold, theline element ds2 = ds2

M + ds2N , one can respectively write

ds2 = ds2M + ds2

N

= γαβ dxα dxβ + Γi j dyi dy j

= γαβ dxα dxβ + Γi j Φiα Φ j

β dxα dxβ

= (γαβ + Φiα Γi j Φ

jβ) dxα dxβ ,

(3.218)

this showing that the components of the metric over the submanifold are γαβ +Φiα Γi j Φ

jβ . Said otherwise, the metric at point x of the submanifold is γ(x) +

Φt(x) Γ(ϕ(x) ) Φ(x) . This implies that the volume element induced on the subman-ifold is

dv(x) =√

det( γ(x) + Φt(x) Γ(ϕ(x) ) Φ(x) ) dx1 ∧ · · · ∧ dxp . (3.219)

Let now h(x, y) be a volumetric probability over M×N . The conditional vol-umetric probability is defined as the limit (which one? which one? which one?).This will lead to a volumetric probability c(x) (remember that we are using the co-ordinates xα over the submanifold) that we shall express in a moment. But let usmade clear before that to evaluate the probability of a set A (at the same time a set


of M and of the submanifold) the volumetric probability c(x) has to be integratedas

P[A] =∫

Adv(x) c(x) , (3.220)

with the volume element dv(x) expressed in equation 3.220. Because (what? what?what?), the value of c(x) at any point ( x , ϕ(x) ) is just proportional to h( x , ϕ(x) ) ,

c(x) =1ν

h( x , ϕ(x) ) , (3.221)

where ν is the normalization constant

ν =∫

Mdv(x) h( x , ϕ(x) ) . (3.222)

Now the two marginals f (x) and g(y) can be introduced by considering somedP on the submanifold, and by “projecting” it into M and into N . This can bewritten by identifying the three expressions

dP = h( x , ϕ(x) )√

det( γ(x) + Φt(x) Γ(ϕ(x) ) Φ(x) ) dx1 ∧ · · · ∧ dxp

= f (x)√

det γ(x) dx1 ∧ · · · ∧ dxp

= g(y)√

det Γ(y) dy1 ∧ · · · ∧ dyq ,

(3.223)

i.e.,

dP = h( x , ϕ(x) )√

det( γ(x) + Φt(x) Γ(ϕ(x) ) Φ(x) ) dx1 ∧ · · · ∧ dxp

= f (x)√

detγ(x) dx1 ∧ · · · ∧ dxp

= g(y)√

det Φt(x) Γ(ϕ(x) ) Φ(x) dx1 ∧ · · · ∧ dxp .

(3.224)

(Note: what have I done here???) And from that, it would follow

f (x) =1ν

h( x , ϕ(x) )√

det(γ(x) + Φ(x)t Γ(ϕ(x) ) Φ(x))√detγ(x)

. (3.225)

g(y) =1ν

∑x :ϕ(x)=y

h( x , y )√

det(γ(x) + Φt(x) Γ(y) Φ(x) )√det Φ(x)t Γ(y) Φ(x)

. (3.226)

134 Probabilities

3.5.4 Bayes Theorem

3.5.4.1 First Piece of Text

When the manifold Mn just considered is general, the conditional probability distri-bution we have defined is over the considered surface, and there is not much morewe can do. In the special situation where the manifold Mn has been built by ‘Carte-sian product’ of two manifolds Rp and Sq , we can advance a little bit further.

Case Mn = Rp ×Sq

Assume that we have a p-dimensional metric manifold Rp , with some coordinatesr = rα , and a metric tensor denoted gr = gαβ . We also have a q-dimensionalmetric manifold Sq , with some coordinates s = sa , and a metric tensor denotedgs = gab . Given two such manifolds, we can always introduce the manifold Mn =Mp+q = Rp ×Sq , i.e., a manifold whose points consist on a pair of points, one inRp and one in Sq . As Rp and Sq are both metric manifolds, it is always possibleto also endow Mp+q with a metric. While the distance element over Rp is ds2

r =(gr)αβ drα drβ , and the distance element over Sq is ds2

s = (gs)ab dsa dsb , the distanceelement over Mp+q is ds2 = ds2

r + ds2s .

The relations = s(r) (3.227)

considered above (equation 3.187) can now be considered as a mapping from Rpinto Sq . The conditional volumetric probability

fp(r|s(r)) = const. f (r, s(r)) (3.228)

of equation 3.188 is defined over the submanifold Mp ⊂ Mp+q defined by the con-straints s = s(r) . On this submanifold we integrate as (equation 3.193)

P(Dp) =∫

r∈Dpdvp(r) gp(r) fp(r|s(r)) , (3.229)

with dvp(r) = dr1 ∧ · · · ∧ drp (because the coordinates rα are being also used ascoordinates over the submanifold), and where the metric determinant gp(r) , givenin equations 3.191 and 3.192, here simplifies into

gp(r) =√

det gp(r) ; gp = gr + ST gs S . (3.230)

Again, this volumetric probability is defined over the submanifold of Mp+q cor-responding to the constraints s = s(r) . Can we consider a volumetric probabilitydefined over Rp ?

Yes, of course, and this is quite easy, as we are already considering over the sub-manifold the coordinates rα that are, in fact, the coordinates of Rp . The only


difference is that instead of evaluating integrals using the induced metric in equa-tions 3.230 we have to use the actual metric of Rp , i.e., the metric gr .

The basic criterion that shall allow us to make the link between the volumetricprobability fp(r|s(r)) (in equation 3.228) and a volumetric probability, say fr(r|s(r)) ,defined over Rp is that the integral over a given domain defined by the coordinatesrα gives the same probability (as the cordinates rα are common to the subman-ifold and to Mp ).

We easily obtain that the volumetric probability defined over Rp is

fr(r|s(r)) = const.

√det(gr + ST gs S)√

det grf (r, s(r)) , (3.231)

and we integrate on a domain Dp ⊂ Rp as

P(Dp) =∫

r∈Dpdvr(r) gr(r) fr(r|s(r)) =

∫r∈Dp

dvr(r) fr(r|s(r)) , (3.232)

with dvr(r) = dr1 ∧ · · · ∧ drp , gr(r) =√

det gr , and dvr(r) = dvr(r) gr(r) . Seefigure 3.20 for a schematic representation of this definition of a volumetric proba-bility over Rp . When these is no risk of confusion with the function fp(r|s(r)) ofequation 3.188, we shall also call fr(r|s(r)) the conditional volumetric probability forr , given s = s(r) . Remember that while fp(r|s(r)) is defined on the p-dimensionalsubmanifold Mp ⊂ Mp+q , fr(r|s(r)) is defined over Rp .

As a special case, the relation s = s(r) may just be

s = s0 , (3.233)

in which case, the matrix of partial derivatives vanishes, S = 0 . In this case, thefunction fr(r|s(r)) simplifies into fr(r|s0) = const. f (r, s0) . When dropping theindex ‘0’ in s0 we just write fr(r|s) = const. f (r, s) , or, under normalized form

fr(r|s) =f (r, s)∫

Rpdvr(r) gr(r) f (r, s)

=f (r, s)∫

Rpdvr(r) f (r, s)

. (3.234)

Example 3.28 With the notations of this section, consider that the metric gr of the spaceRp and the metric gs of the space Sq are constant (i.e., that both, the coordinates rα andsi are rectilinear coordinates in Euclidean spaces), and that the application s = s(r) is alinear application, that we can write

s = S r , (3.235)

as this is consistent with the definition of S as the matrix of partial derivatives, Siα =

∂si/∂sα . Consider that we have a Gaussian probability distribution over the space Rp ,represented by the volumetric probability

fp(r) =1

(2π)p/2exp

(−1

2(r− r0)t gr (r− r0)

), (3.236)

136 Probabilities

r

s

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

.

.

.

.

.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

...

.

..

.

..

.

..

.

..

.

..

.

..

.

..

.

. ... ..

. ..

. ..

. ..

. ..

..

..

..

..

..

..

..

..

..

..

..

..

..

.

..

.

. ... .

..

..

.

...... .

.. .. .. .. .. .

..

..

..

..

...

..

. ..

..

..

.. .

..

..

.. .

r

s

Figure 3.20: In an n-dimensional space Mn that is the Cartesian product of twospaces Rp and Sq , with coordinates r = r1, . . . , rp and s = s1, . . . , sq andmetric tensors gr and gs , there is a volume element on each of Rp and Sq , andan induced volume element in Mn = Rp ×Sq . Given a p-dimensional submani-fold manifold s = s(r) of Mn , there also is an induced volume element on it. Avolumetric probability f (r, s) over Mn , induces a (conditional) volumetric proba-bility fx(r) over the submanifold s = s(r) (equation 3.188), and, as the submanifoldshares the same coordinates as Rp , a volumetric probability fr(r) is also inducedover Rp (equation 3.231).


that is normalized via∫

dr1∧ · · · ∧ drp√

det gr fp(r) =√

det gr∫

dr1∧ · · · ∧ drp fp(r) =1 . Similarly, consider that we also have a Gaussian probability distribution over the spaceSq , represented by the volumetric probability

fq(s) =1

(2π)q/2exp

(−1

2(s− s0)t gs (s− s0)

), (3.237)

that is normalized via∫

ds1∧ · · · ∧ dsq√

det gs fq(s) =√

det gs∫

ds1∧ · · · ∧ dsq fq(s) =1 . Finally, consider the p + q-dimensional probability distribution over the space Mp+q de-fined as the product of these two volumetric probabilities,

f (r, s) = fp(r) fq(s) . (3.238)

Given this p + q-dimensional volumetric probability f (r, s) and given the p-dimensionalhyperplane s = S r , we obtain the conditional volumetric probability fr(r) over Rp asgiven by equation 3.231. All simplifications done13 one obtains the Gaussian volumetricprobability14

fr(r) =1

(2π)p/2

√det g′r√det gr

exp(−1

2(r− r′0)

t g′r (r− r′0))

, (3.239)

where the metric g′r (inverse of the covariance matrix) is

g′r = gr + St gs S (3.240)

and where the mean r′0 can be obtained solving the expression15

g′r (r′0 − r0) = St gs (s0 − S0 r0) . (3.241)

Note: I should now show here that fs(s) , the volumetric probability in the space Sq isgiven, in all cases ( p ≤ q or p ≥ q ) by

fs(s) =1

(2π)q/2

√det g′s√det gs

exp(−1

2(s− s′0)

t g′s (s− s′0))

, (3.242)

where the metric g′s (inverse of the covariance matrix) is

(g′s)−1 = S (g′r)

−1 St (3.243)

and where the mean s′0 iss′0 = S r′0 . (3.244)

Note: say that this is illustrated in figure 3.21.

13Note: explain this.14This volumetric probability is normalized by

∫dr1 ∧ · · · ∧ drp √ det gr fr(r) = 1 .

15Explicitly, one can write r′0 = r0 + (g′r)−1 St gs (s0 − S r0) , but in numerical applications, thedirect resolution of the linear system 3.241 is preferable.

138 Probabilities

Figure 3.21: Provisional figure to illus-trate example 3.28.

fs(r)

fs(s)

fq(s)

fp(r)

s = S

r

fq(s)

fp(r)



In equation 3.234 we have written the conditional volumetric probability fr(r|s) =f (r, s)/

∫Rp

dvr(r) f (r, s) , while in equation 3.103 we have written the marginal vol-umetric probability fs(s) =

∫Rp

dvr(r) f (r, s) . Combining the two equations gives

f (r, s) = fr(r|s) fs(s) , (3.245)

an expression that we can read as follows: ‘a ‘joint’ volumetric probability can be ex-pressed as the product of a conditional volumetric probability times a marginal volumetricprobability.’ Similarly, we could have obtained

f (r, s) = fs(s|r) fr(r) . (3.246)

Combining the two last equations gives the Bayes theorem

fs(s|r) =fr(r|s) fs(s)

fr(r), (3.247)

that allows to express one of the conditionals in terms of the other conditional andthe two marginals.

Of course, in general, f (r, s) 6= fr(r) fs(s) . When one has the property

f (r, s) = fr(r) fs(s) , (3.248)

then, it follows from the two equations 3.245–3.246 that

fr(r|s) = fr(r) and fs(s|r) = fs(s) , (3.249)

i.e., the conditionals equal the marginals. This means that the ‘variable’ r is indepen-dent from the variable s and vice-versa. Therefore, when the relation 3.248 holds, itis then said that the two variables are independent.

Example 3.29 (This example has to be updated.) Over the surface of the unit sphere,using geographical coordinates, we have the two displacement elements

dsϕ(ϕ, λ) = cos λ dϕ ; dsλ(ϕ, λ) = dλ , (3.250)

with the associated surface element (as the coordinates are orthogonal) ds(ϕ, λ) = cos λ dϕ dλ .Consider a (2D) volumetric probability f (ϕ, λ) over the surface of the sphere, normed underthe usual condition∫

surfaceds(ϕ, λ) f (ϕ, λ) =

∫ +π

−πdϕ∫ +π/2

−π/2dλ cos λ f (ϕ, λ) =

∫ +π/2

−π/2dλ cos λ

∫ +π

−πdϕ f (ϕ, λ) = 1 .

(3.251)

140 Probabilities

One may define the partial integrations

ηϕ(ϕ) =∫ +π/2

−π/2dλ cos λ f (ϕ, λ) ; ηλ(λ) =

∫ +π

−πdϕ f (ϕ, λ) , (3.252)

so that the probability of a sector between two meridians and of an annulus between twoparallels are respectively computed as

P(ϕ1 <ϕ <ϕ2) =∫ ϕ2

ϕ1

dϕ ηϕ(ϕ) ; P(λ1 < λ < λ2) =∫ λ2

λ1

dλ cos λ ηλ(λ) ,

(3.253)but the terms dϕ and cos λ dλ appearing in these two expressions are not the displacementelements on the sphere’s surface (equation 3.250). The functions ηϕ(ϕ) and ηλ(λ) shouldnot be mistaken as marginal volumetric probabilities: as the surface of the sphere is not theCartesian product of two 1D spaces, marginal volumetric probabilities are not defined.

Chapter 4

Examples

Bla, bla, bla. . .

142 Examples

4.1 Homogeneous Probability for Elastic Parameters

In this appendix, we start from the assumption that the uncompressibility modu-lus and the shear modulus are Jeffreys parameters (they are the eigenvalues of thestiffness tensor ci jk` ), and find the expression of the homogeneous probability den-sity for other sets of elastic parameters, like the set Young’s modulus - Poisson

ratio or the set Longitudinal wave velocity - Tranverse wave velocity

.

Uncompressibility Modulus and Shear Modulus

The ‘Cartesian parameters’ of elastic theory are the logarithm of the uncompressibil-ity modulus and the logarithm of the shear modulus

κ∗ = logκ

κ0; µ∗ = log

µ

µ0, (4.1)

where κ0 and µ0 are two arbitrary constants. The homogeneous probability densityis just constant for these parameters (a constant that we set arbitrarily to one)

fκ∗µ∗(κ∗,µ∗) = 1 . (4.2)

As is often the case for homogeneous ‘probability’ densities, fκ∗µ∗(κ∗,µ∗) is notnormalizable. Using the jacobian rule, it is easy to transform this probability densityinto the equivalent one for the positive parameters themselves

fκµ(κ,µ) =1κ µ

. (4.3)

This 1/x form of the probability density remains invariant if we take any power ofκ and of µ . In particular, if instead of using the uncompressibility κ we use thecompressibility γ = 1/κ , the Jacobian rule simply gives fγµ(γ,µ) = 1/(γµ) .

Associated to the probability density 4.2 there is the Euclidean definition of dis-tance

ds2 = (dκ∗)2 + (dµ∗)2 , (4.4)

that corresponds, in the variables (κ,µ) , to

ds2 =(

dκκ

)2

+(

dµµ

)2

, (4.5)

i.e., to the metric (gκκ gκµgµκ gµµ

)=(

1/κ2 00 1/µ2

). (4.6)

4.1 Homogeneous Probability for Elastic Parameters 143

Young Modulus and Poisson Ratio

The Young modulus Y and the Poisson ration σ can be expressed as a function ofthe uncompressibility modulus and the shear modulus as

Y =9κ µ

3κ +µ; σ =

12

3κ − 2µ3κ +µ

(4.7)

or, reciprocally,

κ =Y

3(1− 2σ); µ =

Y2(1 +σ)

. (4.8)

The absolute value of the Jacobian of the transformation is easily computed,

J =Y

2(1 +σ)2(1− 2σ)2 , (4.9)

and the Jacobian rule transforms the probability density 4.3 into

fYσ(Y,σ) =1κ µ

J =3

Y (1 +σ)(1− 2σ), (4.10)

which is the probability density representing the homogeneous probability distribu-tion for elastic parameters using the variables (Y,σ) . This probability density is theproduct of the probability density 1/Y for the Young modulus and the probabilitydensity

g(σ) =3

(1 +σ)(1− 2σ)(4.11)

for the Poisson ratio. This probability density is represented in figure 4.1. From thedefinition of σ it can be demonstrated that its values must range in the interval−1 < σ < 1/2 , and we see that the homogeneous probability density is singular atthese points. Although most rocks have positive values of the Poisson ratio, thereare materials where σ is negative (e.g., Yeganeh-Haeri et al., 1992).

Figure 4.1: The homogeneous probability den-sity for the Poisson ratio, as deduced from thecondition that the uncompressibility and theshear modulus are Jeffreys parameters.

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.40

5

10

15

20

25

30-

-0.5 0-1 +0.5Poisson's ratio

It may be surprising that the probability density in figure 4.1 corresponds to a ho-mogeneous distribution. If we have many samples of elastic materials, and if their

144 Examples

logarithmic uncompressibility modulus κ∗ and their logarithmic shear modulus µ∗

have a constant probability density (what is the definition of homogeneous distri-bution of elastic materials), then, σ will be distributed according to the g(σ) of thefigure.

To be complete, let us mention that in a change of variables xi xI , a metric gi jchanges to

gI J = ΛIi ΛJ

j gi j =∂xi

∂xI∂x j

∂xJ gi j . (4.12)

The metric 4.5 then transforms into(gYY gYσgσY gσσ

)=

( 2Y2

2(1−2σ) Y −

1(1+σ) Y

2(1−2σ) Y −

1(1+σ) Y

4(1−2σ)2 + 1

(1+σ)2

). (4.13)

The surface element is

dSYσ(Y,σ) =√

det g dY dσ =3 dY dσ

Y (1 +σ)(1− 2σ), (4.14)

a result from which expression 4.10 can be inferred.Although the Poisson ratio has a historical interest, it is not a simple parameter,

as shown by its theoretical bounds −1 < σ < 1/2 , or the form of the homogeneousprobability density (figure 4.1). In fact, the Poisson ratio σ depends only on theratio κ/µ (incompressibility modulus over shear modulus), as we have

1 +σ1− 2σ

=32κ

µ. (4.15)

The ratio J = κ/µ of two Jeffreys parameters being a Jeffreys parameter, a usefulpair of Jeffreys parameters may be κ, J . The ratio J = κ/µ has a physical in-terpretation easy to grasp (as the ratio between the uncompressibility and the shearmodulus), and should be preferred, in theoretical developments, to the Poisson ratio,as it has simpler theoretical properties.

Longitudinal and Transverse Wave Velocities

Equation 4.3 gives the probability density representing the homogeneous probabilitydistribution of elastic media, when parameterized by the uncompressibility modulusand the shear modulus:

fκµ(κ,µ) =1κ µ

. (4.16)

Should we have been interested, in addition, to the mass density ρ , then we wouldhave arrived (as ρ is another Jeffreys parameter), to the probability density

fκµρ(κ,µ,ρ) =1

κ µ ρ. (4.17)

4.1 Homogeneous Probability for Elastic Parameters 145

This is the starting point for this section.What about the probability density representing the homogeneous probability

distribution of elastic materials when we use as parameters the mass density andthe two wave velocities? The longitudinal wave velocity α and the shear wavevelocity β are related to the uncompressibility modulus κ and the shear modulusµ through

α =

√κ + 4µ/3

ρ; β =

√µ

ρ, (4.18)

and a direct use of the Jacobian rule transforms the probability density 4.17 into

fαβρ(α,β,ρ) =1

ραβ(

34 −

β2

α2

) . (4.19)

which is the answer to our question.That this function becomes singular for α = 2√

3β is just due to the fact that

the “boundary” α = 2√3β can not be crossed: the fundamental inequalities κ >

0 ; µ > 0 impose that the two velocities are linked by the inequality constraint

α >2√3β . (4.20)

Let us focus for a moment on the homogeneous probability density for the twowave velocities (α,β) existing in an elastic solid (disregard here the mass densityρ ). We have

fαβ(α,β) =1

αβ(

34 −

β2

α2

) . (4.21)

It is displayed in figure 4.2.

Figure 4.2: The joint homogeneous probability densityfor the velocities (α,β) of the longitudinal and trans-verse waves propagating in an elastic solid. Contraryto the incompressibility and the shear modulus, thatare independent parameters, the longitudinal wavevelocity and the transversal wave velocity are not in-dependent (see text for an explanation). The scales forthe velocities are unimportant: it is possible to multi-ply the two velocity scales by any factor without mod-ifying the form of the probability (which is itself de-fined up to a multiplicative constant).

0

α

β

146 Examples

Let us demonstrate that the marginal probability density for both α and β is ofthe form 1/x . For we have to compute

fα(α) =∫ √

3α/2

0dβ f (α,β) (4.22)

andfβ(β) =

∫ +∞2β/

√3

dα f (α,β) (4.23)

(the bounds of integration can easily be understood by a look at figure 4.2). Theseintegrals can be evaluated as

fα(α) = limε→0

∫ √1−ε

√3α/2

√ε√

3α/2dβ f (α,β) = lim

ε→0

(43

log1−εε

)1α

(4.24)

and

fβ(β) = limε→0

∫ 2β/(√ε√

3)√

1+ε 2β/√

3dα f (α,β) = lim

ε→0

(23

log1/ε− 1ε

)1β

. (4.25)

The numerical factors tend to infinity, but this is only one more manifestation of thefact that the homogeneous probability densities are usually improper (not normaliz-able). Dropping these numerical factors gives

fα(α) = 1/α (4.26)

andfβ(β) = 1/β . (4.27)

It is interesting to note that we have here an example where two parameters that looklike Jeffreys parameters, but are not, because they are not independent (the homo-geneous joint probability density is not the product of the homogeneous marginalprobability densities.).

It is also worth to know that using slownesses instead of velocities ( n = 1/α, η =1/β ) leads, as one would expect, to

fnηρ(n, η,ρ) =1

ρ n η(

34 −

n2

η2

) . (4.28)

4.2 Measuring a One-Dimensional Strain (I) 147

4.2 Measuring a One-Dimensional Strain (I)

A one-dimensional material medium with an initial length X is deformed, and itslength becomes Y . The strain that has affected the medium, denoted ε , is definedas

ε = logYX

. (4.29)

A measurement of X and Y provides the information represented by a probabilitydensity f (X, Y) . This induces an information on the actual value of the strain, thatis represented by a probability density g(ε) . The problem is to express gε(ε) usingas ‘inputs’ the definition 4.29 and the probability density f (X, Y) .

We need to introduce a ‘slack variable’, say Z , in order to be able to change fromthe pair X, Y into a pair ε, Z . We have a total freedom for the choice of Z .I choose here the (geometric) average length Z =

√X Y . The change of variables is,

then,ε = log(Y/X)Z =

√X Y

X = Z e−ε/2

Y = Z eε/2 . (4.30)

The probability density for the new variables, say g(ε, Z) , can be obtained usingthe Jacobian rule:

g(ε, Z) = f (X, Y) | det

(∂X∂ε

∂X∂Z

∂Y∂ε

∂Y∂Z

)| = f (X, Y)

X YZ

. (4.31)

Replacing X and Y by their expressions in terms of ε and Z gives

g(ε, Z) = Z f ( Z e−ε/2 , Z eε/2 ) . (4.32)

The probability density for ε is then

gε(ε) =∫ ∞

0dZ g(ε, Z) =

∫ ∞0

dZ Z f ( Z e−ε/2 , Z eε/2 ) . (4.33)

Should we be interested in Z instead of ε we would evaluate

gZ(Z) =∫ +∞−∞ dε g(ε, Z) = Z

∫ +∞−∞ dε f ( Z e−ε/2 , Z eε/2 ) . (4.34)

As an example, assume that our measurement of the initial length X and of thefinal length Y has produced an independent information on X and Y that can bemodeled by the product of two log-normal functions1:

f (X, Y) = LN(X, X0,σx) LN(Y, Y0,σy) , (4.35)

1The log-normal model is acceptable, as X and Y are Jeffreys quantities.

148 Examples

whereLN(R, R0,σ) ≡ 1√

2π σ1R

exp(− 1

2

( 1σ

logRR0

)2). (4.36)

Here, X0 and Y0 are respectively the center of the probability distributions of Xand Y , and σx and σy are respectively the standard deviations of the associatedlogarithmic variables.

Then, using equation 4.33 one arrives at

gε(ε) =1√

2π σεexp

(− 1

2(ε−ε0)2

σ2ε

)(4.37)

whereε0 = log

Y0

X0; σε =

√σ2

x +σ2y . (4.38)

This is a normal probability density, centered at ε0 = log(Y0/X0) , that is the strainthat would be computed from the central values of X and Y . The standard devia-tion on the variable ε is σε =

√σ2

x +σ2y (remember that σx and σy are the standard

deviations associated to the logarithms of X and Y .Using equation 4.34 one arrives at

gZ(Z) =1√

2π σz

1Z

exp(− 1

2

( 1σz

logZZ0

)2)(4.39)

whereZ0 =

√X0 Y0 ; σZ =

√σ2

x +σ2y / 2 . (4.40)

This is a lognormal probability density, centered at Z0 =√

X0 Y0 . Remember thatZ was defined as the (geometric) average of X and Y , so it is quite reasonable thatthe Z0 , the average of Z , equals the average of X0 (that is the average of X ) andY0 (that is the average of Y ). The standard deviation of the logarithmic variableassociated with Z is σZ =

√σ2

x +σ2y / 2 .

This example has been treated using probability densities only. To pass fromprobability densities to volumetric probabilities we can introduce a metric in theX, Y manifold. As X and Y are Jeffreys quantities we can select the metricds2 = (dX/X)2 + (dY/Y)2 . This leads, for the quantities ε, Z , to the metricds2 = 1

2 dε2 + 2 (dZ/Z)2 .


4.2.1 Measuring a One-Dimensional Strain (II)

A one-dimensional material medium with an initial length X is deformed into asecond state, where its length is Y . The strain that has affected the medium, denotedε , is defined as

ε = logYX

. (4.41)

A measurement of X and Y provides the information represented by a volumet-ric probability fr(Y, X) . This induces an information on the actual value of thestrain, that shall be represented by a volumetric probability fs(ε) . The problemis to express fs(ε) using as ‘inputs’ the definition 4.41 and the volumetric probabil-ity fr(Y, X) . Let us introduce the two-dimensional ‘data’ space R2 , over which thequantities X and Y are coordinates. The lengths X and Y being Jeffreys quanti-ties (see discussion in section XXX), we have, in the space R2 , the distance elementds2

r = ( dYY )2 + ( dX

X )2 , associated to the metric matrix

gr =

(1

Y2 00 1

X2

). (4.42)

This, in particular, gives √det gr =

1Y X

, (4.43)

so the (2D) volume element over R2 is dvr = dY∧dXY X , and any volumetric probability

fr(Y, X) over R2 is to be integrated via

Pr =∫

dY ∧ dX1

Y Xfr(Y, X) , (4.44)

over the appropriate bounds. In particular, a volumetric probability fr(Y, X) is nor-malized if the integral over ( 0 < Y < ∞ ; 0 < X < ∞ ) equals one. Let us alsointroduce the one-dimensional ‘space of deformations’ S1 , over which the quantityε is the chosen coordinate (one could as well chose the exponential of ε , or twice thestrain as coordinate). The strain being an ordinary Cartesian coordinate, we have,in the space of deformations S1 the distance element ds2

s = dε2 , associated to thetrivial metric matrix gs = (1) . Therefore,√

det gs = 1 . (4.45)

The (1D) volume element over S1 is dvs = dε , and any volumetric probability fs(ε)over S1 is to be integrated via

Ps =∫

dε fs(ε) , (4.46)

over given bounds. A volumetric probability fs(ε) is normalized by the conditionthat the integral over (−∞ < ε < +∞) equals one. As suggested in the general

150 Examples

theory, we must change the coordinates in R2 using as part of the coordinates thoseof S1 , i.e., here, using the strain ε . Then, arbitrarily, select X as second coordinate,so we pass in R2 from the coordinates Y , X to the coordinates ε , X . Then,the Jacobian matrix defined in equation ?? is

K =(

UV

)=(

∂ε/∂Y ∂ε/∂X∂X/∂Y ∂X/∂X

)=(

1/Y −1/X0 1

), (4.47)

and we obtain, using the metric 4.42,√det K g−1

r Kt = X . (4.48)

Noting that the expression 4.41 can trivially be solved for Y as

Y = X expε , (4.49)

everything is ready now to attack the problem. If a measurement of X and Y hasproduced the information represented by the volumetric probability fr(Y, X) , thistransports into a volumetric probability fs(ε) that is given by equation ??. Using theparticular expressions 4.45, 4.48 and 4.49 this gives

fs(ε) =∫ ∞

0dX

1X

fr( X expε , X ) . (4.50)

Example 4.1 In the context of the previous example, assume that the measurement of thetwo lengths X and Y has provided an information on their actual values that: (i) has in-dependent uncertainties and (ii) is Gaussian (which, as indicated in section 5.3.13.2, meansthat the dependence of the volumetric probability on the Jeffreys quantities X and Y is ex-pressed by the lognormal function). Then we have

fX(X) =1√

2π sXexp

(− 1

2 s2x

(log

XX0

)2)

, (4.51)

fY(Y) =1√

2π sYexp

(− 1

2 s2Y

(log

YY0

)2)

(4.52)

andfr(Y, X) = fY(Y) fX(X) . (4.53)

The volumetric probability for X is centered at point X0 , with standard deviation sX ,and the volumetric probability for Y is centered at point Y0 , with standard deviation sY(see section ?? for a precise —invariant— definition of standard deviation). In this simple


example, the integration in equation 4.50 can be performed analytically, and one obtains aGaussian probability distribution for the strain, represented by the normal function

fs(ε) =1√

2π sεexp

(− (ε−ε0)2

2 s2ε

), (4.54)

where ε0 , the center of the probability distribution for the strain, equals the logarithm of theratio of the centers of the probability distributions for the lengths,

ε0 = logY0

X0, (4.55)

and where s2ε , the variance of the probability distribution for the strain, equals the sum of the

variances of the probability distributions for the lengths,

s2ε = s2

X + s2Y . (4.56)

152 Examples

4.3 Free-Fall of an Object

The one-dimensional free fall of an object (under the force of gravity) is given by theexpression

x = x0 + v0 t + 12 g t2 (4.57)

(note: explain the assumptions and what the variables are). The acceleration of grav-ity is assumed to have a fixed value (for instance, g = 9.81 m/s2 ). A firecracker isdropped that has been prepared with random values of initial position x0 , of initialvelocity v0 , and flying time t , and we are interested in the position x at which itwill explode. Assume the random values of x0, v0, t have been generated accord-ing to some probability density f (x0, v0, t) . Which is the probability density of thequantity x ?

This is a typical problem of transport of probabilities. Here we transport a prob-ability distribution defined in a three-dimensional manifold into a one-dimensionalmanifold.

We start by introducing two ‘slack quantities’ that, together with x , will form athree-dimensional set. Among the infinitely many possible choices, let us take thequantities ω and τ defined through the following change of variables

x = x0 + v0 t + 12 g t2

ω = v0τ = t

x0 = x−ωτ − 1

2 g τ2

v0 = ω

t = τ

, (4.58)

i.e., the quantity ω is, in fact, identical to the initial velocity v0 , and the quantity τis identical to the falling time t .

We can now apply the Jacobian rule to transform the probability density f (x0, v0, t)into a probability density g(x,ω, τ) :

g(x,ω, τ) = f (x0, v0, t) | det

∂x0∂x

∂x0∂ω

∂x0∂τ

∂v0∂x

∂v0∂ω

∂v0∂τ

∂t∂x

∂t∂ω

∂t∂τ

| . (4.59)

Because of the particular variables ω and τ chosen, the Jacobian determinant justequals 1, and, therefore, we simply have g(x,ω, τ) = f (x0, v0, t) . It is understoodin this expression that the three variables x0, v0, t have to be replaced, in the func-tion f , by their expression in terms of the three variables x,ω, τ (in the right offormula 4.58), so we could write, more explicitly,

g(x,ω, τ) = f ( x−ωτ − 12 g τ2 , ω , τ ) . (4.60)

The probability density we were seeking can now be obtained by integration: gx(x) =∫dω

∫dτ g(x,ω, τ) . Explicitly,

gx(x) =∫ ∞−∞ dω

∫ ∞−∞ dτ f ( x−ωτ − 1

2 g τ2 , ω , τ ) . (4.61)

4.3 Free-Fall of an Object 153

As an example, assume that the three variables x0, v0, t in the probability den-sity f are independent and, furthermore, that the probability distributions of x0and v0 are normal:

f (x0, v0, t) = N(x0, X0,σx) N(v0, V0,σv) h(t) . (4.62)

Here,

N(u, U,σ) ≡ 1√2π σ

exp(− 1

2(u−U)2

σ2

). (4.63)

This means that we assume that the variable x0 is centered at the value X0 witha standard deviation σx that the variable v0 is centered at the value V0 with astandard deviation σv and that the variable t has an arbitrary probability densityh(t) . The computation is just a matter of replacing the expression 4.62 into 4.61,and invoking a good mathematical software to perform the analytical integrations.In fact, it is better to do this in steps. First, in equation 4.62 we replace the variablesx0, v0, t by their expressions (at the right in equation 4.58) in terms of the variablesx,ω, τ , and we input the result in equation 4.60, to obtain the explicit expressionfor g(x,ω, τ) . We then first evaluate

g(x, τ) =∫ ∞−∞ dω g(x,ω, τ) , (4.64)

and, then,gx(x) =

∫ ∞−∞ dτ g(x, τ) . (4.65)

My mathematical software produces the result2

g(x, τ) = f (τ)1√

2π σ(τ)exp

(− 1

2(x− x(τ))2

σ(τ)2

), (4.66)

where

x(τ) = X0 + V0 τ + 12 g τ2 ; σ(τ) =

√σ2

x +σ2v τ

2 . (4.67)

Then, gx(x) is obtained by evaluation of the integral in equation 4.65.Note that x(τ) is the position the firecracker would have at time τ if its initial

position was X0 (the mean value of the distribution of x0 ) and if its initial velocitywas V0 (the mean value of the distribution of v0 ). Note also that the standarddeviation σ(τ) increases with time (as the result of the uncertainty in the initialvelocity v0 ).

2Unfortunately, at the time of this computation, the integral is well evaluated by the software(mathematica), but the simplification of the result has still to be made by hand.

154 Examples

4.4 Measure of Poisson’s Ratio

Hooke’s Law in Isotropic Media

For an elastic medium, in the limit of infinitesimal strains (Hooke’s law),

σi j = ci jk` εk` , (4.68)

where ci jk` is the stiffness tensor. If the elastic medium is isotropic,

ci jk` =λκ

3gi j gk` +

λµ

2(

gik g j` + gi` g jk −23

gi j gk`)

, (4.69)

where λκ (with multiplicity one) and λµ (with multiplicity five) are the two eigen-values of the stiffness tensor ci jk` . They are related to the common umcompressibil-ity modulus κ and shear modulus µ through

κ = λκ/3 ; µ = λµ/2 . (4.70)

The Hooke’s law 4.68 can, alternatively, be written

εi j = di jk`σk` , (4.71)

where di jk` , the inverse of the stiffness tensor, is called the compliance tensor. If theelastic medium is isotropic,

di jk` =γ

3gi j gk` +

ϕ

2(

gik g j` + gi` g jk −23

gi j gk`)

, (4.72)

where γ (with multiplicity one) and ϕ (with multiplicity five) are the two eigenval-ues of di jk` . These are, of course, the inverse of the eigenvalues of ci jk` :

γ =1λκ

=1

3κ; ϕ =

1λµ

=1

2µ. (4.73)

From now on, I shall call γ the eigencompressibility or, if there is no risk of confusionwith 1/κ , the compressibility. The quantitity ϕ shall be called the eigenshearabilityor, if there is no risk of confusion with 1/µ , the shearability.

With the isotropic stiffness tensor of equation 4.69, the Hooke’s law 4.68 becomes

σi j =λκ

3gi j εk

k + λµ(εi j −

13

gi j εkk) , (4.74)

or, equivalently, with the isotropic compliance tensor of equation 4.72, the Hooke’slaw 4.71 becomes

εi j =γ

3gi jσk

k +ϕ(σi j −

13

gi jσkk) . (4.75)

4.4 Measure of Poisson’s Ratio 155

Definition of the Poisson’s Ratio

Consider the experimental arrangement of figure 4.3, where an elastic medium issubmitted to the (homogeneous) uniaxial stress (using Cartesian coordinates)

σxx = σyy = σxy = σyz = σzx = 0 ; σzz 6= 0 . (4.76)

Then, the Hooke’s law 4.71 predicts the strain

εxx = εyy =13

(γ −ϕ)σzz

εzz =13

(γ + 2ϕ)σzz

σxy = σyz = σzx = 0 .

(4.77)

The Young modulus Y and the Poisson ratio ν are defined as

Y =σzz

εzz; ν = −εxx

εzz= −

εyy

εzz, (4.78)

and equation 4.77 gives

Y =3

2ϕ+γ; ν =

ϕ−γ2ϕ+γ

, (4.79)

with reciprocal relations

γ =1− 2ν

Y; ϕ =

1 + νY

. (4.80)

Figure 4.3: A possible experimental setup for measur-ing the Young modulus and the Poisson ratio of anelastic medium. The measurement of the force F ofthe ‘bar length’ Z and of the bar diameter X allowsto estimate the two elastic parameters. Details below.

Z

X

F

Note that when γ and ϕ take values inside their natural range

0 < γ < ∞ ; 0 < ϕ < ∞ , (4.81)

the variation of Y and ν is

0 < Y < ∞ ; −1 < ν < +1/2 . (4.82)

Although most materials have positive values of the Poisson ratio ν , there are ma-terials where it is negative (see figures 4.4 and 4.5)

The Poisson ratio has mainly a historical interest. Note that a simple function ofit would have given a bona fide Jeffreys quantity,

J =1 + ν

1− 2ν=λκ

λµ, (4.83)

with the natural domain of variation 0 < J < ∞ .

156 Examples

Figure 4.4: An example of a 2D elas-tic structure with a positive value ofthe Poisson ratio. When imposing astretching in one direction (the ‘horizon-tal’ here), the elastic structure reacts con-tracting in the perpendicular direction.

Figure 4.5: An example of a 2D elas-tic structure with a negative value ofthe Poisson ratio. When imposing astretching in one direction (the ‘hori-zontal’ here), the elastic structure reactsalso stretching in the perpendicular di-rection.

The Parameters

Although one may be interested in the Young modulus Y and the Poisson ratioν , we may choose to measure the compressibility γ = 1/λκ and the shearabilityϕ = 1/λµ . Any information we may need on Y and ν can be obtained, as usual,through the change of variables.

From the two first equations in expression 4.77 it follows that the relation betweenthe elastic parameters γ and ϕ , the stress and the strains is

γ =εzz + 2εxx

σzz; ϕ =

εzz −εxx

σzz. (4.84)

As the uniaxial tress is generated by a force F applied to one of the ends of the bar(and the reaction force of the support),

σzz =Fs

, (4.85)

where s , the section of the bar, is

s =π X2

4. (4.86)

The most general definition of strain (that does not assume the strains to be small) is

εxx = logXX0

; εzz = logZZ0

, (4.87)

where X0 and Z0 are the initial lengths (see figure 4.3) and X and Z are the finallengths. We have then the final relation

γ =π X2 ( log Z/Z0 + 2 log X/X0

)4 F

; ϕ =π X2 ( log Z/Z0 − log X/X0

)4 F

.(4.88)


When necessary, these two expressions shall be written

γ = γ(X0, Z0, X, Z, F) ; ϕ = ϕ(X0, Z0, X, Z, F) . (4.89)

We shall later need to extract from these relations the two parameters X0 and Z0 :

X0 = X exp(−4 F (γ −ϕ)

3 π X2

); Z0 = Z exp

(−4 F (γ + 2ϕ)

3 π X2

), (4.90)

expressions that, when necessary, shall be written

X0 = X0(γ,ϕ, X, Z, F) ; Z0 = Z0(γ,ϕ, X, Z, F) . (4.91)

The Partial Derivatives

The variables me measure are

r = X0, Z0, X, Z, F , (4.92)

while we are interested in the two variables γ,ϕ . In order to have a set of filevariables, we take

s = γ,ϕ, X′, Z′, F′ , (4.93)

whereX′ = X ; Z′ = Z ; F′ = F . (4.94)

The relation s = s(r) corresponds to these three identities plus the two relations 4.88.We can then introduce the (inverse) matrix of partial derivatives

J−1 =

∂γ/∂X0 ∂γ/∂Z0 ∂γ/∂X ∂γ/∂Z ∂γ/∂F∂ϕ/∂X0 ∂ϕ/∂Z0 ∂ϕ/∂X ∂ϕ/∂Z ∂ϕ/∂F∂X/∂X0 ∂X/∂Z0 ∂X/∂X ∂X/∂Z ∂X/∂F∂Z/∂X0 ∂Z/∂Z0 ∂Z/∂X ∂Z/∂Z ∂Z/∂F∂F/∂X0 ∂F/∂Z0 ∂F/∂X ∂F/∂Z ∂F/∂F

, (4.95)

to easily obtain

J =16 F2 X0 Z0

3 π2 X4 . (4.96)

The Measurement

We measure X0, Z0, X, Z, F and describe the result of our measurement via a prob-ability density

f (X0, Z0, X, Z, F) . (4.97)

[Note: Explain this.]

158 Examples

Transportation of the Probability Distribution

To obtain the probability density in the variables γ,ϕ , we just apply equations ??–??. With the present notations this gives

g(γ,ϕ) =16

3 π2

∫ ∞0

dX∫ ∞

0dZ∫ +∞−∞ dF

F2 X0 Z0

X4 f (X0, Z0, X, Z, F)︸︷︷︸X0=X0(γ,ϕ,X,Z,F) ; Z0=Z0(γ,ϕ,X,Z,F)

, (4.98)

where the functions X0 = X0(γ,ϕ, X, Z, F) and Z0 = Z0(γ,ϕ, X, Z, F) are thoseexpressed by equations 4.90–4.91. The two associated marginal probability densitiesare, then,

gγ(γ) =∫ ∞

0dϕ g(γ,ϕ) and gϕ(ϕ) =

∫ ∞0

dγ g(γ,ϕ) . (4.99)

As γ and ϕ are Jeffreys quantities, we can easily transform these probabilitydensities into volumetric probabilities. One has

g(γ,ϕ) = γϕ g(γ,ϕ) , (4.100)

gγ(γ) =∫ ∞

0

dϕϕ

g(γ,ϕ) and gϕ(ϕ) =∫ ∞

0

dγγ

g(γ,ϕ) . (4.101)

To represent the results is better to use the ‘Cartesian parameters’ of the problem[note: explain]. Here, the logarithmic parameters

γ∗ = logγ

γ0ϕ∗ = log

ϕ

ϕ0, (4.102)

where γ0 and ϕ0 are two arbitray constants having the dimension of a complianceare Cartesian coordinates over the 2D space of elastic (isotropic) media. As volumet-ric probabilities are invariant, we simply have

h(γ∗,ϕ∗) = g(γ,ϕ)|γ=γ0 expγ∗ ; ϕ=ϕ0 expϕ∗ . (4.103)

Numerical Illustration

Let us use the notations N(u, u0, s) and L(U, U0, s) respectively for the normal andthe lognormal probability densities

N(u, u0, s) =1√2π s

exp(− (u− u0)2

2 s2

)L(U, U0, s) =

1√2π s

1U

exp(− 1

2 s2

(log

UU0

)2).

(4.104)


Asume that the result of the measurement of the quantities X0 , Z0 (initial di-ameter and length of the bar), X , Z (final diameter and length of the bar), and theforce F , has given an information that can be represented by a probability desitywith independent uncertainties,

f (X, X0, Z, Z0, F) =

L(X0, Xobs0 , sX0) L(Z0, Zobs

0 , sZ0) L(X, Xobs, sX) L(Z, Zobs, sZ) N(F, Fobs, sF) ,(4.105)

with the numerical valuesXobs

0 = 1.000 m ; sX0 = 0.015Zobs

0 = 1.000 m ; sZ0 = 0.015Xobs = 0.975 m ; sX = 0.015Zobs = 1.105 m ; sZ = 0.015Fobs = 9.81 kg m/s2 ; sF ≈ 0 .

This is the probability density that appears at the right of equation 4.98. To simplifythe example I have assumed that the uncertainty on the force F is much smallerthan the other uncertainties, so, in fact, F can be treated as a constant. Figure 4.6displays the four (marginal) one-dimensional lognormal probability densities (withthe small uncertainties chosen, the lognormal probability densities in 4.105 visuallyappear as normal probability densities). To illustrate how the uncertaintiers in themeasurement of the lengths propagate into uncertainties in the elastic parameters,I have chosen the quite unrealistic example where the uncertainties in X and X0overlap: it is likely that the diameter of the rod has decreased (so the Poisson ratiois positive) but the probability that it has increased (negative Poisson ratio) is signif-icant. In fact, as we shall see, the measurement don’t even exclude the virtuality ofnegative elastis parameters γ and ϕ (this possibility being excluded by the elastictheory that in included in the present formulation).

Figure 4.6: The four 1D marginal volumetic probabilitities forthe initial and final lengths. Note that the uncertainties in Xand X0 overlap: it is likely that the diameter of the rod hasdecreased (so the Poisson ratio is positive) but the probabilitythat it has increased (negative Poisson ratio) is significant.

1 1.1

length ZZ0

1 1.1

diameterX X0

Figure 4.7 represents the volumetric probability h(γ∗,ϕ∗) defined by equations4.98, 4.103 and ??. It represents the information that the measurements of the lengthhas given on the elastic parameters γ and ϕ . [Note: Explain this better.] [Note:Explain that negative values of γ and ϕ are excluded ‘by hand’].

The two associated marginal volumetric probabilities are defined in equations 4.101,and are represented in figure 4.8.


160 Examples

-6 -4-8-10

-6

-5

-4

-10 -8 -6 -4

-6

-5

-4

γ∗ = log γ Q

ϕ∗ =

log ϕ

Q

γ∗ = log γ Q

ϕ∗ =

log ϕ

Q

( Q = 1N/m )2

Figure 4.7: The (2D) volumetric probability for the compressibility γ and the shear-ability ϕ , as induced from the measurement results. At the left a direct representa-tion of the volumetric probability defined by equation 4.98 and 4.103. At the right,a Monte Carlo simulation of the measurement (see section ??). Here, natural loga-rithms are used, and Q = 1 N/m2 . Of the 3000 points used, 9 falled at the left and 7below the domain plotted, and are not represented. The zone of nonvanishing prob-ability extends over all the space, and only the level lines automatically proposed bythe plotting software have been used.

Figure 4.8: The marginal(1D) volumetric proba-bilities defined by equa-tions 4.101.

-7 -6 -5 -4-12 -10 -8 -6 -4

γ∗ = log γ Q ϕ∗ = log ϕ Q

Log[X/k] = −0.094 Log[X/k] = +0.068Log[X/k] = −0.094 Log[X/k] = +0.068

Log[X /k] = −0.0940

Log[X /k] = +0.0680

Figure 4.9: The marginal probability distributions for the lengths X and X0 . At theleft, a Monte Carlo sampling of the probability distribution for X as X0 defined byequation 4.105 (the values Z and Z0 are also sampled, but are not shown). At theright, the same Monte Carlo sampling, but where only the points that correspond,through equation 4.88, to positive values of γ and ϕ (and, thus, acceptable by thetheory of elastic media). Note that many of the points ‘behind’ the diagonal bar havebeen suppressed.


Translation into the Young Modulus and Poisson Ratio Language

From the volumetric probability g(γ,ϕ) we immediately deduce the expression ofthe volumetric probability q(Y,ν) for the Young modulus Y and the Poisson ra-tio ν :

q(Y,ν) = g(γ,ϕ)|γ= 1−2νY ν= 1+ν

Y. (4.106)

I prefer to suggest an alternative to the evaluation of q(Y,ν) . We have seen thatthe quantities γ∗ and ϕ∗ (logarithmic compressibility and and logarithmic sheara-bility) are Cartesian quantities in the 2D space of linear elastic media. My preferredchoice for visualizing q(Y,ν) is a direct representation of the ‘new coordinates’ on ametrically correct representation, i.e., to superimpose in figure 4.7, where the coordi-nates γ∗ and ϕ∗ where used, the new coordinates Y,ν (the change of variablesbeing deined by equations 4.79–4.80). This gives the representation displayed in fig-ure 4.10.

Figure 4.10: The metri-cally correct representationof the volumetric proba-bility q(Y,ν) , obtained byjust superimposing on thefigure 4.7 the new coordi-nates Y,ν . As above,Q = 1 N/m2 .

γ∗ = log γ Q

ϕ∗ =

log ϕ

Q

υ =

+0.

49Y = 100 Q

Y = 200 Q

Y = 300 Q

-10 -8 -6 -4

-6

-5

-4

υ =

−0.

2υ =

0

υ =

+0.

2

υ =

+0.

4

υ =

−0.

6

υ =

−0.

8

As this is not the conventional way of plotting probability distributions, let usalso examine the more conventional plot of q(Y,ν) in figure 4.11. One may observe,in particular, the ‘round’ character of the ‘level lines’ in this plot, due to the fact thatthe experiment was specially designed to have a good (and independent) resolutionof the Young modulus and the Poisson ratio.

Figure 4.11: The volumetric probability for the Youngmodulus Y and the Poisson ratio ν , deduced, usinga change of variables, from the volumetric probabil-ity on γ and ϕ represented in figure 4.7 (see equa-tion 4.106).

100 150 200

-0.2

0

0.2

0.4

Y

ν

162 Examples

As the metric matrix is not diagonal in the coordinates Y,ν , one can not de-fine marginal volumetric probabilities, but marginal probability densities only (seesection 3.3.3). Let us evaluate them.

We may start by the consideration that the distance element over the space γ,ϕis

ds2 =(

dγγ

)2

+(

dϕϕ

)2

, (4.107)

so the metric matrix is

gr =1c2

(1/γ2 0

0 1/ϕ2

). (4.108)

To obtain the expression of the metric in the coordinates Y,ν one can use thepartial derivatives of the old coordinates with respect to the new coordinates, andequation 2.65. Then, the metric matrix in equation 4.108, written in the coordinatesγ,ϕ becomes(

gYY gYνgνY gνν

)=

( 2Y2

2Y(1−2ν) −

1Y(1+ν)

2Y(1−2ν) −

1Y(1+ν)

4(1−2ν)2 + 1

(1+ν)2

), (4.109)

the metric determinant being√det g =

3Y (1 + ν)(1− 2ν)

. (4.110)

The the probability density is then q(Y,ν) =√

det g q(Y,ν) , i.e.,

q(Y,ν) =3 q(Y,ν)

Y (1 + ν) (1− 2ν). (4.111)

The marginal probability density for the Young modulus is then defined as qY(Y) =∫ +1/2−1 dν q(Y,ν) , i.e.,

qY(Y) =3Y

∫ +1/2

−1dν

q(Y,ν)(1 + ν) (1− 2ν)

, (4.112)

and the marginal probability density for the Poisson ratio is qν(ν) =∫∞

0 dY q(Y,ν) ,i.e.,

qν(ν) =3

(1 + ν) (1− 2ν)

∫ ∞0

dYq(Y,ν)

Y. (4.113)

Then, we can evaluate probabilities like

P(Y1 < Y < Y2) =∫ Y2

Y1

dY qY(Y) ; P(ν1 < ν < ν2) =∫ ν2

ν1

dν qν(ν) . (4.114)

As an example, the marginal probability density for the Poisson ratio, qν(ν) , isplotted in figure 4.12.


Figure 4.12: The marginal probability density for the Pois-son ratio ν (equation 4.113).

-1 -0.5 0 0.5

ν

164 Examples

4.5 Mass Calibration

Note: I take this problem from Measurement Uncertainty and the Propagation ofDistributions, by Cox and Harris, 10-th International Metrology Congress, 2001.

When two bodies, with masses mW and mR , equilibrate in a balance that oper-ates in air of density a , one has (taking into account Archimedes’ buoyancy),(

1− aρW

)mW =

(1− a

ρR

)mR , (4.115)

where ρW and ρR are the two volumetric masses of the bodies.Given a body with mass m , and volumetric mass ρ , it is a common practice in

metrology to define its ‘conventional mass’, denoted m0 , as the mass of a (hypothet-ical) body of conventional density ρ0 = 8000 kg/m3 in air of conventional densitya0 = 1.2 kg/m3 . The equation above then gives the relation(

1− a0

ρ0

)m0 =

(1− a0

ρ

)m . (4.116)

In terms of conventional masses, equation 4.115 becomes

ρW − aρW − a0

mW,0 =ρR − aρR − a0

mR,0 . (4.117)

To evaluate the mass mW,0 of a body one puts a mass mR,0 in the other arm, andselects the (typically small) mass δmR,0 (with same volumetric mass as mR,0 ) thatequilibrates the balance. Replacing mR,0 by mR,0 + δmR,0 in the equation above,and solving for mW,0 gives

mW,0 =(ρR − a) (ρW − a0)(ρW − a) (ρR − a0)

(mR,0 + δmR,0) . (4.118)

The knowledge of the five quantities mR,0 , δmR,0 , a , ρW , ρR allows, via equa-tion 4.118, to evaluate mW,0 . Assume that a measure of these five quantities has pro-vided the information represented by the probability density f (mR,0, δmR,0, a,ρW ,ρR) .Which is the probability density induced over the quantity mW,0 by equation 4.118?

This is just a special case of the transport of probabilities considered in section ??,so we can directly apply here the results of the section. In the five-dimensional ‘mea-surement space’ over which the variables mR,0 , δmR,0 , a , ρW , ρR can be consid-ered as coordinates, we can change to the variables mW,0 , δmR,0 , a , ρW , ρR , thisdefining the matrix K of partial derivatives (see equation ??). One easily arrives atthe simple result

√det K Kt =

(ρR − a) (ρW − a0)(ρW − a) (ρR − a0)

. (4.119)

4.5 Mass Calibration 165

Because of the change of variables used, we shall also need to express mW,0 as afunction of mR,0 , δmR,0 , a , ρW , ρR . From equation 4.118 one immediately ob-tains

mR,0 =(ρW − a) (ρR − a0)(ρR − a) (ρW − a0)

mW,0 − δmR,0 . (4.120)

Equation ?? gives the probability density for mW,0 :

g(mW,0) =∫

dδmR,0

∫da∫

dρW

∫dρR

(ρW − a) (ρR − a0)(ρR − a) (ρW − a0)

f (mR,0, δmR,0, a,ρW ,ρR) ,

(4.121)where in f (mR,0, δmR,0, a,ρW ,ρR) one has to replace the variable mR,0 by its expres-sion as a function of the other five variables, as given by equation 4.120.

Given the probability density f (mR,0, δmR,0, a,ρW ,ρR) representing the informa-tion obtained though the measurement act, one can try an analytic integration (pro-vided the probability density f has an analytical expression, or it can be approxi-mated by one). More generally, the probability density f can be sampled using theMonte Carlo methods described in section XXX.

This is, in fact, quite trivial here. Let us denote r = mR,0, δmR,0, a,ρW ,ρR ands = mW,0 . Then the relation 4.118 can be written formally as s = s(r) . One just needsto sample f (r) to obtain points r1 , r2 , . . . . The points s1 = s(r1) , s2 = s(r2) , . . .are samples of g(s) (because of the very definition of the notion of transport of prob-abilities).

166 Examples

4.6 Probabilistic Estimation of Earthquake Locations

Earthquakes generate waves, and the arrival times of the waves at a network ofseismic observatories carries information on the location of the hypocenter. This in-formation is better understood by a direct examination of the probability densityf (X, Y, Z) defined by the arrival times, rather than just estimating a particular loca-tion (X, Y, Z) and the associated uncertainties.

Provided that a ‘black box’ is available that rapidly computes the travel times tothe seismic station from any possible location of the earthquake, this probabilistic ap-proach can be relatively efficient. This appendix shows that it is quite trivial to writea computer code that uses this probabilistic approach (much easier than to write acode using the traditional Geiger method, that seeks to obtain the ‘best’ hypocentralcoordinates).

A Priori Information on Model Parameters

The ‘unknowns’ of the problem are the hypocentral coordinates of an Earthquake3

X, Z , as well as the origin time T . We assume to have some a priori informationabout the location of the earthquake, as well as about ots origin time. This a prioriinformation is assumed to be represented using the probability density

ρm(X, Z, T) . (4.122)

Because we use Cartesian coordinates and Newtonian time, the homogeneous prob-ability density is just a constant,

µm(X, Y, T) = k . (4.123)

For consistency, we must assume (rule ??) that the limit of ρm(X, Z, T) for infinite‘dispersions’ is µm(X, Z, T) .

Example 4.2 We assume that the a priori probability density for (X, Z) is constant insidethe region 0 < X < 60 km , 0 < Z < 50 km , and that the (unnormalizable) probabilitydensity for T is constant.

Data

The data of the problem are the arrival times t1, t2, t3, t4 of the seismic waves at aset of four seismic observatories whose coordinates are xi, zi . The measurementof the arrival times will produce a probability density

σobs(t1, t2, t3, t4) (4.124)

3To simplify, here, we consider a 2D flat model of the Earth, and use Cartesian coordinates.

4.6 Probabilistic Estimation of Earthquake Locations 167

over the ‘data space’. As these are Newtonian times, the associated homogeneousprobability density is constant:

µo(t1, t2, t3, t4) = k . (4.125)

For consistency, we must assume (rule ??) that the limit of σobs(t1, t2, t3, t4) forinfinite ‘dispersions’ is µo(t1, t2, t3, t4) .

Example 4.3 Assuming Gaussian, independent uncertainties, we have

σobs(t1, t2, t3, t4) = k exp

(−1

2(t1 − t1

obs)2

σ21

)exp

(−1

2(t2 − t2

obs)2

σ22

)

× exp

(−1

2(t3 − t3

obs)2

σ23

)exp

(−1

2(t4 − t4

obs)2

σ24

).(4.126)

Solution of the Forward Problem

The forward problem consists in calculating the arrival times ti as a function of thehypocentral coordinates X, Z , and the origin time T :

ti = f i(X, Z, T) . (4.127)

Example 4.4 Assuming that the velocity of the medium is constant, equal to v ,

t1cal = T +

√(X− xi)2 + (Z− zi)2

v. (4.128)

Solution of the Inverse Problem

Note: explain here that ‘putting all this together’,

σm(X, Z, T) = kρm(X, Z, T) σobs(t1, t2, t3, t4)∣∣∣ti= f i(X,Z,T)

. (4.129)

168 Examples

Numerical Implementation

To show how simple is to implement an estimation of the hypocentral coordinatesusing the solution given by equation 4.129, we give, in extenso, all the commandsthat are necessary to the implementation, using a commercial mathematical software(Mathematica). Unfortunately, while it is perfectly possible, using this software, toexplicitly use quantities with their physical dimensions, the plotting routines requireadimensional numbers. This is why the dimensions have been suppresed in whayfollows. We use kilometers for the space positions and seconds for the time positions.

We start by defining the geometry of the seismic network (the vertical coordinatez is oriented with positive sign upwards):

x1 = 5;

z1 = 0;

x2 = 10;

z2 = 0;

x3 = 15;

z3 = 0;

x4 = 20;

z4 = 0;

The velocity model is simply defined, in this toy example, by giving its constantvalue ( 5 km/s ):

v = 5;

The ‘data’ of the problem are those of example 4.3. Explicitly:

t1OBS = 30.3;

s1 = 0.1;

t2OBS = 29.4;

s2 = 0.2;

t3OBS = 28.6;

s3 = 0.1;

t4OBS = 28.3;

s4 = 0.1;

rho1[t1_] := Exp[ - (1/2) (t1 - t1OBS)^2/s1^2 ]

rho2[t2_] := Exp[ - (1/2) (t2 - t2OBS)^2/s2^2 ]

rho3[t3_] := Exp[ - (1/2) (t3 - t3OBS)^2/s3^2 ]

rho4[t4_] := Exp[ - (1/2) (t4 - t4OBS)^2/s4^2 ]

rho[t1_,t2_,t3_,t4_]:=rho1[t1] rho2[t2] rho3[t3] rho4[t4]


Although an arbitrarily complex velocity velocity model could be consideredhere, let us take, for solving the forward problem, the simple model in example 4.4:

t1CAL[X_, Z_, T_] := T + (1/v) Sqrt[ (X - x1)^2 + (Z - z1)^2 ]




The posterior probability density is just that defined in equation 4.129:

sigma[X_,Z_,T_] := rho[t1CAL[X,Z,T],t2CAL[X,Z,T],t3CAL[X,Z,T],t4CAL[X,Z,T]]

We should have multiplied by the ρm(X, Z, T) defined in example 4.2, but as thisjust corresponds to a ‘trimming’ of the values of the probability density outside the‘box’ 0 < X < 60 km , 0 < Z < 50 km , we can do this afterwards.

The defined probability density is 3D, and we could try to represent it. Instead,let us just represent the marginal probabilty densities. First, we ask the software toevaluate analytically the space marginal:

sigmaXZ[X_,Z_] = Integrate[ sigma[X,Z,T], T,-Infinity,Infinity ];

This gives a complicated result, with hypergeometric functions4. Representing thisprobability density is easy, as we just need to type the command

ContourPlot[-sigmaXZ[X,Z],X,15,35,Z,0,-25,

PlotRange->All,PlotPoints->51]

The result is represented in figure 4.13 (while the level lines are those directly pro-duced by the software, there has been some additional editing to add the labels).When using ContourPlot, we change the sign of sigma, because we wish to reversethe software’s convention of using light colors for positive values. We have chosenthe right region of the space to be plotted (significant values of sigma) by a prelimi-nary plotting of ‘all’ the space (not represented here).

Should we have some a priori probability density on the location of the earth-quake, represented by the probability density f(X,Y,Z), then, the theory says that weshould multiply the density just plotted by f(X,Y,Z). For instance, if we have the apriori information that the hypocenter is above the level z = −10 km, we just put tozero everyhing below this level in the figure just plotted.

Let us now evaluate the marginal probability density for the time, by typing thecommand

sigmaT[T_] := NIntegrate[ sigma[X,Z,T], X,0,+60, Z,0,+50 ]

4Typing sigmaXZ[X,Z] presents the result.

170 Examples

Here, we ask Mathematica NOT to try to evaluate analytically the result, but toperform a numerical computation (as we have checked that no analytical result isfound). We use the ‘a priori information’ that the hypocenter must be inside a region0 < X < 60 km , 0 < Z < 50 km but limiting the integration domain to that area(see example 4.2). To represent the result, we enter the command

p = Table[0,i,1,400];

Do[ p[[i]] = sigmaT[i/10.] , i,100,300]

ListPlot[ p,PlotJoined->True, PlotRange->100,300,All]

and the produced result is shown (after some editing) in figure 4.14. The softwarewas not very stable in producing the results of the numerical integhration.

Figure 4.13: The probability density for thelocation of the hypocenter. Its asymmetricshape is quite typical, as seismic observato-ries tend to be asymmetrically placed.

15 20 25 30 35-25

-20

-15

-10

-5

0

0 km 10 km 20 km 30 km

0 km

-10 km

-20 km v = 5 km/s

t1ob

s =

(30

.3 ±

0.1

) s

t2ob

s =

(29

.4 ±

0.2

) s

t3ob

s =

(28

.6 ±

0.1

) s

t4ob

s =

(28

.3 ±

0.1

) s

Figure 4.14: The marginal probability densityfor the origin time. The asymmetry seen in theprobability density in figure 4.13, where the de-cay of probability is slow downwards, translateshere in significant probabilities for early times.The sharp decay of the probability density fort < 17s does not come from the values of thearrival times, but from the a priori informationthat the hypocenters must be above the depthZ = −50 km .

15 s 20 s 25 s 30 s10 s


An Example of Bimodal Probability Density for an Arrival Time.

As an exercise, the reader could reformulate the problem replacing the assumtionof Gaussian uncertainties in the arrival times by multimodal probability densities.For instance, figure 5.41 suggested the use of a bimodal probability density for thereading of the arrival time of a seismic wave. Using the Mathematica software, thecommand

rho[t_] := (If[8.0<t<8.8,5,1] If[9.8<t<10.2,10,1])

defines a probability density that, when plotted using the command

Plot[ rho[t],t,7,11 ]

produces the result displayed in figure 4.15.

Figure 4.15: In figure 5.41 it was sug-gested that the probability density forthe arrival time of a seismic phase maybe multimodal. This is just an exampleto show that it is quite easy to definesuch multimodal probability densities incomputer codes, even if they are not an-alytic.

8 9 10 11

2

4

6

8

10

172 Examples

Chapter 5

Appendices

5.1 APPENDICES FOR SET THEORY

174 Appendices

5.1.1 Proof of the Set Property ϕ[ A∩ϕ-1[ B ] ] = ϕ[A]∩ B

First Version of the Proof (note: keep only one version)

Let us here demonstrate the property that, if ϕ represents a mapping from a set A0into a set B0 , then, for any A ⊆ A0 and for any B ⊆ B0 ,

ϕ[ A∩ϕ-1[ B ] ] = ϕ[A]∩ B . (5.1)

First, consider an element y ∈ ϕ[ A∩ϕ-1[ B ] ] . This means that there is a x ∈A∩ϕ-1[ B ] with ϕ(x) = y . Now x ∈ϕ-1[ B ] means that ϕ(x) ∈ B . Hence we havethat x ∈ A and ϕ(x) ∈ B , and y =ϕ(x) , so y ∈ϕ[A]∩ B .

Conversely, if y ∈ ϕ[A]∩ B , there is an x ∈ A with y = ϕ(x) ∈ B . This canbe written equivalently as x ∈ ϕ-1[ B ] , y = ϕ(x) , x ∈ A , which can be written asy ∈ϕ[ A∩ϕ-1[ B ] ] .

Second Version of the Proof (note: keep only one version)

Consider an element y ∈ ϕ[ A∩ϕ-1[ B ] ] . This means that there is a x ∈ A∩ϕ-1[ B ]with ϕ(x) = y . Therefore, y ∈ ϕ[A] . From another side, as x ∈ ϕ-1[ B ] , there isone y′ ∈ B such that x = ϕ-1(y′) . Therefore, ϕ(x) = y′ = y . This demonstratesthat y ∈ϕ[A]∩ B . We conclude that ϕ[A∩ϕ-1[ B ]] ⊆ ϕ[A]∩ B .

Let now be one y ∈ ϕ[A]∩ B . This means that there is a x ∈ A with y = ϕ(x) .Therefore, ϕ-1(y) = x , and, as y ∈ B , we have x ∈ A∩ϕ-1[ B ] . This demonstratesthat y ∈ϕ[A∩ϕ-1[ B ]] . We conclude that ϕ[A]∩ B ⊆ ϕ[ A∩ϕ-1[ B ] ] .

Therefore, the two sets ϕ[ A∩ϕ-1[ B ] ] and ϕ[A]∩ B are identical.

5.2 APPENDICES FOR MANIFOLDS 175

5.2 APPENDICES FOR MANIFOLDS

176 Appendices

5.2.1 Capacity Element and Change of Coordinates

Consider the problem, when dealing with an n-dimensional manifold, to pass froma coordinate system xα = x1, . . . , xn to some other coordinate system xα

′ =x1′ , . . . , xn′ , and let be, as usual,

Xαα′ =∂xα

∂xα′; Xα

′α =

∂xα′

∂xα. (5.2)

The capacity elements in each of the two coordinate systems are

dv = εα1 ...αn dxα1 . . . dxαn ; dv′ = εα′1 ...α′n dxα′1 . . . dxα

′n , (5.3)

and one can write

dv′ = εα′1 ...α′n Xα′1α1 . . . Xα

′nαn dxα1 . . . dxαn . (5.4)

Because of the antisymmetry properties of the Levi-Civita densities and capacities,this can also be written (see relations 2.32) as

dv′ = εα′1 ...α′n Xα′1β1 . . . Xα

′nβn ( 1

n! εβ1 ...βn εα1 ...αn ) dxα1 . . . dxαn , (5.5)

i.e.,dv′ = ( 1

n! εα′1 ...α′n Xα′1β1 . . . Xα

′nβn ε

β1 ...βn ) εα1 ...αn dxα1 . . . dxαn , (5.6)

In the left-hand side, onr recognizes the definition of a determinant (see the thirdof equations 2.33) and one finds the capacity element dv introduced above, so onefinally has

dv′ = (det X′) dv , (5.7)

as one should, as this is the general expression for the change of value of a scalar1

capacity.

1A scalar capacity is a capacity of rank (or order) zero, i.e., a capacity “having no tensor indices”.


5.2.2 Conditional Volume

Consider an n-dimensional manifold Mn , with some coordinates x1, . . . , xn , anda metric tensor gi j(x) . Consider also a p-dimensional submanifold Mp of the n-dimensional manifold Mn (with p ≤ n ). The n-dimensional volume over Mn ,as characterized by the metric determinant g =

√det g , induces a p-dimensional

volume over the submanifold Mp . Let us try to characterize it.The simplest way to represent a p-dimensional submanifold Mp of the n-dimen-

sional manifold Mn is by separating the n coordinates x = x1, . . . , xn of Mninto one group of p coordinates r = r1, . . . , rp and one group of q coordinatess = s1, . . . , sq , with

p + q = n . (5.8)

Using the notations

x = x1, . . . , xn = r1, . . . , rp, s1, . . . , sq = r, s , (5.9)

the set of q relations

s1 = s1(r1, r2, . . . , rp)s2 = s2(r1, r2, . . . , rp)

. . . = . . .sq = sq(r1, r2, . . . , rp) , (5.10)

that, for short, may be writtens = s(r) , (5.11)

define a p-dimensional submanifold Mp in the (p + q)-dimensional manifold Mn .For later use, we can now introduce the matrix of partial derivatives

S =

S1

1 S12 · · · S1

pS2

1 S22 · · · S2

p...

... . . . ...Sq

1 Sq2 · · · Sq

p

=

∂s1

∂r1∂s1

∂r2 · · · ∂s1

∂rp

∂s2

∂r1∂s2

∂r2 · · · ∂s2

∂rp

...... . . . ...

∂sq∂r1

∂sq∂r2 · · · ∂sq

∂rp

. (5.12)

We can write S(r) for this matrix, as it is defined at a point x = r, s(r) . Note alsothat the metric over Mn can always be partitioned as

g(x) = g(r, s) =(

grr(r, s) grs(r, s)gsr(r, s) gss(r, s)

), (5.13)

with grs = (gsr)T .In what follows, let us use the Greek indexes α,β . . . r1, . . . , rp , like in

rα ; α ∈ 1, . . . , p , and the Latin indexes a, b . . . for the variables s1, . . . , sq ,like in sa ; a ∈ 1, . . . , q . Consider an arbitrary point r, s of the manifold M .

178 Appendices

r2

s

r1

s

r1

s = s( r , r )1 2

Some surface coordinates

of a coordinate system

over a 3D manifold

An elementary region

on the coordinate surface

defined by a condition

s = constant

An elementary region

on the surface

defined by a condition

dSdSr

2

s = s( r , r )1 2

Figure 5.1: On a 3D manifold, a coordinate system x1, x2, x3 = r1, r2, s is de-fined. Some characteristic surface coordinates are represented (left). In the middle,a surface element (2D volume element) on a coordinate surface s = const. is repre-sented, that corresponds to the expression in equation 5.14. In the right, a submani-fold (surface) is defined by an equation s = s(r1, r2) . A surface element (2D volumeelement) is represented on the submanifold, that corresponds to the expression inequation 5.15.

If the coordinates rα are perturbed to rα + drα , with the coordinates sa kept un-perturbed, one defines a p-dimensional submanifold of the n-dimensional manifoldMn . The volume element of this submanifold can be written (middle panel in fig-ure 5.1)

dvp(r, s) =√

det grr(r, s) dr1 ∧ · · · ∧ drp . (5.14)

Alternatively, consider a point (r, s) of Mn that, in fact, is on the submanifoldMp , i.e., a point that has coordinates of the form (r, s(r)) . It is clear that the variablesr1 . . . rp define a coordinate system over the submanifold, as it is enough to preciser to define a point in Mp . If the coordinates rα are perturbed to rα + drα , andthe coordinates sa are also perturbed to sa + dsa in a way that one remains on thesubmanifold, (i.e., with dsa = Sa

α drα ), then, with the metric over Mn partitioned asin equation 5.13, the general distance element ds2 = gi j dxi dx j can be written ds2 =(grr)αβ drα drβ + (grs)αb drα dsb + (gsr)aβ dsa drβ + (gss)ab dsa dsb , and replacing dsa

by dsa = Saα drα , we obtain ds2 = Gαβ drα drβ , with G = grr + grs S + ST gsr +

ST gss S . The ds2 just expressed gives the distance between two any points of Mp ,i.e., G is the metric matrix of the submanifold associated to the coordinates r .

The p-dimensional volume element on the submanifold Mp is, then, dvr =√det G dr1 ∧ · · · ∧ drp , i.e.,

dvp(r) =√

det (grr + grs S + ST gsr + ST gss S) dr1 ∧ · · · ∧ drp , (5.15)


where, if the variables are explicitly written, S = S(r) , grr = grr(r, s(r)) , grs =grs(r, s(r)) , gsr = gsr(r, s(r)) and gss = gss(r, s(r)) . Figure 5.1 illustrates this result.The expression 5.15 says that the p-dimensional volume density induced over thesubmanifold Mp is

gp =√

det (grr + grs S + ST gsr + ST gss S) . (5.16)

Note that the notion of ‘conditional volume’ just explored does not make anysense when the manifold is not metric.

180 Appendices

5.3 APPENDICES FOR PROBABILITY THEORY

5.3 APPENDICES FOR PROBABILITY THEORY 181

5.3.1 Image of a Probability Density

Consider a mapping ϕ from a p-dimensional manifold Mp into a q-dimensionalmanifold Mq . To every probability function P defined over (a σ-field of subsetsof) Mp , the mapping ϕ associates its image Q = ϕ[P] , the probability functionover Mq defined in section 3.2.2. (Note: I should check that the image of a σ-field isa σ-field.)

There are two different situations (illustrated in figure 5.2):

• p ≥ q : When the dimension of the departure manifold is larger or equal thanthe dimension of the arrival manifold, the probability function Q = ϕ[P] isregular, i.e., it is an ordinary probability function defined over the q-dimensionalarrival manifold. This is the most common situation.

• p < q : When the dimension of the departure manifold is smaller than the di-mension of the arrival manifold, the probability function Q =ϕ[P] is singular,in the sense that it is only defined over a p-dimensional submanifold of theq-dimensional arrival manifold. This is an unfrequent situation.

Figure 5.2: When considering theimage of a probability functiondefined over a manifold, thereare two different situations to beanalyzed: (i) when the dimen-sion of the departure manifoldis greater or equal and (ii) whenit is smaller (bottom). (Note: Ishould perhaps not draw coordi-nates in this figure.)

f(x1,x2)

y1

y2

y3

x1

x2

g(y1,y2,y3)y = ϕ(x)

y1

y2g(y1,y2)

f(x1,x2,x3)

x1

x2

x3

y = ϕ(x)

The case p ≥ q is . . . (Note: write this paragraph after having done the computa-tions.)

The case p ≤ q is relatively easy to analyze, as any coordinate system x1, . . . , xpchosen over Mp also constitutes a coordinate system over the p-dimensional sub-manifold of Mq where Q = ϕ[P] is defined. I this sense, one can consider that theimage of a probability density f (x1, . . . , xp) defined over Mp is the same probabil-ity density, as one is using the same coordinates x1, . . . , xp over the image of Mp .So, the only interesting problem in the case p ≤ q is for the case when the twomanifolds are metric manifolds as there are then volume elements defined indepen-dently of any coordinate system, and the image of a volumetric probability can thenbe expressed using a nontrivial formula (equation 5.48 below).

182 Appendices

Case p = q :

We consider here a mapping y = ϕ(x) between two manifolds having the samedimension. Should the mapping ϕ be a bijection, then, to pass from the probabilitydensity function f (x) —over the departure manifold— to the probability densityfunction g(y) —over the arrival manifold— one should just use a reasoning verysimilar to that used when changing coordinates on a manifold. The result wouldhave been (see section 3.3.1)

g(y) =f (ϕ-1(y) )Y(ϕ-1(y) )

, (5.17)

whereY(x) =

∂ϕ

∂x(x) ; Y(x) = det Y(x) , (5.18)

and the condition(ϕ[P])[ B ] = P[ϕ-1[ B ] ] (5.19)

would obviously be satisfied:

Q[ B ] = (ϕ[P])[ B ] =∫

Bdy1 ∧ · · · ∧ dyp g(y1, . . . , yp)

=∫ϕ-1[B]

dx1 ∧ · · · ∧ dxp f (x1, . . . , xp) = P[ϕ-1[ B ] ] .

(5.20)

If the mapping ϕ is not a bijection, we can always partition2 A = ϕ-1[ B ] into aset of subsets Ar = A1, A2, . . . such that the mapping ϕr between each of theAr and B is a bijection (see an illustration of this in figure 5.3). Each of the subsetsthen contributes in the same way to g(y) , so the result is, then,

g(y) = ∑r

f (ϕ-1r (y) )

Y(ϕ-1r (y) )

, (5.21)

or, equivalently,

g(y) = ∑r

f (x)Y(x)

∣∣∣∣∣x=ϕ-1

r (y)

. (5.22)

If the two manifolds have volume densities defined, say ω(x) and v(y) , then,replacing the probability densities by the volumetric probabilities gives

g(y) =1

v(y) ∑r

f (x)ω(x)Y(x)

∣∣∣∣x=ϕ-1

r (y). (5.23)

2A partition of a set A is a set of subsets of A such that the intersection of any two of the subsetsis empty, while the union of all the subsets equals A .


Figure 5.3: When considering a mapping be-tween two manifolds with same dimension,one can partition the reciprocal image A =ϕ-1[ B ] of a set B into sets A1, A2, . . . suchthe mapping between each of the Ar and Bis a bijection.

B

A1 A2 A3 A4 A5

If the manifolds are, in fact, metric manifolds, with metric tensors respectively de-noted γ(x) and G(y) , then we can write this as

g(y) =1√

det G(y) ∑r

f (x)√det( Y(x)γ-1(x) Y(x)t )

∣∣∣∣∣x=ϕ-1

r (y)

. (5.24)

Case p ≥ q :

We contemplate here a mapping ϕ from a p-dimensional manifold Mp , endowedwith some coordinates xα = x1, . . . , xp , into a q-dimensional manifold Mq ,endowed with some coordinates yi = y1, . . . , yq . The dimension of the depar-ture manifold is larger than the dimension of the arrival manifold, so we are in thesituation suggested at the top of figure 5.2, and the image of a probability densityf (x1, . . . , xp) is a bona fide probability density g(y1, . . . , yq) .

The simplest way to address this problem consists in introducing a “slack man-ifold” Mp−q , whose dimension is p − q , and whose coordinates may be denotedyI = yq+1, . . . , yp . We can complete the q functions yi = ϕi(x1, . . . , xp) by someother (well chosen, but arbitrary) p− q functions yI = ϕI(x1, . . . , xp) , in order tohave p functions depending on p variables:

initial functions :

y1 = ϕ1(x1, . . . , xp)· · · = · · ·yq = ϕq(x1, . . . , xp)

arbitrary functions :

yq+1 = ϕq+1(x1, . . . , xp)· · · = · · ·yp = ϕp(x1, . . . , xp) .

(5.25)

So we have now a mapping from the p-dimensional manifold Mp into the p-dimen-sional manifold Mq ×Mp−q . Passing from the p-dimensional probability densityfunction f (x) ≡ f (x1, . . . , xp) to its image, the p-dimensional probability density

184 Appendices

function gp(yq, yp−q) ≡ gp(y1, . . . , yq; yq+1, . . . , yp) is a problem similar to the prob-lem of a change of variables, excepted that the (augmented) mapping ϕ may notbe a bijection. In this case, we do as we did for the case p = q , i.e., we partitionthe departure manifold Mp in as many subsets Ar = A1, A2, . . . as it may benecessary, so for all the subsets, the mapping ϕr from subset Ar to Mq ×Mp−q isa bijection. On Mq ×Mp−q we then have the probability density function (this issimilar to equation 5.21)

gp(yq, yp−q) = ∑r

f (ϕ-1r (yq, yp−q) )

Y(ϕ-1r (yq, yp−q) )

. (5.26)

Then we obtain the desired probability density function by marginalization:

g(y1, . . . , yq) =∫

dyq+1 ∧ · · · ∧ dyp gp(y1, . . . , yq, yq+1, . . . , yp) . (5.27)

(Note: I must demonstrate here that this marginal is independent from the arbitraryp− q functions chosen above.)

Putting all this together gives

g(y1, . . . , yq) =∫

dyq+1 ∧ · · · ∧ dyp∑r

f (ϕ-1r (y1, . . . , yq; yq+1, . . . , yp) )

Y(ϕ-1r (y1, . . . , yq; yq+1, . . . , yp) )

.

(5.28)There is another way3 for addressing this problem, but it will not be used in the

examples that we shall examine.

Example 5.1 Two quantities X, Y can take any real positive value, and, associated tothem is the (two-dimensional lognormal) probability density function

f (X, Y) =1

2 π σ21

X Yexp

(− 1

2log(X/X0)2 + log(Y/Y0)2

σ2

). (5.29)

3The p variables xα can be separated in two groups, x1, . . . , xq; xq+1, . . . , xp ≡u1, . . . , uq; v1, . . . , vp−q , so we can write the original mapping as yi = ϕ1(u1, . . . , uq; v1, . . . , vp−q) ,· · · = . . . , yq = ϕq(u1, . . . , uq; v1, . . . , vp−q) . If we can solve these q equations, we then obtainthe q variables u1 = ψ1(y1, . . . , yq; v1, . . . , vp−q) , · · · = . . . , uq = ψq(y1, . . . , yq; v1, . . . , vp−q) . Ifthe system is not invertible, we partition the space into as many subsets as necessary, and solvethe system in each subspace (as suggested above). Now, considering for a moment the p− q vari-ables v1, . . . , vp−q as fixed parameters, we can pass from the original probability density, that canbe written f (u1, . . . , uq; v1, . . . , vp−q) , to the probability density gp(y1, . . . , yq; v1, . . . , vp−q) , as thisis like a change of variables. Then, we obtain the desired probability density by marginalization:gq(y1, . . . , yq) =

∫dv1 ∧ · · · ∧ dvp−q gp(y1, . . . , yq, v1, . . . , vp−q) . I leave as an exercise to the reader

to demonstrate that this marginal is independent from the arbitrary choice of variables u1, . . . , uqand v1, . . . , vp−q . Note that while in this approach the change of variables concerns q variables, inthe previous suggested approach, the change of variables concerns p variables (and we have p ≥ q ).


Which is the probability density for the real positive quantity U = X/Y ? To use the firstmethod described above, we can introduce another real positive quantity, V = X Y . Usingthe Jacobian rule for the change of variables, one obtains

g2(U, V) =1

2 π Σ21

U Vexp

(− 1

2log(U/U0)2 + log(V/V0)2

Σ2

), (5.30)

withU0 = X0/Y0 ; V0 = X0 Y0 ; Σ =

√2σ , (5.31)

and computing the marginal g(U) =∫∞

0 dV g2(U, V) gives the result we were searching:

g(U) =1√

2 π Σ

1U

exp(− 1

2log(U/U0)2

Σ2

). (5.32)

If instead of choosing V = X Y we would have chosen V = X , we would have arrived tothe probability density

g2(U, V) =1

2 π σ21

U Vexp

(− 1

2

(log(U/U0)log(V/V0)

)t

C-1(

log(U/U0)log(V/V0)

) ), (5.33)

with

U0 = X0/Y0 ; V0 = X0 ; C = σ2(

2 11 1

), (5.34)

and, when computing the marginal g(U) , to the same result as above (as we should). Finally,if we choose the alternative method suggested above, of the two variables in f (X, Y) , we keep,for instance Y , and change X by U = X/Y . This leads to

g2(U, Y) =1

2 π σ21

U Yexp

(− 1

2

(log(U/U0)log(Y/Y0)

)t

C-1(

log(U/U0)log(Y/Y0)

) ), (5.35)

with

U0 = X0/Y0 ; C = σ2(

2 −1−1 1

), (5.36)

and, when computing the marginal g(U) , again to the same result as above. We could havekept X , and change Y by U = X/Y .

Note: there are other multiplicities that will be discussed here. For the time being,have a look at figure 5.4.

186 Appendices

Figure 5.4: Consider that we have amapping from the Euclidean plane, withpolar coordinates r = ρ,ϕ , into a one-dimensional space with a metric coordi-nate s (in this illustration, s = s(ρ,ϕ) =sinρ/ρ ). When transporting a proba-bility from the plane into the ‘verticalaxis’, for a given value of s = s0 wehave, first, to obtain the set of discretevalues ρn giving the same s0 , and, foreach of these values, we have to per-form the integration for −π < ϕ ≤ +πcorresponding to that indicated in equa-tion XXX.

Case p ≤ q :

As explained above, there is not much to say when working with probability densi-ties. (Note: check that I really need the developments here below.) So let us assumethat both, the departure manifold Mp and the arrival manifold Mq are metric. Al-though our aim is to arrive at coordinate-free expressions, is much easier to do thedemonstrations using (arbitrary) coordinates. Let then x ≡ xα = x1, . . . , xp bea coordinate system over4 Mp , and let y ≡ yi = y1, . . . , yq be a coordinatesystem over5 Mq . We shall denote gi j the components of the metric in Mq .

At some point on Mp , consider the “infinitesimal vector” dxα . It is mapped intothe vector

dyi = Φiα dxα , (5.37)

where Φiα = ∂ϕi/∂xα . This vector belongs to a p-dimensional submanifold of Mq :

the image of Mp, through the mapping ϕ . The squared length of this vector is

ds2 = gi j dyi dy j = gi j Φiα Φ j

β dxα dxβ , (5.38)

i.e.,ds2 = Gαβ dxα dxβ , (5.39)

whereGαβ = Φi

α gi j Φjβ , (5.40)

or, using a matrix notation,G = Φt g Φ . (5.41)

4Or, at least, over a vicinity of the point where the reasoning is made.5Or, at least, over a vicinity of the image point.


This equation can be interpreted as follows. The coordinates xα of Mp define acoordinate system over ϕ[Mp] ⊆ Mq . The metric g of Mq induces a metric overthis submanifold, and the components of the metric so induced are, in the coordi-nates xα , the Gαβ just expressed.

Let now γαβ be the metric of Mp (in the coordinates xα ). To the originalvector dxα was associated the capacity element dω = εα1 ...αp dxα1 . . . dxαp , and thevolume element dω = γ dω =

√detγ dω . To the image vector dyi is associated

a capacity element that has the same expression (we are also using the coordinatesxα on ϕ[Mp] ), but the volume element is dv = G dω =

√det(Φt g Φ) dω . So,

the image of the original volume element can be expressed in terms of the originalvolume element as

dv =Gγ

dω =√

det(Φt g Φ)√detγ

dω . (5.42)

The ratio is a ratio of two densities (defined using the same coordinates), so it is aninvariant (independent of the coordinates), so we can write this equation using anotation that will make more apparent this invariance:

dv =√

det( Φt(P) g(ϕ(P)) Φ(P) )√detγ(P)

dω . (5.43)

Here, g and γ have to be understood as tensors, Φ as an abstract (tangent) oper-ator, and Φt as the abstract transpose of this linear operator. (Note: I have to givesomewhere the definition of the transpose of a linear operator.)

Consider now a probability function P defined over Mp , represented by thevolumetric probability f (P) . To the volume dω (in the departure manifold) theprobability function P associates the probability value

dP = f (P) dω . (5.44)

Denotingg = ϕ[ f ] (5.45)

the volumetric probability that represents the image Q =ϕ[P] of P , the probabilityof the image of the volume is

dQ = g(ϕ(P)) dv . (5.46)

By definition of image of a probability function (section 3.2.2), one must have dQ =dP so one has

g(ϕ(P)) dv = f (P) dω . (5.47)

Equation 5.43 then gives

g(ϕ(P)) =√

det( Φt(P) g(ϕ(P)) Φ(P) )√detγ(P)

f (P) . (5.48)

188 Appendices

The volumetric probability g is not defined at an arbitrary point Q ∈ Mq : thepoint must be of the form Q =ϕ(P) , i.e., Q must belong to the submanifold ϕ[Mp] .For this reason, it is better to have written the equation above at a point ϕ(P) , with-out attempting to give an expression for g(Q) . The volumetric probability g mustbe integrated —over the submanifold ϕ[Mp] of Mq — using the volume elementthat the metric of Mq induces over ϕ[Mp] . Of course, there may be more thanone point P that maps into the same ϕ(P) , but when dealing with the submani-fold ϕ[Mp] the problem can easily be treated (remember that the best coordinates touse over ϕ[Mp] ⊆ Mq are the coordinates of Mp ; in these coordinates, all possible“crossings” of the submanifold can be ignored, as suggested in figure 5.5).

Figure 5.5: A mapping from a one-dimensional manifold into a two-dimensional manifold can be rep-resented as a “trajectory”. Here,a coordinate t has been chosen inthe departure manifold, and twocoordinates y1, y2 over the ar-rival manifold. The mapping hererepresented is y1 = cos t +t/6 , y2 = sin t . In some prob-lems, it may be natural to use thecoordinate t over the image of thedeparture manifold, as suggestedin this figure. For the evaluationof probabilities, one then integratesover the variable t , and the possi-ble “crossings” of the image mani-fold can be ignored.

1 2 3

-1

1

0t =

0

t = π

/2

t = 5

π/2

y1

y2

t = 4

π

t = 3

π/2

t = 7

π/2

t = 2

π


5.3.2 Proof of the Compatibility Property (Discrete Sets)

Note: this demonstration is not yet general, as it is only valid for the special casewhere the sets are discrete.

Let A0 and B0 be two discrete sets. We wish to satisfy the condition that for anymapping ϕ from A0 into B0 , for any two probabilities P (over A0 ) and Q (overB0 ), and for any element b ∈ B0 ,

(ϕ[ P∩ϕ-1[Q] ] )[ b ] = (ϕ[P] ∩Q )[ b ] . (5.49)

Using equation 3.53 (image of a probability), we can write the left-hand side of thisexpression as

(ϕ[ P∩ϕ-1[Q] ] )[ b ] = ∑all a such thatϕ(a) = b

( P∩ϕ-1[Q] )[a] , (5.50)

and, using equation 3.61 (intersection of probabilities),

(ϕ[ P∩ϕ-1[Q] ] )[ b ] =1ν

∑all a such thatϕ(a) = b

P[a] (ϕ-1[Q] )[a] , (5.51)

where ν is a normalization constant.Let us now take the right-hand side. Using equation 3.61 (intersection of proba-

bilities) gives

(ϕ[P] ∩Q )[ b ] =1ν′

(ϕ[P])[ b ] Q[ b ] , (5.52)

where ν′ is a normalization constant. Using equation 3.53 (image of a probability)gives

(ϕ[P] ∩Q )[ b ] = =1ν′

(∑

all a such thatϕ(a) = b

P[a])

Q[ b ] , (5.53)

or, equivalently,

(ϕ[P] ∩Q )[ b ] =1ν′ ∑

all a such thatϕ(a) = b

P[a] Q[ ϕ(a) ] . (5.54)

190 Appendices

The two expressions at the right-hand side of equations 5.51 and equation 5.54can be made identical (for any probability P and any element b ) if, and only if, onedefines the reciprocal image of a probability via

(ϕ-1[Q] )[a] =1ν′′

Q[ ϕ(a) ] , (5.55)

where ν′′ is a normalization constant (this is exactly the definition used in equa-tion 3.65). With this definition, expression 5.49 holds (for any mapping ϕ , for anytwo probabilities P and Q , and for any element b ).


5.3.3 Proof of the Compatibility Property (Manifolds)

Note: this demonstration is only valid for the special case where the sets are man-folds.

Note: for the time being I attempt to demonstrate the property only in the casewhere the dimension of the departure manifold equals the dimension of the arrivalmanifold. In that case, the image of a volumetric probability is given by (see equa-tion 5.23):

g(y) =1

v(y) ∑r

f (x)ω(x)Y(x)

∣∣∣∣x=ϕ-1

r (y). (5.56)

I must also use the expression for the intersection of two volumetric probability func-tions:

(h∩ h′)(z) =1ν

h(z) h′(z) . (5.57)

Let Mp and Mp be two manifolds, that, for the time being, are assumed to havethe same dimension. We wish to satisfy the condition that for any mapping ϕ fromMp into Mq , for any two volumetric probabilities f (over Mp ) and g (over Mq ),and at any point y ∈ Mq ,

(ϕ[ f ∩ϕ-1[g] ] )(y) = (ϕ[ f ] ∩ g )(y) . (5.58)

Using equation 5.56 (image of a probability), we can write the left-hand side of thisexpression as

(ϕ[ f ∩ϕ-1[g] ] )(y) =1

v(y) ∑r

( f ∩ϕ-1[g])(x)ω(x)Y(x)

∣∣∣∣x=ϕ-1

r (y), (5.59)

and, using equation 5.57 (intersection of probabilities),

(ϕ[ f ∩ϕ-1[g] ] )(y) =1ν

1v(y) ∑

r

f (x)ϕ-1[g](x)ω(x)Y(x)

∣∣∣∣x=ϕ-1

r (y), (5.60)

where ν is a normalization constant.Let us now take the right-hand side. Using equation 5.57 (intersection of proba-

bilities) gives

(ϕ[ f ] ∩ g )(y) =1ν′

(ϕ[ f ])(y) g(y) , (5.61)

where ν′ is a normalization constant. Using equation 5.56 (image of a probability)gives

(ϕ[ f ] ∩ g )(y) =1ν′

( 1v(y) ∑

r

f (x)ω(x)Y(x)

∣∣∣∣x=ϕ-1

r (y)

)g(y) , (5.62)

or, equivalently (as ϕr(x) =ϕ(x) ),

(ϕ[ f ] ∩ g )(y) =1ν′

1v(y) ∑

r

f (x) g(ϕ(x))ω(x)Y(x)

∣∣∣∣x=ϕ-1

r (y). (5.63)

192 Appendices

The two expressions at the right-hand side of equations 5.60 and equation 5.63can be made identical (for any volumetric probability f and any point y ) if, andonly if, one defines the reciprocal image of a volumetric probability via

(ϕ-1[g] )(x) =1ν′′

g(ϕ(x)) , (5.64)

where ν′′ is a normalization constant (note: this is exactly the definition I was guess-ing). With this definition, expression 5.58 holds (for any mapping ϕ , for any twovolumetric probabilities f and g , and for any point y ).


5.3.4 Axioms for the Union and the Intersection

5.3.4.1 The Union

I guess that the two defining axioms for the union of two probabilities are

P(D) = 0 AND Q(D) = 0 =⇒ (P∪Q)(D) = 0 (5.65)

andP(D) 6= 0 OR Q(D) 6= 0 =⇒ (P∪Q)(D) 6= 0 . (5.66)

But the last property is equivalent to its negation,

P(D) = 0 AND Q(D) = 0 ⇐= (P∪Q)(D) = 0 , (5.67)

and this can be reunited with the first property, to give the single axiom

P(D) = 0 AND Q(D) = 0 ⇐⇒ (P∪Q)(D) = 0 . (5.68)

5.3.4.2 The Intersection

We only have the axiom

P(D) = 0 OR Q(D) = 0 =⇒ (P∩Q)(D) = 0 . (5.69)

and, of course, its (equivalent) negation

P(D) 6= 0 AND Q(D) 6= 0 ⇐= (P∩Q)(D) 6= 0 (5.70)

194 Appendices

5.3.5 Union of Probabilities

Let P , Q . . . be elements of the space of all possible probability distributions (nor-malized or not) over M . An internal operation P , Q 7→ P∪Q of the space iscalled a union if the following conditions are satisfied:

Condition 5.1 (commutativity) for any D ⊂ M ,(P ∪ Q

)(D) =

(Q ∪ P

)(D) ; (5.71)

Condition 5.2 (associativity) for any D ⊂ M ,( (P ∪ Q

)∪ R

)(D) =

(P ∪

(Q ∪ R

) )(D) ; (5.72)

Condition 5.3 for any D ⊂ M ,

P(D) = 0 AND Q(D) = 0 =⇒ (P∪Q)(D) = 0 ; (5.73)

Condition 5.4 if there is some D ⊂ M for which P(D) = 0 , then, necessarily, for anyprobability Q ,

(P∪Q)(D) = Q(D) . (5.74)

There are explicitly defined operations that satisfy these conditions, as the fol-lowing two examples illustrate.

Example 5.2 If a probability distribution P is represented by the volumetric probabilityp(P) , and a probability distribution Q is represented by the volumetric probability q(P) ,then, taking for P∪Q the probability distribution represented by the volumetric probabilitydenoted

(p∪ q

)(P) , and defined by(

p∪ q)(P) = p(P) + q(P) , (5.75)

defines, as it is easy to verify, a union operation. It is not assumed here that any of theprobability distributions is normalized to one.

Example 5.3 An alternative solution would be what is used in fuzzy set theory to definethe union of fuzzy sets. Translated to the language of volumetric probabilities, this wouldcorrespond to (

p∪ q)(P) = max

(p(P) , q(P)

). (5.76)


5.3.5.1 Old Text (To Check!)

With these particular choices, in addition to all the conditions set above, one has asupplementary property

Property 5.1 The intersection is distributive with respect to the union, i.e., for any proba-bility distributions P , Q , and R

P ∩(

Q ∪ R)

=(

P ∩ Q)∪(

P ∩ R)

. (5.77)

One important property of the two operations ‘sum’ and ‘product’ just intro-duced is that of invariance with respect to a change of variables: our definitions areindependent of any possible choice of coordinates over the M . The reader must un-derstand that equations like ?? and ?? are only valid because expressed in terms ofvolumetric probabilities: it would be a mistake to use them as they are, but replacingthe volumetric probabilities by the more common probability densities. Let us seethis, for instance, with equation ??.

196 Appendices

5.3.6 Conditional Volumetric Probability (I)

As in section 5.2.2, consider an n-dimensional manifold Mn , with some coordi-nates x = x1, . . . , xn , and a metric tensor g(x) = gi j(x) . The n-dimensionalvolume element is, then, dV(x) = g(x) dv(x) =

√det g(x) dx1 ∧ · · · ∧ dxn .

In section 5.2.2, the n coordinates x = x1, . . . , xn of M have been separatedinto one group of p coordinates r = r1, . . . , rp and one group of q coordinatess = s1, . . . , sq , with p + q = n , and a p-dimensional submanifold Mp of then-dimensional manifold M (with p ≤ n ) has been introduced via the constraint

s = s(r) . (5.78)

Consider a probability distribution P over Mn , represented by the volumetricprobability f (x) = f (r, s) . We wish to define (and to characterize) the ‘conditionalvolumetric probability’ induced over the submanifold by the volumetric probabilityf (x) = f (r, s) .

Given the p-dimensional submanifold Mp of the n-dimensional manifold Mn ,one can define a set B(∆s) as being the set of all points whose distance to the sub-manifold Mp is less or equal than ∆s . For any finite value of ∆s , Kolmogorov’sdefinition of conditional probability applies, and the conditional probability so de-fined associates, to any D ⊂ Mn , the probability ??. Excepted for a normalizationfactor, this conditional probability equals the original one, excepted in that all thedomain whose points are at a distance larger than ∆s have been ‘trimmed away’.This is still a probability distribution over Mn . In the limit when ∆s → 0 thisshall define a probability distribution over the submanifold Mp that we are aboutto characterize.

Consider a volume element dvp over the submanifold Mn , and all the points ofMn that are at a distance smaller or equal that ∆s of the points inside the volumeelement. For small enough ∆s the n-dimensional volume ∆vn so defined is

∆vn ≈ dvp ∆ωq , (5.79)

where ∆ωq is the volume of the q-dimensional sphere of radius ∆s that is orthogo-nal to the submanifold at the considered point. This volume is proportional to (∆s)q ,so we have

∆vn ≈ k dvp (∆s)q , (5.80)

where k is a numerical factor. The conditional probability associated of this n-dimensional domain by formula ?? is, by definition of volumetric probability,

dP(p+q) ≈ k′ f ∆vn ≈ k′′ f dvp (∆s)q , (5.81)

where k′ and k′′ are constants. The conditional probability of the p-dimensionalvolume element dvp of the submanifold Mp is then defined as the limit

dPp = lim∆s→0

dP(p+q)

(∆s)q , (5.82)


this giving dPn = k′′ f dvp , or, to put the variables explicitly,

dPn(r) = k′′ f (r, s(r)) dvp(r) . (5.83)

We have thus arrived at a p-dimensional volumetric probability over the sub-manifold Mp that is given by

fp(r) = k′′ f (r, s(r)) , (5.84)

where k′′ is a constant. If the probability is normalizable, and we choose to normal-ize it to one, then,

fp(r) =f (r, s(r))∫

r∈Mpdvp(r) f (r, s(r))

. (5.85)

With this volumetric probability, the probability of a domain Dp of the submanifoldis computed as

P(Dp) =∫

r∈Dpdvp(x) fp(r) . (5.86)

198 Appendices

5.3.7 Conditional Volumetric Probability (II)

Note to the reader: this section can be skipped, unless one is particularly interestedin probability densities.

In view of equation 3.193, the conditional probability density (over the submanifoldMp ) is to be defined as

f p(r) = gp(r) fp(r) (5.87)

i.e.,f p(r) = ηr

√det gp(r) fp(r) , (5.88)

so the probability of a domain Dp of the submanifold is given by

P(Dp) =∫

r∈Mpdvp(r) f p(r) , (5.89)

where dvp(r) = dr1 ∧ · · · ∧ drp .We must now express f p(r) in terms of f (r, s) . First, from equations 3.188

and 5.88 we obtain

f p(r) = ηr

√det gp(r)

f (r, s(r))∫r∈Mp

dvp(r) f (r, s(r)). (5.90)

As f (r, s) = f (r, s)/(η√

det g ) (equation ??),

f p(r) = ηr

√det gp(r)

f (r, s(r))/√

det g∫r∈Mp

dvp(r) f (r, s(r))/√

det g. (5.91)

Finally, using 3.190, and expliciting gp(r) ,

f p(r) =

√det(grr+grs S+ST gsr+ST gss S)√

det gf (r, s(r))

∫r∈Mp

dr1 ∧ · · · ∧ drp√

det(grr+grs S+ST gsr+ST gss S)√det g

f (r, s(r)).

(5.92)Again, it is understood here that all the ‘matrices’ are taken at the point ( r, s(r) ) .

This expression does not coincide the the conditional probability defined givenin usual texts (even when the manifold is defined by the condition s = s0 = const. ).This is because we contemplate here the ‘metric’ or ‘orthogonal’ limit to the mani-fold (in the sense of figure ??), while usual texts just consider the ‘vertical limit’. Ofcourse, I take this approach here because I think it is essential for consistent appli-cations of the notion of conditional probability. The best known expression of thisproblem is the so called ‘Borel Paradox’ that we analyze in section 5.3.10.


Example 5.4 If we face the case where the space M is the Cartesian product of two spacesR × S , with guv = gvu = 0 , grr = gr(r) and gss = gs(s) , then det g(r, s) =det gr(r) det gs(s) , and the conditional probability density of equation 5.92 becomes,

f p(r) =

√det(gr(r)+ST(r) gs(s(r)) S(r))√

det gr(r)√

det gs(s(r))f (r, s(r))

∫r∈Mp

dr1 ∧ · · · ∧ drp√

det(gr(r)+ST(r) gs(s(r)) S(r))√det gr(r)

√det gs(s(r))

f (r, s(r)). (5.93)

Example 5.5 If, in addition to the condition of the previous example, the hyperfurface isdefined by a constant value of s , say s = s0 , then, the probability density becomes

f p(r) =f (r, s0)∫

r∈Mpdr1 ∧ · · · ∧ drp f (r, s0)

. (5.94)

Example 5.6 In the situation of the previous example, let us rewrite equation 5.94 droppingthe index 0 from s0 , and use the notations

f r|s(r|s) =f (r, s)f s(s)

, ; f s(s) =∫

r∈Mpdr1 ∧ · · · ∧ drp f (r, s) . (5.95)

We could redo all the computations to define the conditional for s , given a fixed value v , butit is clear by simple analogy that we obtain, in this case,

f s|r(s|r) =f (r, s)f r(r)

, ; f r(r) =∫

r∈Mqds1 ∧ · · · ∧ dsq f (r, s) . (5.96)

Solving in these two equations for f (r, s) gives the ‘Bayes theorem’

f s|r(s|r) =f r|s(r|s) f s(s)

f r(r). (5.97)

Note that this theorem is valid only if we work in the Cartesian product of two spaces. Inparticular, we must have gss(r, s) = gs(s) . Working, for instance, at the surface of thesphere with geographical coordinates (r, s) = (r, s) = (ϕ, λ) this condition is not fulfilled,as gϕ = cos λ is a function of λ : the surface of the sphere is not the Cartesian productof two 1D spaces. A we shall later see, this enters in the discussion of the so-called ‘Borelparadox’ (there is no paradox, if we do things properly).

200 Appendices

5.3.8 Marginal Probability Density

In the context of section ??, where a manifold M is built through the Cartesian prod-uct R×S of two manifolds, and given a ‘joint’ volumetric probability f (r, s) , themarginal volumetric probabily fr(r) is defined as (see equation ??)

fr(r) =∫

s∈Sdvs(s) f (r, s) . (5.98)

Let us find the equivalent expression using probability densities instead of volumet-ric probabilities.

Here below, following our usual conventions, the following notations

g(r, s)) =√

det g(r, s) ; gr(r) =√

det gr(r) ; gs(s) =√

det gs(s)(5.99)

are introduced. First, we may use the relation

f (r, s) =f (r, s)g(r, s)

(5.100)

linking the volumetric probability f (r, s) and the probability density f (r, s) . Here,g is the metric of the manifold M , that has been assumed to have a partitioned form(equation ??). Then, f (r, s) = f (r, s) / ( gr(r) gs(s) ) , and equation 5.98 becomes

fr(r) =1

gr(r)

∫s∈S

dvs(s)f (r, s)gs(s)

. (5.101)

As the volume element dvs(s) is related to the capacity element dvs(s) = ds1 ∧ ds2 ∧. . . via the relation

dvs(s) = gs(s) dvs(s) , (5.102)

we can writefr(r) =

1gr(r)

∫s∈S

dvs(s) f (r, s) , (5.103)

i.e.,gr(r) fr(r) =

∫s∈S

dvs(s) f (r, s) . (5.104)

We recognize, at the left-hand side, the usual defintion of a probability density asthe product of a volumetric probability by the volume density, so we can introducethe marginal probability density

f r(r) = gr(r) fr(r) . (5.105)

Then, equation 5.104 becomes

f r(r) =∫

s∈Sdvs(s) f (r, s) , (5.106)


expression that could be taken as a direct definition of the marginal probability den-sity f r(r) in terms of the ‘joint’ probability density f (r, s) .

Note that this expression is formally identical to 5.98. This contrasts with the ex-pression of a conditional probability density (equation 5.92) that is formally very dif-ferent from the expression of a conditional volumetric probability (equation 3.188).

202 Appendices

5.3.9 Replacement Gymnastics

In an n-dimensional manifold with coordinates x , the volume element dvx(x) , isrelated to the the capacity element dvx(x) = dx1 ∧ · · · ∧ dxn via the volume densitygx(x) =

√det gx(x) ,

dvx(x) = gx(x) dvx(x) , (5.107)

while the relation between a volumetric probability fx(x) and the associated proba-bility density f x(x) is

f x(x) = gx(x) fx(x) . (5.108)

In a change of variables x y , while the capacity element changes according to

dvx(x) = X(y) dvy(y) , (5.109)

where the Jacobian determinant X is the determinant of the matrix Xij = ∂xi/∂y j ,

the probability density changes according to

f x(x) =1

X(y)f y(y) . (5.110)

In the variables y , the relation between a volumetric probability fy(y) and the as-sociated probability density f y(y) is

f y(y) = gy(y) fy(y) , (5.111)

where gy(y) =√

det gy(y) is the volume density in the coordinates y . Finally, the

volume element dvy(y) , is related to the the capacity element dvy(y) = dy1 ∧ · · · ∧dyn through

dvy(y) = gy(y) dvy(y) . (5.112)

Using these relations in turn, we can obtain the following circle of equivalent


equations:

P(D) =∫

P∈Ddv(P) f (P) =

∫x∈D

dvx(x) fx(x)

=∫

x∈Ddvx(x) gx(x) fx(x)

=∫

x∈Ddvx(x) f x(x)

=∫

y∈D

(X(y) dvy

) ( 1X(y)

f y(y))

=∫

y∈Ddvy(y) f y(y)

=∫

y∈Ddvy(y) gy(y) fy(y)

=∫

y∈Ddvy(y) fy(y) =

∫P∈D

dv(P) f (P) = P(D) .

(5.113)

Each one of them may be useful in different circumstances. The student should beable to easily move from one equation to the next.

Example 5.7 In the example Cartesian-geographical, the equations above give, respectively(using the index r for the geographical coordinates),

dvx(x, y, z) = dx ∧ dy ∧ dz (5.114)

f x(x, y, z) = fx(x, y, z) (5.115)

dx ∧ dy ∧ dz = r2 cos λ dr ∧ dϕ ∧ dλ (5.116)

f x(x, y, z) =1

r2 cos λf r(r,ϕ, λ) (5.117)

f r(r,ϕ, λ) = r2 cos λ fr(r,ϕ, λ) (5.118)

dvr(r,ϕ, λ) = r2 cos λ dr ∧ dϕ ∧ dλ , (5.119)

204 Appendices

to obtain the circle of equations,

P(D) =∫

P∈Ddv(P) f (P) =

∫x,y,z∈D

dvx(x, y, z) fx(x, y, z)

=∫x,y,z∈D

dx ∧ dy ∧ dz fx(x, y, z)

=∫x,y,z∈D

dx ∧ dy ∧ dz f x(x, y, z)

=∫r,ϕ,λ∈D

(r2 cos λ dr ∧ dϕ ∧ dλ

) ( 1r2 cos λ

f r(r,ϕ, λ))

=∫r,ϕ,λ∈D

dr ∧ dϕ ∧ dλ f r(r,ϕ, λ)

=∫r,ϕ,λ∈D

dr ∧ dϕ ∧ dλ r2 cos λ fr(r,ϕ, λ)

=∫r,ϕ,λ∈D

dvr(r,ϕ, λ) fr(r,ϕ, λ) =∫

P∈Ddv(P) f (P) = P(D) .

(5.120)

Note that the Cartesian system of coordinates is special: scalar densities, scalar capacitiesand invariant scalars coincide.


5.3.10 The Borel ‘Paradox’

[Note: This appendix has to be updated.]A description of the paradox is given, for instance, by Kolmogorov (1933), in his

Foundations of the Theory of Probability (see figure 5.6).

Figure 5.6: A reproduc-tion of a section of Kol-mogorov’s book Founda-tions of the theory of prob-ability (1950, pp. 50–51).He describes the so-called“Borel paradox”. His ex-planation is not profound:instead of discussing thebehaviour of a conditionalprobability density undera change of variables, itconcerns the interpretationof a probability densityover the sphere when us-ing spherical coordinates.I do not agree with the con-clusion (see main text).

A probability distribution is considered over the surface of the unit sphere, asso-ciating, as it should, to any domain D of the surface of the sphere, a positive realnumber P(D) . To any possible choice of coordinates u, v on the surface of thesphere will correspond a probability density f (u, v) representing the given proba-bility distribution, through P(D) =

∫du∫

dv f (u, v) (integral over the domain D ).At this point of the discussion, the coordinates u, v may be the standard spher-ical coordinates or any other system of coordinates (as, for instance, the Cartesiancoordinates in a representation of the surface of the sphere as a ‘geographical map’,using any ‘geographical projection’).

A great circle is given on the surface of the sphere, that, should we use sphericalcoordinates, is not necessarily the ‘equator’ or a ‘meridian’. Points on this circlemay be parameterized by a coordinate α , that, for simplicity, we may take to be thecircular angle (as measured from the center of the sphere).

The probability distribution P( · ) defined over the surface of the sphere will in-duce a probability distribution over the circle. Said otherwise, the probability density

206 Appendices

f (u, v) defined over the surface of the sphere will induce a probability density g(α)over the circle. This is the situation one has in mind when defining the notion of con-ditional probability density, so we may say that g(α) is the conditional probabilitydensity induced on the circle by the probability density f (u, v) , given the conditionthat points must lie on the great circle.

The Borel-Kolmogorov paradox is obtained when the probability distributionover the surface of the sphere is homogeneous. If it is homogeneous over the sphere,the conditional probability distribution over the great circle must be homogeneoustoo, and as we parameterize by the circular angle α , the conditional probability den-sity over the circle must be

g(α) =1

2π, (5.121)

and this is not what one gets from the standard definition of conditional probabilitydensity, as we will see below.

From now on, assume that the spherical coordinates λ,ϕ are used, where λis the latitude (rather than the colalitude θ ), so the domains of definition of thevariables are

−π/2 < λ ≤ +π/2 ; −π <ϕ ≤ +π . (5.122)

As the surface element is dS(λ,ϕ) = cos λ dλ dϕ , the homogeneous probability dis-tribution over the surface of the sphere is represented, in spherical coordinates, bythe probability density

f (λ,ϕ) =1

4πcos λ , (5.123)

and we satisfy the normalization condition∫ +π/2

−π/2dλ∫ +π

−πdϕ f (λ,ϕ) = 1 . (5.124)

The probability of any domain equals the relative surface of the domain (i.e., theratio of the surface of the domain divided by the surface of the sphere, 4π ), sothe probability density in equation 5.123 do represents the homogeneous probabilitydistribution.

Two different computations follow. Both are aimed at computing the conditionalprobability density over a great circle.

The first one uses the nonconventional definition of conditional probability den-sity introduced in section in section ?? of this article (and claimed to be ‘consistent’).No paradox appears. No matter if we take as great circle a meridian or the equator.

The second computation is the conventional one. The traditional Borel-Kolmogorovparadox appears, when the great circle is taken to be a meridian. We interpret this asa sign of the inconsistency of the conventional theory. Let us develop the example.

We have the line element (taking a sphere of radius 1 ),

ds2 = dλ2 + cos2 λ dϕ2 , (5.125)


which gives the metric components

gλλ(λ,ϕ) = 1 ; gϕϕ(λ,ϕ) = cos2 λ (5.126)

and the surface elementdS(λ,ϕ) = cos λ dλ dϕ . (5.127)

Letting f (λ,ϕ) be a probability density over the sphere, consider the restrictionof this probability on the (half) meridian ϕ = ϕ0 , i.e., the conditional probabilitydensity on this (half) meridian. It is, following equation ??,

f λ(λ|ϕ =ϕ0) = kf (λ,ϕ0)√gϕϕ(λ,ϕ0)

. (5.128)

In our case, using the second of equations 5.126

f λ(λ|ϕ =ϕ0) = kf (λ,ϕ0)

cos λ, (5.129)

or, in normalized version,

f λ(λ|ϕ =ϕ0) =f (λ,ϕ0)/ cos λ∫ +π/2

−π/2 dλ f (λ,ϕ0)/ cos λ. (5.130)

If the original probability density f (λ,ϕ) represents an homogeneous probabil-ity, then it must be proportional to the surface element dS (equation 5.127), so, innormalized form, the homogeneous probability density is

f (λ,ϕ) =1

4πcos λ . (5.131)

Then, equation 5.129 gives

f λ(λ|ϕ =ϕ0) =1π

. (5.132)

We see that this conditional probability density is constant6.This is in contradiction with usual ‘definitions’ of conditional probability density,

where the metric of the space is not considered, and where instead of the correctequation 5.128, the conditional probability density is ‘defined’ by

f λ(λ|ϕ =ϕ0) = k f (λ,ϕ0) =f (λ,ϕ0)∫ +π/2

−π/2 dλ f (λ,ϕ0)/ cos λwrong definition ,

(5.133)6This constant value is 1/π if we consider half a meridian, or it is 1/2π if we consider a whole

meridian.

208 Appendices

this leading, in the considered case, to the conditional probability density

f λ(λ|ϕ =ϕ0) =cos λ

2wrong result . (5.134)

This result is the celebrated ‘Borel paradox’. As any other ‘mathematical paradox’, itis not a paradox, it is just the result of an inconsistent calculation, with an arbitrarydefinition of conditional probability density.

The interpretation of the paradox by Kolmogorov (1933) sounds quite strangeto us (see figure 5.6). Jaynes (1995) says “Whenever we have a probability density onone space and we wish to generate from it one on a subspace of measure zero, the only safeprocedure is to pass to an explicitly defined limit [. . . ]. In general, the final result will andmust depend on which limiting operation was specified. This is extremely counter-intuitiveat first hearing; yet it becomes obvious when the reason for it is understood.”

We agree with Jaynes, and go one step further. We claim that usual parameterspaces, where we define probability densities, normally accept a natural definitionof distance, and that the ‘limiting operation’ (in the words of Jaynes) must the theuniform convergence associated to the metric. This is what we have done to define thenotion of conditional probability. Many examples of such distances are shown in thistext.


5.3.11 Sampling a Probability

5.3.11.1 Sample Points (I)

Note: write here a simple section defining (intuitively) what a sample is, and de-scribing the simplest sampling methods.

Note: explain that this section is not about the estimation of properties of a pop-ulation from the properties of a sample (am important problem in statistics). We arehere concerned about a different problem: give a probability over a set, how can we“draw” elements of the set, according to the given probability? (Well. . . this is notso clear, as what we essentially want is to evaluate the probability of an event, sayP[A ] , using the sample points. This implies counting how many points fall in Aand evaluating a ratio. I must give the basic probabilistic rules of sampling. . . )

Example 5.8 Assume that a deck of playing cards has twice as much clubs and spades as ithas hearts and diamonds. Then, when randomly drawing a card, the probability of each ofthe suits is

a ♣ ♠ ♥ ♦p(a) 2/6 2/6 1/6 1/6 .

To mathematically sample this probability, one may use a virtual deck of cards (i.e., a com-puter software with a random number generator), or, equivalently, a virtual six-faces dice,and use the following correspondence:

dice face 1 2 3 4 5 6associated suit ♣ ♣ ♠ ♠ ♥ ♦ .

An experiment produced the following sequence7: ♠ ♠ ♠ ♣ ♥ ♣ ♣ ♣ ♣ ♦ ♥ ♣ ♦ ♦ ♦♣ ♣ ♦ ♣ ♠ ♣ ♦ ♣ ♣ ♠ ♣ ♠ ♣ ♠ ♠ ♣ ♣ . . .

There are some advantages in using deterministic pseudo-random number gen-erators (generation is fast, it is immediately available, and the results are repro-ducible [if the same “seed” is used]). In some situations, when the number of randomdrawings is huge, and where it is important to avoid any possible correlation, onemay resort to “true” random number generators, that typically sample (and process)a source of entropy outside the computer. These “true” random number generatorsare available at different web sites (e.g., http://www.random.org/). They typicallypass all the tests that a true random sequence should satisfy, so it is reasonable torely on them for practical applications. It remains that any actual realization of asequence of numbers will never be random in the mathematical sense of the term.

7The experiment was stopped after 120 000 000 sample points had been generated. At that mo-ment, the discrepancy between the experimental frequencies and the theoretical frequencies was ofthe order of 10-4 (as it should, as 1/

√120 000 000 = 0.91 10-4 ).

210 Appendices

Note: mention that we can easily obtain a random integer inside a finite set ofintegers, but not a random integer (the probability of any integer is zero). Also,given an interval of the real line, we can also obtain a random finite-accuracy realnumber inside the interval, but not a true real number.

Note: mention somewhere the “resampling stats” method.


5.3.11.2 Sample Points (II)

Introduction

When a probability distribution has been defined, we have to face the problem ofhow to ‘use’ it. The definition of some central estimators (like the mean or the me-dian) and some estimators of dispersion (like the covariance matrix), lacks gener-ality, as it is quite easy to find examples (like multimodal distributions in highly-dimensioned spaces) where these estimators fail to have any interesting meaning.

When a probability distribution has been defined over a space of low dimension(say, from one to four dimensions), then we can directly represent the associatedvolumetric probability. This is trivial in one or two dimensions. It is easy in threedimensions, using, for instance, virtual reality software. Some tricks may allow usto represent a four-dimensional probability distribution, but clearly this approachcannot be generalized to the high dimensional case.

Let us explain the only approach that seems practical, with help of figure 5.7.At the left of the figure, there is an explicit representation of a 2D probability dis-tribution (by means of the associated volumetric probability). In the middle, somerandom points have been generated (using the Monte Carlo method about to be de-scribed). It is clear that if we make a histogram with these points, in the limit of asufficiently large number of points, we recover the representation at the left. Disre-garding the histogram possibility, we can concentrate on the individual points. Inthe 2D example of the figure, we have actual points in a plane. If the problem is mul-tidimensional, each ‘point’ may corresponds to some abstract notion. For instance,for a physicist, a ‘point’ may be a given state of a physical system. This state maybe represented in some way, for instance using some color drawing. Then a collec-tion of ‘points’ is a collections of such drawings. Our experience shows that, givensuch a collection of randomly generated ‘models’, the human eye-brain system is ex-tremely good at apprehending the basic characteristics of the underlying probabilitydistribution, including possible multimodalities, correlations, etc.

Figure 5.7: An explicit representation of a 2D probabil-ity distribution, and the sampling of it, using MonteCarlo methods. While the representation at the top-left cannot be generalized to high dimensions, the ex-amination of a collection of points can be done in arbi-trary dimensions. Practically, Monte Carlo generationof points is done through a ‘random walk’ where a‘new point’ is generated in the vicinity of the previouspoint.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

.

.

..

..

.

.

.

.

.

.

..

.

.

.

.

.

.

.

.

.

..

.

When such a (hopefully large) collection of random models is available we canalso answer quite interesting questions. For instance, a geologist may ask: at which

212 Appendices

depth is that subsurface strucure? To answer this, we can make an histogram of thedepth of the given geological structure over the collection of random models, andthe histogram is the answer to the question. Which is the probability of having a zonewith large values of mass density shallower that one kilometer? The ratio of the numberof models presenting such a characteristic over the total number of models in thecollection gives the answer (if the collection of models is large enough).

Any Monte Carlo sampling method to be used in a space with a large number ofdimensions, has to be very carefully designed: blind Monte Carlo searches will fail,excepted for very simple probability distributions. For large dimensional space tendto be terribly empty, as the figure 5.8, suggests.

In this chapter it is assumed that we work with a metric manifold. Then both,the notion of distance between two points and the notion of volume make sense.And we work with volumetric probabilities, not probability densities. In principle,the formulas here developed could be adapted for the case where one may wish touse probability densities (the formulas become more complicated), but I see a majorproblem with this: the use of probability densities may give the illusion that one canwork with manifolds where the distance between points is not defined. But then,what would mean, in a Metropolis algorithm, to make a ‘small’ jump? And whatwould mean to start with a random walk that samples the homogeneous probabilitydistribution? I encourage the reader to start using Monte Carlo methods only afterthe notion of distance and the notion of volume have been carefully introduced inthe manifold.

A comment to be mentioned somewhere. When using sampling methods forapproximating probabilities via observed frequencies, the following question mayarise:

Consider an event that has a probability p of occurring. We generate N randomtrials, and we observe that the event has occurred n times ( 0 ≤ n ≤ N ). If N islarge, we should obviously have n ≈ p N . More precisely, which is the probabilityfor each possible value of n (when N is not necessarily large)?

The answer is provided by the binomial distribution,

P(n) =N!

n! (N − n)!pn (1− p)N−n ;

N

∑n=0

P(n) = 1 . (5.135)


Dimension1 2 3 4 5 6 7 8 9 10 11

Volume hypersphere / Volume hypercube

0.6

0.4

0.2

0.0

0.8

1.0

2R 43

πR3πR2

2R (2R)2 (2R)n(2R)3

(...)

(...)

πn/2 Rn

Γ(1+n/2)

Figure 5.8: Consider a square and the inscribed circle. If the circle’s surface is πR2 ,that of the square is (2R)2 . If we generate a random point inside the square, withhomogeneous probability distribution, the probability of hitting the circle equals theratio of the surfaces, i.e., P = π/4 . We can do the same in 3D, but, in this case,the ratio of volumes is P = π/6 : the probability of hitting the target is smallerin 3D than in 2D. This probability tends dramatically to zero when the dimensionof the space increases. For instance, in dimension 100, the probability of hittingthe hypersphere incribed in the hypercube is P = 1.9 10−70 , what means that it ispractically impossible to hit the target ‘by chance’. The formulas at the top give thevolume of an hypersphere of radius R in a space of dimension 2n or 2n + 1 (theformula is not the same for spaces with even or odd dimension), and the volume ofan hypercube with sides of length 2R . The graph at the bottom shows the evolution,as a function of the dimension of the space, of the ratio between the volume of thehypersphere and the volume of the hypercube. In large dimension, the hyperspherefills a negligible amount of the hypercube.

214 Appendices

Notion of Sample

Let M be a finite-dimensional metric manifold, with points denoted P0 , P , . . . .Let dv(P) represent the volume element of the manifold. If f (P) is a normalizedvolumetric probability over M , then, by definition, the probability of a domain A ⊂M is

P(A) =∫

Adv(P) f (P) . (5.136)

Assume that some random process (mathematical or physical) generates one ran-dom point P0 on M . The random point P0 is called a sample of the probabilitydistribution f (P) if the probability that P0 belongs to any subset A of M equalsP(A) .


Inversion Method

Consider a (1D) volumetric probability f (x) depending on a scalar variable x , withlength element ds(x) . This may occur when we have really one single random vari-able or, more often, when on a multidimensional manifold we consider a conditionaldistribution on a line (along which x is a parameter). The ‘inversion method’ con-sists in introducing the cumulative probability

y = F(x) =∫ x

xmin

ds(x′) f (x′) , (5.137)

that takes values in the interval [0, 1] , and the inverse function x = F−1(y) . It iseasy to see that if one randomly generates values of y with constant probabilitydensity in the interval [0, 1] , then the values x = F−1(y) are random samples ‘of’the volumetric probability f (x) . Provided the function F−1 is available, the methodis simple and efficient.

Example 5.9 Let y1 , y2 . . . be samples of a random variable with constant volumetricprobability in the interval [0, 1] , and let erf−1 be the inverse error function8. The numberserf−1(y1) , erf−1(y1) . . . are then normally distributed, with zero mean and unit variance(see figure 5.9).

Figure 5.9: Use of the ‘inversion method’ to pro-duce samples of a two-dimensional Gaussianvolumetric probability.

0

0

1

1

0

0

+3

+3

−3

−3

8The error function erf(x) is the integral between −∞ and x of a normalized Gaussian with zeromean and unit variance (be careful, there are different definitions). One may find in the literaturedifferent series expressions for erf−1 .

216 Appendices

Rejection Method

The ‘rejection method’ starts by generating samples x1 , x2 . . . of the homogeneousvolumetric probability, which usually is a simple problem. Then, each sample is sub-mitted to the possibility of a rejection: the probability that the sample xk is acceptedbeing taken equal to

P =f (xk)fmax

, (5.138)

where fmax stands for the maximum of all the values f (x) , or any larger number(the larger the number, the less efficient the method). It is then easy to prove that anyaccepted point is a sample of the volumetric probability f (x) .

This methods works reasonably well in one dimension or two dimensions, andcould, in principle, be applicable in any number of dimensions. But, as alreadymentioned, large-dimensional spaces tend to be very empty, and the chances thatthis method accepts a point may be dramatically low when working with multi-dimensional spaces.


Sequential Realization

In equation 3.245 (page 139), we have expressed a joint volumetric probability as theproduct of a conditional times a marginal. The conditional itself may sometimes bealso further decomposed, and so on, until one has an expression like

fn(x1, x2, . . . , xn) == f1(x1) f1|1(x2|x1) f1|2(x3|x1, x2) . . . f1|n−1(xn|x1, . . . , xn−1) ,

(5.139)

where each of the xi is, in general, multidimensional.All these marginal and conditional volumetric probabilities are contained in the

original n-dimensional joint volumetric probability fn(x1, x2, . . . , xn) , and can, atleast in principle, be evaluated from it using integrals. Assume that they are allknown, and let us see how an n-dimensional sample could be generated.

One starts generating a sample for the (perhaps multidimensional) variable x1 ,using the marginal f1(x1) , this giving a value x0

1 . With this value at hand, onegenerates a sample for the variable x2 , using the conditional f1|1(x2|x0

1) , this givinga value x0

2 . Then, one generates a sample for the variable x3 , using the conditionalf1|2(x3|x0

1, x02) , this giving a value x0

3 . And so on until one generates a sample forthe variable xn , using the conditional f1|n−1(xn|x0

1, . . . , x0n−1) , this giving a value

x0n . In this manner a point x0

1, x02, . . . , x0

n has been generated that is a sample ofthe original fn(x1, x2, . . . , xn) .

218 Appendices

5.3.12 Random Points on the Surface of the Sphere

Figure 5.10: 1000 random points on the surface of thesphere.

Note: Figure 5.10 has been generated using the following Mathematica code:

spc[t_,p_,r_:1] := r Sqrt[1-t^2] Cos[p], Sqrt[1-t^2] Sin[p], t

Show[Graphics3D[Table[Point[spc[Random[Real,-1,1],

Random[Real,0,2Pi]]],1000]]]

Figure 5.11: A geodesic dome dividing the surface of thesphere into domains with approximately the same area.

Figure 5.12: The coordinate division of the surface of thesphere.


θ = 0

ϕ = 0

θ = −π/2ϕ = −π/2

θ = −π θ = +π

ϕ = +π/2

θ = +π/2 θ = 0

ϕ = 0

θ = −π/2ϕ = −π/2

θ = −π θ = +π

ϕ = +π/2

θ = +π/2

Figure 5.13: Map representation of a random homogeneous distribution of points atthe surface of the sphere. At the left, the naıve division of the surface of the sphereusing constant increments of the coordinates. At the right, the cylindrical equal-areaprojection. Counting the points inde each ‘rectangle’ gives, at the left, the probabilitydensity of points. At the right, the volumetric probability.

220 Appendices

5.3.13 Basic Probability Distributions

5.3.13.1 Dirac’s Probability Distribution

In a metric manifold (where the notion of distance D(P1, P2) between two pointsmakes sense) we introduce the notion of homogeneous ball. The homogeneous ballof radius r centered at P0 ∈ M is the probability distribution represented by thevolumetric probability

f (P; P0, r) =

1/V(P0, r) if D(P, P0) ≤ r

0 if D(P, P0) > r ,(5.140)

where V(P0, r) is the volume of the ‘spherical’ domain here considered:

V(P0, r) =∫

D(P,P0)≤rdv(P) . (5.141)

This probability distribution is normalized to one.For any scalar ‘test function’ ψ(P) defined over the manifold M , clearly,∫

P∈Mdv(P) ψ(P) f (P; P0, r) =

1V(P0, r)

∫D(P,P0)≤r

dv(P) ψ(P) . (5.142)

If the test function ψ(P) is sufficiently regular, one can take the limit r → 0 in thisexpression, to get limr→0

∫P∈M dv(P) ψ(P) f (P; P0, r) = ψ(P0) . One then formally

writes ∫P∈M

dv(P) ψ(P) δ(P; P0) = ψ(P0) , (5.143)

where, formally,δ(P; P0) = lim

r→0f (P; P0, r) , (5.144)

and we call δ(P; P0) the Dirac’s probability distribution centered at point P0 ∈ M . Itassociates probability one to any domain A ⊂ M that contains P0 and probabilityzero to any domain that does not contain P0 .

A Dirac’s probability density could also be introduced, but we don’t need to enterinto the technicalities necessary to its proper definition.


5.3.13.2 Gaussian Probability Distribution

One Dimensional Spaces

Warning, the formulas of this section have to be changed, to make them consistentwith the multidimensional formulas 5.160 and 5.161. And I must assume a linearspace!

Let M by a one-dimensional metric line with points P , Q . . . , and let D(Q, P)denote the distance between point P and point Q . Given any particular point P

on the line, it is assumed that the line extends to infinite distances from P in thetwo senses. The one-dimensional Gaussian probability distribution is defined by thevolumetric probability

f (P; P0;σ) =1√

2π σexp

(− D(P, P0)2

2σ2

), (5.145)

and it follows from the general definition of volumetric probability, that the proba-bility of the interval between any two points P1 and P2 is

P =∫ P2

P1

ds(P) f (P; P0;σ) , (5.146)

where ds denotes the elementary length element. The following properties are easyto demonstrate:

• the probability of the whole line equals one (i.e., the volumetric probabilityf (P; P0;σ) is normalized);

• the mean of f (P; P0;σ) is the point P0 ;

• the standard deviation of f (P; P0;σ) equals σ .

Example 5.10 Consider a coordinate X such that the distance between two points is D =| log(X′/X)| . Then, the Gaussian distribution 5.145 takes the form

fX(X; X0,σ) =1√

2π σexp

(−1

2

(1σ

logXX0

)2)

, (5.147)

where X0 is the mean and σ the standard deviation. As, here, ds(X) = dX/X , theprobability of an interval is

P(X1 ≤ X ≤ X2) =∫ X2

X1

dXX

fX(X; X0,σ) , (5.148)

222 Appendices

and we have the normalization∫ ∞0

dXX

fX(X; X0,σ) = 1 . (5.149)

This expression of the Gaussian probability distribution, written in terms on the variable X ,is called the lognormal law. I suggest that the information on the parameter X representedby the volumetric probability 5.147 should be expressed by a notation like9

logXX0

= ±σ , (5.150)

that is the exact equivalent of the notation used in equation 5.154 below. Defining the dif-ference δX = X − X0 one converts this equation into log (1 + δX/X0) , whose first orderapproximation is δX/X0 = ±σ . This shows that σ corresponds to what is usually calledthe ‘relative uncertainty’. I do not recommend this terminology, as, with the definitions usedin this book (see section ??), σ is the actual standard deviation of the quantity X .

Exercise: write the equivalent of the three expressions 5.147–5.149 using, insteadof the variable X , the variables U = 1/X or Y = Xn .

Example 5.11 Consider a coordinate x such that the distance between two points is D =|x′ − x| . Then, the Gaussian distribution 5.145 takes the form

fx(x; x0,σ) =1√

2π σexp

(−1

2(x− x0)2

σ2

), (5.151)

where x0 is the mean and σ the standard deviation. As, here, ds(x) = dx , the probabilityof an interval is

P(x1 ≤ x ≤ x2) =∫ x2

x1

dx fx(x; x0,σ) , (5.152)

and we have the normalization∫ +∞−∞ dx fx(x; x0,σ) = 1 . (5.153)

This expression of the Gaussian probability distribution, written in terms on the variable x ,is called the normal law. The information on the parameter x represented by the volumetricprobability 5.151 is commonly expressed by a notation like10

x = x0 ±σ . (5.154)

9Equivalently, one may write X = X0 exp(±σ) , or X = X0·÷ Σ , where Σ = expσ .

10 More concise notations are also used. As an example, the expression x = 1 234.567 89 m ±0.000 11 m (here, ‘m’ represents the physical unit ‘meter’) is sometimes written x = ( 1 234.567 89±0.000 11 ) m or even x = 1 234.567 89(11) m .


Example 5.12 It is easy to verify that through the change of variable

x = logXK

, (5.155)

where K is an arbitrary constant, the equations of the example 5.10 become those of theexample 5.11, and vice-versa. In this case, the quantity x has no physical dimensions (thisis, of course, a possibility, but not a necessity, for the quantity x in example 5.11).

The Gaussian probability distribution is represented in figure 5.14. Note thatthere is no need to make different plots for the normal and the lognormal volumetricprobabilities.

1K

0 2 4-2-4

T

t

t = log10(T/T0) ; T0 = 1K

102K 104K10-2K10-4K

-1-3 1 3

Figure 5.14: A representation of the Gaussian probability distribution, where the ex-ample of a temperature T is used. Reading the scale at the top, we associate to eachvalue of the temperature T the value h(T) of a lognormal volumetric probability.Reading the scale at the bottom, we associate to every value of the logarithmic tem-perature t the value g(t) of a normal volumetric probability. There is no need tomake a special plot where the lognormal volumetric probability h(T) would notbe represented ‘in a logarithmic axis’, as this strongly distorts the beautiful Gaussianbell (see figures 5.15 and 5.16). In the figure represented here, one standard deviationcorresponds to one unit of t , so the whole range represented equals 8σ .

Figure 5.16 gives the interpretation of these functions in terms of histograms. Bydefinition of volumetric probability, an histogram should be made dividing the in-terval under study in segments of same length ds(X) = dX/Y , as opposed to thedefinition of probability density, where the interval should be divided in segmentsof equal ‘variable increment’ dX . We clearly see, at the right of the figure the im-practicality of making the histogram corresponding to the probability density: whilethe right part of the histogram oversamples the variable, the left part undersamplesit. The histogram suggested at the left samples the variable homogeneously, but thisonly means that we are using constant steps of the logarithmic quantity x associatedto the positive quantity X . Better, then, to directly use the representation suggested

224 Appendices

Figure 5.15: Left: thelognormal volumetricprobability h(X) . Right:the lognormal probabilitydensity h(X) . Distribu-tions centered at 1, withstandard deviations re-spectively equal to 0.1, 0.2,0.4, 0.8, 1.6 and 3.2 .

in figure 5.14 or in figure ??. We have then a double conclusion: (i) the lognormalprobability density (at the right in figures 5.15 and 5.16) does not correspond to anypractical histogram; it is generally uninteresting. (ii) the lognormal volumetric prob-ability (at the left in figures 5.15 and 5.16) does correspond to a practical histogram,but is better handled when the associated normal volumetric probability is used in-stead (figure 5.14 or figure ??). In short: lognormal functions should never be used.

Figure 5.16: A typicalGaussian distribution,with central point 1 andstandard deviation 5/4,represented here, using aJeffreys (positive) quantity,by the lognormal volumet-ric probability (left) andthe lognormal probabilitydensity (right).

0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Volumetric

Probability

0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7Probability

Density


Multi Dimensional Spaces

In dimension grater than one, the spaces may have curvature. But the multidimen-sional Gaussian distribution makes only sense in linear spaces. In this section, xrepresents a vector, and an expression like

‖ x ‖2 = xt g x (5.156)

represent the squared norm of the vector x with respect to some metric tensor g .Of course, the components of vectors can be also seen as linear coordinates of thepoints of the affine linear space associated to the vector space. In this manner, wecan interpret the expression

D2(x2, x1) = (x2 − x1)t (x2 − x1) (5.157)

as the squated distance between two points. The volume element of this affine spaceis, then,

dv(x) =√

det g dx1 ∧ · · · ∧ dxn . (5.158)

As√

det g is a constant, the only difference between volumetric probabilities andprobability densities is, in the present situation, a multiplicative factor.

Let f (x) be a volumetric probability over the space. By definition, the probabilityof a domain D is

P(D) =∫D

dv(x) f (x) , (5.159)

i.e.,P(D) =

∫ √det g dx1 ∧ · · · ∧ dxn f (x) . (5.160)

The multidimensional Gaussian volumetric probability (and probability density) is

f (x) =1

(2π)n/2

√det W√det g

exp(−1

2(x− x0)t W (x− x0)

). (5.161)

The following properties correspond to well known results concerning the multidi-mensional Gaussian:

• f (x) is normed, i.e.,∫

dv(x) f (x) = 1 ;

• the mean of f (x) is x0 ;

• the covariance matrix of f (x) is11 C = W-1 .

11Remember that the general definition of covariance gives here covi j =∫

dv(x)(xi − xi0)(x j −

x j0) f (x) , so this property is not as obvious as it may seem.

226 Appendices

5.3.13.3 Laplacian Probability Distribution

Let M by a metric manifold with points P , Q . . . , and let D(P, Q) = D(Q, P) denotethe distance netween two points P and Q . The Laplacian probability distribution isrepresented by the volumetric probability

f (P) = k exp(− 1σ

D(P, Q))

. (5.162)

[Note: Elaborate this.]


5.3.13.4 Exponential Distribution

Definition

Consider a one-dimensional metric space, with length element (one-dimensionalvolume element) ds , and P0 be one of its points. Let us introduce the metric co-ordinates

s(P, P0) =∫ P

P0

ds . (5.163)

Note that because of the definition of one-dimensional integral, the variable s has asign, and one has s(P1, P2) = −s(P2, P1) .

The exponential distribution has the (1D) volumetric probability

f (P; P0) = α exp(−α s(P, P0)

); α ≥ 0 . (5.164)

It is volumetric probability is normed via∫

ds(P) f (P, P0) = 1 , where the sumconcerns the half-interval at the right or at the left of point P0 , depending on theorientation chosen (see examples 5.13 and 5.14).

Example 5.13 Consider a coordinate X such that the displacement between two pointsis sX(X′, X) = log(X′/X) . Then, the exponential distribution 5.164 takes the formfX(X; X0) = k exp (−α log(X/X0)) , i.e.,

fX(X) = α

(XX0

)−α; α ≥ 0 . (5.165)

As, here, ds(X) = dX/X , the probability of an interval is P(X1 ≤ X ≤ X2) =∫ X2

X1dXX fX(X) .

The volumetric probability fX(X) has been normed using∫ ∞X0

dXX

fX(X) = 1 . (5.166)

This form of the exponential distribution is usually called the Pareto law. The cumulativeprobability function is

gX(X) =∫ X

X0

dX′

X′ fX(X′) = 1−(

XX0

)−α. (5.167)

It is negative for X < X0 , zero for X = X0 , and positive for X > X0 . The power α ofthe ‘power law’ 5.165 may be any real number, but it most examples concerning the physical,biological or economical sciences, it is of the form α = p/q , with p and q being smallpositive integers12. With a variable U = 1/X , equation 5.170 becomes

fU(U) = k′ Uα ; α ≥ 0 , (5.168)12In most problems, the variables seem to be chosen in such a way that α = 2/3 . This is the case

for the probability distributions of Earthquakes as a function of their energy (Gutenberg-Richter law,see figure 5.18), or of the probability distribution of meteorites hitting the Earth as a function of theirvolume (see figure 5.21).

228 Appendices

the probability on an interval is P(U1 ≤ U ≤ U2) =∫ U2

U1dUU fU(U) , and one typically

uses the norming condition∫ U0

0dUU fU(U) = 1 , where U0 is some selected point. Using

a variable Y = Xn , one arrives at the volumetric probability

fY(Y) = k′ Y−β ; β =α

n≥ 0 . (5.169)

Example 5.14 Consider a coordinate x such that the displacement between two points issx(x′, x) = x′ − x . Then, the exponential distribution 5.164 takes the form

fx(x) = α exp (−α (x− x0)) ; α ≥ 0 . (5.170)

As, here, ds(s) = ds , the probability of an interval is P(x1 ≤ x ≤ x2) =∫ x2

x1dx fx(x) , and

fx(x) is normed by ∫ +∞x0

dx fx(x) = 1 . (5.171)

With a variable u = −x , equation 5.170 becomes

fu(u) = α exp (α (u− u0)) ; α ≥ 0 , (5.172)

and the norming condition is∫ u0−∞ du fu(u) = 1 . For the plotting of these volumetric

probabilities, sometimes a logarithmic ‘vertical axis’ is used, as suggested in figure 5.17. Notethat via a logarithmic change of variables x = log(X/K) (where K is some constant) thisexample is identical to the example 5.13. The two volumetric probabilities 5.165 and 5.170represent the same exponential distribution.



Figure 5.17: Plots of exponential distribution for differentdefinitions of the variables. Top: The power functionsfX(X) = 1/X−α , and fU(U) = 1/Uα . Middle: Usinglogarithmic variables x and u , one has the exponentialfunctions fx(x) = exp(−α x) and fu(u) = exp(α u) .Bottom: the ordinate is also represented using a logarith-mic variable, this giving the typical log-log linear func-tions.

2.7 Basic Probability Distributions 65

As, here, ds(s) = ds , the probability of an interval is P(x1 ≤ x ≤ x2) =∫ x2

x1dx fx(x) , and

fx(x) is normed by ∫ +∞x0

dx fx(x) = 1 . (2.173)

With a variable u = −x , equation 2.172 becomes

fu(u) = α exp (α (u− u0)) ; α ≥ 0 , (2.174)

and the norming condition is∫ u0−∞ du fu(u) = 1 . For the plotting of these volumetric

probabilities, sometimes a logarithmic ‘vertical axis’ is used, as suggested in figure 2.14. Notethat via a logarithmic change of variables x = log(X/K) (where K is some constant) thisexample is identical to the example 2.21. The two volumetric probabilities 2.167 and 2.172represent the same exponential distribution.


Figure 2.14: Plots of ex-ponential distribution fordifferent definitions of thevariables. Top: The powerfunctions fX(X) = 1/X−α ,and fU(U) = 1/Uα . Middle:Using logarithmic variables xand u , one has the exponentialfunctions fx(x) = exp(−α x)and fu(u) = exp(α u) . Bottom:the ordinate is also representedusing a logarithmic variable,this giving the typical log-loglinear functions.

0 0.5 1 1.5 2

0

0.5

1

1.5

2

0 0.5 1 1.5 2

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

α = 0

α = 1/4α = 1/2

α = 1

α = 2

α = 0

α = 1/4

α = 1/2

α = 1

α = 2

α = 0

α = 1/4

α = 1/2

α = 1

α = 2

α = 0

α = 1/4

α = 1/2

α = 1

α = 2

α = 0

α = 1/4

α = 1/2

α = 1

α = 2

α = 0

α = 1/4

α = 1/2

α = 1

α = 2

x = log X/X0

x = log X/X0

u = log U/U0

u = log U/U0

f

f

f

f

UX

log f

/f0

log f

/f0

230 Appendices

Example: Distribution of Earthquakes

The historically first example of power law distribution is the distribution of energiesof Earthquakes (the famous Gutenberg-Richter law).

An earthquake can be characterized by the seismic energy generated, E , or bythe moment corresponding to the dislocation, that I denote here13 M . As a roughapproximation, the moment is given by the product M = ν ` S , where ν is theelastic shear modulus of the medium, ` the average displacement between the twosides of the fault, and S is the faults’ surface (Aki and Richards, 1980).

Figure 5.18 shows the distribution of earthquakes in the Earth. As the same loga-rithmic base (of 10) has been chosen in both axes, the slope of the line approximatingthe histogram (which is quite close to -2/3 ) directly leads to the power of the powerlaw (Pareto) distribution. The volumetric probability f (M) representing the distri-bution of earthquakes in the Earth is

f (M) =k

M2/3, (5.173)

where k is a constant. Kanamori (1977) pointed that the moment and the seismicenergy liberated are roughly proportional: M ≈ 2.0 104 E (energy and moment havethe same physical dimensions). This implies that the volumetric probability as afunction of the energy has the same form as for the moment:

g(E) =k′

E2/3. (5.174)

Figure 5.18: Histogram of the numberof earthquakes (in base 10 logarithmicscale) recorded by the global seismolog-ical networks in a period of xxx years,as a function of the logarithmic seismicmoment (adapted from Lay and Wallace,1995). More precisely, the quantity in thehorizontal axis is µ = log10(M/MK) ,where M is the seismic moment, andMK = 107 J = 1 erg is a constant, whosevalue is arbitrarily taken equal the unitof moment (and of energy) in the cgssystem of units. [note: Ask for the per-mission to publish this figure.]

23 24 25 26 27 28 29

µ = Log10(Moment/MK)

n = Log10(Number of Events)

0

1

2

3

1

10

100

1000Number of Events

13It is traditionally denoted M0 .


Example: Shapes at the Surface of the Earth.


Figure 5.19: Wessel and Smith (1996) have compileda high-resolution shoreline data, and have processedit to suppress erratic points and crossing segments.The shorelines are closed polygons, and they are clas-sified in 4 levels: ocean boundaries, lake boundaries,islands-in-lake boundaries and pond-in-island-in-lakeboundaries. The 180,496 polygons they encounteredhad the size distribution shown at the right (the ap-proximate numbers are in the quoted paper, the exactnumbers where kindly sent to me by Wessel). A lineof slope is -2/3 is suggested in the figure.

0 2 4 6 8-2-4

log10(S/S0) ; S0 = 1 km2

log10(Number of Polygons)

5

4

3

2

1

0

Example: Size of oil fields


Figure 5.20: Histogram of the sizes of oil fields in a do-main of Texas. The horizontal axis corresponds, witha logarithmic scale, to the ‘millions of Barrels of OilEquivalent’ (mmBOE). Extracted from chapter 2 (Thefractal size and spatial distribution of hydrocarbon ac-cumulation, by Christopher C. Barton and Christo-pher H. Scholz) of the book “Fractals in petroleumgeology and Earth processes”, edited by ChristopherC. Barton and Paul R. La Pointe, Plenum Press, NewYork and London, 1995. [note: ask for the permissionto publish this figure]. The slope of the straight line is-2/3, comparable to the value found with the data ofWessel & Smith (figure 5.19).

1

2

3

4

0

Example: Meteorites


232 Appendices

Figure 5.21: The approximate number ofmeteorites falling on Earth every year isdistributed as follows: 1012 meteorites witha diameter of 10−3 mm, 106 with a diame-ter 1 mm, 1 with a diameter 1 m, 10−4 witha diameter 100 m, and 10−8 with a diam-eter 10 km. The statement is loosy, and Ihave extracted it from the general press. Itis nevertheless clear that a log-log plot ofthis ‘histogram’ gives a linear trend witha slope equal to -2. Rather, transform-ing the diameter D into volume V = D3

(which is proportional to mass), gives the‘histogram’ at the right, with a slope of -2/3.

log10 V/V0 (V0 = 1 m3)

-10 0 10-20

log10 (number every year)

-10

10

0


5.3.13.5 Spherical Distributions

The simplest probabilistic distribution over the circle and over the surface of thesphere are the von Mises and the Fisher probability distributions, respectively.

The von Mises Distribution

As already mentioned in example 3.20, and demonstrated in section 5.3.14 here be-low, the conditional volumetric probability induced over the unit circle by a 2DGaussian is

f (θ) = k exp(

cosθσ2

). (5.175)

The constant k is to be fixed by the normalization condition∫ 2π

0 dθ f (θ) = 1 , thisgiving

k =1

2 π I0(1/σ2), (5.176)

where I0( · ) is the modified Bessel function of order zero.

Figure 5.22: The circular (von Mises) distributioncorresponds to the intersection of a 2D Gaussianby a circle passing by the center of the Gaussian.Here, the unit circle has been represented, andtwo Gaussians with standard deviations σ = 1(left) and σ = 1/2 (right) . In fact, this is mypreferred representation of the von Mises distri-bution, rather than the conventional functionaldisplay of figure 5.23.

ϑ ϑ

Figure 5.23: The circular (von Mises) dis-tribution, drawn for two full periods, cen-tered at zero, and with values of σ equal to2 ,√

2 , 1 , 1/√

2 , 1/2 (from smooth to sharp).

2π 3π 4π0

0

1

π

The Fisher Probability Distribution

Note: mention here Fisher (1953).

234 Appendices

As already mentioned in example 3.20, and demonstrated in section 5.3.14 herebelow, the conditional volumetric probability induced over the surface of a sphereby a 3D Gaussian is, using spherical coordinates

f (θ,ϕ) = k exp(

cosθσ2

). (5.177)

We can normalize this volumetric probability by∫dS(θ,ϕ)) f (θ,ϕ)) = 1 , (5.178)

with dS(θ,ϕ) = sinθ dθ dϕ . This gives

k =1

4 π χ(1/σ2), (5.179)

where

χ(x) =sinh(x)

x. (5.180)


5.3.14 Fisher from Gaussian (Demonstration)

Let us demonstrate here that the Fisher probability distribution is obtained as theconditional of a Gaussian probability distribution over a sphere. As the demon-stration is independent of the dimension of the space, let us take an space with ndimensions, where the (generalized) geographical coordinates14 are

x1 = r cos λ cos λ2 cos λ3 cos λ4 . . . cos λn−2 cos λn−1

x2 = r cos λ cos λ2 cos λ3 cos λ4 . . . cos λn−2 sin λn−1

. . . = . . .

xn−2 = r cos λ cos λ2 sin λ3

xn−1 = r cos λ sin λ2

xn = r sin λ .

(5.181)

We shall consider the unit sphere at the origin, and an isotropic Gaussian prob-ability distribution with standard deviation σ , with its center along the xn axis, atposition xn = 1 .

The Gaussian volumetric probability, when expressed as a function of the Carte-sian coordinates is

fx(x1, . . . , xn) = k exp(− 1

2σ2

((x1)2 + (x2)2 + · · ·+ (xn−1)2 + (xn − 1)2 )) .

(5.182)As the volumetric probability is an invariant, to express it using the geographicalcoordinates we just need to use the replacements 5.181, to obtain

fr(r, λ, λ′, . . . ) = k exp(− 1

2σ2

(r2 cos2λ+ (r sin λ− 1)2 )) , (5.183)

i.e.,

fr(r, λ, λ′, . . . ) = k exp(− 1

2σ2

(r2 + 1− 2 r sin λ

)). (5.184)

The condition to be on the sphere is just

r = 1 , (5.185)

so that the conditional volumetric probability, as given in equation 3.188, is just ob-tained (up to a multiplicative constant) by setting r = 1 in equation 5.184,

f (λ, λ′, . . . ) = k′ exp(

sin λ− 1σ2

), (5.186)

14The geographical coordinates (longitude and latitude) generalize much better to high dimensionsthan the more usual spherical coordinates.

236 Appendices

i.e., absorbing the constant exp(1/σ2) ,

f (λ, λ′, . . . ) = k′′ exp(

sin λσ2

). (5.187)

This volumetric probability corresponds to the n-dimensional version of the Fisherdistribution. Its expression is identical in all dimensions, only the norming constantdepends on the dimension of the space.


5.3.15 Probability Distributions for Tensors

In this appendix we consider a symmetric second rank tensor, like the stress tensorσ of continuum mechanics.

A symmetric tensor, σi j = σ ji , has only sex degrees of freedom, while it has ninecomponents. It is important, for the development that follows, to agree in a properdefinition of a set of ‘independent components’. This can be done, for instance, bydefining the following six-dimensional basis for symmetric tensors

e1 =

1 0 00 0 00 0 0

; e2 =

0 0 00 1 00 0 0

; e3 =

0 0 00 0 00 0 1

(5.188)

e4 =1√2

0 0 00 0 10 1 0

; e5 =1√2

0 0 10 0 01 0 0

; e6 =1√2

0 1 01 0 00 0 0

.

(5.189)Then, any symmetric tensor can be written as

σ = sα eα , (5.190)

and the six values sα are the six ‘independent components’ of the tensor, in terms ofwhich the tensor writes

σ =

s1 s6/√

2 s5/√

2s6/√

2 s2 s4/√

2s5/√

2 s4/√

2 s3

. (5.191)

The only natural definition of distance between two tensors is the norm of theirdifference, so we can write

D(σ2,σ1) = ‖σ2 −σ1 ‖ , (5.192)

where the norm of a tensor σ is15

‖σ ‖ =√σi jσ

ji . (5.193)

The basis in equation 5.189 is normed with respect to this norm16. In terms of theindependent components in expression 5.191 the norm of a tensor simply becomes

‖σ ‖ =√

(s1)2 + (s2)2 + (s3)2 + (s4)2 + (s5)2 + (s6)2 , (5.194)

15Of course, as, here, σi j = σ ji one can also write ‖ σ ‖ =√σi jσ

i j , but this expression is onlyvalid for symmertric tensors, while the expression 5.193 is generally valid.

16It is also orthonormed, with the obvious definition of scalar product from which this norm de-rives.

238 Appendices

this showing that the six components sα play the role of Cartesian coordinates ofthis 6D space of tensors.

A Gaussian volumetric probability in this space has then, obviously, the form

fs(s) = k exp

(− ∑

α=6α=1(sα − sα0 )2

2ρ2

), (5.195)

or, more generally,

fs(s) = k exp(− 1

2ρ2

(sα − sα0

)Wαβ

(sβ − sβ0

)). (5.196)

It is easy to find probabilistic models for tensors, when we choose as coordinatesthe independent components of the tensor, as this Gaussian example suggests. Buta symmetric second rank tensor may also be described using its three eigenvaluesλ1, λ2, λ3 and the three Euler angles ψ,θ,ϕ defining the eigenvector’s direc-tions s1 s6/

√2 s5/

√2

s6/√

2 s2 s4/√

2s5/√

2 s4/√

2 s3

= R(ψ) R(θ) R(ϕ)

λ1 0 00 λ2 00 0 λ3

R(ϕ)T R(θ)T R(ψ)T ,

(5.197)where R denotes the usual rotation matrix. Some care is required when using thecoordinates λ1, λ2, λ3,ψ,θ,ϕ .

To write a Gaussian volumetric probability in terms on eigenvectors and eigendi-rections only requires, of course, to insert in the fs(s) of equation 5.196 the ex-pression 5.197 giving the tensor components as a function of the eigenvectors andeigendirections (we consider volumetric probabilities —that are invariant— and notprobability densities —that would require an extra multiplication by the Jacobiandeterminant of the transformation—),

f (λ1, λ2, λ3,ψ,θ,ϕ) = fs(s1, s2, s3, s4, s5, s6) . (5.198)

But then, of course, we still need how to integrate in the space using these newcoordinates, in order to evaluate probabilities.

Before facing this problem, let us remark that it is the replacement in equation5.196 of the components sα in terms of the eigenvalues and eigendirections of thetensor that shall express a Gaussian probability distribution in terms of the variablesλ1, λ2, λ3,ψ,θ,ϕ . Using a function that would ‘look Gaussian’ in the variablesλ1, λ2, λ3,ψ,θ,ϕ would not correspond to a Gaussian probability distribution, inthe sense of section 5.3.13.2.


The Jacobian of the transformation s1, s2, s3, s4, s5, s6 λ1, λ2, λ3,ψ,θ,ϕ canbe obtained using a direct computation, that gives17∣∣∣∣∂(s1, s2, s3, s4, s5, s6)

∂(λ1, λ2, λ3,ψ,θ,ϕ)

∣∣∣∣ = (λ1 − λ2) (λ2 − λ3) (λ3 − λ1) sinθ . (5.199)

The capacity elements in the two systems of coordinates are

dvs(s1, s2, s3, s4, s5, s6) = ds1 ∧ ds2 ∧ ds3 ∧ ds4 ∧ ds5 ∧ ds6

dv(λ1, λ2, λ3,ψ,θ,ϕ) = dλ1 ∧ dλ2 ∧ dλ3 ∧ dψ ∧ dθ ∧ dϕ .(5.200)

As the coordinates sα are Cartesian, the volume element of the space is numeri-cally identical to the capacity element,

dvs(s1, s2, s3, s4, s5, s6) = dvs(s1, s2, s3, s4, s5, s6) , (5.201)

but in the coordinates λ1, λ2, λ3,ψ,θ,ϕ the volume element and the capacity arerelated via the Jacobian determinant in equation 5.199,

dv(λ1, λ2, λ3,ψ,θ,ϕ) = (λ1 − λ2) (λ2 − λ3) (λ3 − λ1) sinθ dv(λ1, λ2, λ3,ψ,θ,ϕ) .(5.202)

Then, while the evaluation of a probability in the variables s1, s2, s3, s4, s5, s6 shouldbe done via

P =∫

dvs(s1, s2, s3, s4, s5, s6) fs(s1, s2, s3, s4, s5, s6)

=∫

ds1 ∧ ds2 ∧ ds3 ∧ ds4 ∧ ds5 ∧ ds6 fs(s1, s2, s3, s4, s5, s6) ,(5.203)

in the variables λ1, λ2, λ3,ψ,θ,ϕ it should be done via

P =∫

dv(λ1, λ2, λ3,ψ,θ,ϕ) f (λ1, λ2, λ3,ψ,θ,ϕ)

=∫

dλ1 ∧ dλ2 ∧ dλ3 ∧ dψ ∧ dθ ∧ dϕ ×

× (λ1 − λ2) (λ2 − λ3) (λ3 − λ1) sinθ f (λ1, λ2, λ3,ψ,θ,ϕ) .(5.204)

To conclude this appendix, we may remark that the homogeneous probabilitydistribution (defined as the one who is ‘proportional to the volume distribution’) isobtained by taking both fs(s1, s2, s3, s4, s5, s6) and f (λ1, λ2, λ3,ψ,θ,ϕ) as constants.

[Note: I should explain somewhere that there is a complication when, insteadof considering ‘a tensor like the stress tensor’ one consider a positive tensor (likean electric permittivity tensor). The treatment above applies approximately to thelogarithm of such a tensor.]

17If instead of the 3 Euler angles, we take 3 rotations around the three coordinate axes, the sinθhere above becomes replaced by the cosinus of the second angle. This is consistent with the formulaby Xu and Grafarend (1997).

240 Appendices

5.3.16 Homogeneous Distribution of Second Rank Tensors

The usual definition of the norm of a tensor provides the only natural definition ofdistance in the space of all possible tensors. This shows that, when using a Cartesiansystem of coordinates, the components of a tensor are the ‘Cartesian coordinates’in the 6D space of symmetric tensors. The homogeneous distribution is then repre-sented by a constant (nonnormalizable) probability density:

f (σxx,σyy,σzz,σxy,σyz,σzx) = k . (5.205)

Instead of using the components, we may use the three eigenvalues λ1, λ2, λ3of the tensor and the three Euler angles ψ,θ,ϕ defining the orientation of theeigendirections in the space. As the Jacobian of the transformation

σxx,σyy,σzz,σxy,σyz,σzx λ1, λ2, λ3,ψ,θ,ϕ (5.206)

is ∣∣∣∣∂(σxx,σyy,σzz,σxy,σyz,σzx)∂(λ1, λ2, λ3,ψ,θ,ϕ)

∣∣∣∣ = (λ1 − λ2)(λ2 − λ3)(λ3 − λ1) sinθ , (5.207)

the homogeneous probability density 5.205 transforms into

g(λ1, λ2, λ3,ψ,θ,ϕ) = k (λ1 − λ2)(λ2 − λ3)(λ3 − λ1) sinθ . (5.208)

Although this is not obvious, this probability density is isotropic in spatial direc-tions (i.e., the 3D referentials defined by the three Euler angles are isotropically dis-tributed). In this sense, we recover ‘isotropy’ as a special case of ‘homogeneity’.

The rule ??, imposing that any probability density on the variables λ1, λ2, λ3,ψ,θ,ϕhas to tend to the homogeneous probability density 5.208 when the ‘dispersion pa-rameters’ tend to infinity imposes a strong constraint on the form of acceptable prob-ability densities, that is, generally, overlooked.

For instance, a Gaussian model for the variables σxx,σyy,σzz,σxy,σyz,σzx isconsistent (as the limit of Gaussian is a constant). This induces, via the Jacobian rule,a probability density for the variables λ1, λ2, λ3,ψ,θ,ϕ , a probability density thatis not simple, but consistent. A Gaussian model for the parameters λ1, λ2, λ3,ψ,θ,ϕwould not be consistent.


5.3.17 Center of a Probability Distribution

Let M be an n-dimensional manifold, and let P, Q, . . . represent points of M . Themanifold is assumed to have a metric defined over it, i.e., the distance betweenany two points P and Q is defined, and denoted D(Q, P) . Of course, D(Q, P) =D(P, Q) .

A normalized probability distribution P is defined over M , represented by thevolumetric probability f . The probability of D ⊂ M is obtained, using the notationsof equation 3.119, as

P(D) =∫

P∈Ddv(P) f (P) . (5.209)

If ψ(P) is a scalar (invariant) function defined over M , its average value is de-noted 〈ψ 〉 , and is defined as

〈ψ 〉 ≡∫

P∈Mdv(P) f (P) ψ(P) . (5.210)

This clearly corresponds to the intuitive notion of ‘average’.Let p be a real number in the range 1 ≤ p < ∞ . To any point P we can associate

the quantity (having the dimension of a length)

σp(P) =(∫

Q∈Mdv(Q) f (Q) D(Q, P)p

) 1p

. (5.211)

Definition 5.1 The point18 where σp(P) attains its minimum value is called the Lp-normcenter of the probability distribution f (P) , and it is denoted Pp .

Definition 5.2 The minimum value of σp(P) is called the Lp-norm radius of the probabil-ity distribution f (P) , and it is denoted σp .

The interpretation of these definitions is simple. Take, for instance p = 1 . Com-paring the two equations 5.210–5.211, we see that, for a fixed point P , the quantityσ1(P) corresponds to the average of the distances from the point P to all the points.The point P that minimizes this average distance is ‘at the center’ of the distribution(in the L1-norm sense). For p = 2 , it is the average of the squared distances that isminimized, etc.

The following terminology shall be used:

• P1 is called the median, and σ1 is called the mean deviation;

18If there is more than one point where σp(P) attains its minimum value, any such point is calleda center (in the Lp-norm sense) of the probability distribution f (P) .

242 Appendices

• P2 is called the barycenter (or the center, or the mean), and σ2 is called the stan-dard deviation (while its square is called the variance);

• P∞ is called19 the circumcenter, and σ∞ is called the circumradius.

Calling P∞ and σ∞ respectively the ‘circumcenter’ and the ‘circumradius’ seemsjustified when considering, in the Euclidean plane, a volumetric probability that isconstant inside a triangle, and zero outside. The ‘circumcenter’ of the probability dis-tribution is then the circumcenter of the triangle, in the usual geometrical sense, andthe ‘circumradius’ of the probability distribution is the radius of the circumscribedcircle20. More generally, the circumcenter of a probability distribution is always atthe point that minimizes the maximum distance to all other points, and the circum-radius of the probability distribution is this ‘minimax’ distance.

Example 5.15 Consider a one-dimensional space N , with a coordinate ω , such that thedistance between the point ν1 and the point ν2 is

D(ν2,ν1) =∣∣∣∣ log

ν2

ν1

∣∣∣∣ . (5.212)

As suggested in XXX, the space N could be the space of musical notes, and ν the frequencyof a note. Then, this distance is just (up to a multiplicative factor) the usual distance betweennotes, as given by the number of ‘octaves’. Consider a normalized volumetric probabilityf (ν) , and let us be interested in the L2-norm criteria. For p = 2 , equation 5.211 can bewritten (

σ2(µ))2 =

∫ ∞0

ds(ν) f (ν)(

logν

µ

)2

, (5.213)

The L2-norm center of the probability distribution, i.e., the value ν2 at which σ2(µ) isminimum, is easily found21 to be

ν2 = ν0 exp(∫ ∞

0ds(ν) f (ν) log

ν

ν0

), (5.214)

where ν0 is an arbitrary constant (in fact, and by virtue of the properties of the log-expfunctions, the value ν2 is independent of this constant). This mean value ν2 corresponds

19The L∞-norm center and radius are defined as the limit p → ∞ of the Lp-norm center andradius.

20The circumscribed circle is the circle that contains the three vertices of the triangle. Its center(called circumcenter) is at the the point where the perpendicular bisectors of the sides cross.

21For the minimization of the function σ2(µ) is equivalent to the minimization of(σ2(µ)

)2 , andthis gives the condition

∫ds(ν) f (ν) log(ν/µ) = 0 . For any constant ν0 , this is equivalent to∫

ds(ν) f (ν) (log(ν/ν0)− log(µ/ν0)) = 0 , i.e., log(µ/ν0) =∫

ds(ν) f (ν) log(ν/ν0) , from wherethe result follows. The constant ν0 is necessary in these equations for reasons of physical dimensions(only the logarithm of adimensional quantities is defined).


to what in statistical theory is called the ‘geometric mean’. The variance of the distribution,i.e., the value of the expression 5.213 at its minimum, is

(σ2)2 =

∫ ∞0

ds(ν) f (ν)(

logν

ν2

)2

. (5.215)

The distance element associated to the distance in equation 5.212 is, clearly, ds(ν) = dν/ν ,and the probability density associated to f (ν) is f (ν) = f (ν)/ν , so, in terms of theprobability density f (ν) , equation 5.214 becomes

ν2 = ν0 exp(∫ ∞

0dν f (ν) log

ν

ν0

), (5.216)

while equation 5.215 becomes

(σ2)2 =

∫ ∞0

dν f (ν)(

logν

ν2

)2

. (5.217)

The reader shall easily verify that if instead of the variable ν , one chooses to use the loga-rithmic variable ν∗ = log(ν/ν0) , where ν0 is an arbitrary constant (perhaps the same asabove), then instead of the six expressions 5.212–5.217 we would have obtained, respectively,

s(ν∗2 ,ν∗1) = | ν∗2 − ν∗1 |(σ2(µ∗)

)2 =∫ +∞−∞ ds(ν∗) f (ν∗) (ν∗ −µ∗)2

ν∗2 =∫ +∞−∞ ds(ν∗) f (ν∗) ν∗(

σ2)2 =

∫ +∞−∞ ds(ν∗) f (ν∗)

(ν∗ − ν∗2

)2

(5.218)

ν∗2 =∫ +∞−∞ dν∗ f (ν∗) ν∗ (5.219)

and (σ2)2 =

∫ +∞−∞ dν∗ f (ν∗)

(ν∗ − ν∗2

)2 , (5.220)

with, for this logarithmic variable, ds(ν∗) = dν∗ and f (ν∗) = f (ν∗) . The two lastexpressions are the ordinary equations used to define the mean and the variance in elementarytexts.

Example 5.16 Consider a one-dimensional space, with a coordinate χ , the distance betweentwo poits χ1 and χ2 being denoted D(χ2, χ1) . Then, the associated length element is

244 Appendices

d`(χ) = D( χ+ dχ , χ ) . Finally, consider a (1D) volumetric probability f (χ) , and let usbe interested in the L1-norm case. Assume that χ runs from a minimum value χmin to amaximum value χmax (both could be infinite). For p = 1 , equation 5.211 can be written

σ1(χ) =∫

d`(χ′) f (χ′) D(χ′, χ) . (5.221)

Denoting χ1 be the median, i.e., the point the point where σ1(χ) is minimum), one easily22

founds that χ1 is characterized by the property that it separates the line into two domains ofequal probability, i.e.,

∫ χ1

χmin

d`(χ) f (χ) =∫ χmax

χ1

d`(χ) f (χ) , (5.222)

expression that can readily be used for an actual computation of the median, and whichcorresponds to its elementary definition. The mean deviation is then given by

σ1 =∫ χmax

χmin

d`(χ) f (χ) D(χ, χ1) . (5.223)

Example 5.17 Consider the same situation as in the previous example, but let us become in-terested in the L∞-norm case. Let χmin and χmax the minimum and the maximum valuesof χ for which f (χ) 6= 0 . It can be shown that the circumcenter of the probability distri-bution is the point χ∞ that separates the interval χmin, χmax in two intervals of equallength, i.e., satisfying the condition

D(χ, χmin) = D(χmax, χ) , (5.224)

and that the circumradius is

σ∞ =D(χmax, χmin)

2. (5.225)

Example 5.18 Consider, in the Euclidean n-dimensional space En , with Cartesian coordi-nates x = x1, . . . , xn , a normalized volumetric probability f (x) , and let us be interestedin the L2-norm case. For p = 2 , equation 5.211 can be written, using obvious notations,(

σ2(y))2 =

∫dx f (x) ‖ x− y ‖2 . (5.226)

22In fact, the property 5.222 of the median being intrinsic (independent of any coordinate system),we can limit ourselves to demonstrate it using a special ‘Cartesian’ coordinate, where d`(x) = dx ,and D(x1, x2) = |x2 − x1| , where the property is easy to demonstrate (and well known).


Let x2 denote the mean of the probability distribution, i.e., the point where σ2(y) is min-imum (or, equivalently, where

(σ2(y)

)2 is minimum). The condition of minimum (thevanishing of the derivatives) gives

∫dx f (x) (x− x2) = 0 , i.e.,

x2 =∫

dx f (x) x , (5.227)

which is an elementary definition of mean. The variance of the probability distribution is then

(σ2)2 =

∫dx f (x) ‖ x− x2 ‖2 . (5.228)

In the context of this example, we can define the covariance tensor

C =∫

dx f (x)(x− x2

)⊗(x− x2

). (5.229)

Note that equation 5.227 and equation 5.229 can be written, using indices, as

xi2 =

∫dx1 ∧ · · · ∧ dxn f (x1, . . . , xn) xi , (5.230)

andCi j =

∫dx1 ∧ · · · ∧ dxn f (x1, . . . , xn) (xi − xi

2) (x j − x j2) . (5.231)

246 Appendices

5.3.18 Dispersion of a Probability Distribution


5.3.19 Monte Carlo (Sampling) Methods

5.3.19.1 Random Walks and the Metropolis Rule

Blind random search in multidimensional spaces may be very inefficient, as alreadymentioned. This is why, when the probability distribution to be sampled is relativelyuncomplicated, one may use a ‘random walk’, a sort of Brownian motion where theprobability of getting lost in the vast emptiness of multidimensional spaces is keptlow. When this works, this works very well, but this is not panacea: for really com-plicated probability distributions, with isolate regions of significant probability, thismay not work at all: the discovery of these isolated regions is an intrinsically difficultproblem, where mathematics alone are not of much help. It is only the careful con-sideration of the physics involved in the problem, and of the particular properties ofthe probability distribution that may suggest some strategy. This strategy shall beproblem dependent.

In what follows, then, we concentrate in moderately complicated probability dis-tributions, where random walks are appropriate for sampling. We analyze here therandom walks without memory: each step depends only on the last step. Such awalk without memory is technically called a Markov Chain Monte Carlo (MCMC) ran-dom walk.

5.3.19.2 Modification of Random Walks

Assume here that we can start with a random walk that samples some normalizedvolumetric probability f (P) , and have the goal of having a random walk that sam-ples the volumetric probability

h(P) =1ν

f (P) g(P) , (5.232)

i.e., the conjunction of f with some other volumetric probability g . Here, ν is thenormalizing factor ν =

∫M dv(P) f (P) .

Call Pi the ‘current point’. With this current point as starting point, run one stepof the random walk that unimpeded would sample the volumetric probability f (P) ,to generate a ‘test point’ Ptest . Compute the value

g(Ptest) . (5.233)

If this value is ‘high enough’, let the point Ptest ‘survive’. If g(Ptest) is not ‘highenough’, discard this point and generate another one (making another step of therandom walk sampling the prior volumetric probability f (P)) , using again Pi asstarting point).

There are many criteria for deciding when a point should survive or should bediscarded, all of them resulting in a collection of ‘surviving points’ that are samples

248 Appendices

of the target volumetric probability h(P) . For instance, if we know the maximumpossible value of g(P) , say g(P)max , then define

Ptest =g(Ptest)g(P)max

, (5.234)

and give the point Ptest the probability Ptest of survival (note that 0 < Ptest < 1 ).It is intuitively obvious why the random walk modified using such a criterion pro-duces a random walk that actually samples the volumetric probability h(P) definedby equation 5.232.

Among the many criteria that can be used, the by far most efficient is the Metropo-lis criterion, the criterion behind the Metropolis Algorithm (Metropolis et al. 1953). Inthe following we shall describe this algorithm with some detail.

5.3.19.3 The Metropolis Rule

Consider the following situation. Some random rules define a random walk thatsamples the volumetric probability f (P) . At a given step, the random walker is atpoint Pi , and the application of the rules would lead to a transition to point P j .If that ‘proposed transition’ Pi → P j is always accepted, the random walker willsample the volumetric probability f (P) . Instead of always accepting the proposedtransition Pi → P j , we reject it sometimes by using the following rule to decide ifthe random walker is allowed to move to P j of if it must stay at Pi :

• if g(P j) ≥ g(Pi) , then accept the proposed transition to P j ,

• if g(P j) < g(Pi) , then decide randomly to move to P j , or to stay at Pi , withthe following probability of accepting the move to P j :

P =g(P j)g(Pi)

. (5.235)

Then we have the following

Theorem 5.1 The random walker samples the conjunction h(P) of the volumetric probabil-ities f (P) and g(P)

h(P) = k f (P) g(P) (5.236)

(see appendix ?? for a demonstration).

The algorithm above is reminiscent (see appendix ??) of the Metropolis algorithm(Metropolis et al., 1953), originally designed to sample the Gibbs-Boltzmann distri-bution. Accordingly, we will refer to the above acceptance rule as the Metropolis rule.


5.3.19.4 The Cascaded Metropolis Rule

As above, assume that some random rules define a random walk that samples thevolumetric probability f1(P) . At a given step, the random walker is at point Pi ;

1 apply the rules, that unthwarted, would generate samples of f1(P) , to proposea new point P j ,

2 if f2(P j) ≥ f2(Pi) , go to point 3; if f2(P j) < f2(Pi) , then decide randomly togo to point 3 or to go back to point 1, with the following probability of going topoint 3: P = f2(P j)/ f2(Pi) ;

3 if f3(P j) ≥ f3(Pi) , go to point 4; if f3(P j) < f3(Pi) , then decide randomly togo to point 4 or to go back to point 1, with the following probability of going topoint 4: P = f3(P j)/ f3(Pi) ;

. . . . . .

n if fn(P j) ≥ fn(Pi) , then accept the proposed transition to P j ; if fn(P j) <fn(Pi) , then decide randomly to move to P j , or to stay at Pi , with the follow-ing probability of accepting the move to P j : P = fn(P j)/ fn(Pi) ;

Then we have the following

Theorem 5.2 The random walker samples the conjunction h(P) of the volumetric proba-bilities f1(P), f2(P), . . . , fn(P) :

h(P) = k f1(P) f2(P) . . . fn(P) . (5.237)

(see appendix XXX for a demonstration).

5.3.19.5 Initiating a Random Walk

Consider the problem of obtaining samples of a volumetric probability h(P) definedas the conjunction of some volumetric probabilities f1(P), f2(P), f3(P) . . . ,

h(P) = k f1(P) f2(P) f3(P) . . . , (5.238)

and let us examine three common situations.We may start with a random walk that actually samples f1(P) . Then, a direct

application of the cascaded Metropolis rule allows to produce samples of h(P) .Sometimes, we do not have readily available a random walk that samples f1(P) .

In that case, we rewrite expression 5.238 as

h(P) = k f0(P) f1(P) f2(P) f3(P) . . . , (5.239)

250 Appendices

where f0(P) is the homogeneous volumetric probability. Because we use volumetricprobabilities (and not probability densities), f0(P) is just a constant23. Then, thefirst cascade of the cascaded Metropolis algorithm provides the random walk thatsamples f1(P) , and we can proceed ‘as usual’.

In the worst circumstance, we only have a random walk that samples some vol-umetric probability ψ(P) that is not of interest to us. Rewriting expression 5.238as

h(P) = kψ(P)f0(P)ψ(P)

f1(P) f2(P) f3(P) . . . , (5.240)

immediately suggests to use the cascaded Metropolis algorithm to pass from a ran-dom walk that samples ψ(P) to a random walk that samples the homogeneous vol-umetric probability f0(P) , then to a random walk that samples f1(P) , and so on.

5.3.19.6 Choosing Random Directions and Step Lengths

A random walk is an iterative process where, when we stay at some ‘current point’,we may jump to a neighboring point. We must decide two things, the direction ofthe jump and its step length. Let us examine the two problems in turn.

5.3.19.6.1 Choosing Random Directions When the number of dimensions is small,a ‘direction’ in a space is something simple. This is not so when we work in large-dimensional spaces. Consider, for instance, the problem of choosing a direction ina space of functions. Of course, a space where each point is a function is infinite-dimensional, and we work here with finite-dimensional spaces, but we may justassume that we have discretized the functions using a large number of points, say10 000 or 10 000 000 points.

If we are ‘at the origin’ of the space, i.e., at point 0, 0, . . . representing a func-tion that is everywhere zero, we may decide to choose a direction pointing towardssmooth functions, or fractal functions, gaussian-like functions, functions having zeromean value, L1 functions, L2 functions, functions having a small number of largejumps, etc. This freedom of choice, typical of large-dimensional problems, has tobe carefully analyzed, and it is indispensable to take advantage of it whe designingrandom walks.

Assume that we are able to design a random walk that samples the volumetricprobability f (P) , and we wish to modify it considering the values g(P) , using theMetropolis rule (or any equivalent rule), in order to obtain a random walk that sam-ples

h(P) = k f (P) g(P) . (5.241)

We can design many initial random walks that sample f (P) . Using Metropolismodification of a random walk, we will always obtain a random walk that samples

23Whose value is the inverse of the total volume of the manifold.


h(P) . A well designed initial random walk will ‘present’ to the Metropolis crite-rion test points Ptest that have a large probability of being accepted (i.e., that have alarge value of g(Ptest) ). A poorly designed initial random walk will test points witha low probability of being accepted. Then, the algorithm is very slow in produc-ing accepted points. Although high acceptance probability can always be obtainedwith very small step lengths (if the volumetric probability to be sampled is smooth),we need to discover directions that give high acceptance ratios even for large steplengths.

5.3.19.6.2 Choosing Step Lengths Numerical algorithms are usually forced to com-promise between some conflicting wishes. For instance, a gradient-based minimiza-tion algorithm has to select a finite step length along the direction of steepest descent.The larger the step length, the smaller may be the number of iterations required toreach the minimum, but if the step length is chosen too large, we may lose efficiency;we can even increase the value of the target function, instead of diminishing it.

The random walks contemplated here face exactly the same situation. The direc-tion of the move is not deterministically calculated, but is chosen randomly, with thecommon-sense constraint discussed in the previous section. But once a direction hasbeen decided, the size of the jump along this direction, that has to be submitted tothe Metropolis criterion, has to be ‘as large as possible’, but not too large. Again, the‘Metropolis theorem’ guarantees that the final random walk will sample the targetprobability distribution, but the better we are in choosing the step length, the moreefficient the algorithm will be.

In practice, a neighborhood size giving an acceptance rate of 30%− 60% (for thefinal, posterior sampler) can be recommended.

252 Appendices

5.4 APPENDICES FOR INVERSE PROBLEMS

5.4 APPENDICES FOR INVERSE PROBLEMS 253

5.4.1 Inverse Problems

[Note: Complete and expands what follows.]In the so called ‘inverse problems’, values of the parameters describing physical

systems are estimated, using as data some indirect measurements. A consistent for-mulation of inverse problems can be made using the concepts of probability theory.Data and attached uncertainties, (a possibly vague) a priori information on modelparameters, and a physical theory relating the model parameters to the observationsare the fundamental elements of any inverse problem. While the most general solu-tion of the inverse problem requires extensive use of Monte Carlo methods, specialhypothesis (e.g., Gaussian uncertainties) allow, in some cases, to solve part of theproblem analytically (e.g., using the method of least squares).

Given a physical system, the ‘forward’ of ‘direct’ problem consists, by definition,in using a physical theory to predict the outcome of possible experiments. In classicalphysics, this problem has a unique solution. For instance, given a seismic model ofthe whole Earth (elastic constants, attenuation, etc. at every point inside the Earth)and given a model of a seismic source, we can use current seismological theoriesto predict which seismograms should be observed at given locations at the Earth’ssurface.

The ‘inverse problem’ arises when we do not have a good model of the Earth,or a good model of the seismic source, but we have a set of seismograms, and wewish to use these observations to infer the internal Earth structure or a model of thesource (typically we try to infer both).

There are many reasons that make the inverse problem underdetermined (thesolution is not unique). In the seismic example, two different Earth models maypredict the same seismograms24, the finite bandwidth of our data sets will neverallow us to resolve very small features of the Earth model, and there are alwaysexperimental uncertainties that allow different models to be ‘acceptable’.

The name ‘inverse problem’ is widely accepted. I only like this name moderately,as I see the problem more as a problem of ‘conjunction of states of information’ (the-oretical, experimental and a priori information). In fact, the equations used belowhave a range of applicability well beyond ‘inverse problems’: they can be used, forinstance, to predict the values of observation in a realistic situation where the pa-rameters describing the Earth model are not ‘given’, but only known approximately.

In fact, I like to think of an ‘inverse’ problem as merely a ‘measurement’. A mea-surement that can be quite complex, but the basic principles and the basic equationsto be used are the same for a relatively complex ‘inverse problem’ as for a relativelysimple ‘measurement’.

24For instance, we could fit our observations with a heterogeneous but isotropic Earth model or,alternatively, with an homogeneous but anisotropic Earth.

254 Appendices

5.4.1.1 Model Parameters and Observable Parameters

Although the separation of all the variables of a problem in two groups may some-times be artificial, we take this point of view here, since it allows us to propose asimple setting for a wide class of problems.

We may have in mind a given physical system, like the whole Earth, or a smallcrystal under our microscope. The system (or a given state of the system) may bedescribed by assigning values to a given set of parameters m = m1, m2, . . . , mNMthat we will name the model parameters.

Let us assume that we make observations on this system. Although we are inter-ested in the parameters m , they may not be directly observable, so we may makesome indirect measurement like obtaining seismograms at the Earth’s surface for an-alyzing the Earth’s interior, or making spectroscopic measurements for analyzing thechemical properties of a crystal. The set of observable parameters will be representedby o = o1, o2, . . . , oNO .

We assume that we have a physical theory that solves the forward problem, i.e.,that given an arbitrary model m , it allows us to predict the theoretical data valueso that an ideal measurement should produce (if m was the actual system). Thegenerally nonlinear function that associates to any model m the theoretical datavalues o may be represented by a notation like

oi = oi(m1, m2, . . . , mNM) ; i = 1, 2, . . . , ND , (5.242)

or, for short,o = o(m) . (5.243)

In fact, it is this expression that separates the whole set of our parameters into thesubsets o and m , as sometimes there is no difference of nature between the param-eters in o and the parameters in m . For instance, in the classical inverse problemof estimating the hypocenter coordinates of an earthquake, we may put in o thearrival times of the seismic waves at some seismic observatories, and we need to putin m the coordinates of the observatories —as these are parameters that are neededto compute the travel times—, although we estimate arrival times of waves as wellas coordinates of the observatories using similar types of measurements.

5.4.1.2 A Priori Information on Model Parameters

In a typical geophysical problem, the model parameters contain geometrical param-eters (positions and sizes of geological bodies) and physical parameters (values ofthe mass density, of the elastic parameters, the temperature, the porosity, etc.).

The a priori information on these parameters is all the information we possess in-dependently of the particular measurements that will be considered as ‘data’ (to bedescribed below). This probability distribution is, generally, quite complex, as themodel space may be high dimensional, and the parameters may have nonstandardprobability densities.


To this, generally complex, probability distribution over the model space corre-sponds a volumetric probability that we denote as ρprior(m) .

If an explicit expression for the volumetric probability ρprior(m) is known, thenit can be used in analytical developments. But such an explicit expression is, byno means, necessary. All that is needed is a set of probabilistic rules that allows usto generate samples of ρprior(m) in the model space (random samples distributedaccording to ρprior(m) ).

Example 5.19 Gaussian a priori Information.Of course, the simplest example of a probability distribution is the Gaussian (or ‘normal’)

distribution. Not many physical parameters accept the Gaussian as a probabilistic model (wehave, in particular, seen that many positive parameters are Jeffreys parameters, for whichthe simplest consistent volumetric probability is not the normal, but the lognormal). Butif we have chosen the right parameters (for instance, taking the logarithms of all Jeffreysparameters), it may happen that the Gaussian probabilistic model is acceptable. We thenhave

ρprior(m) = k exp(−1

2(m−mprior)T M-1

prior (m−mprior))

. (5.244)

When this Gaussian volumetric probability is used, mprior , the center of the Gaussian iscalled the ‘a priori model’ while Mprior is called the ‘a priori covariance matrix’. The name‘a priori model’ is dangerous, as for large dimensional problems, the average model may notbe a good representative of the models that can be obtained as samples of the distribution(see figure 5.34 as an example). Other usual sources of prior information are the rangesand distribution of media properties in the rocks, or probabilities for the localization of mediadiscontinuities. If the information refers to marginals of the model parameters, and is not in-cluding the description of relations across model parameters, the prior volumetric probabilityreduces to a product of univariate volumetric probabilities. The next example illustrates thiscase.

Example 5.20 Prior Information for a 1D Mass Density ModelWe consider the problem of describing a model consisting of a stack of horizontal lay-

ers with variable thickness and uniform mass density. The prior information is shown infigure 5.24, involving marginal distributions of the mass density and the layer thickness.Spatial statistical homogeneity is assumed, hence marginals are not dependent on depth inthis example. Additionally, they are independent of neighbor layer parameters. The modelparameters consist of a sequence of thicknesses and a sequence of mass density parameters,m = `1, `2, . . . , `NL,ρ1,ρ2, . . . ,ρNL . The marginal prior probability densities for thelayer thicknesses are all assumed to be identical and of the form (exponential volumetricprobability)

f (`) =1`0

exp(− `

`0

), (5.245)

256 Appendices

where the constant `0 has the value `0 = 4 km (see the left of figure 5.24), while all themarginal prior probability densities for the mass density are also assumed to be identical, andof the form (lognormal volumetric probability)

g(ρ) =1√

2π σexp

(− 1

2σ2

(log

ρ

ρ0

)2)

, (5.246)

where ρ0 = 3.98 g/cm3 and σ = 0.58 (see the right of figure 5.24). Assuming thatthe probability distribution of any layer thickness is independent of the thicknesses of theother layers, that the probability distribution of any mass density is independent of the massdensities of the other layers, and that layer thicknesses are independent of mass densities, thea priori volumetric probability in this problem is the product of a priori probability densities(equations 5.245 and 5.246) for each parameter,

ρprior(m) = ρm(`1, `2, . . . , `NL,ρ1,ρ2, . . . ,ρNL) = kNL

∏i

f (ρi) g(ρi) . (5.247)

Figure 5.25 shows (pseudo) random models generated according to this probability distribu-tion. Of course, the explicit expression 5.247 has not been used to generate these randommodels. Rather, consecutive layer thicknesses and consecutive mass densities have been gen-erated using the univariate probability densities defined by equations 5.245 and 5.246.

Figure 5.24: At left, the probabilitydensity for the layer thickness. Atright, the probability density for thedensity of mass. 0 5 10 15 20 25 30

0

0.05

0.1

0.15

0.2

0.25

0 5 10 15 20 250

0.2

0.4

0.6

0.8

1

Mass Density (g/cm3)Depth (km)

Figure 5.25: Three random Earth modelsgenerated according to the a priori proba-bility density in the model space.

020

40

60

80

100

0 5

10

15

20

020

40

60

80

100

0 5

10

15

20

020

40

60

80

100

0 5

10

15

20

100

80

60

40

20

0

100

80

60

40

20

0

100

80

60

40

20

0

20151050 2 015105020151050

Dep

th (

km)

Mass Density (g/cm3)

Example 5.21 Geostatistical Modeling[Note: I must give here as an example the use of a priori information in geo-

physical inverse problems the geostatistics approach, as developed, for instance, byJournel and Huijbregts (1978). I may also mention the inverse stratigraphic modelingof Bornholdt et al. (1999) and of Cross and Lessenger (1999).]


5.4.1.3 Modeling Problem (or Forward Problem)

Physics analyzes the correlations existing between physical parameters. In stan-dard mathematical physics, these correlations are represented by ‘equalities’ be-tween physical parameters (like when we write f = m a to relate the force f appliedto a particle, the mass m of the particle and the acceleration a ). In the context ofinverse problems this corresponds to assuming that we have a function from the‘parameter space’ to the ‘data space’ that we may represent as

o = o(m) . (5.248)

We do not mean that the relation is necessarily explicit. Given m , we may need tosolve a complex system of equations in order to get o , but this, nevertheless, definesa function m → o = o(m) .

258 Appendices

5.4.1.4 Measurements and Experimental Uncertainties

Note: the text that was here has been moved to section 5.5.4.4. Remember that weend here with a volumetric probability σobs(o) , that represents the result of ourmeasurements.


5.4.1.5 Combination of Available Information

5.4.1.6 Solution in the Model Parameter Space

The basic idea is easy to explain when imagining a Monte Carlo approach, that canbe defined without the need of an explicit expression for the final result. Then, ourtask is to find the analytic expression corresponding to this Monte Carlo approach.

The data of the problem are as follows:

• a volumetric probability on the model parameter space M ,

ρprior(m) , (5.249)

representing the a priori information we have on the model parameters;

• a mapping from M into O ,

m 7→ o = o(m) , (5.250)

providing the solution of the modeling problem (or ‘forward’ problem);

• and a volumetric probability in the observable parameters manifold O ,

σobs(o) , (5.251)

representing the information on the observable parameters obtained from someobservations (or ‘measurements’).

The approach about to be proposed25 is not the shorter nor the more elegant. Butit has the advantage of corresponding to the more general of all the possible imple-mentations. The justification of the proposed approach is obtained in section 5.4.2,where the link is made with the notion of conjunction of states of information.

Basic Monte Carlo Approach: We consider, in M , a ”very large” set of points, thatis sample of ρprior(m) . For each point m of the sample, we compute the predictedvalues of the observable parameters, o(m) . A random decision is taken to keep themodel m or to discard it, the probability for the model m to be kept being

π = σobs( o(m) )/σmaxobs , (5.252)

i.e., the probability is proportional to the value of the volumetric probability σobs(o)at the point o(m) . The subset of models (of the initial set) that have been kept de-fine a volumetric probability, that we denote ρpost(m) , and that we call the posteriorvolumetric probability.

25In some of my past writings, I have introduced inverse problems using the notion of conditionalvolumetric probability (or conditional probability density). The definitions are then different, andlead to different, more complex solutions (see, for instance, Mosegaard and Tarantola, 2002). Theapproach proposed here replaces the old one.

260 Appendices

To obtain the expression for ρpost(m) , we only need to remark that this situationis exactly that examined in section ??. Therefore, the solution found there applieshere: the posterior volumetric probability just defined can be expressed as

ρpost(m) =1νρprior(m)σobs( o(m) ) , (5.253)

where ν is the normalizing constant

ν =∫

Mdvm ρprior(m)σobs( o(m) ) . (5.254)

Example 5.22 Gaussian model. When the model parameter manifold and the observableparameter manifold are linear spaces, the Gaussian model for uncertainties may apply:

σobs(o) =1

(2π)n/2√

det Oobs

exp(

- 12 (o− oobs)t O -1

obs (o− oobs))

ρprior(m) =1

(2π)n/2√

det Mprior

exp(

- 12 (m−mprior)t M -1

prior (m−mprior))

.

(5.255)

Note: explain here the meaning of oobs (the ‘observed values’ of the observable patrameters),Oobs , mprior , and Mprior . Then,

ρpost(m) =1ν

exp( - S(m) ) , (5.256)

where the misfit function S(m) is the sum of squares defined through

2 S(m) = ( o(m)− oobs )t O-1obs ( o(m)− oobs )

+ ( m−mprior )t M-1prior ( m−mprior ) ,

(5.257)

and where ν is the normalizing constant ν =∫M dvm(m) exp( - S(m) ) . The maximum

likelihood model is the model maximizing ρpost(m) , i.e., the model minimizing S(m) . Forthat reason, one may call it the ‘best model in the least-squares sense’.

Example 5.23 Gaussian linear model. If the relation between model parameters and dataparameters is linear, there is a matrix Ω such that

o = o(m) = Ω m . (5.258)


Then, the posterior probability density ρpost(m) is also Gaussian with mean

mpost = ( Ωt O-1obs Ω + M-1

prior )-1 ( Ωt O-1obs oobs + M-1

prior mprior )

= mprior + ( Ωt O-1obs Ω + M-1

prior )-1 Ωt O-1obs ( oobs −Ω mprior )

= mprior + Mprior Ωt ( Ω Mprior Ωt + Oobs )-1 (oobs −Ω mprior)(5.259)

and covariance

Mpost = ( Ωt O-1obs Ω + M-1

prior )-1

= Mprior −Mprior Ωt ( Ω Mprior Ωt + Oobs )-1 Ω Mprior .(5.260)

Example 5.24 If, in the previous example, there is, in fact, no a priori information on themodel parameters, we an formally take Mprior → ∞ I , and the first of equations 5.259reduces to

mpost = ( Ωt O-1obs Ω )-1 Ωt O-1

obs oobs , (5.261)

while the first of equations 5.260 gives

Mpost = ( Ωt O-1obs Ω)-1 . (5.262)

Example 5.25 If, in the previous example, the number of model parameters equals the num-ber of observable parameters, and the matrix Ω is invertible, then one has

mpost = Ω-1 oobs , (5.263)

an equation that corresponds to the Kramer solution of a linear system. The posterior covari-ance matrix can be written

Mpost = Ω-1 Oobs Ω-t . (5.264)

5.4.2 Solution in the Observable Parameter Space

We here raise a series of questions that, although apparently innocent, require intri-cate developments.

• Given the a priori information on the model parameters, as represented bythe volumetric probability ρprior(m) , and given the theoretical mapping m 7→o = o(m) , which is the (probabilistic) prediction we can make for the ob-servable parameters o ? In other words, which is the volumetric probabil-ity σprior(o) obtained in the observable parameter manifold by transport, viam 7→ o(m) , of the prior volumetric probability ρprior(m) ?

262 Appendices

• Which is the volumetric probability σpost(o) obtained by transport of the pos-terior volumetric probability ρpost(m) ?

• How are related the two volumetric probabilities σprior(o) and σpost(o) ?

We can anticipate the answer to the last question (that is demonstrated below). Therelation is

σpost(o) =1νσprior(o)σobs(o) . (5.265)

where ν is the same normalizing constant obtained above (equations 5.253–5.254).This is a very important expression. It demonstrates that the procedure we haveused to define the solution of an inverse problem is consistent with the product ofvolumetric probabilities in the observable parameters manifold: the posterior distri-bution for the observable parameters equals the product of the prior distribution bythe distribution describing the measurements. It is this internal consistency of thetheory that gives weight to the definition of the solution of an inverse problem viathe ‘Monte Carlo paradigm’ used above.

The starting point for the development is to consider the model parameter man-ifold M , on which the model parameters mα can be considered as coordinates.There also is the observable parameters manifold O , in which the observable param-eters oi can be considered coordinates. Both manifolds are assumed to be metric,with respective metric tensors gm and go . Note, in particular, that the manifold O

is assumed to exist, and to have a metric, independently of the existence of M . Therelation o = o(m) defines an application from M into O .

As M and O have different dimension, the kind of considered mapping matters.Let us start by assuming that there are “more data than unknowns”, i.e., where thenumber of observable parameters oi is larger than the number of model parame-ters mα . Denoting p = dim(M) and q = dim(O) , we then have p ≤ q ). Thissituation is represented in figure 5.26.

Figure 5.26: This representationcorresponds to the case whenthere is one model parameter mand two observable parameterso1, o2 (case p < q ).

ρprior(m)

m

o1

m

o2

σprior(o)o=o(m)

So, in the case now examined ( dim(M) ≤ dim(O) ), the mapping o = o(m)defines in the manifold O a subspace of dimension dim(M) : the image of M by theapplication o = o(m) , that we may denote as o(M) . As suggested in figure 5.26,the coordinates mα , that are coordinates of M , can also be used as coordinatesover the g(M) .

Our question is: which volumetric probability τ(o) is induced on O by the vol-umetric probability ρprior(m) over M and the application m 7→ o = o(m) ? It


is obvious that there is one, and that it is unambiguously defined. For a sample ofρprior(m) can be transported to O via the mapping o = o(m) , where it will become,by definition, a sample of the transported probability distribution.

The expression of the induced volumetric probability, say σprior(o) , has beenobtained in equation ??. It satisfies the relation

σprior( o(m) ) = ρprior(m)√

det gm(m)√det( Ωt(m) go(o(m)) Ω(m) )

, (5.266)

where Ω is the matrix of partial derivatives Ωiα = ∂oi/∂mα . Note that this expres-

sion does not explicitly gives a function of o (note: explain this).We know that the volumetric probability σprior(o) is singular, as it is only nonzero

inside the submanifold of o(M) ⊂ O that is of dimension dim(M) . This (singu-lar) volumetric probability σprior(o) is to be be integrated with the volume elementinduced over o(M) by the metric go , that is (see equation ??)

dωp =√

det( Ωt(m) go(o(m)) Ω(m) ) dm1 ∧ · · · ∧ dmp . (5.267)

(remember that we are using the coordinates mα over o(M) ).This ends the problem of ‘prior data prediction’, i.e., the problem of transporting

ρprior(m) from M into O . The transport of ρpost(m) is done in the same way, soone obtains an expression similar to 5.266, but this time concerning the posteriordistributions:

σpost( o(m) ) = ρpost(m)√

det gm(m)√det( Ωt(m) go(o(m)) Ω(m) )

. (5.268)

Inserting here equation 5.253 one obtains

σpost( o(m) ) =1νρprior(m)σobs( o(m) )

√det gm(m)√

det( Ωt(m) go(o(m)) Ω(m) ),

(5.269)i.e., using expression 5.266, σpost( o(m) ) = 1

ν σprior( o(m) )σobs( o(m) ) . Denotingo(m) by o , this can be written

σpost(o) =1νσprior(o)σobs(o) , (5.270)

that is the result we had anticipated in equation 5.265.

Example 5.26 Note: demonstrate here that in the linear Gaussian case (example 5.23, page260), σprior(o) is a Gaussian centered at oprior = Ω mprior with covariance matrixOprior = Ω Mprior Ωt , while σpost(o) is a Gaussian centered at opost = Ω mpost withcovariance matrix Opost = Ω Mpost Ωt . Explain that in the case being here analyzed( p ≤ q ), the q× q matrices Oprior and Opost can only be regular if p = q .

264 Appendices

It remains to analyze the case when p ≥ q (see figure 5.27). Which is, in the casep ≥ q the volumetric probability σprior(o) induced in O by the prior volumetricprobability ρprior(m) and the mapping o(m) ?

Figure 5.27: Same as in figure 5.26, butin the case p > q .

o1

o2

σprior(o)ρprior(m)

m1

m2

m3

o=o(m)

The result is obtained by a direct application of equation ??. We must first sepa-rate the p model parameters into two subsets,

m1, . . . , mq, mq+1, . . . , mp = µ1, . . . ,µq,νq+1, . . .νp , (5.271)

i.e., for short,m = µ,ν . (5.272)

We must then rewrite the application m 7→ o = o(m) as

µ,ν 7→ o = o(µ,ν) , (5.273)

and solve a q× q system for the model parameters µ :

µ = µ(o,ν) . (5.274)

Then, in terms of probability densities (see equation ??)

σprior(o) =∫

dνq+1 ∧ · · · ∧ dνp ρprior(µ,ν)| det Ω(µ,ν) | , (5.275)

where Ω is the q × q matrix of partial derivatives Ωiα = ∂oi/∂mα . In the right-

hand side of equation 5.275 it is understood that the two occurrences of µ have tobe replaced by the function of o and ν obtained above (equation 5.274).

Denoting by gm(m) the metric tensor in the model parameter manifold, and bygo(o) the metric tensor in the observable parameter manifold, we can transformequation 5.275 into an equation concerning volumetric probabilities:

σprior(o) =1√

det go(o)

∫dνq+1 ∧ · · · ∧ dνp

√det gm(µ,ν)

| det Ω(µ,ν) | ρprior(µ,ν) .

(5.276)This solves the problem of ‘prior data prediction’ in the case p ≥ q .

For the transportation of ρpost(m) from M into O , one can follow exactly thesame approach as for the case p ≤ q , to obtain σpost(o) = 1

ν σprior(o)σobs(o) , i.e.,equation 5.265 again. So we see that this equation is also valid in the case p ≥ q .

Example 5.27 Note: I have to revisit here example 5.26, in the case p ≥ q .


i . >

/ , . : ' -

F h L . : I r

I w - i

l . -

l . 1 . + < l '

I i -.12

i .

. ' ! " ,i

I r _

, ' . \

?I

i . . ) " 7

1- . r f - l

l * - '. , . , \ - + " \ \

r . . - " - , _ . l ' ' i .

' ' " ' j -

',ft-..,

rI J . t r + u >

. b/" t -- -

l a n l ' a ) z a q

-r- ( [ \ ! , ; , l - ]

] v l " c ' i

J L* - ; ' ' " -+-: : t : U ; J L /

( : ) l ' (2

( , - . n c ^ n ' )

- - ( 2 q

\,?; ' ' -2 ' .Y. ' - - Q-

Figure 5.28: Scan.

266 Appendices

1,1 . .2 r , ,. lI

I *I

( r . : : " ^ - , ) - ) ! - t

" - I . l ^ t l

t - l

Figure 5.29: Scan.


5.4.3 Implementation of Inverse Problems

Once the volumetric probability ρpost(m) has been defined, there are different waysof ‘using’ it.

If the model parameter manifold M has a small number of dimensions (say be-tween one and four), the values of ρpost(m) can be computed at every point of a gridand a graphical representation of ρpost(m) can be attempted. A visual inspection ofsuch a representation is usually worth a thousand ‘estimators’ (central estimators orestimators of dispersion). But, of course, if the values of ρpost(m) are known at allsignificant points, these estimators can also be computed. This point of view is em-phasized in section XXX. If the ‘model space’ M has a large number of dimensions(say from five to many millions or billions), then an exhaustive exploration of thespace is not possible, and we must turn to Monte Carlo sampling methods to extractinformation from ρpost(m) . We discuss the application of Monte Carlo methodsto inverse problems in 5.4.3.2. Finally, the optimization techniques are discussed insection 5.4.3.5.

5.4.3.1 Direct use of the Volumetric Probability

Note: write this section.

268 Appendices

5.4.3.2 Using Monte Carlo Methods

[Note: Write a small introduction here].

5.4.3.3 Sampling the Prior Probability Distribution

The first step in the Monte Carlo analysis is to switch off the comparison betweencomputed and observed data, thereby generating samples of the a priori probabilitydensity. This allows us verify statistically that the algorithm is working correctly,and it allows us to understand the prior information we are using. We will refer to alarge collection of models representing the prior probability distribution as the “priormovie”. The more models present in this movie, the more accurate representation ofthe prior probability density.

If we are interested in smooth Earth models (knowing, e.g., that only smoothproperties are resolved by the data), a smooth movie can be produced simply bysmoothing the individual models of the original movie.

5.4.3.4 Sampling the Posterior Probability Distribution

If we now switch on the comparison between computed and observed data using,e.g., the Metropolis Rule, the random walk sampling the prior distribution is mod-ified into a walk sampling the posterior distribution. Again, smoothed versions ofthis “posterior movie” can be generated by smoothing the individual models in theoriginal, posterior movie.

Since data rarely put strong constraints on The Earth, the “posterior movie” typi-cally shows that many different models are possible. But even though the models inthe posterior movie may be quite different, all of them predict data that, within ex-perimental uncertainties, are models with high likelihood. In other words, we mustaccept that data alone cannot have a preferred model.

The posterior movie allows us to perform a proper resolution analysis that helpsus to choose between different interpretations of a given data set. Using the moviewe can answer complicated questions about the correlations between several modelparameters. To answer such questions, we can view the posterior movie and tryto discover structure that is well resolved by data. Such structure will appear as“persistent” in the posterior movie. Another, more traditional, way of investigatingresolution is to calculate covariances and higher order moments.

Note: continue the discussion.


5.4.3.5 Appendix: Using Optimization Methods

As we have seen, the solution of an inverse problem essentially consists of a proba-bility distribution over the space of all possible models of the physical system understudy. In general, this ‘model space’ is highly-dimensional, and the only general wayto explore it is by using the Monte Carlo methods developed in section ??.

If the probability distributions are ‘bell-shaped’ (i.e., if they look like a Gaussianor like a generalized Gaussian), then, one may simplify the problem by calculatingonly the point around which the probability is maximum, with an approximate esti-mation of the variances and covariances. This is the problem addressed in this sec-tion. [Note: I rephrased this sentence] Among the many methods available to obtainthe point at which a scalar function reaches its maximum value (relaxation methods,linear programming techniques, etc.), we limit our scope here to the methods usingthe gradient of the function, which we assume can be computed analytically or, atleast, numerically. For more general methods, the reader may have a look at Fletcher,(1980, 1981), Powell (1981), Scales (1985), Tarantola (1987), or Scales et al. (1992).

270 Appendices

5.4.3.6 Maximum Likelihood Point

Let us consider a space X , with a notion of volume element dV defined. If somecoordinates x ≡ x1, x2, . . . , xn are chosen over the space, the volume element hasan expression dV(x) = v(x) dx , and each probability distribution over X can berepresented by a probability density f (x) . For any fixed small volume ∆V , wecan search for the point xML such that the probability dP of the small volume,when centered around xML , gets a maximum. In the limit ∆V → 0 this definesthe maximum likelihood point. The maximum likelihood point may be unique (if theprobability distribution is monomodal), may be degenerated (if the probability dis-tribution is ‘roof-shaped’) or may be multiple (as when we have the sum of a fewbell-shaped functions).

The maximum likelihood point is not the point at which the probability densityis maximum. [Note: Rephrase the following sentence...] For our definition imposesthat what must be maximum is the ratio of the probability density by the functionv(x) defining the volume element:

x = xML ⇐⇒ F(x) =f (x)v(x)

maximum . (5.277)

We recognize in the ratio F(x) = f (x)/v(x) the volumetric probability associatedto the probability density f (x) (see equation ??). As the homogeneous probabilitydensity is µ(x) = k v(x) (see rule ??), we can equivalently define the maximumlikelihhod point by the condition

x = xML ⇐⇒ f (x)µ(x)

maximum . (5.278)

The point at which a probability density has its maximum is not xML . In fact,the maximum of a probability density does not correspond to an intrinsic definitionof a point: a change of coordinates x 7→ y = ψ(x) would change the probabilitydensity f (x) into the probability density g(y) (obtained using the Jacobian rule),but the point of the space at which f (x) is maximum is not the same as the pointof the space where g(y) is maximum (unless the change of variables is linear). Thiscontrasts with the maximum likelihood point, as defined by equation 5.278, that is anintrinsically defined point: no matter which coordinates we use in the computationwe always obtain the same point of the space.

5.4.3.7 Misfit

One of the goals here is to develop gradient-based methods for obtaining the max-imum of F(x) = f (x)/µ(x) . As a quite general rule, gradient-based methods per-form quite poorly for (bell-shaped) probability distributions, as when one is far fromthe maximum the probability densities tend to be quite flat, and it is difficult to get,


reliably, the direction of steepest ascent. Taking a logarithm transforms a bell-shapeddistribution into a paraboloid-shaped distribution on which gradient methods workwell.

The logarithmic volumetric probability, or misfit, is defined as S(x) = − log(F(x)/F0) ,where p′ and F0 are two constants, and is given by

S(x) = − logf (x)µ(x)

. (5.279)

The problem of maximization of the (typically) bell-shaped function f (x)/µ(x) hasbeen transformed into the problem of minimization of the (typically) paraboloid-shaped function S(x) :

x = xML ⇐⇒ S(x) minimum . (5.280)

Example 5.28 The conjunction σ(x) of two probability densities ρ(x) and ϑ(x) wasdefined (equation ??) as

σ(x) = pρ(x)ϑ(x)µ(x)

. (5.281)

Then,S(x) = Sρ(x) + Sϑ(x) , (5.282)

where

Sρ(x) = − logρ(x)µ(x)

; Sϑ(x) = − logϑ(x)µ(x)

. (5.283)

Example 5.29 In the context of Gaussian distributions, we have found the probability den-sity (see example ??)

ρpost(m) = (5.284)

= k exp(−1

2

((m−mprior)t M-1

prior (m−mprior) + (o(m)− oobs)t O-1obs (o(m)− oobs)

)).

The limit of this distribution for infinite variances is a constant, so in this case µm(m) = k .The misfit function S(m) = − log( ρpost(m)/µm(m) ) is then given by

2 S(m) = (m−mprior)t M-1prior (m−mprior)+ (o(m)− oobs)t O-1

obs (o(m)− oobs) .(5.285)

The reader should remember that this misfit function is valid only for weakly nonlinear prob-lems (see examples ?? and ??). The maximum likelihood model here is the one that minimizesthe sum of squares 5.285. This correpponds to the least squares criterion.

272 Appendices

Example 5.30 In the context of Laplacian distributions, we have found the probability den-sity (see example ??)

ρpost(m) = k exp

(−(

∑α

|mα −mαprior|σα

+ ∑i

| f i(m)− oiobs|

σi

)). (5.286)

The limit of this distribution for infinite mean deviations is a constant, so here µm(m) = k .The misfit function S(m) = − log( ρpost(m)/µm(m) ) is then given by

S(m) = ∑α


+ ∑i

| f i(m)− oiobs|

σi. (5.287)

The reader should remember that this misfit function is valid only for weakly nonlinear prob-lems. The maximum likelihood model here is the one that minimizes the sum of least absolutevalues 5.287. This correpponds to the least absolute values criterion.

5.4.3.8 Gradient and Direction of Steepest Ascent

One must not consider as synonymous the notions of ‘gradient’ and ‘direction ofsteepest ascent’. Consider, for instance, an adimensional misfit function26 S(P, T)over a pressure P and a temperature T . Any sensible definition of the gradient ofS will lead to an expression like

grad S =

∂S∂P

∂S∂T

(5.288)

and this by no means can be regarded as a ‘direction’ in the (P, T) space (for in-stance, the components of this ‘vector’ does not have the dimensions of pressureand temperature, but of inverse pressure and inverse temperature).

Mathematically speaking, the gradient of a function S(x) at a point x0 is the linearapplication that is tangent to S(x) at x0 . [Note: Rephrase the following sentence...]This definition of gradient is consistent with the more elementary one, based on theuse of the first order development

S(x0 + δx) = S(x0) + γT0 δx + . . . (5.289)

Here, it is γ0 what is called the gradient of S(x) at point x0 . It is clear thatS(x0) + γT

0 δx is a linear application, and that it is tangent to S(x) at x0 , so the two

26We take this example because typical misfit functions are adimensional, but the argument hasgeneral validity.


defintions are, in fact, equivalent. Explicitly, the components of the gradient at pointx0 are

(γ0)p =∂S∂xp (x0) . (5.290)

Everybody is well trained at computing the gradient of a function (event if theinterpretation of the result as a direction in the original space is wrong). How can wepass from the gradient to the direction of steepest ascent (a bona fide direction in theoriginal space)? In fact, the gradient (at a given point) of a function defined over agiven space E ) is an element of the dual of the space. To obtain a direction in E , wemust pass from the dual to the primal space. As usual, it is the metric of the spacethat maps the dual of the space into the space itself. So if g is the metric of the spacewhere S(x) is defined, and if γ is the gradient of S at a given point, the directionof steepest ascent is

γ = g-1 γ . (5.291)

The direction of steepest ascent must be interpreted as follows: if we are at apoint x of the space, we can consider a very small hypersphere around x0 . Thedirection of steepest ascent points towards the point of the sphere at which S(x)gets its maximum value.

Example 5.31 Figure 5.30 represents the level lines of a scalar function S(u, v) in a 2Dspace. A particular point has been selected. What is the gradient of the function at the givenpoint? As suggested in the main text, it is not an arrow ‘perpendicular’ to the level linesof the function at the considered point, as the notion of perpendicularity will depend on ametric not yet specified (and unnecessary to define the gradient). The gradient must be seenas ‘the linear function that is tangent to S(u, v) at the considered point’. If S(u, v) hasbeen represented by its level lines, then the gradient may also be represented by its level lines(right of the figure). We see that the condition, in fact, is that the level lines of the gradientare tangent to the level lines of the original function (at the considered point). Contrary tothe notion of perpendicularity, the notion of tangency is metric-independent.

Figure 5.30: The gradient of a functionhas not to be seen as a vector orthogonalto the level lines, but as a form parallelto them (see text.)

A function, a pointand the tangent

level line

The gradient ofthe function

at the considered point

274 Appendices

Example 5.32 In the context of least squares, we consider a misfit function S(m) and acovariance matrix OM . If γ0 is the gradient of S , at a point x0 , and if we use OM todefine distances in the space, the direction of steepest ascent is

γ0 = OM γ0 . (5.292)

Example 5.33 If the misfit function S(P, T) depends on a pressure P and on a temperatureT , the gradient of S is, as mentioned above (equation 5.288),

γ =

∂S∂P

∂S∂T

. (5.293)

As the quantities P and T are Jeffreys quantities, associated to the metric ds2 =(

dPP

)2+(

dTT

)2, the direction of steepest ascent is27

γ =

P2 ∂S∂P

T2 ∂S∂T

. (5.294)

5.4.3.9 The Steepest Descent Method

Consider that we have a probability distribution defined over an n-dimensionalspace X . Having chosen some coordinates x ≡ x1, x2, . . . , xn over the space, theprobability distribution is represented by the probability density f (x) whose homo-geneous limit (in the sense developed in section ??) is µ(x) . We wish to calculate thecoordinates xML of the maximum likelihood point. By definition (equation 5.278),

x = xML ⇐⇒ f (x)µ(x)

maximum , (5.295)

i.e.,x = xML ⇐⇒ S(x) minimum , (5.296)

where S(x) is the misfit (equation5.279)

S(x) = −k logf (x)µ(x)

. (5.297)

27We have here(

gPP gPTgTP gTT

)=(

1/P2 00 1/T2

).


Let us denote by γ(xk) the gradient of S(x) at point xk , i.e. (equation 5.290),

(γ0)p =∂S∂xp (x0) . (5.298)

We have seen above that γ(x) is not to be interpreted as a direction in the spaceX , but a direction in the dual space. The gradient can be converted into a directionusing some metric g(x) over X . In simple situations the metric g will be thatused to define the volume element of the space, i.e., we will have µ(x) = k v(x) =k√

det g(x) , but this is not a necessity, and iterative algorithms may be acceleratedby astute introduction of ad-hoc metrics.

Given, then, the gradient γ(xk) (at some particular point xk ) to any possiblechoice of metric g(x) we can define the direction of steepest ascent associated to themetric g , by (equation 5.292)

γ(xk) = g-1(xk) γ(xk) . (5.299)

The algorithm of steepest descent is an iterative algorithm passing from point xkto point xk+1 by making a ‘small jump’ along the local direction of steepest descent,

xk+1 = xk −εk g-1k γk , (5.300)

where εk is an ad-hoc (real, positive) value adjusted to force the algorithm to con-verge rapidly (if εk is chosen too small the convergence may be too slow; it is itchosen too large, the algorithm may even diverge).

Many elementary presentations of the steepest descent algorithm just forget toinclude the metric gk in expression 5.300. These algorithms are not consistent.Even the physical dimensionality of the equation is not assured. The authors of thisarticle have traced some ‘numerical’ problems in existing computer implementationsof steepest descent algorithms to this neglection of the metric.

Example 5.34 In the context of example 5.29, where the misfit function S(m) is given by

2 S(m) = (o(m)− oobs)t O-1obs (o(m)− oobs)+ (m−mprior)t M-1

prior (m−mprior) ,(5.301)

the gradient γ , whose components are γα = ∂S/∂mα , is given by the expression

γ(m) = Ft(m) O-1obs (o(m)− oobs) + M-1

prior (m−mprior) , (5.302)

where F is the matrix of partial derivatives

Fiα =∂ f i

∂mα. (5.303)

An example of computation of partial derivatives is given in appendix ??.

276 Appendices

Example 5.35 In the context of example 5.34 the model space M has an obvious metric,namely that defined by the inverse of the ‘a priori’covariance operator g = M-1

prior . Usingthis metric and the gradient given by equation 5.302, the steepest descent algorithm 5.300becomes

mk+1 = mk −εk

(Mprior Ft

k O-1obs (fk − oobs) + (mk −mprior)

), (5.304)

where Fk ≡ F(mk) and fk ≡ f(mk) . The real positive quantities εk can be fixed, aftersome trial and error, by accurate linear search, or by using a linearized approximation28.

Example 5.36 In the context of example 5.34 the model space M has a less obvious metric,namely that defined by the inverse of the ‘a posteriori’ covariance operator, g = M-1

post .Note: Explain here that the ‘best current estimator’ of Mpost is

Mpost ≈(

Ftk O-1

obs Fk + M-1prior

)-1. (5.305)

Using this metric and the gradient given by equation 5.302, the steepest descent algorithm 5.300becomes

mk+1 = mk−εk

(Ft

k O-1obs Fk + M-1

prior

)-1 (Ft

k O-1obs (fk − oobs) + M-1

prior (mk −mprior))

,(5.306)

where Fk ≡ F(mk) and fk ≡ f(mk) . The real positive quantities εk can be fixed, aftersome trial and error, by accurate linear search, or by using a linearized approximation thatsimply gives29 εk ≈ 1 .

The algorithm 5.306 is usually called a ‘quasi-Newton algorithm’. [Note: Rephrasethe following sentence...] This is a misname, as a Newton method applied to the min-imization of the misfit function S(m) would be a method using the second deriva-

tives of S(m) , and thus the derivatives Hiαβ = ∂2 f i

∂mα∂mβ , that are not computed (ornot estimated) when using this algorithm. It is just a steepest descent algorithm witha nontrivial definition of metric in the working space. In this sense it belongs to thewider class of ‘variable metric methods’, not discussed in this article.

Example 5.37 In the context of example 5.30, where the misfit function S(m) is given by

S(m) = ∑i

| f i(m)− oiobs|

σi+ ∑

α


, (5.307)

28As shown in Tarantola (1987), if γk is the direction of steepest ascent at point mk , i.e., γk =Mprior Ft

k O-1obs (fk − oobs) + (mk −mprior) , then, a local linearized approximation for the optimal εk

gives εk =γt

k M-1priorγk

γtk( Ft

k O-1obs Fk+M-1

prior)γk.

29While a sensible estimation of the optimal values of the real positive quantities εk is crucial forthe algorithm 5.304, they can, in many usual circumstances, be dropped from the algorithm 5.306.


the gradient γ whose components are γα = ∂S/∂mα is given by the expression

γα = ∑i

Fiα 1σi

sign( f i − oiobs) +

1σα

sign(mα −mαprior) , (5.308)

where Fiα = ∂ f i∂/mα . We can now choose in the model space the ad-hoc metric definedas the inverse of the ‘covariance matrix’ formed by the square of the mean deviations σi andσα (interpreted as if they were variances). Using this metric, the direction of steepest ascentassociated to the gradient in 5.308, is

γα = ∑i

Fiα σi sign( f i − oiobs) +σα sign(mα −mαprior) . (5.309)

The steepest descent algorithm can now be appplied:

mk+1 = mk −εkγk . (5.310)

The real positive quantities εk can be fixed after some trial and error or by accurate linearsearch.

An expression like 5.307 defines a sort of deformed polyhedron, and to solve thissort of minimization problems the linear programming techniques are often advo-cated (e.g., Claerbout and Muir, 1973). We have found that for problems involvingmany dimensions the crude steepest descent method defined by equations 5.309–5.310 performs extremely well. For instance, in Djikpesse and Tarantola (1999) alarge-sized problem of waveform fitting is solved using this algorithm. It is wellknown that the sum of absolute values 5.307 provides a more robust30 criterion thanthe sum of squares 5.301. If one fears that the data set to be used is corrupted bysome unexpected errors, the least-absolute values criterion should be preferred tothe least squares criterion31.

30A method is ‘robust’ if its output is not sensible to a small number of large errors in the inputs.31Of course, it would be much better to develop a realistic model of the uncertainties, and use the

more general probabilistic methods developed above, but if those models are not available, then theleast absolute values criterion is a valuable criterion.

278 Appendices

5.4.3.10 Estimation of A Posteriori Uncertainties

In the Gaussian context, the Gaussian probability density that is tangent to ρpost(m)has its center at the point given by the iterative algorithm

mk+1 = mk −εk

(Mprior Ft

k O-1obs (fk − oobs) + (mk −mprior)

), (5.311)

(equation 5.304) or, equivalently, by the iterative algorithm

mk+1 = mk−εk

(Ft

k O-1obs Fk + M-1

prior

)-1 (Ft

k O-1obs (fk − oobs) + M-1

prior (mk −mprior))

(5.312)(equation 5.306). The covariance of the tangent gaussian is

Mpost ≈(

Ft∞ O-1obs F∞ + M-1

prior

)-1, (5.313)

where F∞ refers to the value of the matrix of partial derivatives at the convergencepoint.

[note: Emphasize here the importance of Mpost ].


5.4.3.11 Some Comments on the Use of Deterministic Methods

5.4.3.11.1 About the Use of the Term ‘Matrix’ [note: Warning, old text to be up-dated.] Contrary to the next chapter, where the model parameter space and the dataspace may be functional spaces, I assume here that we have discrete spaces, with afinite number of dimensions. [Note: What is ’indicial’ ?] Then, it makes sense to usethe indicial notation

o = oi , i ∈ ID ; m = mα , i ∈ IM , (5.314)

where ID and IM are two index sets, for the data and the model parametersrespectively. In the simplest case, the indices are simple integers, ID = 1, 2, 3 . . . ,and IM = 1, 2, 3 . . . , but this is not necessarily true. For instance, figure 5.31suggests a 2D problem where we compute the gravitational field from a distributionof masses. Then, the index α is better understood as consisting on a pair of integers.

Figure 5.31: A simple example where theindex in m = mα is not necessar-ily an integer. In this case, where we areinterested in predicting the gravitationalfield g generated by a 2-D distributionof mass, the index α is better under-stood as consisting on a pair of integers.Here, for instance, mA,B means the totalmass in the block at row A and columnB .

m3,4

m2,4

m1,4

m3,3

m2,3

m1,3

m3,2

m2,2

m1,2

m3,1

m2,1

m1,1g1

g4

g3

g2

5.4.3.11.2 Linear, Weakly Nonlinear and Nonlinear Problems There are differ-ent degrees of nonlinearity. Figure 5.32 illustrates the four domains of nonlinearityallowing the use of the different optimisation algorithms This figure symbolicallyrepresents the model space in the abscissa axis, and the data space in the ordinatesaxis. The gray oval represents the information coming in part from a priori informa-tion on the model parameters and coming in part from the data observations32. It isthe function ρ(o, m) = σobs(o) ρprior(m) seen elsewhere (note: say where).

To fix ideas, the oval suggests here a Gaussian probability, but the sorting of prob-lems we are about to make as a function of their nonlinearity will not depend funda-mentally on this.

32The gray oval is the product of the probability density over the model space, representing the apriori information, times the probability density over the data space representing the experimentalresults.

280 Appendices

Figure 5.32: Illustrationof the four domains ofnonlinearity allowing theuse of the different op-timization algorithms Themodel space is symboli-cally represented in the ab-scissa axis, and the dataspace in the ordinates axis.The gray oval representsthe information coming inpart from a priori infor-mation on the model pa-rameters and coming inpart from the data observa-tions. What is importantis not some intrinsic non-linearity of the function re-lating model parameters todata, but how linear thefunction is inside the domainof significant probabilty.

M

D

dobs

mpriorM

D

dobs

mprior

d = g(m)

d = g(m)

σΜ(m) σΜ(m)

M

D

dobs

mprior

d = G mσΜ(m)

M

D

dobs

mprior

d = g(m)

d - dprior = G0 (m - mprior)

σΜ(m)

Linear problem Linearisable problem

Non-linear problemWeakly non-linear problem


First, there are some strictly linear problems. For instance, in the example illus-trated by figure 5.31, the gravitational field g depends linearly on the masses insidethe blocks33

Strictly linear problems are illustrated at the top left of figure 5.32. The linear rela-tionship between data and model parameters, o = Ω m , is represented by a straightline. The a priori probability density ρ(o, m) “induces”, on this straight line, thea posteriori probability density (warning: this notation corresponds to volumetricprobabilities) σ(o, m) whose “projection” over the model space gives gives the aposteriori probability density over the model parameter space, ρpost(m) . Shouldthe a priori probability densities be Gaussian, then the a posteriori probability distri-bution would also be Gaussian: this is the simplest situation (in such problems, as wewill later see (section xxx), the problem reduces to find the mean and the covarianceof the a posteriori Gaussian).

Quasi-linear problems are illustrated at the bottom-left of figure 5.32. If the rela-tionship linking the observable data o to the model parameters m ,

o = o(m) , (5.315)

is approximately linear inside the domain of significant a priori probability (i.e., insidethe gray oval of the figure), then the a posteriori probability is as simple as the apriori probability. For instance, if the a priori probability is Gaussian the a posterioriprobability is also Gaussian.

In this case also, the problem can be reduced to the computation of the meanand the covariance of the Gaussian. Typically, one begins at some “starting model”m0 (typically, one takes for m0 the “a priori model” mprior ) (note: explain clearly

somewhere in this section that “a priori model” is a language abuse for the “mean apriori model”), linearizing the function o = o(m) around m0 and one looks for amodel m1 “better than m0 ”.

Iterating such an algorithm, one tends to the model m∞ at which the “quasi-Gaussian” ρpost(m) is maximum. The linearizations made in order to arrive to m∞are not, so far, an approximation: the point m∞ is perfectly defined independently

33The gravitational field at point x0 generated by a distribution of volumetric mass ρ(x) is givenby

g(x0) =∫

dV(y)x0 − y

‖x0 − x‖3 ρ(x) .

When the volumetric mass is constant inside some predefined (2-D) volumes, as suggested in figure5.31, this gives

g(x0) = ∑A

∑B

GA,B(x0) mA,B .

This is a strictly linear equation between data (the gravitational field at a given observation point)and the model parameters (the masses inside the volumes). Note that if instead of choosing as modelparameters the total masses inside some predefined volumes one chooses the geometrical parametersdefining the sizes of the volumes, then the gravity field is not a linear function of the parameters.More details can be found in Tarantola and Valette (1982b, page 229).

282 Appendices

of any linearization, and any method used to find it. But once the convergence tothis point has been obtained, a linearization of the function o = o(m) around thispoint,

o− o(m∞) = Ω∞ (m−m∞) , (5.316)

allows to obtain a good approximation of the a posteriori uncertainties. For instance,if the a priori probability is Gaussian this will give the covariance of the “tangentGaussian”.

Between linear and quasi-linear problem there are the “linearizable problems”.The scheme at the top-right of figure 5.32 shows the case where the linearization ofthe function o = o(m) around the a priori model,

o− g(mprior) = Ωprior (m−mprior) , (5.317)

gives a function that, inside the domain of significant probability, is very similar tothe true (nonlinear) function.

In this case, there is no practical difference between this problem and the strictlylinear problem, and the iterative procedure necessary for quasi-linear problems ishere superfluous.

It remains to analyze the true nonlinear problems that, using a pleonasm, aresometimes called strongly nonlinear problems. They are illustrated at the bottom-rightof figure 5.32.

In this case, even if the a priori probability is simple, the a posteriori probabil-ity can be quite complicated. For instance, it can be multimodal. [Note: Rephrasethe following sentence...] These problems are, in general, quite complex to solve,and only the Monte Carlo methods described in the previous chapter are sufficientlygeneral.

If full Monte Carlo methods cannot be used, because they are too expensive, thenone can mix some random part (for instance, to choose the starting point) and somedeterministic part. The optimization methods applicable to quasi-linear problemscan, for instance, allow us to go from the randomly chosen starting point to the“nearest” optimal point (note: explain this better). Repeating these computations fordifferent starting points one can arrive at a good idea of the a posteriori probabilityin the model space.

5.4.3.11.3 The Maximum Likelihood Model The most likely model is, by defini-tion, that at which the volumetric probability σβ(m) attains its maximum. Asσβ(m) is maximum when S(m) is minimum, we see that the most likely modelis also the the ‘best model’ obtained when using a ‘least squares criterion’. Shouldwe have used the double exponential model for all the uncertainties, then the mostlikely model would be defined by a ‘least absolute values’ criterion.

There are many circumstances where the most likely model is not an interestingmodel. One trivial example is when the volumetric probability has a ‘narrow max-imum’, with small total probability (see figure 5.33). A much less trivial situation


arises when the number of parameters is very large, as for instance when we dealwith a random function (that, in all rigor, corresponds to an infinite number of ran-dom variables). Figure XXX, for instance, shows a few realizations of a Gaussianfunction with zero mean and an (approximately) exponential correlation. The mostlikely function is the center of the Gaussian, i.e., the null function shown at the left.But this is not a representative sample (specimen) of the probability distribution, asany realization of the probability distribution will have, with a probability very closeto one, the ‘oscillating’ characteristics of the three samples shown at the right.

Figure 5.33: One of the circumstanceswhere the ‘maximum likelihood model’may not be very interesting, is when itcorresponds to a narrow maximum, withsmall total probability, as the peak at theleft of this probability distribution.

-40 -20 0 20 400

0.2

0.4

0.6

0.8

1

5.4.3.11.4 The Interpretation of ‘The Least Squares Solution’ Note: explain herethat when working with a large number of dimensions, the center of a Gaussian is abad representer of the possible realizations of the Gaussian.

Mention somewhere that mpost is not the ‘posterior model’, but the center of thea posteriori Gaussian, and explain that for multidimensional problems, the center ofa Gaussian is not representative of a random realisation of the Gaussian.

[note: Mention somewhere that one should not compute the inverse of the matri-ces, but solve the associated linear system.]

Figure 5.34: At the right, three random realizations of a Gaussian random functionwith zero mean and (approximatelty) exponential correlation function. The mostlikely function, i.e., the center of the Gaussian, is shown at the left. We see that themost likely function is not a representative of the probability distribution.

284 Appendices

5.5 OTHER APPENDICES

5.5.1 Determinant of a Partitioned Matrix

Using well known properties of matrix algebra (e.g., Lutkepohl, 1996), the determi-nant of a partitioned matrix can be expressed as

det(

grr grsgsr gss

)= det grr det

(gss − gsr g−1

rr grs

). (5.318)

5.5 OTHER APPENDICES 285

5.5.2 Operational Definitions can not be Infinitely Accurate

Note: refer here to figure 5.35, and explain that “the length” of a real object (as op-posed to a mathematically defined object) can only be defined by specifying the mea-suring instrument. There are different notions of length associated to a given object.For instance, figure 5.35 suggests that the length of a piece of wood is larger whendefined by the use of a calliper34 than when defined by the use of a ruler35, becausea calliper tends to measure the distance between extremal points, while an observerusing a ruler tends to average the rugosities at the wood ends.

Figure 5.35: Different definitions of thelength of an object.

34Calliper: an instrument for measuring diameters (as of logs or trees) consisting of a graduatedbeam and at right angles to it a fixed arm and a movable arm. From the Digital Webster.

35Ruler: a smooth-edged strip (as of wood or metal) that is usu. marked off in units (as inches) andis used as a straightedge or for measuring. From the Digital Webster.

286 Appendices

5.5.3 The Ideal Output of a Measuring Instrument

Note: mention here figures 5.36 and 5.37.

Figure 5.36: Instrument built tomeasure the pitches of musicalnotes. Due to unavoidable measur-ing noises, a measurement is neverinfinitely accurate. Figure 5.37 sug-gests an ideal instrument output.

MEASURING

SYSTEM

INSTRUMENT

OUTPUT

SEN

SO

R

Environmental

noiseInstrument

noise


ν = 5∗

ν = 6∗

ν = 7∗

ν = 440 Hz

ν = 220 Hz

ν = 110 Hz

ν = 880 Hz

ν = 1760 Hz

Τ = −5∗

Τ = −6∗

Τ = −7∗

ν = 100 Hz

ν = 200 Hz

ν = 300 Hz

ν = 400 Hz

ν = 500 Hz

ν = 1000 Hz

ν = 2000 Hz

Τ = 0.001 s

Τ = 0.005 s

Τ = 0.004 s

Τ = 0.002 s

Τ = 0.003 s

Τ = 0.004 s

Τ = 0.005 s

Τ = 0.01 s

ν = log ν/ν∗0 Τ = log Τ/Τ

∗0 Τ = 1/ν = 1 s

00ν = 1/Τ = 1 Hz

00

ν = 440 Hz ν = +6.09∗ ∗Τ = +6.09Τ = 2.27 10 s

−3Center:

Radius (standard deviation): σ = 0.12

Figure 5.37: The ideal output of a mesuring instrument (in this example, measur-ing frequencies-periods). The curve in the middle corresponds to the volumetricprobability describing the information brought by the measurement (on ‘the mea-surand’). Five different scales are shown (in a real instrument, the user would justselect one of the scales). Here, the logarithmic scales correspond to the natural loga-rithms that a physicist should prefer, but engineers could select scales using decimallogarithms. Note that all the scales are ‘linear’ (with respect to the natural distancein the frequency-period space [see section XXX]): I do not recommend the use of ascale where the frequencies (or the periods) would ‘look linear’.

288 Appendices

5.5.4 Measurements

5.5.4.1 Output as Conditional Probability Density

As suggested by figure 5.38, an ‘measuring instrument’ is specified when the condi-tional volumetric probability f (y|x) for the output y , given the input x is given.

Figure 5.38: The input (or measurand) and theoutput of a measuting instrument. The output isnever an actual value, but a probability distribu-tion, in fact, a conditional volumetric probabilityf (y|x) for the output y , given the input x .

INP

UT

OUTPUT

5.5.4.2 A Little Bit of Theory

We want to measure a given property of an object, say the quantity x . Assumethat the object has been randomly selected from a set of objects, so that the ‘prior’probability for the quantity x is fx(x) .

Then, the conditional. . .Then, Bayes theorem. . .

5.5.4.3 Example: Instrument Specification

[Note: This example is to be put somewhere, I don’t know yet where.]It is unfortunate that ordinary measuring instruments tend to just display some

‘observed value’, the ‘measurement uncertainty’ tending to be hidden inside somewritten documentation. Awaiting the day when measuring instruments directly dis-play a probability distribution for the measurand, let us contemplate the simple situ-ation where the maker of an instrument, say a frequencymeter, writes someting likethe following.

This frequencymeter can operate, with high accuracy, in the range 102 Hz < ν <109 Hz . When very far from this range, one may face uncontrollable uncertainties.Inside (or close to) this range, the measurement uncertainty is, with a good approx-imation, independent of the value of the measured frequency. When the instrumentdisplays the value ν0 , this means that the (1D) volumetric probability for the mea-


surand isif log ν

ν0≤ −σ then f (ν) = 0

if −σ < log νν0

< +2σ then f (ν) = 29σ2

(2σ − log ν

ν0

)if + 2σ ≤ log ν

ν0then f (ν) = 0

, (5.319)

where σ = 10−4 . This volumetric probability is displayed at the top of figure 5.39.Using the logarithmic frequency as coordinate, this is an asymmetric triangle.

Figure 5.39: Figure for ‘instrument spec-ification’. Note: write this caption.

ν + 2σ∗0ν − σ∗

0

Κ = 1 Ηz

ν

Κν = log∗

10ν0

Κν = log∗

0 10

σ = 10−4

ν = 1.0000 10 Hz0

6

ν = 6.00000∗

ν = 6.0002

0∗

ν = 5.9999

0∗

ν = 6.0000

0∗

290 Appendices

5.5.4.4 Measurements and Experimental Uncertainties

Observation of geophysical phenomena is represented by a set of parameters d thatwe usually call data. These parameters result from prior measurement operations,and they are typically seismic vibrations on the instrument site, arrival times of seis-mic phases, gravity or electromagnetic fields. As in any measurement, the data isdetermined with an associated certainty, described with a volumetric probabilityover the data parameter space, that we denote here ρd(d). This density describes,not only marginals on individual datum values, but also possible cross-relations indata uncertainties.

Although the instrumental errors are an important source of data uncertainties,in geophysical measurements there are other sources of uncertainty. The errors as-sociated with the positioning of the instruments, the environmental noise, and thehuman appreciation (like for picking arrival times) are also relevant sources of un-certainty.

Example 5.38 Non-analytic volumetric probability Assume that we wish to measurethe time t of occurrence of some physical event. It is often assumed that the result of ameasurement corresponds to something like

t = t0 ±σ . (5.320)

An obvious question is the exact meaning of the ±σ . Has the experimenter in mindthat she/he is absolutely certain that the actual arrival time satisfies the strict conditionst0 −σ ≤ t ≤ t0 +σ , or has she/he in mind something like a Gaussian probability, or someother probability distribution (see figure 5.40)? We accept, following ISO’s recommenda-tions (1993) that the result of any measurement has a probabilistic interpretation, with somesources of uncertainty being analyzed using statistical methods (‘type A’ uncertainties), andother sources of uncertainty being evaluated by other means (for instance, using Bayesian ar-guments) (‘type B’ uncertainties). But, contrary to ISO suggestions, we do not assume thatthe Gaussian model of uncertainties should play any central role. In an extreme example,we may well have measurements whose probabilistic description may correspond to a mul-timodal volumetric probability. Figure 5.41 shows a typical example for a seismologist: themeasurement on a seismogram of the arrival time of a certain seismic wave, in the case onehesitates in the phase identification, or in the identification of noise and signal. In this casethe volumetric probability for the arrival of the seismic phase does not have an explicit expres-sion like f (t) = k exp(−(t− t0)2/(2σ2)) , but is a numerically defined function. Using,for instance, the Mathematica (registered trademark) computer language we may define thevolumetric probability f (t) as

f[t_] := ( If[t1<t<t2,a,c] If[t3<t<t4,b,c] ) .

Here, a and b are the ‘levels’ of the two steps, and c is the ‘background’ volumetricprobability.


Figure 5.40: What has an experi-menter in mind when she/he de-scribes the result of a measurementby something like t = t0 ±σ ? t0 t0 t0 t0

Figure 5.41: A seismologist tries to measure the arrivaltime of a seismic wave at a seismic station, by ‘reading’the seismogram at the top of the figure. The seismologistmay find quite likely that the arrival time of the wave isbetween times t3 and t4 , and believe that what is beforet3 is just noise. But if there is a significant probability thatthe signal between t1 and t2 is not noise but the actualarrival of the wave, then the seismologist should definea bimodal volumetric probability, as the one suggestedat the bottom of the figure. Typically, the actual formof each peak of the volumetric probability is not crucial(here, box-car functions are chosen), but the position ofthe peaks is important. Rather than assigning a zero vol-umetric probability to the zones outside the two inter-vals, it is safer (more ‘robust’) to attribute some small‘background’ value, as we may never exclude some un-expected source of error.

Time

t1 t2 t3 t4

Sign

al a

mpl

itude

t4t3t2t1

Prob

abili

ty d

ensi

ty

Time

292 Appendices

Example 5.39 The Gaussian model for uncertainties. The simplest probabilistic model thatcan be used to describe experimental uncertainties is the Gaussian model

ρD(d) = k exp(−1

2(d− dobs)T C−1

D (d− dobs))

. (5.321)

It is here assumed that we have some ‘observed data values’ dobs , with uncertainties de-scribed by the covariance matrix CD . If the uncertainties are uncorrelated,

ρD(d) = k exp

−12 ∑

i

(di − di

obsσ i

)2 , (5.322)

where the σ i are the ‘standard deviations’.

Example 5.40 The Generalized Gaussian model for uncertainties. An alternative to theGaussian model, is to use the Laplacian (double exponential) model for uncertainties,

ρD(d) = k exp

(−∑

i

|di − diobs|

σ i

). (5.323)

While the Gaussian model leads to least-squares related methods, this Laplacian model leastto absolute-values methods (see section??), well known for producing robust36 results. Moregenerally, there is the Lp model of uncertainties

ρp(d) = k exp

(−1

p ∑i

|di − diobs|

p

(σp)p

)(5.324)

(see figure 5.42).

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

Figure 5.42: Generalized Gaussian for values of the parameter p = 1,√

2, 2, 4, 8and ∞ .

36A numerical method is called robust if it is not sensitive to a small number of large errors.


5.5.5 The ‘Shipwrecked Person’ Problem

Note: this example is to be developed. For the time being this is just a copy ofexample 3.15

Let S represent the surface of the Earth, using geographical coordinates (lon-gitude ϕ and latitude λ ). An estimation of the position of a floating object at thesurface of the sea by an airplane navigator gives a probability distribution for the po-sition of the object corresponding to the (2D) volumetric probability f (ϕ, λ) , and anindependent, simultaneous estimation of the position by another airplane navigatorgives a probability distribution corresponding to the volumetric probability g(ϕ, λ) .How the two volumetric probabilities f (ϕ, λ) and g(ϕ, λ) should be ‘combined’ toobtain a ‘resulting’ volumetric probability? The answer is given by the ‘product’ ofthe two volumetric probabilities densities:

( f · g)(ϕ, λ) =f (ϕ, λ) g(ϕ, λ)∫

S dS(ϕ, λ) f (ϕ, λ) g(ϕ, λ). (5.325)

294 Appendices

5.5.6 Problems Solved Using Conditional Probabilities

Note: Say here the we consider here two problems: (i) Bayes theorem and (ii) Ad-justing measurements.

These two problems are mathematically very similar, and are essentially solvedusing either the notion of ‘conditional probability’ or the notion of ‘product of prob-abilities’ (see chapter ??).

Note: what follows comes from an old text:A so-called ‘inverse problem’ usually consists in a sort quite complex measure-

ment, simetimes a gigantic measurement, involving years of observations and thou-sands of instruments. Any measurement is indirect (we may weigh a mass by ob-serving the displacement of the cursor of a balance), and as such, a possibly nontriv-ial analysis of uncertainties must be done.

Any good guide describing good experimental practice (see, for instance ISO’sGuide to the expression of uncertainty in measurement [ISO, 1993] or the shorter descrip-tion by Taylor and Kuyatt, 1994) acknowledges that any measurement involves, atleast, two different sources of uncertainties: those that we estimate using statisticalmethods, and those that we estimate using subjective, common sense estimations.Both are described using the axioms of probability theory, and this article clearlytakes the probabilistic point of view for developing inverse theory.


5.5.6.1 First Example (Bayes Theorem)

Figure 5.43: Scan

' t l

t l - l

o , < P ! . . 2 ( , / . - ,\ ?

1 e .

! l t /

-,11

I lr'

Y

J r'-l| . i l \ , . + - i ' t ' /l , i r i i + '_* ,_l i_ .

4La-ot.

/ l . ( .P , " l - 3 , / i l ' f

( "

r (':L::

i ( t zf (v ) '1 " 1 ; ( 6 t . j -

lop tr" lr t . ,1 ,

Figure 5.44: The columns of this drawing represent, for eachvalue of the quantity x , the conditional fy|x(y|x) .

f( y | x )

x

y

296 Appendices

Figure 5.45: If the marginal fx(x) is also known,then we can, first, evaluate the joint XXX, f (x, y) =fy|x(y|x) fx(x) , then the other marginal fy(y) =∫

dx f (x, y) =∫

dx fy|x(y|x) fx(x) .

x

y

f( x , y )

fx( x )

fy( y

)

Figure 5.46: The conditional we were seeking, fx|y(x|y) , can

now be obtained as fx|y(x|y) = f (x,y)fy(y) =

fy|x(y|x) fx(x)fy(y) =

fy|x(y|x) fx(x)∫dx fy|x(y|x) fx(x) . The rows of this drawing represent, for each

value of the quantity y , the conditional fx|y(x|y) .

f( x | y )

x

y


5.5.6.2 Second Example

Note: mention here that the problem in section ?? (chemical concentrations) has beensolved using conditional probabilities.

5.5.6.3 Example: Adjusting a Measurement to a Theory

When a particle of mass m is submitted to a force F , one has

F = mddt

v√1− v2/c2

. (5.326)

Assuming initial conditions of rest (at a time arbitrarily set to 0 ), the trajectory of theparticle is

x(t) =c2

γ

(√1 + (γt/c)2 − 1

), (5.327)

whereγ = F/m . (5.328)

Note: introduce here the problem set in the caption of figure 5.47. Say, in partic-ular, that we have a measurement whose results are represented by the volumetricprobability f (t, x) .

The problem here, is clearly a problem of conditional probability, and it makessense because we do have a metric over our 2D space: From the expression of thedistance element ds2 = dt2 − dx2/c2 it follows the Minkowski metric(

gtt gtxgxt gxx

)=(

1 00 −1/c2

). (5.329)

When taking the conditional volumetric probability of f (t, x) given the expres-sion x = x(t) in equation 5.327, we simply obtain (see equation 3.188)

gt(t) =1ν

f ( t , x(t) ) , (5.330)

where ν is the normalization factor

ν =∫ +∞−∞ dst(t) f ( t , x(t) ) . (5.331)

The probability of a time interval is computed (see equation 3.189) via

P(t1 < t < t2) =∫ t2

t1

dst(t) gt(t) . (5.332)

298 Appendices

Figure 5.47: In the space-time of special rela-tivity, we have measured the space-time coordi-nates of an event, and obtained the volumetricprobability f (t, x) displayed in the figure at thetop. We then learn that that event happened onthe trajectory of a particle with mass m submit-ted to a constant force F (equation 5.327). Thistrajectory is represented in the figure at the mid-dle. It is clear that thanks to the theory, we canameliorate the knowledge of the coordinates ofthe event, by considering the conditional volu-metric probability induced on the trajectory. Seetext for details. To scale the axis of this draw-ing, the quantities T = c/γ and X = c2/γ havebeen introduced.

0

X

2X

3X

3T 4T2TT0

t

x

0

X

2X

3X

3T 4T2TT0

t

x

0

X

2X

3X

3T 4T2TT0

t

x

Figure 5.48: The length element inducedby the two-dimensional metric over the one-dimensional manifold where the conditionalprobability distribution is defined.

t

x

ds = dt

ds

=dxc /

dst(t)

=

dt

√ 1+(γt

c)2

/


The length element dst(t) is the length induced over the line x = x(t) by thetwo-dimensional Minkowski metric (figure 5.48). We can evaluate it using equa-tions 3.191–3.192. Here, we obtain

dst(t) =√

gtt + x′(t) gxx x′(t) dt , (5.333)

and this gives

dst(t) =dt√

1 + (γt/c)2. (5.334)

Equations 5.331–5.332 can now be written, explicitly,

ν =∫ +∞−∞ dt

f ( t , x(t) )√1 + (γt/c)2

(5.335)

and

P(t1 < t < t2) =∫ t2

t1

dtgt(t)√

1 + (γt/c)2. (5.336)

The three equations 5.330, 5.335, and 5.336 solve our problem in what respects thevariable t .

We may, instead, be primarily interested in the variable x . We have two equiva-lent procedures:

1. we can do exactly what we have just done, but starting with the trajectory 5.327not written as x = x(t) but as t = t(x) ,

t =√

2x/γ + (x/c)2 ; (5.337)

2. we can take the results just obtained and made the change of variables t 7→ x(using equation 5.337).

The computations are left as an exercise to the reader. One reaches the conclusionthat the information on the position x is represented by the volumetric probability

gx(x) =1µ

f ( t(x) , x ) , (5.338)

where

µ =∫ +∞−∞ dx

f ( t(x) , x )√2γx + (γx/c)2

, (5.339)

and the probability of an interval is computed via

P(x1 < x < x2) =∫ x2

x1

dxgx(x)√

2γx + (γx/c)2. (5.340)

300 Appendices

It is important to realize that a consistent formulation of this problem has onlybeen possible because in the space t, x we have a metric (the Minkowski metric).Note that the question raised hare still makes perfect sense in Galilean (nonrelativis-tic) physics, where the trajectory 5.327 degenerates into its nonrelativistic limit

x(t) = 12 γ t2 . (5.341)

Taking the limit c → ∞ in all the equations above gives valid equations, but theseequations correspond to using in the space t, x the degenerated metric

ds2 = dt2 , (5.342)

i.e., a degenerated metric where only time distances matter. From a strict Galileanpoint of view, this metric is arbitrary, and the problem is that any other metric in thespace t, x may be as arbitrary. This implies that, unless one has an ad-hoc reasonfor selecting a particular metric in the space t, x , this simple problem can not besolved consistently in a Galilean framework.


5.5.7 Parameters

To describe a physical system (a planet, an elastic sample, etc.) we use physicalquantities (temperature and mass density at some given points, total mass, etc.). Iexamine here the situation where the quantities take real values (i.e., I do not tryto consider the case where the quantities take integer or complex values). The realvalues may have a physical dimension, length, mass, etc.

We will here that there is one very important type of quantities (called below the‘Jeffreys quantities’), and three other marginal types of quantities.

302 Appendices

5.5.7.1 Jeffreys Quantities

5.5.7.2 Definition

Let us examine ‘positive parameters’, like a temperature, a period, etc. One of theproperties of the parameters we have in mind is that they occur in pairs of mutuallyreciprocal parameters:

Period T = 1/ν ; Frequency ν = 1/TResistivity ρ = 1/σ ; Conductivity ρ = 1/σ

Temperature T = 1/(kβ) ; Thermodynamic parameter β = 1/(kT)Mass density ρ = 1/` ; Lightness ` = 1/ρ

Compressibility γ = 1/κ ; Bulk modulus (uncompressibility) κ = 1/γ .

When physical theories are elaborated, one may freely choose one of these parame-ters or its reciprocal.

Sometimes these pairs of equivalent parameters come from a definition, like whenwe define frequency ν as a function of the period T , by ν = 1/T . Sometimesthese parameters arise when analyzing an idealized physical system. For instance,Hooke’s law, relating stress σi j to strain εi j can be expressed as σi j = ci j

k` εk` , thusintroducing the stiffness tensor ci jk` , or as εi j = di j

k`σk` , thus introducing the com-pliance tensor di jk` , inverse of the stiffness tensor. Then the respective eigenvaluesof these two tensors belong to the class of scalars analyzed here.

Let us take, as an example, the pair conductivity-resistivity (this may be thermal,electric, etc.). Assume we have two samples in the laboratory S1 and S2 whoseresistivities are respectively ρ1 and ρ2 . Correspondingly, their conductivities areσ1 = 1/ρ1 and σ2 = 1/ρ2 . How should we define the ‘distance’ between the twosamples? As we have |ρ2 − ρ1| 6= |σ2 −σ1| , choosing one of the two expressionsas the ‘distance’would be arbitrary. Consider the following definition of ‘distance’between the two samples

D(S1, S2) =∣∣∣∣ log

ρ2

ρ1

∣∣∣∣ =∣∣∣∣ log

σ2

σ1

∣∣∣∣ . (5.343)

This definition (i) treats symmetrically the two equivalent parameters ρ and σ and,more importantly, (ii) has an invariance of scale (what matters is how many ‘octaves’we have between the two values, not the plain difference between the values). Infact, it is the only ‘sensible’ definition of distance between the two samples S1 andS2 .

Note: this is an old text. Associated to the distance D = | log (x2/x1) | is thedistance element

ds = dx/x . (5.344)

Defining the reciprocal parameter y = 1/x , the same distance D, now becomesD = | log (y2/y1) | and we have the distance element

ds = dy/y . (5.345)


Introducing the logarithmic parameters

x∗ = log(x/x0) ; y∗ = log(y/y0) , (5.346)

where x0 and y0 are arbitrary positive constants, leads to D = |x∗2 − x∗1| = |y∗2 −y∗1| , and to the distance elements

ds = dx∗ ; ds = dy∗ . (5.347)

Note: I have to explain here that, for all four parameters, the homogeneous volumetricprobability is a constant (that I arbitrarily take equal to one)

fx(x) = 1 ; fy(y) = 1 ; fx∗(x∗) = 1 ; fy∗(y∗) = 1 . (5.348)

Should one, for some reason, choose to work with probability densities, then we con-vert volumetric probabilities into probability densities using equation ?? (page ??).We then see that the same homogeneous probability distribution is represented bythe following homogeneous probability densities:

f x(x) = 1/x ; f y(y) = 1/y ; f x∗(x∗) = 1 ; f y∗(y∗) = 1 . (5.349)

One should note that the homogeneous probability density for a Jeffreys parameterx is 1/x .

The association of the probability density f (x) = k/x to positive parameterswas first made by Jeffreys (1939). To honor him, we propose to use the term Jeffreysparameters for all the parameters of the type considered above . The 1/x probabilitydensity was advocated by Jaynes (1939), and a nontrivial use of it was made byRietsch (1977), in the context of inverse problems.

If we have a Jeffreys parameter x , we know that the distance element is ds =dx/x . Defining y = xk , i.e., some power of the parameter, leads to ds = (1/k), dy/y .This is, up to a multiplicative constant, the same expression. Therefore, if a parame-ter x is a Jeffreys parameter, then, its inverse, its square, and, in general, any powerof the parameter is also a Jeffreys parameter.

It is important to recognize when we do not face a Jeffreys parameter. Among themany parameters used in the literature to describe an isotropic linear elastic mediumwe find parameters like the Lame’s coefficients λ and µ , the bulk modulus κ , thePoisson ratio σ , etc. A simple inspection of the theoretical range of variation of theseparameters shows that the first Lame parameter λ and the Poisson ratio σ may takenegative values, so they are certainly not Jeffreys parameters. In contrast, Hooke’slaw σi j = ci jk` ε

k` , defining a linearity between stress σi j and strain εi j , definesthe positive definite stiffness tensor ci jk` or, if we write εi j = di jk`σ

k` , defines itsinverse, the compliance tensor di jk` . The two reciprocal tensors ci jk` and di jk` are‘Jeffreys tensors’. This is a notion that would take too long to develop here, but wecan give the following rule: The eigenvalues of a Jeffreys tensor are Jeffreys quantities.

304 Appendices

Note: This solves the complete problem for isotropic tensors only. I have to men-tion here the rules valid for general anisotropic tensors.

As the two (different) eigenvalues of the stiffness tensor ci jk` are λκ = 3κ (withmultiplicity 1) and λµ = 2µ (with multiplicity 5) , we see that the uncompress-ibility modulus κ and the shear modulus µ are Jeffreys parameters37 (as are anyparameter proportional to them, or any power of them, including the inverses). Iffor some reason, instead of working with κ and µ , we wish to work with otherelastic parameters, like for instance the Young modulus Y and the Poisson ratio σ ,then the homogeneous probability distribution must be found using the Jacobian ofthe transformation between (Y,σ) and (κ,µ) . This is done in appendix 4.1.

There is a problem of terminology in the Bayesian literature. The homogeneousprobability distribution is a very special distribution. When the problem of select-ing a ‘prior’ probability distribution arises, in the absence of any information exceptthe fundamental symmetries of the problem, one may select as prior probability dis-tribution the homogeneous distribution. But enthusiastic Bayesians do not call it‘homogeneous’ but ‘noninformative’. I do not agree with this. The homogeneousprobability distribution is as informative as any other distribution, it is just the ho-mogeneous one.

In general, each time we consider an abstract parameter space, each point beingrepresented by some parameters x = x1, x2 . . . xn , we will start by solving the(sometimes nontrivial) problem of defining a distance between points that respectsthe necessary symmetries of the problem. Note: continue this discussion.

5.5.7.3 Benford Law

Let us play a game. We randomly generate many real numbers x1, x2, . . . in theinterval (e−100, e+100) , with an homogeneous probability distribution (in the ele-mentary sense of ‘homogeneous’ for real numbers). Then we compute the positivequantities

X1 = ex1 ; X2 = ex2 . . . , (5.350)

write these numbers in the common way, i.e., using the base ten numbering system.The first digit of these numbers may then be 1, 2, 3, 4, 5, 6, 7, 8 , or 9 . Which is thefrequency of each of the nine digits? The answer is (note: explain here why): thefrequency in which the digit n appears as first digit is

pn = log10n + 1

n. (5.351)

37The definition of the elastic constants was made before the tensorial structure of the theory wasunderstood. Seismologists, today, should never introduce, at a theoretical level, parameters like thefirst Lame coefficient λ or the Poisson ratio. Instead they should use κ and µ (and their inverses).In fact, my suggestion is to use the true eigenvalues of the stiffness tensor, λκ = 3κ , and λµ = 2µ ,that I propose to call the eigen-bulk-modulus and the eigen-shear-modulus.


This means that:

30.1% of the times the first digit is 117.6% of the times the first digit is 212.5% of the times the first digit is 3

9.7% of the times the first digit is 47.9% of the times the first digit is 56.7% of the times the first digit is 65.8% of the times the first digit is 75.1% of the times the first digit is 84.6% of the times the first digit is 9

(5.352)


Figure 5.49: Generate points, uniformly at random, ‘on the realaxis’ (left of the figure). The values x1, x2 . . . will not have anyspecial property, but the quantities X1 = 10x1 , X2 = 10x2 . . . willpresent the Benford effect: as the figure suggests, the intervals0.1–0.2 , 1–2 , 10–20 , etc. are longer (so have grater probability)than the intervals 0.2–0.3 , 2–3 , 20–30 , etc., and so on. It is easyto see that the probability that the first digit of the coordinate Xequals n is pn = log10(n + 1)/n (Benford law).

X =

10

x

0.1

1

10

100

0.2

0.5

2

5

20

50

x =

lo

g10 X

−1

0

1

2

−0.5

0.5

1.5

Note: explain that this is independent of the exponentiation we make in equa-tion 5.350. We could, for instance, have defined

X1 = 10x1 ; X2 = 10x2 . . . . (5.353)

Note: explain that if instead of writing the numbers X1, X2, . . . using base 10, weuse a base b , the first digit of these numbers may then be 1, 2, . . . , (b− 1) . Then, thefrequency in which the digit n appears as first digit is

pn = logbn + 1

n. (5.354)

Note: explain here that Jeffreys quantities exhibit the Benford effect (they tend tostart with one’s of two’s).

5.5.7.4 Examples of the Benford Effect

5.5.7.4.1 First Digit of the Fundamental Physical Constants Note: mention herefigure 5.50, and explain. Say that the negative numbers of the table are ‘false nega-tives’. Figure 5.52 statistics of surfaces and populations of States and Islands.

306 Appendices

Figure 5.50: Statistics of the first digitin the table of Fundamental Physi-cal Constants (1998 CODATA least-squares adjustement; Mohr and Tay-lor, 2001). I have indiscriminatelytaken all the constants of the table(263 in total). The ‘model’ corre-sponds to the prediction that the rel-ative frequency of digit n in a baseK system of numeration is logK(n +1)/n . Here, K = 10 .

0

80

60

40

20

1 2 3 4 5 6 7 8 9

Actual statistics

Model

Frequency

Digits

First digit of the

Fundamental Physical Constants

(1998 CODATA least-squares adjustement)

5.5.7.4.2 First Digit of Territories and Islands Note: mention here figures 5.51and 5.52.

Figure 5.51: The begining of the list of theStates, Territories and Principal Islands ofthe World, in the Times Atlas of the World(Times Books, 1983), with the first digit ofthe surfaces and populations highlighted.The statistics of this first digit in shown infigure 5.52.

STATES, TERRITORIES

& PRINCIPAL ISLANDS

OF THE WORLDName [Plate] and Description Sq. km Sq. miles Population

Abu Dhabi, see United Arab Emirates

Afghanistan [31] 636,267 245,664 15,551,358* 1979

Capital: Kabul

Ajman, see United Arab Emirates

Åland [51] 1,505 581 22,000 1981

Self-governing Island Territory of Finland

Albania [83] 28,748 11,097 2,590,600 1979

Capital: Tirana (Tiran )

Aleutian Islands [113] 17,666 6,821 6,730* 1980

Territory of U.S.A.

Algeria [88] 2,381,745 919,354 18,250,000 1979

Capital: Algiers (Alger)

American Samoa [10] 197 76 30,600 1977

Unincorporated Territory of U.S.A.

Andorra [75] 465 180 35,460 1981

Capital: Andorra la Vella

Angola [91] 1,246,700 481,226 6,920,000 1981

Capital: Luanda

…


Figure 5.52: Statistics of the first digitin the table of the surfaces (bothin squared kilometers and squaredmiles) and populations of the States,Territories and Principal Islands ofthe World, as printed in the firstfew pages of the Times Atlas of theWorld (Times Books, 1983). As forfigure 5.50, the ‘model’ correspondsto the prediction that the relative fre-quency of digit n is log10(n + 1)/n .

0

400

300

200

100

1 2 3 4 5 6 7 8 9

Actual statistics

Model

Frequency

Digits

Surfaces and Populations of the States,

Territories and Principal Islands

(Times Atlas of the World)

308 Appendices

5.5.7.5 Cartesian Quantities

Note: explain here that a Cartesian quantity x has as finite distance the expression

D = |x2 − x1| . (5.355)

Note: Explain here that most of Cartesian quantities we find in physics are thelogarithms of Jeffreys quantities.


5.5.7.6 Quantities ‘[0-1]’

Note: mention here the quantities x that like a chemical concentration, take valuesin the range [0, 1] . Note: explain that defining

X =x

1− x(5.356)

introduces a Jeffreys quantity (with range [0, ∞] ).

310 Appendices

5.5.7.7 Ad-hoc Quantities

Note: mention here the ad-hoc quantities, like the Lame’s parameters or the Pois-son’s ratio, that we should not use.

Bibliography

Aki, K. and Lee, W.H.K., 1976, Determination of three-dimensional velocity anoma-lies under a seismic array using first P arrival times from local earthquakes, J.Geophys. Res., 81, 4381–4399.

Aki, K., Christofferson, A., and Husebye, E.S., 1977, Determination of the three-dimensional seismic structure of the lithosphere, J. Geophys. Res., 82, 277-296.

Aki, K., and Richards, P.G., 1980, Quantitative seismology, (2 volumes), Freeman andCo.

Andresen, B., Hoffmann, K. H., Mosegaard, K., Nulton, J. D., Pedersen, J. M., andSalamon, P., On lumped models for thermodynamic properties of simulated an-nealing problems, Journal de Physique, 49, 1485–1492, 1988.

Backus, G., 1970a. Inference from inadequate and inaccurate data: I, Proceedings ofthe National Academy of Sciences, 65, 1, 1-105.

Backus, G., 1970b. Inference from inadequate and inaccurate data: II, Proceedings ofthe National Academy of Sciences, 65, 2, 281-287.

Backus, G., 1970c. Inference from inadequate and inaccurate data: III, Proceedingsof the National Academy of Sciences, 67, 1, 282-289.

Backus, G., 1971. Inference from inadequate and inaccurate data, Mathematicalproblems in the Geophysical Sciences: Lecture in applied mathematics, 14, Amer-ican Mathematical Society, Providence, Rhode Island.

Backus, G., and Gilbert, F., 1967. Numerical applications of a formalism for geophys-ical inverse problems, Geophys. J. R. astron. Soc., 13, 247-276.

Backus, G., and Gilbert, F., 1968. The resolving power of gross Earth data, Geophys.J. R. astron. Soc., 16, 169-205.

Backus, G., and Gilbert, F., 1970. Uniqueness in the inversion of inaccurate grossEarth data, Philos. Trans. R. Soc. London, 266, 123-192.

Bamberger, A., Chavent, G, Hemon, Ch., and Lailly, P., 1982. Inversion of normalincidence seismograms, Geophysics, 47, 757-770.

Ben-Menahem, A., and Singh, S.J., 1981. Seismic waves and sources, Springer Verlag.Bender, C.M., and Orszag, S.A., 1978. Advanced mathematical methods for scientists

and engineers, McGraw-Hill.Borel, E., 1967, Probabilites, erreurs, 14e ed., Paris.Borel, E., dir., 1924–1952, Traite du calcul des probabilites et de ses applications, 4 t.,

Gauthier Villars, Paris.

312 Appendices

Bornholdt, S. , Nordlund, U. & Westphal, H. (1999) Inverse stratigraphic modellingusing genetic algorithms.- In: Harbaugh, J., Watney, W.L., Rankey, E.C., Slinger-land, R., Goldstein, R.H. & Franseen, E.K.; Numerical Experiments in Stratigra-phy, SEPM Special Publication v. 62, p. 85–90.

Bourbaki, N., 1970, Elements de mathematique, Hermann.Cantor, G., 1884, Uber unendliche, lineare Punktmannigfaltigkeiten, Arbeiten zur

Mengenlehre aus dem Jahren 1872-1884. Leipzig, Teubner.Cary, P.W., and C.H. Chapman, Automatic 1-D waveform inversion of marine seis-

mic refraction data, Geophys. J. R. Astron. Soc., 93, 527–546, 1988.Claerbout, J.F., 1971. Toward a unified theory of reflector mapping, Geophysics, 36,

467-481.Claerbout, J.F., 1976. Fundamentals of Geophysical data processing, McGraw Hill.Claerbout, J.F., 1985. Imaging the Earth’s interior, Blackwell Science Publishers.Claerbout, J.F., and Muir, F., 1973. Robust modelling with erratic data, Geophysics,

38, 5, 826-844.Cross, T.A., and Lessenger, M.A., 1999, Construction and application of a strati-

graphic inverse model, in J.W. Harbaugh, W.L. Watney, E.C. Rankey, R. Slinger-land, R.H. Goldstein, E.K. Franseen, eds, SEPM Special Publication 62 NumericalExperiments in Stratigraphy: Recent Advances in Stratigraphic and Sedimento-logic Computer Simulations: SEPM (Society for Sedimentary Geology), p. 69-83.

Dahl-Jensen, D., Mosegaard, K., Gundestrup, N., Clow, G. D., Johnsen, S. J., Hansen,A. W., and Balling, N., 1998, Past temperatures directly from the Greenland IceSheet, Science, Oct. 9, 268–271.

Davidon, W.C., 1959, Variable metric method for minimization, AEC Res. and Dev.,Report ANL-5990 (revised).

DeGroot, M., 1970, Optimal statistical decisions, McGraw-Hill.Devaney, A.J., 1984, Geophysical diffraction tomography, IEEE trans. Geos. remote

sensing, Vol. GE-22, No. 1.Dietrich, C.F., 1991. Uncertainty, calibration and probability - the statistics of scien-

tific and industrial measurement, Adam Hilger.Djikpesse, H.A. and Tarantola, A., 1999, Multiparameter `1 norm waveform fitting:

Interpretation of Gulf of Mexico reflection seismograms, Geophysics, Vol. 64, No.4, 1023–1035.

Enting, I.G., 2002, Inverse problems in atmospheric constituent transport, CambridgeUniversity Press.

Evrard, G., 1995, La recherche des parametres des modeles standard de la cosmolo-gie vue comme un probleme inverse, These de Doctorat, Univ. Montpellier.

Evrard, G., 1966, Objective prior for cosmological parameters, Proc. of the MaximumEntropy and Bayesian Methods 1995 workshop, K. Hanson and R. Silver (eds),Kluwer.

Evrard, G. and P. Coles, 1995. Getting the measure of the flatness problem, Classicaland quantum gravity, Vol. 12, No. 10, pp. L93-L97.


Feller, W., An introduction to probability theory and its applications, Wiley, N.Y.,1971 (or 1970?).

Fisher, R.A., 1953, Dispersion on a sphere, Proc. R. Soc. London, A, 217, 295–305.Fletcher, R., 1980. Practical methods of optimization, Volume 1: Unconstrained opti-

mization, Wiley.Fletcher, R., 1981. Practical methods of optimization, Volume 2: Constrained opti-

mization, Wiley.Fluke, 1994. Calibration: philosophy in practice, Fluke corporation.Franklin, J.N., 1970. Well posed stochastic extensions of ill posed linear problems, J.

Math. Anal. Applic., 31, 682-716.Gauss, C.F., 1809, Theoria Motus Corporum Cœlestium.Gauthier, O., Virieux, J., and Tarantola, A., 1986. Two-dimensional inversion of seis-

mic waveforms: numerical results, Geophysics, 51, 1387-1403.Geiger, L., 1910, Herdbestimmung bei Erdbeben aus den Ankunftszeiten, Nachrichten

von der Koniglichen Gesellschaft der Wissenschaften zu Gottingen, 4, 331–349.Geman, S., and Geman, D., Stochastic relaxation, Gibbs distributions, and the Bayesian

restoration of images, Inst. Elect. Electron. Eng. Trans. on pattern analysis and ma-chine intelligence, PAMI-6, 721-741, 1984.

Goldberg, D.E., Genetic algorithms in search, optimization, and machine learning(Addison-Wesley, 1989).

Hadamard, J., 1902, Sur les problemes aux derivees partielles et leur significationphysique, Bull. Univ. Princeton, 13.

Hadamard, J., 1932, Le probleme de Cauchy et les equations aux derivees partielleslineaires hyperboliques, Hermann, Paris.

Hammersley, J. M., and Handscomb, D.C., Monte Carlo Methods, in Monographs onStatistics and Applied Probability, Cox, D. R., and Hinkley, D. V.(eds.), Chapmanand Hall, 1964.

Herman, G.T., 1980. Image reconstruction from projections, the fundamentals ofcomputerized tomography, Academic Press.

Holland, J.H., Adaptation in Natural and Artificial Systems, University of MichiganPress, 1975.

Ikelle, L.T., Diet, J.P., and Tarantola, A., 1986. Linearized inversion of multi offsetseismic reflection data in the f -k domain, Geophysics, 51, 1266-1276.

ISO, 1993, Guide to the expression of uncertainty in measurement, International Or-ganization for Standardization, Switzerland.

Jackson, D.D., The use of a priori data to resolve non-uniqueness in linear inversion,Geophys. J. R. Astron. Soc., 57, 137–157, 1979.

Jannane, M., Beydoun, W., Crase, E., Cao Di, Koren, Z., Landa, E., Mendes, M.,Pica, A., Noble, M., Roth, G., Singh, S., Snieder, R., Tarantola, A., Trezeguet, D.,and Xie, M., Wavelengths of earth structures that can be resolved from seismicreflected data. Geophysics, 54, 906–910, 1988.

314 Appendices

Jaynes, E.T., Prior probabilities, IEEE Transactions on systems, science, and cybernetics,Vol. SSC–4, No. 3, 227–241, 1968.

Jaynes, E.T., 2003, Probability theory, the logic of science, Cambridge UniversityPress,

Jaynes, E.T., Where do we go from here?, in Smith, C. R., and Grandy, W. T., Jr., Eds.,Maximum-entropy and Bayesian methods in inverse problems, Reidel, 1985.

Jeffreys, H., 1939, Theory of probability, Clarendon Press, Oxford. Reprinted in 1961by Oxford University Press. Here he introduces the positive parameters.

Johnson, G.R. and and Olhoeft, G.R., 1984, Density or rocks and minerals, in: CRCHandbook of Physical Properties of rocks, Vol. III, ed: R.S. Carmichael, CRC,Boca Ratn, Florida, USA.

Journel, A. and Huijbregts, Ch., 1978, Mining Geostatistics, Academic Press.Kalos, M.H. and Whitlock, P.A., 1986. Monte Carlo methods, John Wiley and Sons.Kandel, A., 1986, Fuzzy mathematical techniques with applications, Addison-Wesley.Keilis-Borok, V.J., and Yanovskaya, T.B., Inverse problems in seismology (structural

review), Geophys. J. R. astr. Soc., 13, 223–234, 1967.Khan, A., Mosegaard, K., and Rasmussen, K. L., 2000, A New Seismic Velocity Model

for the Moon from a Monte Carlo Inversion of the Apollo Lunar Seismic Data,Geophys. Res. Lett. (in press).

Khintchine, A.I., 1969, Introduction a la theorie des probabilites (Elementarnoe vve-denie v teoriju verojatnostej), trad. M. Gilliard, 3e ed., Paris; en anglais: An el-ementary introduction to the theory of probability, avec B.V., Gnedenko, NewYork, 1962.

Kirkpatrick, S., Gelatt, C.D., Jr., and Vecchi, M.P., Optimization by Simulated An-nealing, Science, 220, 671–680, 1983.

Kolmogorov, A. N., 1950. Foundations of the theory of probability, Chelsea, NewYork.

Koren, Z., Mosegaard, K., Landa, E., Thore, P., and Tarantola, A., Monte Carlo esti-mation and resolution analysis of seismic background velocities, J. Geophys. Res.,96, B12, 20,289–20,299 (1991).

Kullback, S., 1967, The two concepts of information, J. Amer. Statist. Assoc., 62,685–686.

Landa, E., Beydoun, W., and Tarantola, A., Reference velocity model estimationfrom prestack waveforms: coherency optimization by simulated annealing, Geo-physics, 54, 984–990, 1989.

Lehtinen, M.S., Paivarinta, L., and Somersalo, E., 1989, Linear inverse problems forgeneralized random variables, Inverse Problems, 5,599–612.

Lions, J.L., 1968. Controle optimal de systemes gouvernes par des equations auxderivees partielles, Dunod, Paris. English translation: Optimal control of sys-tems governed by partial differential equations, Springer, 1971.

Lutkepohl, H., 1996, Handbook of Matrices, John Wiley & Sons.Marroquin, J., Mitter, S., and Poggio, T., 1987, Probabilistic solution of ill-posed prob-


lems in computational vision, Journal of the American Statistical Association, 82,76–89.

Mehrabadi, M.M., and S.C. Cowin, 1990, Eigentensors of linear anisotropic elasticmaterials, Q. J. Mech. appl. Math., 43, 15–41.

Mehta, M.L., 1967, Random matrices and the statistical theory of energy levels, Aca-demic Press, New York and London.

Menke, W., 1984, Geophysical data analysis: discrete inverse theory, Academic Press.Metropolis, N., and Ulam, S.M., The Monte Carlo Method, J. Amer. Statist. Assoc., 44,

335–341, 1949.Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., and Teller, E.,

Equation of State Calculations by Fast Computing Machines, J. Chem. Phys., Vol.1, No. 6, 1087–1092, 1953.

Miller, K.S., 1964, Multidimensional Gaussian distributions, John Wiley and Sons,New York.

Minster, J.B. and Jordan, T.M., 1978, Present-day plate motions, J. Geophys. Res., 83,5331–5354.

Mohr, P.J., and B.N. Taylor, 2001, The Fundamental Physical Constants, Physics To-day, Vol. 54, No. 8, BG6–BG13.

Moritz, H., 1980. Advanced physical geodesy, Herbert Wichmann Verlag, Karlsruhe,Abacus Press, Tunbridge Wells, Kent.

Morse, P.M., and Feshbach, H., 1953. Methods of theoretical physics, McGraw Hill.Mosegaard, K., and Rygaard-Hjalsted, C., 1999, Bayesian analysis of implicit inverse

problems, Inverse Problems, 15, 573–583.Mosegaard, K., Singh, S.C., Snyder, D., and Wagner, H., 1997, Monte Carlo Analysis

of seismic reflections from Moho and the W-reflector, J. Geophys. Res. B /, 102,2969–2981.

Mosegaard, K., and Tarantola, A., 1995, Monte Carlo sampling of solutions to inverseproblems, J. Geophys. Res., Vol. 100, No. B7, 12,431–12,447.

Mosegaard, K., and Tarantola, A., 2002, Probabilistic Approach to Inverse Problems,International Handbook of Earthquake & Engineering Seismology, Part A., p237–265, Academic Press.

Mosegaard, K. and Vestergaard, P.D., A simulated annealing approach to seismicmodel optimization with sparse prior information, Geophysical Prospecting , 39,599–611, 1991.

Nercessian, Al., Hirn, Al., and Tarantola, Al., 1984. Three-dimensional seismic trans-mission prospecting of the Mont-Dore volcano, France, Geophys. J.R. astr. Soc.,76, 307-315.

Nolet, G., 1985. Solving or resolving inadequate and noisy tomographic systems, J.Comp. Phys., 61, 463-482.

Nulton, J.D., and Salamon, P., 1988, Statistical mechanics of combinatorial optimiza-tion: Physical Review A, 37, 1351-1356.

Parker, R.L., 1975. The theory of ideal bodies for gravity interpretation, Geophys. J.

316 Appendices

R. astron. Soc., 42, 315-334.Parker, R.L., 1977. Understanding inverse theory, Ann. Rev. Earth Plan. Sci., 5, 35-64.Parker, R.L., 1994, Geophysical Inverse Theory, Princeton University Press.Pedersen, J.B., and Knudsen, O., Variability of estimated binding parameters, Bio-

phys. Chemistry, 36, 167–176 , 1990.Pica, A., Diet, J.P., and Tarantola, A., 1990, Nonlinear inversion of seismic reflection

data in a laterally medium, Geophysics, Vol. 55, No. 3, pp 284–292.Polack, E. et Ribiere, G., 1969. Note sur la convergence de methodes de directions

conjuguees, Revue Fr. Inf. Rech. Oper., 16-R1, 35-43.Popper, K., Objective knowledge, Oxford, 1972. Trad. franc.: La logique de la

decouverte scientifique, Payot, Paris, 1978.Powell, M.J.D., 1981. Approximation theory and methods, Cambridge University

Press.Press, F., Earth models obtained by Monte Carlo inversion, J. Geophys. Res., 73, 5223–

5234, 1968.Press, F., An introduction to Earth structure and seismotectonics, Proceedings of the In-

ternational School of Physics Enrico Fermi, Course L, Mantle and Core in PlanetaryPhysics, J. Coulomb and M. Caputo (editors), Academic Press, 1971.

Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T., Numerical Recipes,Cambridge, 1986.

Pugachev, V.S., Theory of random functions and its application to control problems, Perga-mon, 1965.

Renyi, A., 1966, Calcul des probabilites, Dunod, Paris.Renyi, A., 1970, Probability theory, Elsevier, New York.Rietsch, E., The maximum entropy approach to inverse problems, J. Geophys., 42,

489–506, 1977.Rothman, D.H., Nonlinear inversion, statistical mechanics, and residual statics esti-

mation, Geophysics, 50, 2797–2807, 1985.Rothman, D.H., Automatic estimation of large residual static corrections, Geophysics,

51, 332–346, 1986.Scales, L. E., 1985. Introduction to non-linear optimization, Macmillan.Scales, J.A., Smith, M.L., and Fischer, T.L., 1992, Global optimization methods for

multimodal inverse problems, Journal of Computational Physics, 102, 258-268.Scales, J., 1996, Uncertainties in seismic inverse calculations, in: Inverse methods,

Interdisciplinary elements of methodology, computation, and applications, Eds.:B.H. Jacobsen, K. Mosegaard and P. Sibani, Springer, Berlin, p. 79–97.

Shannon, C.E., 1948, A mathematical theory of communication, Bell System Tech. J.,27, 379–423.

Simon, J.L., 1995, Resampling: the new statistics, Resampling stats Inc., Arlington,VA, USA.

Stark, P.B., 1992, Inference in infinite-dimensional inverse problems: Discretizationand duality, J. Geophys. Res., 97, 14,055–14,082.


Stark, P.B., 1997, Does God play dice with the Earth? (And if so, are they loaded?),Fourth SIAM Conference on Mathematical and Computational Methods in theGeosciences, Oral presentation, available at www.stat.berkeley.edu/users/stark/Seminars/doesgod.htm

Stein, S.R., 1985, Frequency and time — their measure and characterization, in: Pre-cision frequency control, Vol. 2, edited by E.A. Gerber and A. Ballato, AcademicPress, New York, pp. 191–232 and pp. 399–416.

Tarantola, A., 1984. Linearized inversion of seismic reflection data, GeophysicalProspecting, 32, 998-1015.

Tarantola, A., 1984. Inversion of seismic reflection data in the acoustic approxima-tion, Geophysics, 49, 1259-1266.

Tarantola, A., 1984. The seismic reflection inverse problem, in: Inverse problems ofAcoustic and Elastic Waves, edited by: F. Santosa, Y.-H. Pao, W. Symes, and Ch.Holland, SIAM, Philadelphia.

Tarantola, A., 1986. A strategy for nonlinear elastic inversion of seismic reflectiondata, Geophysics, 51, 1893-1903.

Tarantola, A., 1987. Inverse problem theory; methods for data fitting and modelparameter estimation, Elsevier.

Tarantola, A., 1987. Inversion of travel time and seismic waveforms, in: Seismictomography, edited by G. Nolet, Reidel.

Tarantola, A., 1990, Probabilistic foundations of Inverse Theory, in: Geophysical To-mography, Desaubies, Y., Tarantola, A., and Zinn-Justin, J., (eds.), North Holland.

Tarantola, A., 2005, Inverse Problem Theory and Model Parameter Estimation, SIAM.Tarantola, A., Jobert, G., Trezeguet, D., and Denelle, E., 1987. The inversion of seismic

waveforms can either be performed by time or by depth extrapolation, submittedto Geophysics.

Tarantola, A. and Nercessian, A., 1984. Three-dimensional inversion without blocks,Geophys. J. R. astr. Soc., 76, 299-306.

Tarantola, A., and Valette, B., 1982a. Inverse Problems = Quest for Information, J.Geophys., 50, 159-170.

Tarantola, A., and Valette, B., 1982b. Generalized nonlinear inverse problems solvedusing the least-squares criterion, Rev. Geophys. Space Phys., 20, No. 2, 219-232.

Taylor, S.J., 1966, Introduction to measure and integration, Cambridge Univ. Press.Taylor, A.E., and Lay, D.C., 1980. Introduction to functional analysis, Wiley.Taylor, B.N., and C.E. Kuyatt, 1994, Guidelines for evaluating and expressing the

uncertainty of NIST measurement results, NIST Technical note 1297.Watson, G.A., 1980. Approximation theory and numerical methods, Wiley.Weinberg, S., 1972, Gravitation and Cosmology: Principles and Applications of the

General Theory of Relativity, John Wiley & Sons.Wiggins, R.A., 1969, Monte Carlo Inversion of Body-Wave Observations, J. Geoph.

Res., Vol. 74, No. 12, 3171–3181.Wiggins, R.A., 1972, The General Linear Inverse Problem: Implication of Surface

Waves and Free Oscillations for earth Structure, Rev. Geoph. and Space Phys., Vol.

318 Appendices

10, No. 1, 251–285.Wilks, S., 1962, Mathematical Statistics, Wiley and Sons.Winogradzki, J., 1979, Calcul Tensoriel (I), Masson.Winogradzki, J., 1987, Calcul Tensoriel (II), Masson.Xu, P. and Grafarend, E., 1997, Statistics and geometry of the eigenspectra of 3-D

second-rank symmetric random tensors, Geophys. J. Int. 127, 744–756.Xu, P., 1999, Spectral theory of constrained second-rank symmetric random tensors,

Geophys. J. Int. 138, 1–24.Yeganeh-Haeri, A., Weidner, D.J. and Parise, J.B., 1992, Elasticity of α-cristobalite: a

silicon dioxide with a negative Poisson ratio, Science, 257, 650–652.Zadeh, L. A., 1965, Fuzzy sets. Information and Control, Vol. 8, pp. 338–353.

Index

and, 2associativity

sets, 5

blocks, 4Borel field, 10Borel sets, 10

capacities, 28capacity

metric, 46capacity element, 37cardinality, 6Cartesian product

of sets, 3characteristic function, 3closed, 9commutativity

sets, 5complement

of a set, 3conditional probability, 69, 117, 122continuous, 6coordinates, 26countable, 6

De Morgan laws, 5densities, 28density

metric, 46of mass, 54

determined element, 2dimension of a manifold, 26Dirac’s probability distribution, 220disjoint sets, 3distributivity

sets, 5dual basis, 25dual tensors, 36

elementof a set, 3

elementary probability, 70elementary probability function, 70elementary probability value, 61elements, 2empty set, 3enumerable, 6equal

sets, 3equivalent

relations, 2event, 69exterior product of vectors, 36

false, 2falsification, 63field, 7

Gaussianlinear model for inverse problems, 260model for inverse problems, 260

homogeneous ball, 220homogeneous probability, 76, 104homogeneous probability distribution, 106homogeneous probability function, 76

identical probability functions, 69identity, 2image

of a probability, 73image of a probability, 61

320 INDEX

implication, 2implicit sum convention, 24independent, 139indicator function, 3integral, 39intersection, 62

of probabilities, 76of sets, 3

inversion sampling method, 215

Jacobian determinants, 27Jacobian matrices, 27

Kronecker tensors, 31

Levi-Civita capacity, 34Levi-Civita density, 34Levi-Civita tensor, 48linear form, 25

marginal probability, 85marginal probability density, 92Markov Chain Monte Carlo, 247mass density, 54matrix of partial derivatives, 27measurable sets, 9measure, 94metric capacity, 46metric density, 46metric manifold, 42metric tensor, 42minimal field, 8minimal sigma-field, 9

natural basis, 26negation, 2

or, 2

partial derivativesmatrix, 27

partition, 4, 182countably infinite, 9finite, 9

physical dimension, 301

points, 95possibility, 71power set, 3, 7preimage, 11probability, 61, 68probability function, 61, 68probability triplet, 69probability value, 61, 68proper subset, 3properties, 2property, 2

random variable, 69reciprocal extension, 11reciprocal image, 11reciprocal image of a probability, 62reciprocal image of a probability function,

79reference set, 3rejection sampling method, 216relation, 2

sample, 214sample element, 60sample space, 69sequential realization method, 217set

definition, 3sigma-algegra, 9sigma-field, 9smooth manifold, 26subset, 3support, 72symbol, 2

trivial field, 8true, 2

unionof sets, 3

union of probability distributions, 194

variable element, 2volume, 52

INDEX 321

volume density, 49volume element, 49volume measure, 94volume triplet, 95volumetric mass, 55volumetric probability, 95

mapping of probabilities

Documents