probability random processes and ergodic properties

Probability, Random Processes,and Ergodic Properties

Robert M. Gray

Probability, Random Processes,and Ergodic Properties

Second Edition

10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connectionwith any form of information storage and retrieval, electronic adaptation, computer software, or by similaror dissimilar methodology now known or hereafter developed is forbidden.The use in this publication of trade names, trademarks, service marks, and similar terms, even if they arenot identified as such, is not to be taken as an expression of opinion as to whether or not they are subjectto proprietary rights.

Printed on acid-free paper

This work may not be translated or copied in whole or in part without the written

Springer Dordrecht Heidelberg London New York

© Springer Science+Business Media, LLC 2009

Springer is part of Springer Science+Business Media (www.springer.com)

All rights reserved.permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY

Robert M. GrayDepartment of Electrical EngineeringStanford University

[email protected]

ISBN 978-1-4419-1089-9 e-ISBN 978-1-4419-1090-5DOI 10.1007/978-1-4419-1090-5

Library of Congress Control Number: 2009932896

Stanford, CA 94305-4075

To my brother, Peter R. GrayApril 1940 – May 2009

Preface

This book has a long history. It began over two decades ago as the firsthalf of a book then in progress on information and ergodic theory. Theintent was and remains to provide a reasonably self-contained advanced(at least for engineers) treatment of measure theory, probability theory,and random processes, with an emphasis on general alphabets and onergodic and stationary properties of random processes that might be nei-ther ergodic nor stationary. The intended audience was mathematicallyinclined engineering graduate students and visiting scholars who hadnot had formal courses in measure theoretic probability or ergodic the-ory. Much of the material is familiar stuff for mathematicians, but manyof the topics and results had not then previously appeared in books. Sev-eral topics still find little mention in the more applied literature, includ-ing descriptions of random processes as dynamical systems or flows;general alphabet models including standard spaces, Polish spaces, andLebesgue spaces; nonstationary and nonergodic processes which havestationary and ergodic properties; metrics on random processes whichquantify how different they are in ways useful for coding and signal pro-cessing; and stationary or sliding-block mappings – coding or filtering –of random processes.

The original project grew too large and the first part contained muchthat would likely bore mathematicians and discourage them from thesecond part. Hence I followed a suggestion to separate the materialand split the project in two. The original justification for the presentmanuscript was the pragmatic one that it would be a shame to waste allthe effort thus far expended. A more idealistic motivation was that thepresentation had merit as filling a unique, albeit small, hole in the litera-ture. Personal experience indicates that the intended audience rarely hasthe time to take a complete course in measure and probability theoryin a mathematics or statistics department, at least not before they needsome of the material in their research.

vii

Preface

Many of the existing mathematical texts on the subject are hard forthe intended audience to follow, and the emphasis is not well matchedto engineering applications. A notable exception is Ash’s excellent text[1], which was likely influenced by his original training as an electricalengineer. Still, even that text devotes little effort to ergodic theorems,perhaps the most fundamentally important family of results for apply-ing probability theory to real problems. In addition, there are many otherspecial topics that are given little space (or none at all) in most texts onadvanced probability and random processes. Examples of topics devel-oped in more depth here than in most existing texts are the following.

Random processes with standard alphabets

The theory of standard spaces as a model of quite general process alpha-bets is developed in detail. Although not as general (or abstract) as ex-amples sometimes considered by probability theorists, standard spaceshave useful structural properties that simplify the proofs of some gen-eral results and yield additional results that may not hold in the moregeneral abstract case. Three important examples of results holding forstandard alphabets that have not been proved in the general abstractcase are the Kolmogorov extension theorem, the ergodic decomposition,and the existence of regular conditional probabilities. All three resultsplay fundamental roles in the development and understanding of math-ematical information theory and signal processing. Standard spaces in-clude the common models of finite alphabets (digital processes), count-able alphabets, and real alphabets as well as more general completeseparable metric spaces (Polish spaces). Thus they include many func-tion spaces, Euclidean vector spaces, two-dimensional image intensityrasters, etc. The basic theory of standard Borel spaces may be found inthe elegant text of Parthasarathy [82], and treatments of standard spacesand the related Lusin and Suslin spaces may be found in Blackwell [7],Christensen [12], Schwartz [94], Bourbaki [8], and Cohn [15].

The development of the basic results here is more coding-oriented andattempts to separate clearly the properties of standard spaces, whichare useful and easy to manipulate, from the demonstrations that certainspaces are standard, which are more complicated and can be skipped.Unlike the traditional treatments, we define and study standard spacesfirst from a purely probability theory point of view and postpone thetopological metric space details until later. In a nutshell, a standardmeasurable space is one for which any nonnegative set function that isfinitely additive (usually easy to prove) on a field has a unique extensionto a probability measure on the σ -field generated by the field. Spaceswith this property ease the construction of probability measures fromset functions that arise naturally in real systems, such as sample aver-

viii

Preface

ages and relative frequencies. The fact that well behaved (separable andcomplete) metric spaces with the usual (Borel) σ -field are standard alsoprovides a natural measure of approximation (or distortion) betweenrandom variables, vectors, and processes that will be seen to be quiteuseful in quantifying the performance of codes and filters. This bookand its sequel have numerous overlaps with ergodic theory, and mostof the ergodic theory literature focuses on another class of probabilityspaces and the associated models of random processes and dynamicalspaces — Lebesgue spaces. Standard spaces and Lebesgue spaces are in-timately connected, and both are discussed here and comparisons made.The full connection between the two is stated, but not proved becausedoing so would require significant additional material not part of theprimary goals of this book.

Nonstationary and nonergodic processes

The theory of asymptotically mean stationary processes and the ergodicdecomposition are treated in depth in order to model many physical pro-cesses better than can traditional stationary and ergodic processes. Bothtopics are virtually absent in all books on random processes, yet they arefundamental to understanding the limiting behavior of nonergodic andnonstationary processes. Both topics are considered in Krengel’s excel-lent book on ergodic theorems [62], but the treatment here is in greaterdepth. The treatment considers both the common two-sided or bilateralrandom processes, which are considered to have been producing outputsforever, and the often more difficult one-sided or unilateral processes,which better model processes that are “turned on” at some specific timeand which exhibit transient behavior,.

Ergodic properties and theorems

The notions of time averages and probabilistic averages are developedtogether in order to emphasize their similarity and to demonstrate manyof the implications of the existence of limiting sample averages. The er-godic theorem is proved for the general case of asymptotically meanstationary processes. In fact, it is shown that asymptotic mean stationar-ity is both sufficient and necessary for the classical pointwise or almosteverywhere ergodic theorem to hold for all bounded measurements. Thesubadditive ergodic theorem of Kingman [60] is included as it is usefulfor studying the limiting behavior of certain measurements on randomprocesses that are not simple arithmetic averages. The proofs are basedon simple proofs of the ergodic theorem developed by Ornstein andWeiss [79], Katznelson and Weiss [59], Jones [55], and Shields [98]. These

ix

Preface

proofs use coding arguments reminiscent of information and communi-cation theory rather than the traditional (and somewhat tricky) maximalergodic theorem. The interrelations of stationary and ergodic propertiesof processes that are stationary or ergodic with respect to block shiftsare derived. Such processes arise naturally in coding and signal process-ing systems that operate sequentially on disjoint blocks of values pro-duced by stationary and ergodic processes. The topic largely originatedwith Nedoma [73] and it plays an important role in the general versionsof Shannon channel and source coding theorems in the sequel and else-where.

Process distance measures

Measures of “distance” between random processes which quantify how“close” one process is to another are developed for two applications. Thedistributional distance is introduced as a means of considering spacesof random processes, which in turn provide the means of proving theergodic decomposition of certain functionals of random processes. Theoptimal coupling or Kantorovich/Wasserstein/Ornstein or generalizedd-distance provides a means of characterizing how close or different thelong term behavior of distinct random processes can be expected to be.The distributional distance, quantifies how well the probabilities on agenerating field match, and the optimal coupling distance characterizesthe smallest achievable distortion between two processes with given dis-tributions. The optimal coupling distance has cropped up in many areasunder a variety of names as a distance between distributions, randomvariables, or random objects, but here the emphasis is on extensionsof Ornstein’s distance between random processes. The process distancemeasures are related to other metrics on measures and random vari-ables, including the Prohorov, variation, and distribution distance.

B-processes

Several classes of random processes are introduced as examples of thegeneral ideas, including the class of B-processes introduced by Ornstein,arguably the most important class for virtually all engineering applica-tions, yet a class virtually invisible in the engineering literature, exceptfor the special case of the outputs of time-invariant stable linear filtersdriven by independent, identically distributed (IID) processes. A discrete-time B-process is the result of a time-invariant or stationary coding orfiltering of an IID process; a continuous time B-process is one for whichthere is some sampling period T for which the discrete-time processformed by sampling every T seconds is a discrete-time B-process. This

x

Preface

class is of fundamental importance in ergodic theory because it is thelargest class of stationary and ergodic processes for which the Ornsteinisomorphism theorem holds, that is, the largest class for which equalentropy rate implies isomorphic processes. It is also of importance inapplications, because it is exactly the class of processes which can bemodeled as stationary coding or filtering of coin flips, roulette wheelspins, or Gaussian or uniform IID sequences. In a sense, B-processes arethe most random possible since they have at their core IID sequences.

What’s new in the Second Edition

Much of the material has been rearranged and revised for pedagogicalreasons to improve the flow of ideas and to introduce several basictopics earlier in order to reduce the clutter in the later, more difficult,ideas. The original overlong first chapter has been split in order to al-low a more thorough treatment of basic probability before tackling ran-dom processes and dynamical systems. This permits moving materialon many of the basic spaces treated in the book — metric spaces, Borelspaces, Polish spaces – into the discussion of sample spaces, providinga smoother flow of ideas both in the fundamentals of probability andin the later development of Polish Borel spaces. The final chapter in theoriginal printed version has similarly been broken into two pieces toprovide separate emphasis on process metrics and the ergodic decom-position of affine functionals.

Completion of event spaces and probability measures is treated inmore detail in order to allow a more careful discussion of the metricspace based Borel spaces emphasized here and the Lebesgue spacesubiquitous in ergodic theory. I have long found existing treatments ofthe relative merits of these two important probability spaces confusingand insufficient in describing why one or the other of the two modelsshould be preferred. The emphasis remains on the former model, but Ihave made every effort to clarify why Borel spaces are the more naturaland useful for many of the results developed here and in the sequel.

Much more material is now given on the history and ideas underly-ing the the Ornstein process distance and its extensions. The d-distanceand its extensions have their roots in the classic problem of “optimaltransportation” or “optimal transport” introduced by Monge in 1781 [71]and extended by Kantorovich in the mid twentieth century [56, 57, 58].A discussion of the optimal transport problem has been included as itprovides insight into origins and properties of a key tool used in thisbook and its sequel as well as a liaison with a currently active area ofresearch which has found application in a variety of branches of pureand applied mathematics. The discussion of the distance has been con-

xi

Preface

siderably expanded and reorganized to facilitate comparison with therelated Monge/Kantorovich history, while still emphasizing Ornstein’sextension of these ideas to processes. Examples have been added wherethe distance can actually be computed, including independent identicallydistributed random processes and Gaussian random processes. Previ-ously unpublished results extend the Gaussian results to a family of oflinear process models driven by a common memoryless process.

More specific examples of random processes have been introduced,most notably B-processes and K-processes, in order to better describea hierarchy of examples of stationary and ergodic processes in termsof increasingly strong mixing properties. In addition, I believe stronglythat the idea of B-processes is insufficiently known in information the-ory and signal processing, and that many results in those fields can bestrengthened and simplified by focusing on B-processes.

Numerous minor additions have been made of material I have founduseful through over four decades of work in the field. Many classic in-equalities (such as those of Jensen, Young, Hölder, and Minkowski) arenow incorporated into the text along with proofs and many citationshave been added.

Missing topics

Having described the topics treated here that are lacking in most texts,I admit to the omission of many topics usually contained in advancedtexts on random processes or second books on random processes forengineers. First, the emphasis is on discrete time processes and the de-velopment for continuous time processes is not nearly as thorough. Itis more complete, however, than in earlier versions of this book, whereit was ignored. The emphasis on discrete time is somewhat justified bya variety of excuses: The advent of digital systems and sampled-datasystems has made discrete time processes at least equally important ascontinuous time processes in modeling real world phenomena. The shiftin emphasis from continuous time to discrete time in texts on electricalengineering systems can be verified by simply perusing modern texts.The theory of continuous time processes is inherently more difficult thanthat of discrete time processes. It is harder to construct the models pre-cisely and much harder to demonstrate the existence of measurementson the models, e.g., it is usually harder to prove that limiting integralsexist than limiting sums. One can, however, approach continuous timemodels via discrete time models by letting the outputs be pieces of wave-forms. Thus, in a sense, discrete time systems can be used as a buildingblock for continuous time systems.

xii

Preface

Another topic clearly absent is that of spectral theory and its appli-cations to estimation and prediction. This omission is a matter of tasteand there are many books on the subject. Power spectral densities areconsidered briefly in the discussion of process metrics.

A further topic not given the traditional emphasis is the detailed the-ory of the most popular particular examples of random processes: Gaus-sian and Poisson processes. The emphasis of this book is on generalproperties of random processes rather than the specific properties ofspecial cases. Again, I do consider some specific examples, including theGaussian case, as a sideline in the development.

The final noticeably absent topic is martingale theory. Martingales areonly briefly discussed in the treatment of conditional expectation. Thispowerful theory is not required in this book or its sequel on informationand ergodic theory. If I had had the time, however, I would have includedthe convergence of conditional probability given increasing or decreas-ing σ -fields, results following from Doob’s Martingale convergence the-orem.

Goals

The book’s original goal of providing the needed machinery for a sub-sequent book on information and ergodic theory remains. That bookrests heavily on this book and only quotes the needed material, freeingit to focus on the information measures and their ergodic theorems andon source and channel coding theorems and their geometric interpreta-tions. In hindsight, this manuscript also serves an alternative purpose. Ihave been approached by engineering students who have taken a mas-ter’s level course in random processes using my book with Lee Davis-son [36, 37] and who are interested in exploring more deeply into theunderlying mathematics that is often referred to, but rarely exposed.This manuscript provides such a sequel and fills in many details onlyhinted at in the lower level text.

As a final, and perhaps less idealistic, goal, I intended in this bookto provide a catalogue of many results that I have found need of in myown research together with proofs that I could follow. I also intended toclarify various connections that I had found confusing or insufficientlytreated in my own reading. These are goals for which I can judge the suc-cess — I have often found myself consulting this manuscript to find theconditions for some convergence result, the reasons for some requiredassumption, the generality of the existence of some limit, or the relativemerits of different models for the various spaces arising in probabilityand random processes. If the book provides similar service for others, itwill have succeeded in a more global sense. Each time the book has been

xiii

Preface

revised, an effort has been made to expand the index to help searchingby topic.

Assumed Background

The book is aimed at graduate engineers and hence does not assumeeven an undergraduate mathematical background in functional analy-sis or measure theory. Therefore topics from these areas are developedfrom scratch, although the developments and discussions often divergefrom traditional treatments in mathematics texts. Some mathematicalsophistication is assumed for the frequent manipulation of deltas andepsilons, and hence some background in elementary real analysis or astrong calculus knowledge is required. Familiarity with basic set theoryis also assumed.

Exercises

A few exercises are strewn throughout the book. In general these are in-tended to make handy results which do not merit the effort and space ofa formal proof, but which provide additional useful examples or proper-ties or otherwise reinforce the text.

Errors

As I age my frequency of typographical and other errors seems to growalong with my ability to see through them. I apologize for any that re-main in the book. I will keep a list of all errors found by me or sentto me at [email protected] and I will post the list at my Web site,http://ee.stanford.edu/∼gray/.

Acknowledgments

The research in information theory that yielded many of the results andsome of the new proofs for old results in this book was supported bythe National Science Foundation. Portions of the research and much ofthe early writing were supported by a fellowship from the John Simon

xiv

Preface

Guggenheim Memorial Foundation. Later revisions were supported bygifts from Hewlett Packard, Inc.

The book benefited greatly from comments from numerous studentsand colleagues through many years: most notably Paul Shields, LeeDavisson, John Kieffer, Dave Neuhoff, Don Ornstein, Bob Fontana, JimDunham, Farivar Saadat, Mari Ostendorf, Michael Sabin, Paul Algoet, WuChou, Phil Chou, Tom Lookabaugh, and, in recent years, Támas Linder,John Gill, Stephen Boyd, and Abbas El Gamal. They should not be blamed,however, for any mistakes I have made in implementing their sugges-tions. I would like to thank those who have sent me typos for correctionthrough the years, including Hossein Kakavand, Ye Wang, and ZhengdaoWang.

I would also like to acknowledge my debt to Al Drake for introducingme to elementary probability theory, to Wilbur Davenport for introduc-ing me to random processes, and to Tom Pitcher for introducing me tomeasure theory. All three were extraordinary teachers.

Finally and most importantly, I would like to thank my family, espe-cially Lolly, Tim, and Lori, for all their love and support through the yearsspent writing and rewriting. I am dedicating this second edition to mybrother Pete, who taught me how to sail, helped me to survive my under-graduate and early graduate years, was a big fan of my Web books aboutour grandmother Amy Heard Gray, and set a high standard for living agood life — even during his final battle with Alzheimer’s and Parkinson’sdiseases. Sailing with Pete as his heavy-weather crew when MIT won the1961 Henry A. Morss Memorial Trophy at Annapolis remains an unfor-gettable high point of my life.

La Honda, California Robert M. GrayJune 2009

xv

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 Probability Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Measurable Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4 Borel Measurable Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.5 Polish Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.6 Probability Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.7 Complete Probability Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251.8 Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2 Random Processes and Dynamical Systems . . . . . . . . . . . . . . . . . . . 352.1 Measurable Functions and Random Variables . . . . . . . . . . . . . 352.2 Approximation of Random Variables and Distributions . . . . 382.3 Random Processes and Dynamical Systems. . . . . . . . . . . . . . . . 402.4 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.5 Equivalent Random Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.6 Codes, Filters, and Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.7 Isomorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3 Standard Alphabets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.1 Extension of Probability Measures . . . . . . . . . . . . . . . . . . . . . . . . . 613.2 Standard Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.3 Some Properties of Standard Spaces. . . . . . . . . . . . . . . . . . . . . . . 683.4 Simple Standard Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723.5 Characterization of Standard Spaces . . . . . . . . . . . . . . . . . . . . . . 743.6 Extension in Standard Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763.7 The Kolmogorov Extension Theorem . . . . . . . . . . . . . . . . . . . . . . 773.8 Bernoulli Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

xvii

vii

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi

Contents

3.9 Discrete B-Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803.10 Extension Without a Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823.11 Lebesgue Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903.12 Lebesgue Measure on the Real Line . . . . . . . . . . . . . . . . . . . . . . . . 91

4 Standard Borel Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.1 Products of Polish Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.2 Subspaces of Polish Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.3 Polish Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014.4 Product Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094.5 IID Random Processes and B-processes . . . . . . . . . . . . . . . . . . . . 1104.6 Standard Spaces vs. Lebesgue Spaces . . . . . . . . . . . . . . . . . . . . . . 112

5 Averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.1 Discrete Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.2 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1225.4 Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285.5 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1305.6 Integrating to the Limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1345.7 Time Averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1395.8 Convergence of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 1425.9 Stationary Random Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1515.10 Block and Asymptotic Stationarity . . . . . . . . . . . . . . . . . . . . . . . . 154

6 Conditional Probability and Expectation . . . . . . . . . . . . . . . . . . . . . . 1576.1 Measurements and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1576.2 Restrictions of Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1626.3 Elementary Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . 1636.4 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1666.5 The Radon-Nikodym Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1696.6 Probability Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1736.7 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1766.8 Regular Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 1786.9 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1836.10 Independence and Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . 190

7 Ergodic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1957.1 Ergodic Properties of Dynamical Systems . . . . . . . . . . . . . . . . . 1957.2 Implications of Ergodic Properties . . . . . . . . . . . . . . . . . . . . . . . . 2007.3 Asymptotically Mean Stationary Processes . . . . . . . . . . . . . . . . 2067.4 Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2157.5 Asymptotic Mean Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2217.6 Limiting Sample Averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2237.7 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2267.8 Block Ergodic and Totally Ergodic Processes . . . . . . . . . . . . . . . 230

xviii

Contents

7.9 The Ergodic Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

8 Ergodic Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2398.1 The Pointwise Ergodic Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 2398.2 Mixing Random Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2458.3 Block AMS Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2498.4 The Ergodic Decomposition of AMS Systems . . . . . . . . . . . . . . 2528.5 The Subadditive Ergodic Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 253

9 Process Approximation and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 2639.1 Distributional Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2639.2 Optimal Coupling Distortion/Transportation Cost . . . . . . . . . 2699.3 Coupling Discrete Spaces with the Hamming Distance . . . . . 2749.4 Fidelity Criteria and Process Optimal Coupling Distortion . 2779.5 The Prohorov Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2869.69.7 Evaluating dp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2899.8 Measures on Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

10 The Ergodic Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29510.1 The Ergodic Decomposition Revisited . . . . . . . . . . . . . . . . . . . . . 29510.2 The Ergodic Decomposition of Markov Processes . . . . . . . . . . 29910.3 Barycenters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30210.4 Affine Functions of Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30510.5 The Ergodic Decomposition of Affine Functionals . . . . . . . . . 309

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

The Variation/Distribution Distance for Discrete Alphabets 288

xix

Introduction

Abstract Random process are introduced with their traditional basic def-inition as a family of random variables and two additional models are de-scribed and motivated. The Kolmogorov or directly-given representationdescribes a random process as a probability distribution on the spaceof output sequences or waveforms producible by the random process.A dynamical system representation yields a random process by defininga single random variable or measurement on a probability space witha transformation or family of transformations on the underlying space.The content of the book is summarized as filling in the details of themany components of these models and developing their ergodic prop-erties — behavior exhibited by limiting sample averages — and ergodictheorems — results relating sample averages to probabilistic averages.The organization of the book is described by a tour of the chapters, in-cluding introductory comments and overviews.

Random processes provide the mathematical structure for the studyof random or stochastic phenomena in the real world, such as the over-whelming variety of signals, measurements, and data communicatedthrough wires, cables, the atmosphere, and fiber; stored in hard disks,compact discs, flash cards, and volatile memory; and processed by seem-ingly uncountable computers and other devices to produce useful in-formation for human or machine users. Most of the theory underlyingthe analysis of such systems for communications, transmission, stor-age, interpretation, recognition, classification, estimation, detection, sig-nal processing, and so on is based on the theory of random processes,especially on results relating computable and tractable predictions ofbehavior based on mathematical models to the likely actual behavior re-sulting when algorithms based on the theory are applied to “real-world”signals.

xxi

Introduction

Mathematically, a random process is simply an indexed family ofrandom variables Xt ; t ∈ T defined on a common probability space(Ω,B, P), which consists of a sample space Ω, event space or σ -field B,and a probability measure P . These objects, along with all of the oth-ers mentioned in this introductory chapter, are developed in depth inthe remaining chapters. Typically the index set T corresponds to time,which might be discrete, say the integers or the nonnegative integers, orcontinuous, say the real line or the nonnegative portion of the real line.In this book the emphasis is on the discrete index or discrete time caseand we typically write Xn;n ∈ T to represent a discrete time randomprocess or abbreviate it to Xn or even X when clear from context.

The random variables Xn take values in some set or alphabet A, whichmight be as simple as the binary numbers 0,1, the unit interval [0,1),or the real line R, or as complicated as a vector or function space.

There are two other representations of random processes of partic-ular interest here. One is the Kolmogorov or directly given represen-tation where a new probability space is constructed having the form(A∞,B∞A , µ), where A∞ is the sample space of output sequences of therandom process, B∞A is a suitable event space, and µ is a probability mea-sure on the sequence space called the process distribution of the randomprocess. This construction deals directly with probabilities of sequencesproduced by the random process and not on abstract random variablesdefined on an underlying space. Both constructions have their uses, andthey are clearly related — the Kolmogorov construction gives us a modelof the first type by simply defining the Xn as the coordinate or samplingfunctions; that is, given a sequence x = (· · · , xn−1, xn,xn+1, · · · ) ∈ A∞,define Xn(x) = xn. Probability theory allows to go the other way, we canbegin with the basic definition of an infinite family of random variablesand then construct the Kolmogorov model by deriving the necessary dis-tributions for the output process from the original probability space andthe definitions of the random variables, the sequence space inherits itsmeasure from the original space. Both the original model and the Kol-mogorov model produce the same results so far as the random processoutputs are concerned. The original model, however, might also haveother random processes of interest defined on it, not all of which aredeterminable from the Kolmogorov model for X.

The third representation of a random process derives from the ideaof a dynamical system, the principal object of interest in ergodic the-ory, a branch of probability theory that deals with measure-preservingtransformations. This book and its sequel, Entropy and Information The-ory [30], draw heavily on the ideas and methods from ergodic theory inthe exposition and application of random processes. The early incorpo-ration of the basic relevant structures eases the development.

To approach this idea from the random process point of view, con-sider the Kolmogorov model for a two-sided random process Xn;n ∈

xxii

Introduction

Z, where Z is the set of all integers. If A is the alphabet of the Xn,then the action of “time” is captured by a (left) shift transformationT : A∞ → A∞ defined by

T(· · · , xn−1, xn,xn+1, · · · ) = (· · · , xn,xn−1, xn, · · · ).

In words, T takes a sequence and moves every symbol in the sequenceone slot to the left. The time n value of the random process can now bewritten as a function of the sequence as

Xn(x) = X0(Tnx);

that is, instead of viewing the random process as an infinite family ofrandom variables defined on a common probability space, we can viewit as a single random variable, X0 combined with a shift transformationT . The collection (A∞,B∞A , µ, T) consisting of a probability space and atransformation on that space is a special case of a dynamical system. Thegeneral case is obtained by not requiring that the underlying probabilityspace be a sequence space or Kolmogorov representation of a process. Adynamical system (Ω,B, P , T) is a probability space together with a trans-formation defined on the space (to be sure there are additional technicalassumptions to make this precise, but this is an introduction). A dynam-ical system combined with a random variable, say f , produces a randomprocess in the same way Xn was constructed above: fn = fTn, by whichwe mean fn(ω) = f(Tnω) for everyω ∈ Ω. Given the underlying proba-bility space, (Ω,B, P) and the transformation, we can (in theory, at least)find out everything we wish to know about a process.

The transformation T is measure preserving if the probability of anyevent (any set in B) is unchanged by shifting. In mathematical terms thismeans

µ(F) = µ(x : x ∈ F) = µ(x : Tx ∈ F) = µ(T−1F)

for all events F . The fact that shifting or transforming by T preserves themeasure µ can also be interpreted as stating that given the transforma-tion T , the measure µ is invariant or stationary with respect to the trans-formation T . Thus the theory of measure-preserving transformations inergodic theory is simply the theory of stationary processes. In this book,however, we do not confine interest to measure-preserving transforma-tions or stationary processes. The intent is to capture the most generalpossible class of processes or transformations which possess the sameuseful properties of stationary processes or measure-preserving trans-formations, but which do not satisfy the strict requirements of station-ary processes. This leads to a class of processes that asymptotically havesimilar properties, but which can depart from strict stationarity in a va-riety of ways that arise in applications, including the possession of tran-

xxiii

Introduction

sients that die out with time, artifacts due to blocking, and anomoliesthat get averaged out over time.

Each of the three models — a family of random variables, a distribu-tion on a sequence space, and a dynamical system plus a measurement— says something about the others, and each will prove the most conve-nient representation at times, depending on what we are trying to prove.

This book develops the details of many aspects of random processes,beginning with the fundamental components: a sample space Ω, a σ -field or event space B, a measurable space (Ω,B), a probability measureP , a probability space (Ω,B, P), a process alphabet A, measurable func-tions or random variables, random vectors, and random processes, anddynamical systems. In addition to studying the outputs of random pro-cesses, we study what happens when the processes are themselves codedor filtered to form new processes. In fact we have already built this gener-ality into the infrastructure: we might begin with one random process Xand define a random variable f : A∞ → B mapping a sequence of the orig-inal random process into a symbol in a new alphabet. The dynamical sys-tem implied by the Kolmogorov representation of the first process plusthe random variable together yield a new random process, say Yn = fTn,that is, Yn(x) = f(Tnx). This is an example of a time-invariant or sta-tionary code or filter, and it combines ideas from random processes anddynamical systems to model a common construction in communicationsand signal processing — mapping one process into another.

A primary goal of the theory (and of ergodic theory in particular) is toquantify the properties and interrelationships between two kinds of av-erages: probabilistic averages or expectations and time/sample/empiricalaverages. Probabilistic averages are found by calculus (integration) andlook like

EPf =∫fdP =

∫f(ω)dP(ω),

where P tells us the probability measure and f is a measurement orrandom variable, usually assumed to be real-valued or possibly a real-valued vector. An important example of a measurement f is an indicatorfunctions of an event F taking the form

f(x) = 1F(x) =

1 x ∈ F0 x 6∈ F .

In this case case the expected value is the probability of F is simply theprobability of the event:

EP1F =∫FdP = P(F).

Often these simple averages are not enough to quantify the behaviorin sufficient detail, so we deal with conditional expectations which take

xxiv

Introduction

the form EP(f | M), where we are given some information about whathappened, usually the value of some other measurements on the systemor the occurrence or nonoccurrence of relevant events. Time-averageslook at an actual output of the process and make a sequence of measure-ments on the process and form a simple sum of the successive valuesand divide by the numer of terms, e.g., given a sequence x ∈ A∞ and ameasurement f : A∞ → R, how does the sample average

< f >n (x) =1n

n−1∑i=0

f(T ix) or, simply, < f >n=1n

n−1∑i=0

fT i

behave in the limit as n → ∞? In the special case where f = X0, thezero-time output of a random process Xn, the sample average is thetime-average mean

1n

n−1∑i=0

Xi.

In the special case where the measurement is an indicator function, thesample average becomes a relative frequency or empirical probability .

< 1F >n (x) =1n number of i,0 ≤ i < n : Tix ∈ F.

Results characterizing the behavior of such limiting sample averageswhen they exist are known as ergodic properties, and results relatingsuch limits to expectations are known as ergodic theorems or laws oflarge numbers. Together such results allow us to use the calculus ofprobability to predict the behavior of the actual outputs of random sys-tems, at least if our models are good. An important related topic is howwe measure how good our models are. This raises the issue of how wellone process approximates another (one process might be the model weconstruct, the other process the unknown true model). Having a notionof the distortion or distance between two processes can be useful inquantifying the mismatch resulting from doing theory with one modeland applying the results to another. This topic is touched upon in theconsideration of distortion and distance measures (different from prob-ability measures) and fidelity criteria considered in some detail in thisbook.

The organization of the book is most easily described by an introduc-tion and overview to each chapter. The remainder of this introductionpresents such a tour of the topics and the motivation behind them.

xxv

Introduction

Chapter 1: Probability Spaces

The basic tool for describing and quantifying random phenomena isprobability theory, which is measure theory for finite measures normal-ized to one. The history of probability theory is long, fascinating, andrich (see, for example, Maistrov [69]). Its modern origins lie primarilyin the axiomatic development of Kolmogorov in the 1930s [61]. Notablelandmarks in the subsequent development of the theory (and still goodreading) are the books by Cramér [16], Loève [66], and Halmos [45]. Morerecent treatments that I have found useful for background and referenceare Ash [1], Breiman [10], Chung [14], the treatment of probability theoryin Billingsley [3], and Doob’s gem of a book on measure theory [21].

There are a variety of types of general “spaces” considered in thisbook, including sample spaces, metric spaces, measurable spaces, mea-sure spaces and the special case of probability spaces. All of these areintroduced in this chapter, along with several special cases such asBorel spaces and Polish spaces — complete, separable, metric spaces.Additional spaces are encountered later, including standard spaces andLebesgue spaces. All of these spaces form models of various aspects ofrandom phenomena and often arise in multiple settings. For example,metric spaces can be used to construct probability spaces on which ran-dom variables or processes are defined, and then new metric spaces canbe constructed quantifying the quality of approximation of random vari-ables and processes.

Chapter 2: Random Processes and Dynamical Systems

The basic ideas of mathematical models of random processes are in-troduced in this chapter. The emphasis is on discrete time or discreteparameter processes, but continuous time processes are also consid-ered. Such processes are also called stochastic processes or informationsources, and, in the discrete time case, time series. Intuitively a randomprocess is something that produces a succession of symbols called “out-puts” in a random or nondeterministic manner. The symbols producedmay be real numbers such as produced by voltage measurements froma transducer, binary numbers as in computer data, two-dimensional in-tensity fields as in a sequence of images, continuous or discontinuouswaveforms, and so on. The space containing all of the possible outputsymbols at any particular time is called the alphabet of the random pro-cess, and the directly given model of a random process is an assignmentof a probability measure to events consisting of sequences or collectionsof samples drawn from the alphabet. In this chapter three models aredeveloped for random processes: the traditional model of a family of

xxvi

Introduction

random variables defined on a common probability space, the directlygiven model of a distribution on a sequence or waveform space of out-puts of the random process, and the dynamical system model of a trans-formation on a probability space combined with a random variable ormeasurement.

A variety of means of producing new random processes from a givenrandom process are described, including coding and filtering one pro-cess to obtain another. The very special case of invertible mappings andisomorphism are described. In ergodic theory two processes are consid-ered to be the “same” if one can be coded or filtered into another in aninvertible manner. This book does not go into isomorphism theory inany detail, but the ideas are required both here and in the sequel fordescribing relevant results from the literature and interpreting severalresults developed here.

My favorite classic reference for random processes is Doob [20] andfor classical dynamical systems and ergodic theory Halmos [46] remainsa wonderful introduction.

Chapter 3: Standard Alphabets

It is desirable to develop a theory under the most general possible as-sumptions. Random process models with very general alphabets are use-ful because they include all conceivable cases of practical importance. Onthe other hand, considering only the abstract spaces of Chapter 2 can re-sult in both weaker properties and more complicated proofs. Restrictingthe alphabets to possess some structure is necessary for some resultsand convenient for others. Ideally, however, we can focus on a class ofalphabets that possesses useful structure and still is sufficiently generalto well model all examples likely to be encountered in the real world.Standard spaces are a candidate for this goal and are the topic of thischapter and the next. In this chapter the focus is on the definitions andproperties of standard spaces, leaving the more complicated demonstra-tion that specific spaces are standard to the next chapter. The reader ina hurry can skip the next chapter. The theory of standard spaces is usu-ally somewhat hidden in theories of topological measure spaces. Stan-dard spaces are related to or include as special cases standard Borelspaces, analytic spaces, Lusin spaces, Suslin spaces, and Radon spaces.Such spaces are usually defined by their relation via a mapping to a com-plete separable metric space. Good supporting texts are Parthasarathy[82], Christensen [12], Schwartz [94], Bourbaki [8], and Cohn [15], and thepapers by Mackey [68] and Bjornsson [6]. Standard spaces are also inti-mately connected with Lebesgue spaces, as developed by Rohlin [90, 89],as will be discussed.

xxvii

Introduction

The presentation here differs from the traditional one in that we fo-cus on the fundamental properties of standard spaces purely from aprobability theory viewpoint and postpone application of the topologi-cal and metric space ideas until later. As a result, we can define standardspaces by the properties that are needed and not by an isomorphism toa particular kind of probability space with a metric space alphabet. Thisprovides a shorter route to a description of such a space and to its basicproperties. In particular, it is demonstrated that standard measurablespaces are the only measurable spaces for which all finitely additive can-didate probability measures are also countably additive and that this inturn implies many other useful properties. A few simple, but important,examples are provided: sequence spaces drawn from countable alpha-bets and certain subspaces of such sequence spaces. The next chaptershows that more general topological probability spaces satisfy the re-quired properties.

Chapter 4: Standard Borel Spaces

In this chapter we develop the most important (and, in a sense, the mostgeneral) class of standard spaces– Borel spaces formed from completeseparable metric spaces. In particular, we show that if

1. (A,B(A)) is a Polish Borel measurable space; that is, A is a completeseparable metric space and B(A) is the σ -field generated by the opensubsets of A, and

2. Ω ∈ B(A); that is, Ω is a Borel set of A,

then (Ω,Ω ∩ B(A)) is a standard measurable space. The event Ω itselfneed not be Polish. We accomplish this by showing that such spaces areisomorphic to a standard subspace of a countable alphabet sequencespace and hence are themselves standard. The proof involves a formof coding, this time corresponding to quantization — mapping contin-uous spaces into discrete spaces. The classic text for this material isParthasarathy [82], but we prove a simpler, if weaker, result.

Chapter 5: Averages

The basic focus of classical ergodic theory was the development of con-ditions under which sample or time averages consisting of arithmeticmeans of a sequence of measurements on a random process convergedto a probabilistic or ensemble average of the measurement as expressedby an integral of the measurement with respect to a probability mea-

xxviii

Introduction

sure. Theorems relating these two kinds of averages are called ergodictheorems or laws of large numbers.

In this chapter these two types of averages are defined, and severaluseful properties are gathered and developed. The treatment of ensem-ble averages or expectations is fairly standard, except for the emphasison quantization and the fact that a general integral can be considered asa natural limit of simple integrals (sums, in fact) of quantized versionsof the function being integrated. In this light, general (Lebesgue) integra-tion is actually simpler to define and deal with than is the integral usuallytaught to most engineers, the ill-behaved and complicated Riemann in-tegral. The only advantage of the Riemann integral is in computation,actually evaluating the integrals of specific functions. Although an im-portant attribute for applications, it is not as important as convergenceand continuity properties in developing theory.

The theory of expectation is simply a special case of the theory ofintegration. Excellent treatments may be found in Halmos [45], Ash [1],and Chung [14]. Many proofs in this chapter follow those in these books.

The treatment of expectation and sample averages together is uncom-mon, but it is useful to compare the definitions and properties of thetwo types of average in preparation for the ergodic theorems to come.

As we wish to form arithmetic sums and integrals of measurements,in this chapter we emphasize real-valued random variables.

Chapter 6: Conditional Probability and Expectation

We begin the chapter by exploring some relations between measure-ments, that is, measurable functions, and events, that is, members of aσ -field. In particular, we explore the relation between knowledge of thevalue of a particular measurement or class of measurements and knowl-edge of an outcome of a particular event or class of events. Mathemati-cally these are relations between classes of functions and σ -fields. Suchrelations are useful for developing properties of certain special func-tions such as limiting sample averages arising in the study of ergodicproperties of information sources. In addition, they are fundamental tothe development and interpretation of conditional probability and con-ditional expectation, that is, probabilities and expectations when we aregiven partial knowledge about the outcome of an experiment.

Although conditional probability and conditional expectation are bothcommon topics in an advanced probability course, the fact that we areliving in standard spaces results in additional properties not present inthe most general situation. In particular, we find that the conditionalprobabilities and expectations have more structure that enables themto be interpreted and often treated much like ordinary unconditional

xxix

Introduction

probabilities. In technical terms, there will always be regular versions ofconditional probability and we will be able to define conditional expec-tations constructively as expectations with respect to conditional proba-bility measures.

Chapter 7: Ergodic Properties

In this chapter we formally define ergodic properties as the existence oflimiting sample averages, and we study the implications of such prop-erties. We shall see that if sample averages converge for a sufficientlylarge class of measurements, e.g., the indicator functions of all events,then the random process must have a property called asymptotic meanstationarity or be asymptotically mean stationary (AMS) and that thereis a stationary measure, called the stationary mean of the process, thathas the same sample averages. In addition, it is shown that the limitingsample averages can be interpreted as conditional probabilities or con-ditional expectations and that under certain conditions convergence ofsample averages implies convergence of the corresponding expectationsto a single expectation with respect to the stationary mean. A key role isplayed by invariant measurements and invariant events, measurementsand events which are not affected by shifting.

A random process is said to be ergodic if all invariant events are triv-ial, that is, have probability 0 or 1. Ergodicity is shown to be a necessarycondition for limiting sample averages to be constants instead of to gen-eral random variables.

Finally, we show that any dynamical system or random process pos-sessing ergodic processes has an ergodic decomposition in that its er-godic properties are the same as those produced by a mixture of sta-tionary and ergodic processes. In other words, if a process has ergodicproperties but it is not ergodic, then the process can be considered as arandom selection from a collection of ergodic processes.

Although there are several forms of convergence of random variablesand hence we could consider several types of ergodic properties, we fo-cus on almost everywhere convergence and later consider implicationsfor other forms of convergence. The primary goal of this chapter is thedevelopment of necessary conditions for a random process to possessergodic properties and an understanding of the implications of suchproperties. In Chapter 8 sufficient conditions are developed.

xxx

Introduction

Chapter 8: Ergodic Theorems

At the heart of ergodic theory are the ergodic theorems, results provid-ing sufficient conditions for dynamical systems or random processes topossess ergodic properties, that is, for sample averages of the form

< f >n=1n

n−1∑i=0

fT i

to converge to an invariant limit. Traditional developments of the point-wise ergodic theorem focus on stationary systems and use a subsidiaryresult known as the maximal ergodic lemma (or theorem) to prove the er-godic theorem. The general result for AMS systems then follows since anAMS source inherits ergodic properties from its stationary mean; that is,since the set x :< f >n (x) converges is invariant and since a systemand its stationary mean place equal probability on all invariant events,one measure possesses almost everywhere ergodic properties with re-spect to a class of measurements if and only if the other measure doesand the limiting sample averages are the same.

The maximal ergodic lemma is due to Hoph [52], and Garsia’s simpleand elegant proof [28] is almost universally used in published devel-opments of the ergodic theorem. Unfortunately, however, the maximalergodic lemma is not very intuitive and the proof is somewhat trickyand adds little insight into the ergodic theorem itself. Furthermore, wealso develop limit theorems for sequences of measurements that are notadditive, but are instead subadditive or superadditive, in which case themaximal lemma is insufficient to obtain the desired results without ad-ditional techniques.

During the early 1980s several new and simpler proofs of the ergodictheorem for stationary systems were developed and subsequently pub-lished by Ornstein and Weiss [79], Jones [55], Katznelson and Weiss [59],and Shields [98]. These developments dealt directly with the behaviorof long run sample averages, and the proofs involved no mathematicaltrickery. We here largely follow the approach of Katznelson and Weiss.

My favorite modern texts for ergodic theorems, which includes manytopics not touched here, is Krengle [62].

Chapter 9: Process Approximation and Metrics

Given two probability measures on a common probability space, howdifferent or distant from each other are they? Similarly, given two ran-dom processes with known distributions, how different are the processes

xxxi

Introduction

from each other and what impact does such a difference have on theirrespective ergodic properties?

There are a wealth of notions of distance between measures. For ex-ample, more than seventy examples of probability metrics can be foundin the encyclopedic treatise by Rachev [86]. A metric on spaces of prob-ability measures yields a metric space of such measures, and therebyyields useful notions of convergence and continuity of measures anddistributions. Of the many notions, however, we focus on two which areparticularly relevant for the purposes of this book and its sequel. A fewother relevant metrics are considered for reasons of comparison andas useful tools in their own right. More generally, however, we are in-terested in quantifying the difference between two random objects in away that need not satisfy the properties of a metric or even a pseudo-metric. Such generalizations of metrics are called distortion measuresand fidelity criteria in information theory, but they are also known ascost functions in other fields. As the sequel to this book is concernedwith information theory, we use the term distortion measure and thecorresponding intuition that one random object can be considered as adistorted version of another and that the distortion measure quantifiesthe degree of distortion. Even within information theory, however, “costfunction” has been used as a description for distortion measures, espe-cially when describing the constrained optimization problems arising incommunications systems and the associated Shannon coding theorems.Distortion measures and fidelity criteria — a family of distortion mea-sures — need not be symmetric in their arguments and need not satisfythe triangle inequality.

The classic example of an important distortion measure that is not ametric is the squared-error ubiquitous in information and communica-tions theory and signal processing, as well as in the practical world ofsignal quality measurements based on signal-to-noise ratios. True, a sumof squares can be converted into a metric by taking the square root toproduce the Euclidean distance, but the square root operation changesthe mathematics and does not reflect the actual practice. On the otherhand, the fact that a distortion measure can be a simple function of ametric, such as a positive power of the metric, allows many results formetrics to be extended to the distortion measure. In particular, the tri-angle inequality provides a useful measure of mismatch and robustnessin information theory — it leads to useful bounds on the loss of perfor-mance when designing a system for one random object but applying thesystem to another. The richest collection of results follows for the casewhere distortion measures are powers of metrics, which turns out to bethe primary case of interest in the parallel theory of optimal transporta-tion.

In this chapter we develop two quite distinct notions of the distortionor distance between measures or processes and use these ideas to delve

xxxii

Introduction

further into the ergodic properties of processes. The first notion consid-ered, the distributional distance, measures how well the probabilities ofcertain important events match up for the two probability measures, andhence this metric need not have relate to any underlying metric on theoriginal sample space. The second measure of distortion between pro-cesses is constructed from a more basic measure of distortion on theunderlying process alphabets, which might be a metric on a common al-phabet. The distributional distance is primarily useful for the derivationof the final result of this book in the next chapter: the ergodic decom-position of affine functionals. This result plays an important role in thesequel where certain information theoretic quantities are shown to beaffine functionals of the underlying processes, notably the Shannon en-tropy rate and the Shannon distortion-rate function.

The second notion of distortion has arisen in multiple fields and hasmany names, including the Monge/Kantorovich/Wasserstein/Ornsteindistance, transportation distance, d-distance, optimal coupling distance,earth mover’s distance, and Mallow’s distance. We indulge in more ofan historical discussion then is generally the case in this book becausepast books and papers in ergodic theory and information theory havescarcely mentioned the older history of the Monge/Kantorovich optimaltransportation theory and the fundamentally similar ideas between thedistance measures used, and because the modern rich literature of op-timal transportation theory makes virtually no mention of Ornstein’s d-distance in ergodic theory and its extensions in information theory. Sim-ilar ideas have cropped up under other names in a variety of branchesof mathematics, probability, economics, and physics. The process distor-tion will be seen to quantify the difference between generic or frequency-typical sequences from distinct stationary ergodic sources, which pro-vides a geometric interpretation of ergodic properties of random pro-cesses. In the sequel, the distortion is be used extensively as a tool in thedevelopment of Shannon source coding theory. It quantifies the good-ness or badness of an approximation to or reproduction of an originalprocess or “source” and it provides a measure of the mismatch resultingwhen a communications system optimized for a particular random pro-cess is applied to another. These coding applications are not pursuedin this volume, but they provide motivation for a careful look at suchprocess metrics and their properties and evaluation.

There are several excellent and encyclopedic texts on the couplingdistortion for random variables and vectors, including [86, 87, 109, 110,111]. Fewer references exist for the extension of most interest here: pro-cess distortion measures and metrics, but the classics are Ornstein’s sur-vey paper and book: [77, 78]. See also [97, 42, 99].

xxxiii

Introduction

Chapter 10 The Ergodic Decomposition

The ergodic decomposition as developed in Chapters 7 and 8 demon-strates that all AMS measures have a stationary mean that can be con-sidered as a mixture of stationary and ergodic components. Mathemat-ically, any stationary measure m is a mixture of stationary and ergodicmeasures px and takes the form

m =∫pxdm(x), (0.1)

which is an abbreviation for

m(F) =∫px(F)dm(x), all F ∈ B.

In addition, any m-integrable f can be integrated in two steps as∫f(x)dm(x) =

∫(∫dpx(y)f(y))dm(x). (0.2)

Thus stationary measures can be viewed as the output of mixtures of sta-tionary and ergodic measures or, equivalently, a stationary and ergodicmeasure is randomly selected from some collection and then made vis-ible to an observer. We effectively have a probability measure on thespace of all stationary and ergodic probability distributions and the firstmeasure is used to select one of the distributions of the given class.The preceding relations show that both probabilities and expectations ofmeasurements over the resulting mixture can be computed as integralsof the probabilities and expectations over the component measures.

In applications of the theory of random processes such as informa-tion and communication theory, one is often concerned not only withordinary functions or measurements, but also with functionals of prob-ability measures, that is, mappings of the space of probability measuresinto the real line. Examples include the entropy rate of a dynamical sys-tem, the distortion-rate function of a random process with respect to afidelity criterion, the information capacity of a noisy channel, the rate re-quired to code a given process with nearly zero reproduction error, therate required to code a given process for transmission with a specifiedfidelity, and the maximum rate of coded information that can be trans-mitted over a noisy channel with nearly zero error probability. (See, forexample, the papers and commentary in [38].) All of these quantities arefunctionals of measures: a given process distribution or measure yieldsa real number as a value of the functional.

Given such a functional of measures, say D(m), it is natural to inquireunder what conditions the analog of (0.2) for functions might hold, thatis, conditions under which

xxxiv

Introduction

D(m) =∫D(px)dm(x). (0.3)

This is in general a much more difficult issue than before since unlikean ordinary measurement f , the functional depends on the underly-ing probability measure. In this chapter conditions are developed underwhich (0.3) holds.

The classic result along this line is Jacob’s ergodic decomposition ofentropy rate [54] and this material was originally developed to extendhis results to more general affine functionals, which in particular neednot be bounded (but must be nonnegative). This chapter’s primary use isin the sequel [30], but it belongs in this book because it is an applicationof ergodic properties and ergodic theorems.

xxxv

Chapter 1

Probability Spaces

Abstract The fundamentals of probability spaces are developed: sam-ple spaces, event spaces or σ -fields, and probability measures. The em-phasis is on sample spaces that are Polish spaces — complete, separa-ble metric spaces — and on Borel σ -fields — event spaces generatedby the open sets of the underlying metric space. Complete probabilityspaces are considered in order to facilitate later comparison betweenstandard spaces and Lebesgue spaces. Outer measures are used to provethe Carathéodory extension theorem, providing conditions under whicha candidate probability measure on a field extends to a unique probabil-ity measure on a σ -field.

1.1 Sample Spaces

A sample space Ω is an abstract space, a nonempty collection of pointsor members or elements called sample points (or elementary eventsor elementary outcomes). The abstract space intuitively corresponds tothe “finest grain” outcomes of an experiment, but mathematically it issimply a set containing points. Important common examples are thebinary space 0,1, a finite collection of consecutive integers ZM =0,1, . . . ,M−1, the collection of all integers Z = . . . ,−2,−1,0,1,2, . . .,and the collection of all nonnegative integers Z+ = 0,1,2, . . .. Theseare examples of discrete sample spaces. Common continuous samplespaces include the unit interval [0,1) = r : 0 ≤ r < 1, the real lineR = (−∞,∞) and the nonnegative real line R+ = [0,∞),.

Ordinary set theory is the language for manipulating points and setsin sample spaces. Points will usually be denoted by lower case Greek orRoman letters such as ω, a, x, r with point inclusion in a set F denotedby ω ∈ F and set inclusion of F = ω : ω ∈ F in another set G byF ⊂ G. The usual set theoretic operations of union ∪, intersection ∩,

© Springer Science + Business Media, LLC 2009R.M. Gray, Probability, Random Processes, and Ergodic Properties,DOI: 10.1007/978-1-4419-1090-5_1,

1

2 1 Probability Spaces

complementation c , set difference −, and set symmetric difference ∆ willbe introduced and defined as needed.

All of these spaces can be used to construct more complicated spaces.For example, given a space A and a positive integer n, a simple prod-uct space or Cartesian product space Ω = An is defined as the collec-tion of all n-tuples ω = an = (a0, a1, . . . , an−1), ai ∈ A for i ∈ Zn.For example, Rn is n-dimensional Euclidean space and 0,1n is thespace of all binary n-tuples. Along with these finite-dimensional prod-uct spaces we can consider infinite dimensional product spaces such as

M ∆= 0,1Z+ = ×∞i=00,1, the space of all one-sided binary sequences(u0, u1, . . .), ui ∈ 0,1 for i ∈ Z+. More general product spaces will beintroduced later, but the simple special cases of the binary space 0,1and the one-sided binary sequence space M+ will prove to be of funda-mental importance throughout the book, a fact which can be interpretedas implying that flipping coins is the most fundamental of all randomphenomena.

1.2 Metric Spaces

A metric space generalizes the cited examples of sample spaces in a waythat provides an extremely useful notion of the quality of approximationof points in the sample space, how well or badly one point approximatesanother. Most of this book will focus on random processes with outputstaking values in metric spaces.

A set A with elements called points is called a metric space if for everypair of points a,b in A there is an associated nonnegative real numberd(a,b) such that

d(a,b) = 0 if and only if a 6= b, (1.1)

d(a,b) = d(b,a) (symmetry), (1.2)

d(a,b) ≤ d(a, c)+ d(c, b) all c ∈ A (triangle inequality) . (1.3)

The function d : A×A→ [0,∞) is called a metric or distance or a distancefunction. A metric space will be denoted by (A,d) or, simply, A if themetric d is clear from context.

If d(a,a) = 0 for all a ∈ A, but d(a,b) = 0 does not necessarily im-ply that a = b, then d is called a pseudometric and A is a pseudometricspace. Pseudo-metric spaces can be converted into metric spaces by re-placing the original space A by the collection of equivalence classes ofpoints having zero distance from each other, but following Doob [21] wewill usually just stick with the pseudometric space. Most definitions andresults stated for metric spaces will also hold for pseudometric spaces.

1.2 Metric Spaces 3

In information theory, signal processing, communications, and otherapplications it is often useful to consider more general notions of ap-proximation that might be neither symmetric nor satisfy the triangleinequality. A function d : A × B → [0,∞) will be called a distortion mea-sure or cost function. A common example of a distortion measure is anonnegative power of a metric.

Metric spaces provide mathematical models of many interesting al-phabets. Several common examples are listed below.

Example 1.1. Hamming distance A is a discrete (finite or countable) setwith metric

d(a,b) = dH(a,b) = 1− δa,b,where δa,b is the Kronecker delta function

δa,b =

1 a = b0 a ≠ b

Example 1.2. Lee distance A = ZM with metric d(a,b) = dL(a, b) = |a−b|mod M , where kmod M = r in the unique representation k = aM +r ,a an integer, 0 ≤ r ≤ M − 1. dL is called the Lee metric or circulardistance.

Example 1.3. The unit interval A = [0,1) or the real line A = R withd(a,b) =| a− b |, the absolute magnitude of the difference.

These familiar examples can be used as sample spaces by definingΩ = A. They can also be used to construct more interesting examplessuch as product spaces.

Example 1.4. Average Hamming distance A is a discrete (finite or count-able) set and Ω = An with metric

d(xn,yn) = n−1n−1∑i=0

dH(xi,yi),

the arithmetic average of the Hamming distances in the coordinates.

This distance is also called the Hamming distance, but the name is am-biguous since the Hamming distance in the sense of Example 1.1 appliedto A can take only values of 0 and 1 while the distance of this examplecan take on values of i/n; i = 0,1, . . . , n− 1. Better names for this ex-ample are the average Hamming distance or mean Hamming distance.This distance is useful in error correcting and error detecting coding ex-amples as a measure of the change in digital codewords caused by noisycommunication media.


Example 1.5. Levenshtein distance Ω = An is as in the preceding ex-ample. Given two elements xn and yn of Ω, an ordered set of integersK = (ki, ji); i = 1, . . . , r is called a match of xn and yn if xki = yji fori = 1,2, . . . , r . The number r = r(K) is called the size of the match K. LetK(xn,yn) denote the collection of all matches of xn and yn. The bestmatch sizemn(xn,yn) is the largest r such that there is a match of sizer , that is,

mn(xn,yn) = maxK∈K(xn,yn)

r(K).

The Levenshtein distance λn on An is defined by

λn(xn,yn) = n−mn(xn,yn) = minK∈K(xn,yn)

(n− r(K)),

that is, it is the minimum number of insertions, deletions, and substitu-tions required to map one of the n-tuples into the other [64] [105].

The Levenshtein distance is useful in correcting and detecting errorsin communication systems where the communication medium may notonly cause symbol errors: it can also drop symbols or insert erroneoussymbols. Note that unlike the previous example, the distance betweenvectors is not the sum of the component distances, that is, it is not ad-ditive. The Levenshtein distance is also called the edit distance.

Example 1.6. Euclidean distance For n = 1,2, . . . A = [0,1)n, the unitcube, or A = Rn, n-dimensional Euclidean space, with

d(an, bn) =n−1∑i=0

|ai − bi|21/2

,

the Euclidean distance.

Example 1.7. A = Rn with

d(an, bn) = maxi=0,...,n−1

|ai − bi|

The previous two examples are also special cases of normed linearvector spaces . Such spaces are described next.

Example 1.8. Normed linear vector space A with norm |.| and distanced(a,b) = |a− b|. A vector space or linear space A is a space consistingof points called vectors and two operations — one called addition whichassociates with each pair of vectors a,b a new vector a + b ∈ A andone called multiplication which associates with each vector a and eachnumber r ∈ R (called a scalar) a vector ra ∈ A — for which the usualrules of arithmetic hold, that is, if a,b, c ∈ A and r , s ∈ R, then

1.2 Metric Spaces 5

a+ b = b + a (commutative law)

(a+ b)+ c = a+ (b + c) (associative law)

r(a+ b) = ra+ rb (left distributive law)

(r + s)a = ra+ sa (right distributive law)

(rs)a = r(sa) (associative law for multiplication

1a = a.

In addition, it is assumed that there is a zero vector, also denoted 0, forwhich

a+ 0 = a0a = 0.

We also define the vector −a = (−1)a. A normed linear space is a linearvector space A together with a function ‖a‖ called a norm defined foreach a such that for all a,b ∈ A and r ∈ R, ‖a‖ is nonnegative and

‖a‖ = 0 if and only if a = 0, (1.4)

||a+ b‖ ≤ ‖a|| + ‖b|| (triangle inequality) , (1.5)

‖ra‖ = |r |‖a‖. (1.6)

If ‖a‖ = 0 does not imply that a = 0, then ‖ · || is called a seminorm.

For example, the Euclidean space A = Rn of Example 1.6 is a normedlinear space with norm

‖a‖ =n−1∑i=0

a2i

1/2

,

which gives the Euclidean distance of Example 1.6. The space A = Rn isalso a normed linear space with norm

‖a‖ =n−1∑i=0

| ai |,

which yields the distance of Example 1.7.Both of these norms and the corresponding metrics are special case

of the following.

Example 1.9. `p norm For any p ≥ 1, Euclidean space of Example 1.6 isa normed linear space with norm

‖a‖p =n−1∑i=0

api

1/p

.


The fact that the triangle inequality holds follows from Minkowski’s in-equality for sums, which is later proved in Corollary 5.7.

Example 1.10. Inner product space A with inner product (·, ·) and dis-tance d(a,b) = ‖a − b‖, where ‖a‖ = (a,a)1/2 is a norm. An innerproduct space (or pre-Hilbert space) is a linear vector space A such thatfor each pair of vectors a,b ∈ A there is a real number (a, b) called aninner product such that for a,b, c ∈ A, r ∈ R

(a, b) = (b,a)(a+ b, c) = (a, c)+ (b, c)(ra, b) = r(a, b)(a,a) ≥ 0 and (a,a) = 0 if and only if a = 0.

Inner product spaces and normed linear vector spaces include a va-riety of general alphabets and function spaces, many of which will beencountered in this book.

Example 1.11. Fractional powers of metrics Given any metric space(A,d) and 0 < p < 1, then (A,dp) is also a metric space. Clearly dp(a, b)is nonnegative and symmetric if d is. To show that dp satisfies the trian-gle inequality, we use the following simple inequality.

Lemma 1.1. Given a,b ≥ 0 and 0 < p < 1,

(a+ b)p ≤ ap + bp. (1.7)

Proof. If 0 < p < 1 and 0 < x ≤ 1, then xp ≥ x. Thus

ap + bp(a+ b)p =

ap

(a+ b)p +bp

(a+ b)p

=(a

a+ b

)p+(b

a+ b

)p≥ aa+ b +

ba+ b = 1

2

Given any a,b, c ∈ A, since d is a metric we have that d(a, c) ≤d(a,b)+ d(b, c), which with Lemma 1.1 yields

dp(a, c) ≤ (d(a, b)+ d(b, c))p

≤ dp(a, b)+ dp(b, c),

proving that dp satisfies the triangle inequality.

Example 1.12. Product Spaces Given a metric space (A,d) and an indexset I that is some subset of the integers (e.g., Z+ = 0,1,2, . . .), theCartesian product

1.2 Metric Spaces 7

AI = ×i∈IAi,

where the Ai are all replicas of A, is a metric space with metric

dI(a, b) =∑i∈I

2−|i|d(ai, bi)

1+ d(ai, bi).

The somewhat strange form above ensures that the metric on the se-quences in AI is finite–simply adding up the component metrics wouldlikely yield a sum that blows up. Dividing by 1 plus the metric ensuresthat each term is bounded and the 2−|i| ensure convergence of the in-finite sum. In the special case where the index set I is finite, then theproduct space is a metric space with the simpler metric

dI(a, b) =∑i∈Id(ai, bi).

If I = Zn, then the following shorthand is common:

dn(a, b) = dZn(a, b) =n−1∑i=0

d(ai, bi). (1.8)

The triangle inequality follows from that of the component metrics.

Example 1.13. Product Spaces and Metric Powers A more general con-struction of a metric dn on an n-fold product spaces is based on positivepowers of the component metrics. Given a metric space (A,d) and p > 0,define

d(p)n (a, b) =

(∑n−1

i=0 dp(ai, bi))1/p

1 ≤ p∑n−1i=0 dp(ai, bi) 0 < p ≤ 1

. (1.9)

If 0 < p ≤ 1, the triangle inequality is immediate as in (1.8) since dp isa metric. For p ≥ 1, the triangle inequality follows from the Minkowskisum inequality (Corollary 5.7).

Open Sets and Topology

Let A denote a metric space with metric d. Define for each a in A andeach real r > 0 the open sphere (or open ball) with center a and radiusr by Sr (a) = b : b ∈ A,d(a,b) < r. For example, an open sphere inthe real line R with respect to the Euclidean metric is simply an openinterval (a, b) = x : a < x < b. A subset F of A is called an open setif given any point a ∈ F there is an r > 0 such that Sr (a) is completelycontained in F , that is, there are no points in F on the edge of F . LetOA denote the class of all open sets in A. The following lemma collects


several useful properties of open sets. The proofs are left as an exerciseas they are easy and may be found in any text on topology or metricspaces; e.g., [100].

Lemma 1.2. All spheres are open sets. A and∅ are open sets. A set is openif and only if it can be written as a union of spheres; in fact,

F =⋃a∈F

⋃r :r<d(a,Fc)

Sr (a), (1.10)

whered(a,G) = inf

b∈Gd(a, b).

If Ft ; t ∈ T is any (countable or uncountable) collection of open sets,then

⋃t∈T Ft is open. If Fi, i ∈ N , is any finite collection of open sets, then

∩i∈NFi is open.

When we have a class of open sets V such that every open set F inOA can be written as a union of elements in V , e.g., the preceding openspheres, then we say that V is a base for the open sets and write OA =OPEN(V ) = all unions of sets in V.

In general, a set A together with a class of subsets OA called opensets such that all unions and finite intersections of open sets yield opensets is called a topology on A. If V is a base for the open sets, then it iscalled a base for the topology. Observe that many different metrics mayyield the same open sets and hence the same topology. For example, themetrics d(a,b) and d(a,b)/(1 + d(a,b)) yield the same open sets andhence the same topology. Two metrics will be said to be equivalent on Aif they yield the same topology on A.

Convergence in Metric Spaces

Given a metric space (A,d), a sequence an, n = 1,2, . . . of points in A issaid to converge to a point a if

limn→∞

d(an,a) = 0,

that is, if given ε > 0 there is an N such that n ≥ N implies thatd(an,a) ≤ ε. If an converges to a, then we write a = limn→∞ an oran →n→∞ a.

A metric space A is said to be sequentially compact if every sequencehas a convergent subsequence, that is, if an ∈ A, n = 1,2, . . ., thenthere is a subsequence n(k) of increasing integers and a point a in Asuch that limk→∞ an(k) = a. The notion of sequential compactness is

1.2 Metric Spaces 9

considered somewhat old-fashioned in some modern analysis texts, butoccasionally it will prove to be exactly what we need.

We will need the following two easy results on convergence in productspaces.

Lemma 1.3. Given a metric space A with metric d and a countable indexset I, let AI denote the Cartesian product with the metric of Example 1.12.A sequence xnI = xn,i; i ∈ I converges to a sequence xI = xi, i ∈ Iif and only if limn→∞ xn,i = xi for all i ∈ I.

Proof. The metric on the product space of Example 1.12 goes to zero ifand only if all of the individual terms of the sum go to zero. 2

Corollary 1.1. Let A and AI be as in the lemma. If A is sequentially com-pact, then so is AI.

Proof. Let xn, n = 1,2, . . . be a sequence in the product space A. Sincethe first coordinate alphabet is sequentially compact, we can choose asubsequence, say x1

n, that converges in the first coordinate. Since thesecond coordinate alphabet is sequentially compact, we can choose afurther subsequence, say x2

n, of x1n that converges in the second coor-

dinate. We continue in this manner, each time taking a subsequence ofthe previous subsequence that converges in one additional coordinate.The so-called diagonal sequence xnn will then converge in all coordinatesas n → ∞ and hence by the lemma will converge with respect to thegiven product space metric. This method of taking subsequences of sub-sequences with desirable properties is referred to as a standard diago-nalization. 2

Limit Points and Closed Sets

If A is a metric space with metric d, a point a in A is said to be a limitpoint of a set F if every open sphere Sr (a) contains a point b 6= a suchthat b is in F . A set F is closed if it contains all of its limit points. The setF together with its limit points is called the closure of F and is denotedby F . Given a set F , define its diameter

diam(F) = supa,b∈F

d(a, b).

Lemma 1.4. A set F is closed if and only if for every sequence an,n =1,2, . . . of elements of F such that an → a we have that also a is in F . If dandm are two metrics on A that yield equivalent notions of convergence,that is, d(an,a) → 0 if and only if m(an,a) → 0 for an, a in A, then dand m are equivalent metrics on A, that is, they yield the same open setsand generate the same topology.


Proof. If a is a limit point of F , then one can construct a sequence ofelements an in F such that a = limn→∞ an. For example, choose an asany point in S1/n(a) (there must be such an an since a is a limit point).Thus if a set F contains all limits of sequences of its points, it mustcontain its limit points and hence be closed.

Conversely, let F be closed and an a sequence of elements of F thatconverge to a point a. If for every ε > 0 we can find a point an distinctfrom a in Sε(a), then a is a limit point of F and hence must be in F sinceit is closed. If we can find no such distinct point, then all of the an mustequal a for n greater than some value and hence a must be in F . If twometrics d and m are such that d-convergence and m-convergence areequivalent, then they must yield the same closed sets and hence by theprevious lemma they must yield the same open sets. 2

Exercises

1. Prove Lemma 1.2.2. Show that in inner product space with norm ‖·‖ satisfies the parallel-

ogram law||a+ b||2 + ||a− b||2 = 2||a||2 + 2||b||2

3. Show that dI and d of Example 1.12 are metrics.4. Show that if di; i = 0,1, . . . , n−1 are metrics on A, then so is d(x,y) =

maxi di(x,y).5. Suppose that d is a metric on A and r is a fixed positive real number.

Define ρ(x,y) to be 0 if d(x,y) ≤ r and K if d(x,y) > r . Is ρ ametric? a pseudometric?

1.3 Measurable Spaces

A measurable space (Ω,B) is a pair consisting of a sample space Ω to-gether with a σ -field B of subsets of Ω (also called the event space). Aσ -field or σ -algebra B is a collection of subsets of Ω with the followingproperties: Ω ∈ B. (1.11)

If F ∈ B, then Fc = ω :ω 6∈ F ∈ B. (1.12)

If Fi ∈ B; i = 1,2, . . . , then∞⋃i=1

Fi = ω :ω ∈ Fi for some i ∈ B. (1.13)

The definition is in terms of unions, but from basic set theory thisalso implies that countable sequences of other set theoretic operations

1.3 Measurable Spaces 11

on events also yield events. For example, from de Morgan’s “law” of el-ementary set theory it follows that countable intersections are also inB:

∞⋂i=1

Fi = ω :ω ∈ Fi for all i = (∞⋃i=1

Fci )c ∈ B.

Other set theoretic operations of interest are set difference defined byF −G = F ∩Gc and symmetric difference is defined by

F∆G = (F ∩Gc)∪ (Fc ∩G)which implies that

F∆G = (F ∪G)− (F ∩G) = (F −G)∪ (G − F). (1.14)

Set difference is often denoted by a backslash, F\G. Since both set differ-ence and set symmetric difference can be expressed in terms of unionsand complements and intersections, these operations on events yieldother events.

Set theoretic difference will quite often be used to express arbitraryunions of events as unions of disjoint events, a construction summa-rized in the following lemma for reference. The proof is omitted as it iselementary set theory.

Lemma 1.5. An arbitrary sequence of events Fn,n = 1,2, . . . can be con-verted into a sequence of disjoint events Gn with the same finite unions asfollows: for n = 1,2, . . .

Gn = Fn −n−1⋃i=1

Fi = Fn −n−1⋃i=1

Gi (1.15)

n⋃i=1

Gi =n⋃i=1

Fi. (1.16)

Thus an event space is a collection of subsets of a sample space (calledevents by virtue of belonging to the event space) such that any countablesequence of set theoretic operations (union, intersection, complementa-tion, difference, symmetric difference) on events produces other events.Note that there are two extreme σ -fields: the largest possible σ -field ofΩis the collection of all subsets of Ω (sometimes called the power set), andthe smallest possible σ -field is Ω,∅, the entire space together with thenull set ∅ = Ωc (called the trivial space).

If instead of the closure under countable unions required by (1.13), weonly require that the collection of subsets be closed under finite unions,then we say that the collection of subsets is a field or algebra. Fieldswill usually be simpler to work with, but unfortunately they do not havesufficient structure to develop required convergence properties. In par-


ticular, σ -fields possess the additional important property that it con-tains all of the limits of sequences of sets in the collection. In partic-ular, if Fn, n = 1,2, . . . is an increasing sequence of sets in a σ -field,i.e., Fn−1 ⊂ Fn, and if F =

⋃∞n=1 Fn (in which case we write Fn ↑ F or

limn→∞ Fn = F), then also F is contained in the σ -field. This propertymay not hold true for fields, that is, fields need not contain the limits ofsequences of field elements. If a field has the property that it contains allincreasing sequences of its members, then it is also a σ -field. In a similarfashion we can define decreasing sets. If Fn decreases to F in the sensethat Fn+1 ⊂ Fn and F =

⋂∞n=1 Fn, then we write Fn ↓ F . If Fn ∈ B for all

n, then F ∈ B.Because of the importance of the notion of converging sequences of

sets, it is useful to generalize the definition of a σ -field to emphasizesuch limits. A collection M of subsets of Ω is called a monotone classif it has the property that if Fn ∈ M for n = 1,2, . . . and either Fn ↑ For Fn ↓ F , then also F ∈ M. Clearly a σ -field is a monotone class, butthe reverse need not be true. If a field is also a monotone class, however,then it must be a σ -field.

A σ -field is sometimes referred to as a Borel field in the literature andthe resulting measurable space called a Borel space. We will reserve thisnomenclature for the more common use of these terms as the specialcase of a σ -field having a certain topological structure described in theSection 1.4.

Generating σ -Fields and Fields

When constructing probability measures it will often be natural to definea candidate probability measure on a particular collection of importantevents, such as intervals in the real line, and then to extend the candidatemeasure to a measure on a σ -field that contains the important collection.It is convenient to do this in two steps, first extending the candidatemeasure to a field containing the important events (the field may havea simple structure easing such an extension), and then extending fromthe field to the full σ -field. This approach is captured in the idea ofgenerating a field or σ -field from a collection of important sets.

Given any collection G of subsets of a space Ω, then σ(G) will denotethe smallest σ -field of subsets of Ω that contains G and it will be calledthe σ -field generated by G. By smallest we mean that any σ -field con-taining G must also contain σ(G). The σ -field is well defined since theremust exist at least one σ -field containing G, the collection of all subsetsof Ω. The intersection of all σ -fields that contain G must itself be a σ -field, it must contain G, and it must in turn be contained by all σ -fieldsthat contain G. Thus there is a smallest σ -field containing G.

1.3 Measurable Spaces 13

Similarly, define the field generated by G, F(G), to be the smallestfield containing G. Clearly G ⊂ F(G) ⊂ σ(G) and the field will oftenprovide a useful intermediate step. In particular, σ(G) = σ(F(G)) sothat if it is convenient or necessary to have a generating class be a field,F(G) is the natural class to consider.

A σ -field B will be said to be countably generated if there is a count-able class G which generates B, that is, for which B = σ(G).

It would be a good idea to understand the structure of a σ -field gen-erated from some class of events. Unfortunately, however, there is noconstructive means of describing the σ -field so generated. That is, wecannot give a prescription of adding all countable unions, then all com-plements, and so on, and be ensured of thereby giving an algorithm forobtaining all σ -field members as sequences of set theoretic operationson members of the original collection. We can, however, provide someinsight into the structure of such generated σ -fields when the originalcollection is a field. This structure will prove useful when consideringextensions of measures.

Recall that a collectionM of subsets of Ω is a monotone class if when-ever Fn ∈M, n = 1,2, . . ., and Fn ↑ F or Fn ↓ F , then also F ∈M.

Lemma 1.6. Given a field F , then σ(F) is the smallest monotone classcontaining F .

Proof. Let M be the smallest monotone class containing F and let F ∈M. DefineMF as the collection of all sets G ∈M for which F∩G, F∩Gc ,and Fc ∩G are all inM. ThenMF is a monotone class. If F ∈ F , then allmembers of F must also be in MF since they are in M and since F is afield. Since both classes containF andM is the minimal monotone class,M ⊂ MF . Since the members of MF are all chosen from M, M = MF .This implies in turn that for any G ∈M, then all sets of the form G ∩ F ,G ∩ Fc , and Gc ∩ F for any F ∈ F are in M. Thus for this G, all F ∈ Fare members ofMG or F ⊂MG for any G ∈M. By the minimality ofM,this means that MG =M. We have now shown that for F , G ∈M =MF ,then F ∩ G, F ∩ Gc , and Fc ∩ G are also in M. Thus M is a field. Sinceit also contains increasing limits of its members, it must be a σ -fieldand hence it must contain σ(F) since it contains F . Since σ(F) is amonotone class containing F , it must contain M; hence the two classesare identical.

Exercises

1. Prove (1.14). Hint: When proving two sets F and G are equal, thestraightforward approach is to show first that if ω ∈ F , then also


ω ∈ G and hence F ⊂ G. Reversing the procedure proves the setsequal.

2. Show thatF ∪G = F ∪ (F −G).

and hence the union of two events can always be written as the unionof two disjoint events.

3. Prove deMorgan’s laws.4. Show that the intersection of a collection of σ -fields is itself a a σ -

field.5. Let Ω be an arbitrary space. Suppose that Fi, i = 1,2, . . . are all σ -

fields of subsets of Ω. Define the collection F =⋂iFi; that is, the

collection of all sets that are in all of the Fi. Show that F is a σ -field.6. Given a measurable space (Ω,F), a collection of sets G is called a sub-σ -field of F if it is a σ -field and if all of its elements belong to F ,in which case we write G ⊂ F . Show that G is the intersection of allσ -fields of subsets of Ω of which it is a sub-σ -field.

7. Prove Lemma 1.5.

1.4 Borel Measurable Spaces

Given a metric space A together with a class of open sets OA of A, thecorresponding Borel field (actually a Borel σ -field) B(A) is defined asσ(OA), the σ -field generated by the open sets. The members of B(A)are called the Borel sets of A. A metric space A together with a Borelσ -field B(A) is called a Borel measurable space (A,B(A)) or, simply, aBorel space when it is clear from context that the space is a measurablespace. Thus a Borel space is a measurable space where the σ -field isgenerated by the open sets of the alphabet with respect to some metric.Observe that equivalent metrics on a space A will yield the same Borelmeasurable space since they yield the same open sets.

The structure of a Borel measurable space leads to several properties.

Lemma 1.7. A set is closed if and only if its complement is open and henceall closed sets are in the Borel σ -field. Arbitrary intersections and finiteunions of closed sets are closed. B(A) = σ(all closed sets). The closure ofa set is closed. If F is the closure of F , then diam(F) = diam(F).

Proof. If F is closed and we choose x in Fc , then x is not a limit pointof F and hence there is an open sphere Sr (x) that is disjoint with Fand hence contained in Fc . Thus Fc must be open. Conversely, supposethat Fc is open and x is a limit point of F and hence every sphere Sr (x)contains a point of F . Since Fc is open this means that x cannot be in Fcand hence must be in F . Hence F is closed. This fact, DeMorgan’s laws,

1.4 Borel Measurable Spaces 15

and Lemma 1.2 imply that finite unions and arbitrary intersections ofclosed sets are closed. Since the complement of every closed set is openand vice versa, the σ -fields generated by the two classes are the same.To prove that the closure of a set is closed, first observe that if x is inFc and hence neither a point nor a limit point of F , then there is an opensphere S centered at x which has no points in F . Since every point y inS therefore has an open sphere containing it which contains no points inF , no such y can be a limit point of F . Thus Fc contains S and hence isopen. Thus its complement, the closure of F , must be closed. Any pointsin the closure of a set must have points in the set within an arbitrarilyclose distance. Hence the diameter of a set and that of its closure are thesame. 2

In addition to open and closed sets, two other kinds of sets arise oftenin the study of Borel spaces: Any set of the form

⋂∞i=1 Fi with the Fi open

is called a Gδ set or simply a Gδ. Similarly, any countable union of closedsets is called an Fσ . Observe that Gδ and Fσ sets are not necessarilyeither closed or open. We have, however, the following relation.

Lemma 1.8. Any closed set is a Gδ, any open set is an Fσ .

Proof. If F is a closed subset of A, then

F =∞⋂n=1

x : d(x, F) < 1/n,

and hence F is a Gδ. Taking complements completes the proof since anyclosed set is the complement of an open set and the complement of a Gδis an Fσ . 2

The added structure of Borel measurable spaces and Lemma 1.8 per-mit a different characterization of σ -fields:

Lemma 1.9. Given a metric space A, then B(A) is the smallest class ofsubsets of A that contains all the open (or closed) subsets of A, and whichis closed under countable disjoint unions and countable intersections.

Proof. Let G be the given class. Clearly G ⊂ B(A). Since G contains theopen (or closed) sets (which generate the σ -field B(A)), we need onlyshow that it contains complements to complete the proof. Let H be theclass of all sets F such that F ∈ G and Fc ∈ G. Clearly H ⊂ G. Since Gcontains the open sets and countable intersections, it also contains theGδ’s and hence also the closed sets from the previous lemma. Thus boththe open sets and closed sets all belong toH . Suppose now that Fi ∈H ,i = 1,2, . . . and hence both Fi and their complements belong to G. SinceG contains countable disjoint unions and countable intersections

∞⋃i=1

Fi =∞⋃i=1

(Fi ∩ Fc1 ∩ Fc2 ∩ · · · ∩ Fci−1) ∈ G


and

(∞⋃i=1

Fi)c =∞⋂i=1

Fci ∈ G.

Thus the countable unions of members of H and the complements ofsuch sets are in G. This implies thatH is closed under countable unions.If F ∈ H , then F and Fc are in G, hence also Fc ∈ H . Thus H isclosed under complementation and countable unions and henceH con-tains both countable disjoint unions and countable intersections. SinceH also contains the open sets, it must therefore contain G. Since wehave already argued the reverse inclusion,H = G. This implies that G isclosed under complementation sinceH is, thus completing the proof. 2

1.5 Polish Spaces

Separable Metric Spaces

Let (A,d) be a metric space. A set F is said to be dense in A if everypoint in A is a point in F or a limit point of F . A metric space A is saidto be separable if it has a countable dense subset, that is, if there is adiscrete set, say B, such that all points in A can be well approximatedby points in B in the sense that all points in A are points in B or limitsof points in B. For example, the rational numbers are dense in the realline R and are countable, and hence R of Example 1.3 is separable. Simi-larly, n-dimensional vectors with rational coordinates are countable anddense in Rn, and hence Example 1.6 also provides a separable metricspace. The discrete metric spaces of Examples 1.1 through 1.5 are triv-ially separable since A is countable and is dense in itself. An example ofa nonseparable space is the binary sequence spaceM= 0,1Z+ with thesup-norm metric

d(x,y) = supi∈Z+

|xi −yi|.

The principal use of separability of metric spaces is the followinglemma.

Lemma 1.10. Let A be a metric space with Borel σ -field B(A). If A is sep-arable, then B(A) is countably generated and it contains all of the pointsof the set A.

Proof. Let ai; i = 1,2, . . . be a countable dense subset of A and definethe class of sets

VA = S1/n(ai); i = 1,2, . . . ; n = 1,2, . . .. (1.17)

1.5 Polish Spaces 17

Any open set F can be written as a countable union of sets in VA analo-gous to (1.10) as

F =⋃i:ai∈F

⋃n:n−1<d(ai,Fc)

S1/n(ai). (1.18)

Thus VA is a countable class of open sets and σ(VA) contains the classof open sets OA which generates B(A). Hence B(A) is countably gener-ated. To see that the points of A belong to B(A), index the sets in VA asVi; i = 0,1,2, . . . and define the sets Vi(a) by Vi if a is in Vi and Vci ifa is not in Vi. Then

a =∞⋂i=1

Vi(a) ∈ B(A).

2

For example, the class of all open intervals (a, b) = x : a < x < bwith rational endpoints b > a is a countable generating class of opensets for the Borel σ -field of the real line with respect to the usual Eu-clidean distance. Similar subsets of [0,1) form a similar countable gen-erating class for the Borel sets of the unit interval.

Complete Metric Spaces

A sequence an; = 1,2, . . . in A is called a Cauchy sequence if for everyε > 0 there is an integer N such that d(an,am) < ε if n ≥ N andm ≥ N .Alternatively, a sequence is Cauchy if and only if

limn→∞

diam(an,an+1, . . .) = 0.

A metric space is complete if every Cauchy sequence converges, thatis, if an is a Cauchy sequence, then there is an a in A for which a =limn→∞ an. In the words of Simmons [100], p. 71, a complete metric spaceis one wherein “every sequence that tries to converge is successful.” Astandard result of elementary real analysis is that Examples 1.3 and 1.6,[0,1)k and Rk with the Euclidean distance, are complete (see Rudin [93],p. 46). A complete inner product space (Example 1.10) is called a Hilbertspace. A complete normed space (Example 1.8) is called a Banach space.

A fundamental property of complete metric spaces is given in thefollowing lemma, which is known as Cantor’s intersection theorem. Thecorollary is a slight generalization.

Lemma 1.11. Given a complete metric space A, let Fn, n = 1,2, . . .be a decreasing sequence of nonempty closed subsets of A for whichdiam(Fn) → 0. Then the intersection of all of the Fn contains exactly onepoint.


Proof. Since the diameter of the sets is decreasing to zero, The intersec-tion cannot contain more than one point. If xn is a sequence of points inFn, then it is a Cauchy sequence and hence must converge to some pointx since A is complete. If we show that x is in the intersection, the proofwill be complete. Since the Fn are decreasing, for each n the sequencexk, k ≥ n is in Fn and converges to x. Since Fn is closed, however, itmust contain x. Thus x is in all of the Fn and hence in the intersection.2

The similarity of the preceding condition to the finite intersectionproperty of atoms of a basis of a standard space is genuine and willbe exploited soon.

Corollary 1.2. Given a complete metric space A, let Fn, n = 1,2, . . . be adecreasing sequence of nonempty subsets of A such that diam(Fn) → 0and such that the closure of each set is contained in the previous set, thatis,

Fn ⊂ Fn−1, all n,

where we define F0 = A. Then the intersection of all of the Fn containsexactly one point.

Proof. Since Fn ⊂ Fn ⊂ Fn−1

∞⋂n=1

Fn ⊂∞⋂n=1

Fn ⊂∞⋂n=0

Fn.

Since the Fn are decreasing, the leftmost and rightmost terms above areequal. Thus

∞⋂n=1

Fn =∞⋂n=1

Fn.

Since the Fn are decreasing, so are the Fn. Since these sets are closed,the right-hand term contains a single point from the lemma. 2

The following technical result will be useful toward the end of thebook, but its logical place is here and it provides a little practice.

Lemma 1.12. The space [0,1] with the Euclidean metric |x − y| is se-quentially compact. The space [0,1]I with I countable and the metric ofExample 1.12 is sequentially compact.

Proof. The second statement follows from the first and Corollary 1.1.To prove the first statement, let Sni , n = 1,2, . . . , n denote the closedspheres [(i− 1)/n, i/n] having diameter 1/n and observe that for eachn = 1,2, . . .

[0,1] ⊂n⋃i=1

Sni .

1.6 Probability Spaces 19

Let xi be an infinite sequence of distinct points in [0,1]. For n = 2there is an integer k1 such that K1 = S2

k1contains infinitely many xi.

Since Snk1⊂⋃i S3i we can find an integer k2 such that K2 = S2

k1∩ S3

k2

contains infinitely many xi. Continuing in this manner we construct asequence of sets Kn such that Kn ⊂ Kn−1 and the diameter of Kn is lessthan 1/n. Since [0,1] is complete, it follows from the finite intersectionproperty of Lemma 1.11 that

⋂n Kn consists of a single point x ∈ [0,1].

Since every neighborhood of x contains infinitely many of the xi, x mustbe a limit point of the xi. 2

Polish Spaces

A complete separable metric space is called a Polish space because of thepioneering work by Polish mathematicians on such spaces. Polish spacesplay a central role in functional analysis and provide models for most al-phabets of scalars, vectors, sequences, and functions that arise in com-munications applications. The best known examples are the Euclideanspaces, but we shall encounter others such as the space of square inte-grable functions. A Borel space (A,B(A)) with A a Polish metric spacewill be called a Polish Borel space.

1.6 Probability Spaces

A probability space (Ω,B, P) is a triple consisting of a sample space Ω ,a σ -field B of subsets of Ω, and a probability measure P defined on theσ -field; that is, P(F) assigns a real number to every member F of B sothat the following conditions are satisfied:

Nonnegativity:P(F) ≥ 0, all F ∈ B, (1.19)

Normalization:P(Ω) = 1. (1.20)

Countable Additivity:

If Fi ∈ B, i = 1,2, . . . are disjoint, then

P(∞⋃i=1

Fi) =∞∑i=1

P(Fi). (1.21)

Since the probability measure is defined on a σ -field, such countableunions of subsets of Ω in the σ -field are also events in the σ -field.


Alternatively, a probability space is a measurable space together witha probability measure on the σ -field of the measurable space. A set func-tion satisfying (1.21) only for finite sequences of disjoint events is saidto be additive or finitely additive.

A set function P satisfying only (1.19) and (1.21) but not necessar-ily (1.20) is called a measure, and the triple (Ω,B, P) is called a mea-sure space. The emphasis here will be almost entirely on probabilityspaces rather than the more general case of measure spaces. There will,however, be occasional uses for the more general idea as a means ofconstructing specific probability measures of interest. A measure space(Ω,B, P) is said to be finite if P(Ω) <∞. Any finite measure space can beconverted into a probability space by suitable normalization. A measurespace is said to be sigma-finite if Ω is the union of a collection of mea-surable sets of finite measure, that is, if there is a collection of eventsFi such that

⋃i Fi = Ω and P(Fi) <∞ for all i.

A straightforward exercise provides an alternative characterization ofa probability measure involving only finite additivity, but requiring theaddition of a continuity requirement: a set function P defined on eventsin the σ -field of a measurable space (Ω,B) is a probability measure if(1.19) and (1.20) hold, if the following conditions are met:

Finite Additivity: If Fi ∈ B, i = 1,2, . . . , n are disjoint, then

P(n⋃i=1

Fi) =n∑i=1

P(Fi). (1.22)

Continuity at ∅: IfGn ↓ ∅ (the empty or null set); that is, ifGn+1 ⊂ Gn,all n, and

⋂∞n=1Gn = ∅, then

limn→∞

P(Gn) = 0. (1.23)

The equivalence of continuity and countable additivity is obvious fromLemma 1.5. Countable additivity for the Fn will hold if and only if thecontinuity relation holds for the Gn . It is also easy to see that condition(1.23) is equivalent to two other forms of continuity:

Continuity from Below:

If Fn ↑ F, then limn→∞

P(Fn) = P(F). (1.24)

Continuity from Above:

If Fn ↓ F, then limn→∞

P(Fn) = P(F). (1.25)

Thus a probability measure is an additive, nonnegative, normalized setfunction on a σ -field or event space with the additional property that


if a sequence of sets converges to a limit set, then the correspondingprobabilities must also converge.

If we wish to demonstrate that a set function P is indeed a valid prob-ability measure, then we must show that it satisfies the preceding prop-erties (1.19), (1.20), and either (1.21) or (1.22) and one of (1.23), (1.24), or(1.25).

Observe that if a set function satisfies (1.19), (1.20), and (1.22), thenfor any disjoint sequence of events Fi and any n

P(∞⋃i=0

Fi) = P(n⋃i=0

Fi)+ P(∞⋃

i=n+1

Fi)

≥ P(n⋃i=0

Fi) =n∑i=0

P(Fi),

and hence we have taking the limit as n→∞ that

P(∞⋃i=0

Fi) ≥∞∑i=0

P(Fi). (1.26)

Thus to prove that P is a probability measure one must show that thepreceding inequality is in fact an equality.

Example 1.14. Discrete Probability Spaces As a special case that is bothsimple and important, consider the case where Ω is discrete, either afinite or a countably infinite set. In this case the most common σ -field isthe power set and any probability measure P has a simple description interms of the probabilities it assigns to one point sets. Given a probabilitymeasure P , define the corresponding probability mass function (PMF) pby

p(ω) = P(ω). (1.27)

From countable additivity, the probability of any event F is easily foundby summing the PDF:

P(F) =∑ω∈F

p(ω). (1.28)

Conversely, given any function p : Ω → [0,1) that is nonnegative andnormalized in the sense that ∑

ω∈Ωp(ω) = 1,

then the properties of summation imply that the set function defined by(1.28) is a probability measure. This is obvious for a finite discrete setand it follows from the properties of summation for a generally count-ably infinite set.


Elementary Properties of Probability

The next lemma collects a variety of elementary probability relations forlater reference.

Lemma 1.13. Suppose that (Ω,B, P) is a probability space. All sets beloware assumed to be members of B.

1. If G ⊂ F , then P(F −G) = P(F)− P(G).2. Monotonicity If G ⊂ F , then P(G) ≤ P(F).3. Countable Subadditivity Given a sequence of events Fi; i = 1,2, . . .,

then

P(∞⋃i=1

Fi) ≤∞∑i=1

P(Fi) (1.29)

This is also referred to as the Bonferroni or union bound.4. Given two events F and G, two notions of how dissimilar the sets are

in a probabilistic sense will be encountered shortly and throughout thebook: the probability of the symmetric difference of the two sets and thedifference of the probabilities of the two sets. The following inequalityshows that the latter is always smaller than the former.

| P(F)− P(G) |≤ P(F∆G). (1.30)

5.

P(F∆G) = P(F ∪G)− P(F ∩G) (1.31)

= P(F −G)+ P(G − F) (1.32)

= P(F)+ P(G)− 2P(F ∩G) (1.33)

6. A triangle inequality

P(F∆G) ≤ P(F∆H)+ P(H∆G) (1.34)

Proof. 1. Since G ⊂ F , P(F) = P(F ∩G)+ P(F ∩Gc) = P(G)+ P(F −G).2. Follows from the previous part and nonnegativity of probability.3. Use Lemma 1.5, additivity, and monotonicity to write for all n

P(n⋃i=1

Fi) = P(n⋃i=1

Gi) =n∑i=1

P(Gi) ≤n∑i=1

P(Fi). (1.35)

Taking the limit as n → ∞ the result follows from the continuity ofprobability and the fact that the right hand side is monotonically non-decreasing and hence has a limit.

4. Since F ∩G ⊂ F ∪G, P(F∆G) = P(F ∪G−F ∩G) = P(F ∪G)−P(F ∩G)and P(F ∪G) = P(F)+ P(G)− P(F ∩G), we have


P(F∆G) = P(F)+ P(G)− 2P(F ∩G)≥ P(F)+ P(G)− 2 min(P(F)P(G))= | P(F)− P(G) |

5. Eq. (1.31) follows from (1.14) and the first part of the lemma. Eq. (1.32)follows from (1.14) and the third axiom of probability. Eq. (1.31) fol-lows from (1.31) and (1.37).

6.

F∆G = (F ∩Gc)∪ (Fc ∩G)= (F ∩Gc ∩H)∪ (F ∩Gc ∩Hc)∪ (Fc ∩G ∩H)∪ (Fc ∩G ∩Hc)= (F ∩Gc ∩Hc)∪ (Fc ∩G ∩H)∪ (F ∩Gc ∩H)∪ (Fc ∩G ∩Hc)⊂ (F ∩Hc)∪ (Fc ∩H)∪ (Gc ∩H)∪ (G ∩Hc)= F∆H ∪H∆G

and therefore from the union bound

P(F∆G) = P(F∆H ∪H∆G)≤ P(F∆H)+ P(H∆G).

2

Approximation of Events and Probabilities

Lemma 1.13 yields some useful notions of approximation for events andprobability measures. First consider the triangle inequality for the prob-ability of symmetric differences.

Example 1.15. Given a probability space (Ω,B, P) define the function dP :B×B → [0,1] by

dP(F,G) = P(F∆G). (1.36)

Then dP is a pseudometric and (B, dP) is a pseudometric space.

The function dP is only a pseudometric because the distance is 0 ifthe two events differ by a null set. We could construct an ordinary met-ric space by replacing B by the space of equivalence classes of its mem-bers, that is, by identifying any to events which differ by a null set. Thepseudometric space (B, dP) will be referred to as the pseudometric spaceassociated with the probability space (Ω,B, P) or simply as the associatedpseudometric space.

A similar metric space based on probabilities is the partition distancebetween two measurable partitions of a probability space. Given a mea-surable space (Ω,B) and an index set I (such as Z(n)), P = Pi; i ∈ I is a


(measurable) partition of Ω if the Pi are disjoint,⋃i∈I Pi = Ω, and Pi ∈ B

for all i ∈ I. The sets Pi are called the atoms of the partition.

Example 1.16. Suppose that (Ω,B, µ) is a probability space. Let P denotethe class of all measurable partitions of Ω with K atoms, that is, a mem-ber of P has the form P = Pk, i = 1,2, . . . , K, where the Pi ∈ B aredisjoint. A metric on the space is

d(P,Q) =K∑i=1

µ(Pi∆Qi).This metric is called the partition distance. As previously, strictly speak-ing d is a pseudometric.

Exercises

1. Prove that if P satisfies (1.19), (1.20), and (1.22), then (1.23)-(1.25) areequivalent, that is, any one holds if and only if the other two also hold.

2. Prove the following elementary properties of probability (all sets areassumed to be events).

a.P(F

⋃G) = P(F)+ P(G)− P(F ∩G). (1.37)

b. P(Fc) = 1− P(F).c. For all events F , P(F) ≤ 1.d. If Fi, i = 1,2, . . . is a sequence of events that all have probability 1,

show that⋂i Fi also has probability 1.

3. Suppose that Pi, i = 1,2, . . . is a countable family of probability mea-sures on a space (Ω,B) and that ai, i = 1,2, . . . is a sequence of pos-itive real numbers that sums to one. Show that the set function mdefined by

m(F) =∞∑i=1

aiPi(F)

is also a probability measure on (Ω,B). This is an example of a mixtureprobability measure.

4. Show that the partition distance in Example 1.16 satisfies

d(P,Q) = 2∑i6=jm(Pi ∩Qj) = 2(1−

∑im(Pi ∩Qi)).

1.7 Complete Probability Spaces 25

1.7 Complete Probability Spaces

Completion is a technical detail which we will ignore most of the time,but which needs to be considered occasionally because it is required toexplain differences among the most important probability spaces, es-pecially the standard Borel spaces emphasized here and the Lebesguespaces that are ubiquitous in ergodic theory. Completion is literally atopic of much ado about nothing — how to treat possibly nonmeasur-able subsets of sets of measure zero. The issue will take up space hereand in the development of the extension theorem in order to have a refer-ence point for avoiding the issue later, save for occasional comparisons.Doubtless mathematicians may find the discussion belaboring the obvi-ous and engineers will wonder why it is raised at all, but I have found thesubject sufficiently annoying at times to try to clarify it in context andthereby to avoid the appearance in being too cavalier where the subjectdoes not crop up.

By way of motivation, completion yields a bigger collection of eventswhen one has a specific measure or candidate measure, but not consid-ering completion allows consideration of important properties of a mea-surable space common to all measures. Usually our emphasis is on thelatter, but there are a two named specific measures, the Borel measureand the Lebesgue measure, which deserve special consideration, and thatrequires the notion of complete probability spaces.

Given a probability space (Ω,B, P), an event F ∈ B is called a null set(with respect to P ) or a P -null set if P(F) = 0. A probability space is saidto be complete if all subsets of null sets are themselves in B. Sometimesit is the σ -field B that is said to be complete, but completeness dependson the measure and not only on the σ -field. Any probability space that isnot complete can be made complete (completed) by simply incorporatingall subsets of null sets in a suitable manner, as described in the followinglemma.

Lemma 1.14. Given a probability space (Ω,B, P), define BP as the collec-tion of all sets of the form F∪N , where F ∈ B and N is a subset of a P -nullset. Then BP is complete, contains B, and is the smallest σ -field with theseproperties. The probability measure P extends to a probability measureP on the completed space (Ω,BP ) so that P(F) = P(F) for all F in theoriginal σ -field B.

Proof. Since ∅ is a subset of all sets including null sets, B ⊂ BP . Byconstruction BP is closed under countable unions. To see that BP is alsoclosed under complementation and hence is a σ -field, first observe thatif F ∈ B and N is a subset of a null set G, then F∪N = (F∪G)−[G∩(F∪N)c] = F ′ −G′ = F ∩ (G′)c where F ′ = F ∪G ∈ B and G′ = G ∩ (F ∪N)cis a subset of a null set. Thus (F ∪N)c = (F ′)c ∪G′ ∈ BP .


If B is a complete σ -field that contains B, then it must contain allsubsets of P -null sets and hence all sets of the form F ∪N for F ∈ B andN a subset of a P -null set, thus B must contain BP . Therefore BP is thesmallest σ -field with the desired properties.

Define the set function P on BP by P(F ∪ N) = P(F). To see that theset function is well defined, suppose that G = F ∪ N = F ′ ∪ N′ withF, F ′ ∈ B, N,N′ subsets of P null sets, say M and M′, respectively. Thenthe symmetric difference F∆F ′ ∈ B since F, F ′ ∈ B, and

F∆F ′ = (F ∩N′)∪ (F ′ ∩N)⊂ (F ∩M′)∪ (F ′ ∩M)⊂ M ∪M′,

which is a null set. Hence P(F∆F ′) = 0. From (1.30) this implies thatP(F) = P(F ′), and hence the set function P is well-defined.

By construction P is countably additive and normalized and hence isa probability measure Also by construction, P is complete and for everyset F ∈ B, P(F) = P(F) so that the two probability measures agree onsets for which both are defined. 2

The complete probability space (Ω,BP , P) is called the completion of(Ω,B, P). It is common practice to drop the overbar and superscript Pand simply use the same notation for the completed space as for theoriginal space unless it is useful to distinguish the two spaces.

There is no effect of completion on the computation of probabilitiesof interest, and in a similar vein completion will not effect the compu-tation of expectations encountered later. It does effect the measurabilityof sets and, as described later, of functions. As mentioned earlier, theprimary inconvenience of dealing with complete probability spaces isthat the details depend on the specific measure. We will often be con-cerned with constructing measures and hence will be dealing with withmeasurable spaces initially that do not yet have a measure with respectto which completion is defined. Hence the issue of completion is usuallyignored until or unless it is needed, at which point we can complete theprobability space with respect to a specific measure.

1.8 Extension

Event spaces or σ -fields can be generating from a class of sets, for ex-ample a simple field of subsets of Ω. In this section a similar methodis used to find a probability measure on an event space by extending anonnegative countably additive normalized set function on a field. Thissolves the first and more traditional step of constructing a probability

1.8 Extension 27

measure. It leaves the more difficult problem for later — how to demon-strate that a nonnegative countably additive normalized set function ona field (usually easy to demonstrate) extends to a probability measureon the σ -field generated by the field. This first step is accomplished bythe Carathéodory extension theorem. The theorem states that if we havea probability measure on a field, then there exists a unique probabilitymeasure on the σ -field that agrees with the given probability measureon events in the field. We shall develop the result in a series of steps.The development is patterned on that of Halmos [45].

Suppose that F is a field of subsets of a space Ω and that P is aprobability measure on a field F ; that is, P is a nonnegative, normalized,countably additive set function when confined to sets in F . We wish toobtain a probability measure, say λ, on σ(F) with the property that forall F ∈ F , λ(F) = P(F). Eventually we will also wish to demonstrate thatthere is only one such λ. Toward this end define the set function

λ(F) = infFi∈F ,F⊂

⋃i Fi

∑iP(Fi). (1.38)

The infimum is over all countable collections of field elements whoseunions contain the set F . Such a collection of field members whose unioncontains F is called a cover of F . From Lemma 1.5 we could confineinterest to covers whose members are all disjoint. Observe that this setfunction is defined for all subsets of Ω. From the definition, given anyset F and any ε > 0, there exists a cover Fi such that∑

iP(Fi)− ε ≤ λ(F) ≤

∑iP(Fi). (1.39)

A cover satisfying (1.39) will be called an ε-cover for F .The goal is to show that λ is in fact a probability measure on σ(F).

Obviously λ is nonnegative, so it must be shown to be normalized andcountably additive. This is accomplished in a series of steps, beginningwith the simplest:

Lemma 1.15. The set function λ of (1.38) satisfies

(a) λ(∅) = 0.(b) Monotonicity: If F ⊂ G, then λ(F) ≤ λ(G).(c) Subadditivity: For any two sets F , G,

λ(F ∪G) ≤ λ(F)+ λ(G). (1.40)

(d) Countable Subadditivity: Given sets Fi,

λ(∪iFi) ≤∞∑i=1

λ(Fi). (1.41)


Proof. Property (a) follows immediately from the definition since ∅ ∈ Fand contains itself, hence λ(∅) ≤ P(∅) = 0. From (1.39) given G and εwe can choose a cover Gn for G such that∑

iP(Gi) ≤ λ(G)+ ε.

Since F ⊂ G, a cover for G is also a cover for F and hence

λ(F) ≤∑iP(Gi) ≤ λ(G)+ ε.

Since ε is arbitrary, (b) is proved. To prove (c) let Fi and Gi be ε/2covers for F and G. Then Fi

⋃Gi is a cover for F

⋃G and hence

λ(F ∪G) ≤∑iP(Fi ∪Gi) ≤

∑iP(Fi)+

∑iP(Gi) ≤ λ(F)+ λ(G)+ ε.

Since ε is arbitrary, this proves (c).To prove (d), for each Fi let Fik;k = 1,2, . . . be an ε2−i cover for Fi.

Then Fik; i = 1,2, . . . ; j = 1,2, . . . is a cover for⋃i Fi and hence

λ(∪iFi) ≤∞∑i=1

∞∑k=1

P(Fik) ≤∞∑i=1

(λ(Fi)+ ε2−i) =∞∑i=1

λ(Fi)+ ε,

which completes the proof since ε is arbitrary. 2

We note in passing that a set function λ on a collection of sets havingproperties (a)-(d) of the lemma is called an outer measure on the collec-tion of sets.

The simple properties have an immediate corollary: the set functionλ agrees with P on field events and hence it makes sense to call λ anextension of P .

Corollary 1.3. If F ∈ F , then λ(F) = P(F). Thus, for example, λ(Ω) = 1.

Proof. Since a set covers itself, we have immediately that λ(F) ≤ P(F)for all field events F . Suppose that Fi is an ε cover for F . Then

λ(F) ≥∑iP(Fi)− ε ≥

∑iP(F ∩ Fi)− ε.

Since F∩Fi ∈ F and F ⊂⋃i Fi,

⋃i(F∩Fi) = F∩

⋃i Fi = F ∈ F and hence,

invoking the countable additivity of P on F ,∑iP(F ∩ Fi) = P(∪iF ∩ Fi) = P(F)

and henceλ(F) ≥ P(F)− ε.

1.8 Extension 29

2

Thus far the properties of λ have depended primarily on manipulatingthe definition. In order to show that λ is countably additive (or even onlyfinitely additive) over σ(F), a new concept is required which implies newcollection of sets that we will later see contains σ(F). The definitionsmay seem artificial, but some similar form of tool seems to be necessaryto get to the desired goal. By way of motivation, we are trying to showthat a set function is additive on some class of sets. Perhaps the simplestform of additivity looks like

λ(F) = λ(F ∩ R)+ λ(F ∩ Rc).

Hence it should not seem too strange to build at least this form intothe class of sets considered. To do this, define a set R ∈ σ(F) to beλ-measurable if

λ(F) = λ(F ∩ R)+ λ(F ∩ Rc), all F ∈ σ(F).

In words, a set R is λ-measurable if it splits all events in σ(F) in anadditive fashion. This property is not determined by the σ -field alone, itdepends on how the outer measure λ behaves on members of σ(F).

LetH denote the collection of all λ-measurable sets. We shall see thatindeed λ is countably additive on the collectionH and thatH containsσ(F). Observe for later use that since λ is subadditive, to prove thatR ∈H requires only that we prove

λ(F) ≥ λ(F ∩ R)+ λ(F ∩ Rc), all F ∈ σ(F).

Lemma 1.16. H is a field.

Proof. Clearly Ω ∈ H since λ(∅) = 0 and Ω ∩ F = F . Equally clearly,F ∈ H implies that Fc ∈ H . The only work required is show that if F ,G ∈ H , then also F

⋃G ∈ HÃ§. To accomplish this begin by recalling

that for F , G ∈H and for any H ∈ σ(F) we have that

λ(H) = λ(H ∩ F)+ λ(H ∩ Fc)

and hence since both arguments on the right are members of σ(F)

λ(H) = λ(H ∩ F ∩G)+ λ(H ∩ F ∩Gc)+ λ(H ∩ Fc ∩G)+ λ(H ∩ Fc ∩Gc). (1.42)

As (1.42) is valid for any event H ∈ σ(F), we can replace H by H ∩(F⋃G) to obtain


λ(H ∩ (F ∪G)) = λ(H ∩ (F ∩G))+ λ(H ∩ (F ∩Gc))+ λ(H ∩ (Fc ∩G)), (1.43)

where the fourth term has disappeared since the argument is null. Plug-ging (1.43) into (1.42) yields

λ(H) = λ(H ∩ (F ∪G))+ λ(H ∩ (Fc ∩Gc))

which implies that F⋃G ∈H since Fc ∩Gc = (F

⋃G)c . 2

Lemma 1.17. H is a σ -field.

Proof. Suppose that Fn ∈H , n = 1,2, . . . and

F =∞⋃i=1

Fi.

It is convenient to work with disjoint sets, so use the decomposition ofLemma 1.5 and define as in (1.16) Gk = Fk −

⋃k−1i=1 Fi, k = 1,2, . . . Since

H is a field, clearly the Gi and all finite unions of Fi or Gi are also inH .First apply (1.43) with G1 and G2 replacing F and G. Using the fact thatG1 and G2 are disjoint,

λ(H ∩ (G1 ∩G2)) = λ(H ∩G1)+ λ(H ∩G2).

Using this equation and mathematical induction yields

λ(H ∩n⋃i=1

Gi) =n∑i=1

λ(H ∩Gi). (1.44)

For convenience define the set

F(n) =n⋃i=1

Fi =n⋃i=1

Gi.

Since F(n) ∈ H and since H is a field, F(n)c ∈ H and hence for anyevent H ∈H

λ(H) = λ(H ∩ F(n))+ λ(H ∩ F(n)c).Using the preceding formula, (1.44), the fact that F(n) ⊂ F and henceFc ⊂ F(n)c , the monotonicity of λ, and the countable subadditivity of λ,

λ(H) ≥∞∑i=1

λ(H ∩Gi)+ λ(H ∩ Fc) ≥ λ(H ∩ F)+ λ(H ∩ Fc). (1.45)

Since H is an arbitrary event in H , this proves that the countable unionF is inH and hence thatH is a σ -field. 2

1.8 Extension 31

In fact much more was proved than the stated lemma; the implicitconclusion is made explicit in the following corollary.

Corollary 1.4. λ is countably additive onH

Proof. Take Gn to be an arbitrary sequence of disjoint events in H inthe preceding proof (instead of obtaining them as differences of anotherarbitrary sequence) and let F denote the union of the Gn. Then from thelemma F ∈H and (1.45) with H = F implies that

λ(F) ≥∑i=1

λ(Gi).

Since λ is countably subadditive, the inequality must be an equality,proving the corollary. 2

The strange classH is next shown to be exactly the σ -field generatedby F .

Lemma 1.18. H = σ(F).

Proof. Since the members ofH were all chosen from σ(F),H ⊂ σ(F).Let F ∈ F and G ∈ σ(F) and let Gn be an ε-cover for G. Then

λ(G)+ ε ≥∞∑i=1

P(Gn) =∞∑i=1

(P(Gn ∩ F)+ P(Gn ∩ Fc))

≥ λ(G ∩ F)+ λ(G ∩ Fc)

which implies that for F ∈ F

λ(G) ≥ λ(G ∩ F)+ λ(G ∩ Fc)

for all events G ∈ σ(F) and hence that F ∈ H . This implies that Hcontains F . Since H is a σ -field, however, it therefore must containσ(F). 2

It has now been proved that λ is a probability measure on σ(F) whichagrees with P on F . The only remaining question is whether it is unique.The following lemma resolves this issue.

Lemma 1.19. If two probability measures λ and µ on σ(F) satisfy λ(F) =µ(F) for all F in the field F , then the measures are identical; that is, λ(F)= µ(F) for all F ∈ σ(F).

Proof. Let M denote the collection of all sets F such that λ(F) = µ(F).From the continuity of probability, M is a monotone class. Since it con-tains F , it must contain the smallest monotone class containing F . Thatclass is σ(F) from Lemma 1.6 and hence M contains the entire σ -field.2

Combining all of the pieces of this section yields the principal result.


Theorem 1.1. The Carathéodory Extension Theorem. If a set function Psatisfies the properties of a probability measure (nonnegativity, normal-ization, and countable additivity or finite additivity plus continuity) for allsets in a fieldF of subsets of a spaceΩ, then there exists a unique measureλ on the measurable space (Ω, σ(F)) that agrees with P on F .

When no confusion is possible, we will usually denote the extensionof a measure P by P rather than by a different symbol.

This section is closed with an important corollary to the extensiontheorem that shows that an arbitrary event can be approximated closelyby a field event in a probabilistic way.

and a generating field F , that is, B = σ(F), then given F ∈ B and ε > 0,there exists an F0 ∈ F such that P(F∆F0) ≤ ε.

Proof. From the definition of λ in (1.38) (which yielded the extensionof the original measure P) and the ensuing discussion one can find acountable collection of disjoint field events Fn such that F ⊂

⋃n Fn

and∞∑i=1

P(Fi)− ε/2 ≤ λ(F) ≤∞∑i=1

P(Fi).

Renaming λ as P

P(F∆ ∞⋃i=1

Fn) = P(∞⋃i=1

Fn − F) = P(∞⋃i=1

Fn)− P(F) ≤ε2.

Choose a finite union F =⋃ni=1 Fi with n chosen large enough to ensure

that

P(∞⋃i=1

Fi∆F) = P( ∞⋃i=1

Fi − F) =∞∑

i=n+1

P(Fi) ≤ε2.

From the previous lemma we see that

P(F∆F) ≤ P(F∆ ∞⋃i=1

Fi)+ P(∞⋃i=1

Fi∆F) ≤ ε.2

If a field F is used to generate the σ -field, then the approximationtheorem says that for a probability space (Ω, σ(F), P), any event can beapproximated arbitrarily closely in the sense of probability of symmetricdifference and hence in the pseudometric dP of the associated pseudo-metric space (σ(F), P) (see Example 1.15) by one of a countable collec-tion of “points” in σ(F), that is, by a member of F . If this were a metricspace, this would mean that the metric space is separable. Extending the

Corollary 1.5. (Approximation Theorem) Given a probability space (Ω,B, P)

1.8 Extension 33

definition of separability to pseudometric spaces (or replacing the pseu-dometric space by a metric space of equivalence classes) show that if aσ -field is countably generated, then it is separable with respect to dP .

Exercise

1. Show that if m and p are two probability measures on (Ω, σ(F)),where F is a field, then given an arbitrary event F and ε > 0 there is afield event F0 ∈ F such that m(F∆F0) ≤ ε and p(F∆F0) ≤ ε.

Chapter 2

Random Processes and Dynamical Systems

Abstract The theoretical fundamentals random variables, random pro-cesses, and dynamical systems are developed and elaborated. The em-phasis is on discrete time, where the general notion of a dynamical sys-tem — a probability space along with a transformation — yields a ran-dom process when combined with a random variable. The transforma-tion might, but need not, correspond to a time shift operation. Continu-ous time processes are discussed briefly by considering a family of trans-formations or flow. Probability distributions describing the variables andprocesses are related using product σ -fields. The basic ideas of codes,filters, factors, and isomorphism are introduced.

2.1 Measurable Functions and Random Variables

Given a measurable space (Ω,B), let (A,BA) denote another measurablespace. The first space can be thought of as an input space and the secondas an output space. A random variable or measurable function defined on(Ω,B) and taking values in (A,BA) is a mapping or function f : Ω → A(or f : (Ω,B) → (A,BA) when it helps to make the σ -fields explicit) withthe property that

if F ∈ BA, then f−1(F) = ω : f(ω) ∈ F ∈ B. (2.1)

The name random variable is commonly associated with the case whereA is the real line R and B = B(R), the Borel field of subsets of the realline. Occasionally a more general sounding name such as random objectis used for a measurable function to include implicitly real-valued ran-dom variables (A = R, the real line), real-valued random vectors (A = Rk,k-dimensional Euclidean space), and real-valued random processes (A asequence or waveform space), as well as non necessarily real-valued ran-


35

36 2 Random Processes and Dynamical Systems

dom variables, vectors, and processes. We will usually use the term ran-dom variable in the general sense of including both scalar- and vector-valued functions, providing clarification when needed or helpful. Whenit is not clear from context, random variables etc. will be said to be real-valued if the space A is either a subset of the real line or a product spaceof such subsets.

A random variable is just a function or mapping with the propertythat inverse images of output events determined by the random variableare events in the original measurable space (input events). This simpleproperty ensures that the output of the random variable will inherit itsown probability measure. For example, with the probability measure Pfdefined by

Pf (B) = P(f−1(B)) = P(ω : f(ω) ∈ B);B ∈ BA,

(A,BA, Pf ) becomes a probability space since measurability of f and el-ementary set theory ensure that Pf is indeed a probability measure. Theinduced probability measure Pf is called the distribution of the randomvariable f . The measurable space (A,BA) or, simply, the sample spaceA is called the alphabet of the random variable f . We shall occasion-ally also use the abbreviated notation Pf−1 for Pf−1(F) = P(f−1(F)).The shorter notation is less awkward when f itself is a function with acomplicated name, e.g., ΠT→M.

If the alphabet A of a random variable f is not clear from context, thenwe shall refer to f as an A-valued random variable. If f is a measurablefunction from (Ω,B) to (A,BA), we will say that f is B/BA-measurable.The names of the σ -fields may be omitted if they are clear from context.

If the alphabet A of a random variable f is a discrete set, then therandom variable is called a discrete-alphabet or discrete random variableand its distribution Pf is completely described by a probability massfunction.

Composite Mappings

Given a probability space (Ω,B), a measurable function f defined on thatspace, and the induced output probability space (A,BA) for the measur-able function, a new random variable g can be defined on the outputprobability space (A,BA) with its own output measurable space (B,BB).The composite or cascade mapping g f : (Ω,B) → (B,BB) defined byg f(ω) = g(f(ω)) is then a measurable function on the original spacesince for F ∈ BB we have that (g f)−1(F) = g−1(f−1(F)) ∈ B since theindividual functions are measurable. When it is clear from context that

2.1 Measurable Functions and Random Variables 37

composite functions and not ordinary products are being considered,the notation will be abbreviated to gf = g f .

Continuous Mappings

Suppose that we have two Borel spaces (A,B(A)) and (B,B(B)) withmetrics d and ρ, respectively. A function f : A → B is continuous ata in A if for each ε > 0 there exists a δ > 0 such that d(x,a) < δimplies that ρ(f(x), f (a)) < ε. It is an easy exercise to show that if a isa limit point, then f is continuous at a if and only if an → a implies thatf(an)→ f(a).

A function is continuous if it is continuous at every point in A. Thefollowing lemma collects the main properties of continuous functions: adefinition in terms of open sets and their measurability.

Lemma 2.1. A function f is continuous if and only if the inverse imagef−1(S) of every open set S in B is open in A. A continuous function isBorel measurable.

Proof. Assume that f is continuous and S is an open set in B. To provef−1(S) is open we must show that about each point x in f−1(S) thereis an open sphere Sr (x) contained in f−1(S). Since S is open we canfind an s such that ρ(f(x),y) < s implies that y is in S. By continuitywe can find an r such that d(x,b) < r implies that d(f(x), f (b)) < s.If f(x) and f(b) are in S, however, x and b are in the inverse imageand hence Sr (x) so obtained is also in the inverse image. Conversely,suppose that f−1(S) is open for every open set S in B. Given x in A andε > 0 define the open sphere S = Sε(f (x)) in B. By assumption f−1(S)is then also open, hence there is a δ > 0 such that d(x,y) < δ impliesthat y ∈ f−1(S) and hence f(y) ∈ S and hence ρ(f(x), f (y)) < ε,completing the proof. 2

Since inverse images of open sets are open and hence Borel sets andsince open sets generate the Borel sets, f is measurable from Lemma 2.4.

Random Variables and Completion

How is measurability of a function affected by completing a probabilityspace. Happily things behave as they should. Suppose that f : (Ω,B) →(A,BA) is a random variable and that P is a probability measure on(Ω,B) and Pf = Pf−1 is the induced probability measure on (A,BA).Suppose that B is replaced by the larger σ -field BP , the completion of Bwith respect to P , and let P denote the completed measure. Since B ⊂ BP ,


the function f is automatically BP/BA-measurable. If the output eventspace is similarly completed with respect to Pf to form BPA, then f isalso BP/BPA-measurable. This follows since if F ∈ BPA, it must have theform F = F ′ ∪N for F ′ ∈ BA and N a subset of a Pf null set, say M ∈ BA,Pf (M) = 0. Then f−1(F) = f−1(F ′ ∪ N) = f−1(F ′) ∪ f−1(N) since in-verse images preserve set theoretic operations. Preservation of set the-oretic operations also means that f−1(N) ⊂ f−1(M). But P(f−1(M)) =Pf (M) = 0, and hence f−1(N) is a subset of a P null set and hencef−1(F ′)∪ f−1(N) ∈ BP .

Exercises

1. Let f : Ω → A be a random variable. Prove the following properties ofinverse images:

f−1(Gc) =(f−1(G)

)cf−1(

∞⋃k=1

Gk) =∞⋃k=1

f−1(Gk)

If G ∩ F = ∅, then f−1(G)∩ f−1(F) = ∅If G ⊂ F, then f−1(G) ⊂ f−1(F).

Thus inverse images preserve set theoretic operations and relations.2. If F ∈ B, show that the indicator function 1F defined by 1F(x) = 1 ifx ∈ F and 0 otherwise is a random variable. Describe its distribution.Is the product of indicator functions measurable?

3. Show that for two events F and G,

|1F(x)− 1G(x)| = 1F∆G(x).

2.2 Approximation of Random Variables and Distributions

Two measurements or random variables defined on a common proba-bility space are considered to be the same if their distributions are thesame. Suppose the distributions are not the same, then how does onequantify how different the two random variables are? Several approachesare possible, but only one of which will extend to random processes in auseful way.

Example 2.1. Suppose that X and Y are two A-valued random variablesdefined on a common probability space (Ω,B,m), where A is discrete.

2.2 Approximation of Random Variables and Distributions 39

Define the distance between X and Y as the probability that they aredifferent, that is, d(X,Y) = Pr(X ≠ Y) =m(ω : X(ω) ≠ Y(ω)).

This is a pseudometric, but it can be turned into a metric in the usualway by identifying random variables that are equal with probability one.This is just the partition metric of Example 1.16 in disguise. Define thepartitions P = Pa;a ∈ A and Q = Qa;a ∈ A by

Pa = ω : X(ω) = a = X−1(a) Qa = ω : Y(ω) = a = Y−1(a).(2.2)

The partition distance is given by

dpartition(P,Q) =∑a∈A

m(Pa∆Qa) = ∑a∈A

m((Pa ∩Qca)∪ (Pca ∩Qa))

=∑a∈A

(m(Pa ∩Qca)+m(Pca ∩Qa))

)=∑a∈A

m(Pa ∩Qca)+∑a∈A

m(Pca ∩Qa)

=m(∪a∈APa ∩Qca)+m(∪a∈APca ∩Qa)

and therefore

d(X,Y) =m(ω : X(ω) ≠ Y(ω))=

∑a∈A,b∈A:a≠b

m(ω : X(ω) = a,Y(ω) = b)

=∑a∈A

∑b∈A:a≠b

m(ω : X(ω) = a,Y(ω) = b)

=∑a∈A

m(Pa ∩∪b∈A:a≠bQa) =∑a∈A

m(Pa ∩Qca)

=m(∪a∈APa ∩Qca)

and a similar line of equalities shows that d(X,Y) = m(∪a∈APca ∩Qa),thus

d(X,Y) =∑

a∈A,b∈A:a≠bm(ω : X(ω) = a,Y(ω) = b)

= 12dpartition(P,Q) =

12

∑a∈A

m(Pa∆Qa). (2.3)

The previous example provides a metric on two random variables de-fined on a common probability space. We will also be interested in adistance between two random variables or their distributions when theyare not defined on a common space.

Example 2.2. Suppose that X and Y are two random variables with a com-mon discrete alphabet A with event space BA and distributions PX and


PY , respectively. Suppose one has a choice of all possible joint distribu-tions PX,Y having PX and PY as marginals, that is, PX,Y (F × A) = PX(F)and PX,Y (A × F) = PY (F) for all F ∈ BA. What choice minimizes themetric d(X,Y) of the previous example? This distance between two dis-tributions of random variables is formalized as

d(PX, PY ) = infPX,Y⇒PX,PY

PX,Y (X ≠ Y). (2.4)

This type of distance has a long history and many names and it will betreated in some depth in later chapters.

Example 2.3. Another distance on the space of probability measure onthe discrete measurable space (A,BA) of Example 2.2 is the `1 norm ofthe difference of the two distributions,

| PX − PY |=∑a∈A

| pX(a)− pY (a) |, (2.5)

where lower case denotes the probability mass function for the discretedistribution, that is, pX(a) = PX(a). This distance is called the distri-bution distance.. As we shall later see in Lemma 9.5, it is closely relatedto the variation distance or total variation distance.

It will later be seen in Lemma 9.5 that d(PX, PY ) = |PX − PY |/2.

2.3 Random Processes and Dynamical Systems

We now consider two mathematical models for a random process. Thefirst is the familiar one in elementary and advanced courses: a randomprocess is just an indexed family Xt ; t ∈ T of random variables, whichwe will abbreviate to X when context makes it clear that we are consid-ering random processes. The second model is less familiar in the engi-neering literature: a random process can be constructed from a singlerandom variable together with an abstract dynamical system consistingof a probability space and a family of transformations on the space. Thetwo models are connected by considering transformations as time shifts,but examples will show that other transformations can be useful. The dy-namical system framework will be seen to provide a general descriptionof random processes that provides additional insight into their structureand a natural structure for stating and proving ergodic theorems.

2.3 Random Processes and Dynamical Systems 41

Random Processes

A random process is a family of random variables Xtt∈T or Xt ; t ∈ Tdefined on a common probability space (Ω,B, P), where T is an indexset. We assume that all of the random variables Xt share a commonalphabet A. If T is a discrete set such as Z or Z+, the process is saidto be a discrete time or discrete parameter random process. If T is acontinuous set such as R or R+, the process is said to be a continuoustime or continuous parameter random process . If the index set T = Zor R, the process is said to be two-sided (or double-sided or bilateral. Ifthe index set T = Z+ or R+, the process is said to be one-sided (or single-sided or unilateral. One-sided random processes will usually prove tobe more difficult in theory, but they provide better models for physicalrandom processes that must be “turned on” at some time or that havetransient behavior.

For a fixed ω ∈ Ω, the Xt(ω); t ∈ T is called a sample sequence if Tis discrete and a sample function or sample waveform if T is continuous

These are by far the most common and important examples of indexsets, and usually the index corresponds to time as measured in someunits. Most of the theory will remain valid, however, if we simply requirethat T be closed under addition, that is, if t and s are in T, then so ist + s (where the “+” denotes a suitably defined addition in the indexset). Mathematically, we assume that T is a semi-group with respect toaddition, that is, T is closed with respect to addition and addition isassociative. In the two-sided case, T becomes a group since there is anidentity with respect to addition (0) and an inverse. We henceforth makethis assumption. In general T need not correspond to time or even beone-dimensional. For example T = R2 or Z2 can be used to define randomfields for use in modeling random images.

We abbreviate a random process to Xt or even Xt or X, when thecontext makes clear the index set and that a process rather than a sin-gle random variable is being considered. In the discrete time case, wewill usually write Xn. In the continuous time case the notation X(t) iscommon, and in the discrete time case X[n] is often encountered in theengineering literature.

Since the alphabet A is general, a continuous time random processescould also be modeled as a discrete time random process by letting Aconsist of a family of waveforms defined on an interval, e.g., each ran-dom variable Xn could be in fact a continuous time waveform X(t) fort ∈ [nT , (n+ 1)T), where T is some fixed positive real number.

In the special case where T be a finite set, random vector is a bettername than random process.

As a final technical point, a random process can be viewed as a func-tion of two arguments, Xt(ω) or X : Ω× T → A. We will assume that theprocess is measurable as a function of two arguments with respect to


suitably defined σ -fields (except possibly for a set of total probability 1,as usual we ignore null sets). This assumption is unnecessary in the dis-crete time case as simply having a family of random variables guaranteesmeasurability. But it is needed in the continuous time case, primarily inorder to make sense of time integrals such as

∫ T0 Xtdt. We will assume all

continuous time random processes encountered herein are measurable,leaving proofs to the literature.

Discrete-Time Dynamical Systems: Transformations

A discrete-time abstract dynamical system consists of a probabilityspace (Ω,B, P) together with a measurable transformation T : Ω → Ωof Ω into itself. Measurability means that if F ∈ B, then also T−1F =ω : Tω ∈ F ∈ B. The quadruple (Ω,B, P , T) is called a (discrete-time)dynamical system in ergodic theory.

The interested reader can find excellent introductions to classical er-godic theory and dynamical system theory in the books of Halmos [46]and Sinai [101]. More complete treatments may be found in [3] [97] [84][17] [113] [78] [26] [62]. The name dynamical systems comes from thefocus of the theory on the long term dynamics or dynamical behavior ofrepeated applications of the transformation T on the underlying mea-sure space.

An alternative to modeling a random process as a sequence or familyof random variables defined on a common probability space is to con-sider a single random variable together with a transformation defined onthe underlying probability space. The outputs of the random process willthen be values of the random variable taken on transformed points in theoriginal space. The transformation will usually be related to shifting intime, and hence this viewpoint will focus on the action of time itself.Suppose now that T is a measurable mapping of points of the samplespace Ω into itself. It is easy to see that the cascade or composition ofmeasurable functions is also measurable. Hence the transformation Tndefined as T 2ω = T(Tω) and so on (Tnω = T(Tn−1ω)) is a measurablefunction for all positive integers n. The sequence Tnω;n ∈ T is calledthe orbit of ω ∈ Ω.

If f is an A-valued random variable defined on (Ω,B), then the func-tions fTn : Ω → A defined by fTn(ω) = f(Tnω) for ω ∈ Ω will alsobe random variables for all n in Z+. Thus a dynamical system togetherwith a random variable or measurable function f defines a single-sideddiscrete-time random process Xnn∈Z+ by Xn(ω) = f(Tnω). Note inparticular that setting n = 0 (assuming of course that 0 ∈ T) yields

X0(ω) = f(ω)

2.3 Random Processes and Dynamical Systems 43

and hence f can be thought of as the zero-time time mapping or projec-tion.

If it should be true that T is invertible, that is, T is one-to-one and itsinverse T−1 is measurable, then one can define a double-sided randomprocess by Xn = fTn, all n in Z.

The most common dynamical system for modeling random processesis that consisting of a sequence space Ω containing all one- or two-sided A-valued sequences together with the shift transformation T , thatis, the transformation that maps a sequence xn into the sequencexn+1 wherein each coordinate has been shifted to the left by one timeunit. Thus, for example, let Ω = AZ+ = all x = (x0, x1, . . .) with xi ∈A for all i and define T : Ω→ Ω by

T(x0, x1, x2, . . .) = (x1, x2, x3, . . .).

The transformation T is called the shift or left shift transformation onthe one-sided sequence space. The shift for two-sided spaces is definedsimilarly. We shall treat this sequence-space representation of a randomprocess in more detail later in this chapter when we consider distribu-tions of random processes.

Some interesting dynamical systems in communications applicationsdo not, however, have the structure of a shift on sequences. As an ex-ample, suppose that a probability space (Ω,B, P) has sample space Ω =[0,1), the unit interval. Define a binary random variable f : Ω → 0,1by

f(u) =

1 u ∈ [0,1/2)0 u ∈ [1/2,1)

(this is an example of a quantizer) and a transformation T : Ω → Ωdefined by

Tu = 2umod 1 =

2u u ∈ [0,1/2)2u− 1 u ∈ [1/2,1).

The resulting binary alphabet random process is now defined by Xn =fTn.

As another example, consider a mathematical model of a device calleda sigma-delta modulator, that is used for analog-to-digital conversion,that is, encoding a sequence of real numbers into a binary sequence(analog-to-digital conversion), which is then decoded into a reproductionsequence approximating the original sequence (digital-to-analog conver-sion) [53] [11] [33]. Given an input sequence xn and an initial state u0,the operation of the encoder is described by the difference equations

en = xn − q(un),

un = en−1 +un−1,


where q(u) is +b if its argument is nonnegative and −b otherwise (q isa binary quantizer). The decoder is described by the equation

xn =1N

N∑i=1

q(un−i).

The basic idea of the code’s operation is this: An incoming continuoustime, continuous amplitude waveform is sampled at a rate that is sohigh that the incoming waveform stays fairly constant over N sampletimes (in engineering parlance the original waveform is oversampled orsampled at many times the Nyquist rate). The binary quantizer then pro-duces outputs for which the average over N samples is very near theinput so that the decoder output xkN is a good approximation to theinput at the corresponding times. Since xn has only a discrete numberof possible values (N + 1 to be exact), one has thereby accomplishedanalog-to-digital conversion. Because the system involves only a binaryquantizer used repeatedly, it is a popular one for microcircuit implemen-tation.

As an approximation to a very slowly changing input sequence xn, itis of interest to analyze the response of the system to the special caseof a constant input xn = x ∈ [−b,b) for all n (called a quiet input). Thiscan be accomplished by recasting the system as a dynamical system asfollows: Given a fixed input x, define the transformation T by

Tu =u+ x − b; if u ≥ 0

u+ x + b; if u < 0.

Given a constant input xn = x, n = 1,2, . . . ,N , and an initial conditionu0 (which may be fixed or random), the resulting Un sequence is givenby

un = Tnu0.

If the initial condition u0 is selected at random, then the preceding for-mula defines a dynamical system which can be analyzed.

The example is provided simply to emphasize the fact that time shiftsare not the only interesting transformation when modeling communica-tion systems.

The different models provide equivalent models for a given process:one emphasizing the sequence of outputs and the other emphasising theaction of a transformation on the underlying space in producing theseoutputs.

2.4 Distributions 45

Continuous-Time Dynamical Systems: Flows

A continuous-time abstract dynamical system (Ω,B, P , Tt ; t ∈ T)is de-scribed by a family of measurable transformations Tt ; t ∈ T, Tt :Ω → Ω, instead of by a single transformation, where it is assumed thatTt+s = TtTs , that is, Tt+s(ω) = Tt(Ts(ω)). The collection Tt ; t ∈ T iscalled a flow. Sometimes it is called a semi-flow or a (semi-)group actionof T on Ω if T is one-sided. As in the discrete-time case, the waveformTtω; t ∈ T is called the orbit of ω ∈ Ω.

A continuous-time dynamical system combined with a random vari-able f yields a continuous-time random process Xt(ω) = f(Ttω). Thediscrete-time dynamical system fits into this general definition sincethe entire family Tn;n ∈ T is determined by the single transforma-tion T = T1 through Tn = Tn. The name flow is usually reserved forthe continuous-time case, however, and we use the simpler notation(Ω,B, P , T) instead of (Ω,B, P , Tt ; t ∈ T) in the discrete time case.As with continuous-time random processes, we will always make an ex-tra technical assumption when dealing with continuous-time dynamicalsystems: we assume that the mapping Tt(ω) mapping T × Ω into Ω ismeasurable.

Discrete-time dynamical systems can be constructed from continuous-time dynamical systems by sampling. Given a continuous-time dynamicalsystem (Ω,B, P , Tt ; t ∈ R) and a fixed time τ > 0, then (Ω,B, P , Tτ) isa discrete-time dynamical system which can interpreted as a samplingof the continuous-time system at every τ seconds. Given a random vari-able f , the continuous-time dynamical system yields a random processXt = fTt, t ∈ R while the discrete-dynamical system yields a randomprocess Xn = fTnτ ,n ∈ Z.

2.4 Distributions

Although in principle all probabilistic properties of a random processcan be determined from the underlying probability space, it is oftenmore convenient to deal with the induced probability measures or dis-tributions on the space of possible outputs of the random process. Inparticular, this allows us to compare different random processes with-out regard to the underlying probability spaces and thereby permits usto reasonably equate two random processes if their outputs have thesame probabilistic structure, even if the underlying probability spacesare quite different.

We have already seen that each random variable Xt of the randomprocess Xt ; t ∈ T inherits a distribution because it is measurable. Todescribe a process, however, we need more than simply probability mea-


sures on output values of separate single random variables: we requireprobability measures on families or collections of random variables suchas vectors and sequences of random variables formed by sampling therandom process. In order to place probability measures on collectionsof outputs of a random process, we first must construct the appropriatemeasurable spaces. A convenient technique for accomplishing this is toconsider product spaces, spaces for sequences or waveforms formed byconcatenating spaces for individual outputs as introduced in Section 1.1

Product Spaces

Let T be the index set of the random process Xt ; t ∈ T. Denote a samplefunction or sequence of the random process by xT = xt ; t ∈ T. if Xthas alphabet At for t ∈ T, then xT will be in the Cartesian product space

×t∈TAt = all xT : xt ∈ At all t ∈ T.In all cases of interest in this book, all of the At will be replicas of a singlealphabet A and the preceding product will be denoted simply by AT. Ina similar fashion we can consider a collection of sample values of therandom variables Xt taken for t ∈ S ⊂ T. The most important exampleof such a subset S will be a finite collection of the form I = (t1, t2, . . . , tn)of n ordered sample times t1 < t2 < · · · < tn, ti ∈ T, i = 1,2, . . . n. Inthis case we have a vector

xI = (xt1 , xt2 , . . . , xtn) ∈ AI = An,

where the final form An follows since it is simply the Cartesian productof n identical copies of A. In the common special case where I = Z(n) =0,1,2, . . . , n− 1 we abbreviate xZ(n) to xn = (x0, x1, . . . , xn−1).

For the moment, however, it is useful to deal with a more generalS ⊂ T which need not be finite or even countable if T is continuous. Weallow S = T in the definitions below.

To obtain a useful σ -fields for a product space

AS = ×t∈SAt

we introduce the idea of a rectangle in a product space.


Rectangles in Product Spaces

Suppose that Bt are the coordinate σ -fields for the alphabets At , t ∈ T.A set B ⊂ AS is said to be a (finite-dimensional) rectangle in AS if it hasthe form

B = xS ∈ AS : xt ∈ Bt ; all t ∈ I, (2.6)

where I is a finite subset of S and Bt ∈ Bt for all t ∈ I. The phrasemeasurable rectangle is also used.

A rectangle as in (2.6) can be written as a finite intersection of one-dimensional rectangles as

B =⋂t∈IxS ∈ AS : xt ∈ Bt. (2.7)

As rectangles in AS are clearly fundamental events, they should bemembers of any useful σ -field of subsets of AS. One approach is simplyto define the product σ -field BS

A as the smallest σ -field containing allof the rectangles, that is, the collection of sets that contains the clearlyimportant class of rectangles and the minimum amount of other stuffrequired to make the collection a σ -field. This notion will often proveuseful.

Product σ -Fields and Fields

Given a set S ⊂ T, let rect(Bt, t ∈ S) denote the set of all rectangles inAS taking coordinate values in sets in Bt, t ∈ S . In other words, everyelement of rect(Bt, t ∈ S) has the form of (2.6) for some finite set I ⊂ T.Define the product σ -field and the product field of AS by

BAS = σ(rect(Bt, t ∈ S)) (2.8)

FAS = F(rect(Bt, t ∈ S)) (2.9)

The appropriate σ -fields for product spaces have been generated fromthe class of finite-dimensional rectangles.Although generated σ -fieldsusually lack simple constructions, the field generated by the rectanglesdoes have a simple constructive definition, which will be extremely use-ful when constructing probability measures for random processes. Theconstruction is given and proved in the following lemma.

Lemma 2.2. Let Bt be the coordinate σ -fields for the alphabets At , t ∈ T.Given a set S ⊂ T, let rect(Bt, t ∈ S) denote the set of all rectangles inAS taking coordinate values in sets in Bt, t ∈ S . In other words, everyelement of rect(Bt, t ∈ S) has the form of (2.6) for some finite set I ⊂ T.


Then the field generated by the rectangles, FAS = F(rect(Bt, t ∈ S)) isexactly the collection of all finite disjoint unions of rectangles.

Before proving Lemma 2.2, we isolate the key technical step in a sep-arate lemma.

Lemma 2.3. Given the setup of the previous lemma, any finite union ofrectangles can be written as a finite union of disjoint rectangles.

Proof of Lemma 2.3. The proof uses the trick of (1.16) and the heartof the proof is the demonstration that taking differences of rectanglesyields rectangles. Suppose that we are given a finite union of rectanglesFi ∈ rect(Bt, t ∈ S); i = 1,2, . . . N with

Fi = xS ∈ AS : xt ∈ Fi,t ; all t ∈ Ii;Fi,t ∈ Bt, t ∈ Ii.

We wish to show that⋃Ni=1 Fi can be expressed as a finite union of disjoint

rectangles. It simplifies notation to combine the finite collection of indexsets into a single index set I =

⋃Ni=1 Ii. This causes no loss of generality

since if for a fixed i an index j ∈ I is not in Ii, then we set Fi,j = Aand have the same rectangle with the expanded index set. Thus we canassume

Fi = xS ∈ AS : xt ∈ Fi,t ; all t ∈ I;Fi,t ∈ Bt, t ∈ I.

Define as in (1.16) the difference sets

Gn = Fn −n−1⋃i=1

Fi = Fn −n−1⋃i=1

Gi;n = 2, . . . ,N

N⋃i=1

Gi =N⋃i=1

Fi

To prove the lemma requires only a demonstration that the Gn are them-selves rectangles. Consider first n = 2.

G2 = F2 − F1

= xS ∈ AS : xt ∈ F2,t ; all t ∈ I − xS ∈ AS : xt ∈ F1,t ; all t ∈ I= xS ∈ AS : xt ∈ F2,t − F1,t ; all t ∈ I,

which is a rectangle since F2,t − F1,t ∈ Bt, t ∈ I. In a similar manner, forn = 3, . . . ,N ,


Gn = Fn −n−1⋃i=1

Fi

= xS ∈ AS : xt ∈ Fn,t ; all t ∈ I −n−1⋃i=1

xS ∈ AS : xt ∈ Fi,t ; all t ∈ I

= xS ∈ AS : xt ∈ Fn,t −n−1⋃i=1

Fi,t ; all t ∈ I,

which is a rectangle since Fn,t −⋃n−1i=1 Fi,t ∈ Bt, t ∈ I. 2

Proof of Lemma 2.2. Let H be the collection of all finite disjoint unionsof rectangles and note that H ⊂ FS

A. The lemma is proved by showingthat H is a field since FS

A is by definition the smallest field containingthe rectangles. First consider complements. Let F ∈H so that it has theform F =

⋃Ni=1 Fi for disjoint rectangles

Fi = xS ∈ AS : xt ∈ Fi,t ; all t ∈ Ii;Fi,t ∈ Bt, t ∈ Ii.

As in the previous proof, we can assume a common index set Ii = I. Thecomplement is

Fc = N⋃i=1

Fi

c = N⋂i=1

Fci

=N⋂i=1

xS ∈ AS : xt ∈ Fci,t ; all t ∈ I

= xS ∈ AS : xt ∈N⋂i=1

Fci,t ; all t ∈ I,

which is a rectangle. Finite unions of of finite unions disjoint rectanglesare finite unions of possibly nondisjoint rectangles, but from Lemma 2.3any such union equals a finite union of disjoint rectangles. Hence H isclosed under complementation and finite unions and hence is a field. 2

Distributions

Given an index set T and an A-valued random process Xtt∈T definedon an underlying probability space (Ω,B, P), at first glance it would ap-pear that for any S ⊂ T the measurable space (AS,BS

A) should inherit aprobability measure from the underlying space through the random vari-ables XS = Xt ; t ∈ S. The only hitch is that so far we know only thatindividual random variables Xt for each individual t are measurable (and


hence inherit a probability measure, called the marginal distribution) . Tomake sense here we must first show that collections of random variablesXS = Xt ; t ∈ S such as Xn = X0, . . . , Xn−1 are also measurable andhence themselves random variables in the general sense.

It is easy to show that inverse images of the mapping XS : Ω → AS

yield events in B if we confine attention to rectangles. For a rectangleB =

⋂t∈IxS ∈ AS : xt ∈ Bt we have that

(XI)−1(B) =⋂t∈IXt−1(Bt) (2.10)

and hence the measurability of each individual Xt and the fact that acountable unions of events are events implies that

(XI)−1(B) ∈ B. (2.11)

We will have the desired measurability if we can show that if (2.11) issatisfied for all rectangles, then it is also satisfied for all events in theσ -field BAS = σ(rect(Bt, t ∈ S)) generated by the rectangles. This resultis an application of an approach named the good sets principle by Ash[1], p. 5. We shall occasionally wish to prove that all events possess someparticular desirable property that is easy to prove for generating events.The good sets principle consists of the following argument: Let S be thecollection of good sets consisting of of all events F ∈ σ(G) possessingthe desired property. If

• G ⊂ S and hence all the generating events are good, and• S is a σ -field,

then σ(G) ⊂ S and hence all of the events F ∈ σ(G) are good.

Lemma 2.4. Given measurable spaces (Ω1 ,B) and (Ω2 , σ(G)), then afunction f : Ω1 → Ω2 is B-measurable if and only if f−1(F) ∈ B forall F ∈ G; that is, measurability can be verified by showing that inverseimages of generating events are events.

Proof. If f is B-measurable, then f−1(F) ∈ B for all F and hence for allF ∈ G. Conversely, if f−1(F) ∈ B for all generating events F ∈ G, thendefine the class of sets

S = G : G ∈ σ(G), f−1(G) ∈ B.

It is straightforward to verify that S is a σ -field, clearly Ω1 ∈ S since it isthe inverse image of Ω2. The fact that S contains countable unions of itselements follows from the fact that σ(G) is closed to countable unionsand inverse images preserve set theoretic operations, that is,

f−1(∪iGi) = ∪if−1(Gi).

2.5 Equivalent Random Processes 51

Furthermore, S contains every member of G by assumption. Since S con-tains G and is a σ -field, σ(G) ⊂ S by the good sets principle. 2

We have shown that given a random process Xt ; t ∈ T defined on aprobability space (Ω,B, P) and a set S ∈ T, then the mapping XS : Ω→ AS

is B/σ(rect(Bt, t ∈ S))-measurable. In other words, the output mea-surable space (AS,BS

A) inherits a probability measure from the under-lying probability space and thereby determines a new probability space(AS,BS

A, PXS), where the induced probability measure is defined by

PXS(F) = P((XS)−1(F)) = P(ω : XS(ω) ∈ F), F ∈ BAS. (2.12)

Such probability measures induced on the outputs of random variablesare referred to as distributions for the random variables, exactly as in thesimpler case first treated. When S = m,m+1, . . . ,m+n−1, e.g., whenwe are treating Xn taking values in An, the distribution is referred to asan n-dimensional or nth order distribution and it describes the behaviorof an n-dimensional random variable. If S = T, then the induced prob-ability measure is said to be the distribution of the process or, simply,the process distribution. Thus, for example, a probability space (Ω,B, P)together with a doubly infinite sequence of random variables Xnn∈Zinduces a new probability space (AZ,BZ

A, PXZ) and PXZ is the distributionof the process. For simplicity, we usually denote a process distributionsimply by m (or µ or some other simple symbol).

2.5 Equivalent Random Processes

Since the sequence space (AT,BTA,m) of a random process Xtt∈T is a

probability space, we can define random variables and hence also ran-dom processes on this space. One simple and useful such definition isthat of a sampling or coordinate or projection function defined as fol-lows: Given a product space AT, define for any t ∈ T the sampling func-tion Πt : AT → A by

Πt(xT) = xt,xT ∈ AT, t ∈ T. (2.13)

The sampling function is named Π since it is also a projection. Observethat the distribution of the random process Πtt∈T defined on the prob-ability space (AT,BT

A,m) is exactly the same as the distribution of therandom process Xtt∈T defined on the probability space (Ω,B, P). Infact, so far they are the same process since the Πt simply read off thevalues of the Xt.

What happens, however, if we no longer build the Πt on the Xt , thatis, we no longer first select ω from Ω according to P , then form thesequence xT = XT(ω) = Xt(ω)t∈T, and then define Πt(xT) = Xt(ω)?


Instead we directly choose an x in AT using the probability measure mand then view the sequence of coordinate values. In other words, we areconsidering two completely separate experiments, one described by theprobability space (Ω,B, P) and the random variables Xt and the otherdescribed by the probability space (AT,BT

A,m) and the random variablesΠt. In these two separate experiments, the actual sequences selectedmay be completely different. Yet intuitively the processes should be thesame in the sense that their statistical structures are identical, that is,they have the same process distribution. We make this intuition formalby defining two processes to be equivalent if their process distributionsare identical, that is, if the probability measures on the output spacesare the same, regardless of the functional form of the random variablesand of the underlying probability spaces. Simply put, we consider tworandom variables to be equivalent if their distributions are identical.

We have described two equivalent processes or two equivalent modelsfor the same random process, one defined as a sequence of perhaps verycomplicated random variables on an underlying probability space, theother defined as a probability measure directly on the measurable spaceof possible output sequences. The second model will be referred to as adirectly given random process or the Kolmogorov representation of theprocess.

Which model is better depends on the application. For example, a di-rectly given model for a random process may focus on the random pro-cess itself and not its origin and hence may be simpler to deal with. If therandom process is then coded or measurements are taken on the randomprocess, then it may be better to model the encoded random process interms of random variables defined on the original random process andnot as a directly given random process. This model will then focus onthe input process and the coding operation. We shall let conveniencedetermine the most appropriate model.

Recall that previously we introduced dynamical systems model for arandom process in the context of a directly-given random process. Thuswe have considered three methods of modeling a random process:

• An indexed family of random variables defined on a common proba-bility space.

• A directly-given or Kolmogorov representation as a distribution on asequence or function space.

• A dynamical system consisting of a probability space and a family oftransformations together with a single random variable.

In a sense, the dynamical system model is the most general since it canbe used to model any random process defined by the other models, butnot every dynamical system can be well modeled by a random process.Each of these models will have its uses, but the dynamical systems ap-

2.5 Equivalent Random Processes 53

proach has the advantage of capturing the notion of evolution in time bya transformation or family of transformations.

The shift transformation introduced previously on a sequence spaceis the most important transformation that we shall encounter. It is not,however, the only important transformation. Hence when dealing withtransformations we will usually use the notation T to reflect the factthat it is related to the action of a simple left shift of a sequence, yetwe should keep in mind that occasionally other operators will be con-sidered and the theory to be developed will remain valid,; that is, T isnot required to be a simple time shift. For example, we will also considerblock shifts of vectors instead of samples and variable length shifts.

Most texts on ergodic theory focus on the case of an invertible trans-formation, that is, where T is a one-to-one transformation and the in-verse mapping T−1 is measurable. This is the case for the shift on AZ

and AR, the two-sided shifts. Here the nonivertible case is also consid-ered in detail, but the simplifications and additional properties in thesimpler invertible case are described.

Since random processes are considered equivalent if their distribu-tions are the same, we shall adopt the notation [A,m,X] for a randomprocess Xt ; t ∈ T with alphabet A and process distribution m, theindex set T usually being clear from context. We will occasionally ab-breviate this to [A,m], but it is often convenient to note the name ofthe output random variables as there may be several; e.g., an input Xand output Y . By the associated probability space or Kolmogorov rep-resentation of a random process [A,m,X] we shall mean the sequenceprobability space (AT,BT

A,m). It will often be convenient to consider therandom process as a directly given random process, that is, to view Xtas the coordinate functions Πt on the sequence or function space AT

rather than as being defined on some other abstract space. This will notalways be the case, however, as processes might be formed by codingor communicating other random processes. Context should render suchbookkeeping details clear.

Exercises

1. Given a random process Xn with alphabet A, show that the classF0 = rect(Bi; i ∈ T) of all rectangles is a field.

2. Let F(G) denote the field generated by a class of sets G, that is, F(G)contains the given class and is in turn contained by all other fieldscontaining G. Show that σ(G) = σ(F(G)).


2.6 Codes, Filters, and Factors

In communication systems and signal processing it is common to beginwith one random process and to perform an operation on the processto obtain another random process. Such operations are at the heart ofmany applications, including signal processing, communications, and in-formation theory. The multiple models for random processes provide anatural context for a few basic ideas along this line.

Begin with a directly-given model of a random process Xn;n ∈ Tand hence a probability space (Ω,B, P) = (AT,BT,mX), where mX de-notes the process distribution of Xn;n ∈ T. Let TA denote the shifttransformation on AT. Let f : AT → B be a random variable with alpha-bet B. Invoking the dynamical systems model for a random process, theoriginal process plus the mapping yields a new B-valued random processYn;n ∈ T defined by

Yn(xT) = f(TnAxT); t ∈ T. (2.14)

Intuitively, in the discrete-time case the sample sequence yT of the out-put random process is formed from the sample sequence of the inputxT by first mapping the input into a value y0 = f(xT), then shifting theinput sequence by one and again applying f to form y1 = f(TxT), thenshifting again to form y2 = f(T 2xT), and so on.

As a simple concrete example, suppose that T = Z, A = B = 0,1,and that

f(xT) = g(x0, x−1, x−2)

where g is given by the table

x0x−1x−2 000 001 010 011 100 101 110 111g 0 0 0 0 0 0 0 1

.

The code can be depicted using a shift register and a mapping as

2.6 Codes, Filters, and Factors 55

xn - xn xn−1 xn−2 shift register

?@@@R

"!# g

function, table

- yn = g(xn,xn−1, xn−2)

and a portion of the input and output sequences is shown below:

n -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13xn 1 1 0 0 0 1 1 0 1 1 1 1 0 0 1Yn 0 0 0 0 0 0 0 0 1 1 0 0 0

The code is called a sliding-block or sliding-window code because it“slides” a block or window across the input sequence, looks up the blockin a table, and prints out a single symbol. This is in contrast to a blockcode which independently maps disjoint input blocks into disjoint out-put blocks.

In general such mapping is called a time-invariant coding if the inputalphabet A is discrete, and a time-invariant filtering if it is continuous. Asynonym for time-invariant is stationary. The distinction between codingand filtering is historical, and we will generally refer to both operationsas sliding-block or stationary or time-invariant coding.

The mapping f of a sample function into an output symbol induces amapping into an output sample function, f (xT) : AT → BT defined by

f (xT) = f(TnAxT);n ∈ T. (2.15)

This mapping has the property that shifting the input results in a corre-sponding shift of the output, that is,

f (TAxT) = TBf (xT); t ∈ T (2.16)

which is why the coding or filtering is said to be shift-invariant or sta-tionary.

There are a variety of extensions of this idea which allow differentshifts on the input and output spaces, e.g., in the discrete-time case Bmight itself be a product space or we might shift the input sequence Ntimes and produce N output symbols at each operation.


The abstraction of the idea of a stationary coding or filtering fromrandom processes to dynamical systems is called a factor in ergodictheory. Suppose that we have two dynamical systems (Ω,B, P , T) and(Λ,S, µ, S) and a measurable mapping f : Ω → Λ with the property thatf(Tω) = Sf(ω). In the random processes case this corresponds to astationary code or filter, i.e., shifting the input yields a shifted output.In the coding case T is the shift on AT and S is the shift on BT. In thisabstract case, the new dynamical system (Λ,S, µ, S) is said to be a factorof the original system (Ω,B, P , T).

2.7 Isomorphism

We have defined random variables or random processes to be equivalentif they have the same output probability spaces, that is, if they have thesame probability measure or distribution on their output value measur-able space. This is not the only possible notion of two random processesbeing the same. For example, suppose we have a random process witha binary output alphabet and hence an output space made up of binarysequences. We form a new directly given random process by taking eachsuccessive pair of the binary process and considering the outputs to bea sequence of quaternary symbols, that is, the original random processis coded into a new random process via the mapping

00 → 0

01 → 1

10 → 2

11 → 3

applied to successive pairs of binary symbols.The two random processes are not equivalent since their output se-

quence measurable spaces are different; yet clearly they are the samesince each can be obtained by a simple relabeling or coding of the other.This leads to a more fundamental definition of sameness: isomorphism.The definition of isomorphism is for dynamical systems rather than forrandom processes since, as previously noted, this is the more fundamen-tal notion and hence the definition applies to random processes.

There are, in fact, several notions of isomorphism: isomorphic mea-surable spaces, isomorphic probability spaces, and isomorphic dynami-cal systems. We present these definitions together as they are intimatelyconnected.

2.7 Isomorphism 57

Isomorphic Measurable Spaces

Two measurable spaces (Ω,B) and (Λ,S) are isomorphic if there existsa measurable function f : Ω → Λ that is one-to-one and has a measur-able inverse f−1. In other words, the inverse image f−1(λ) of a pointλ ∈ Λ consists of exactly one point in Ω and the inverse mapping sodefined, say g : Λ → Ω, g(λ) = f−1(λ), is itself a measurable mapping.The function f (or its inverse g) with these properties is called an iso-morphism. An isomorphism between two measurable spaces is thus aninvertible mapping between the two sample spaces that is measurable inboth directions.

Isomorphic Probability Spaces

Two probability spaces (Ω,B, P) and (Λ,S,Q) are isomorphic if there isan isomorphism f : Ω → Λ between the two measurable spaces (Ω,B)and (Λ,S) with the added property that

Q = Pf and P = Qf−1

that is,Q(F) = P(f−1(F));F ∈ SP(G) = Q(f(G));G ∈ B.

Two probability spaces are isomorphic

1. if one can find for each space a random variable defined on that spacethat has the other as its output space, and

2. the random variables can be chosen to be inverses of each other; thatis, if the two random variables are f and g, then f(g(λ)) = λ andg(f(ω)) =ω.

Note that if the two probability spaces (Ω,B, P) and (Λ,S,Q) are iso-morphic and f : Ω → Λ is an isomorphism with inverse g, then therandom variable fg defined by fg(λ) = f(g(λ)) is equivalent to theidentity random variable i : Λ→ Λ defined by i(λ) = λ.

Isomorphism Mod 0

A weaker notion of isomorphism between two probability spaces is thatof isomorphism mod 0 . Two probability spaces are isomorphic mod 0or metrically isomorphic if the mappings have the desired properties on


sets of probability 1, that is, the mappings can be defined except for nullsets in the respective spaces. That is, two probability spaces (Ω,B, P)and (Λ,S,Q) are isomorphic (mod 0) if there are null sets Ω0 ∈ B andλ0 ∈ S and a measurable one-to-one onto map f : Ω − Ω0 → Λ − Λ0

with measurable inverse such that Q(F) = P(f−1(F)) for all F ∈ S. Thisweaker notion is the standard one in ergodic theory.

Isomorphic Dynamical Systems

With the ideas of isomorphic measurable spaces and isomorphic proba-bility spaces in hand, we now can define isomorphic dynamical systems.Roughly speaking, two dynamical systems are isomorphic if one can becoded or filtered onto the other in an invertible way so that the codingcarries one transformation into the other, that is, one can code from onesystem into the other and back again and coding and transformationscommute.

Two dynamical systems (Ω,B, P , S) and (Λ,S,m,T) are isomorphic ifthere exists an isomorphism f : Ω→ Λ such that

Tf(ω) = f(Sω);ω ∈ Ω.As with the isomorphism of probability spaces, isomorphic mod 0 meansthat the properties need hold only on sets of probability 1, but we alsorequire that the null sets in the respective spaces on which the isomor-phism is not defined to be invariant with respect to the appropriatetransformation.

If the probability space (Λ,S,m) is the sequence space of a directlygiven random process, T is the shift on this space, and Π0 the samplingfunction on this space, then the random process Π0Tn = Πn definedon (Λ,S,m) is equivalent to the random process Π0(fSn) defined onthe probability space (Ω,B, P). More generally, any random process ofthe form gTn defined on (Λ,S,m) is equivalent to the random processg(fSn) defined on the probability space (Ω,B, P). A similar conclusionholds in the opposite direction. Thus, any random process that can bedefined on one dynamical system as a function of transformed pointspossesses an equivalent model in terms of the other dynamical systemand its transformation. In addition, not only can one code from one sys-tem into the other, one can recover the original sample point by invertingthe code (at least with probability 1).

The binary example introduced at the beginning of this section iseasily seen to meet this definition of sameness. Let B(Z(n)) denotethe σ -field of subsets of Z(n) comprising all possible subsets of Z(n)(the power set of Z(n)). The described mapping of binary pairs or

2.7 Isomorphism 59

members of Z(2)2 into Z(4) induces a mapping f : Z(2)Z+ → Z(4)Z+mapping binary sequences into quaternary sequences. This mapping iseasily seen to be invertible (by construction) and measurable (use thegood sets principle and focus on rectangles). Let T be the shift onthe binary sequence space and S be the shift on the quaternary se-quence space; then the dynamical systems (Z(2)Z+ ,B(Z(2))Z+ ,m,T 2)and (Z(4)Z+ ,B(Z(4))Z+ ,mf , T) are isomorphic; that is, the quaternarymodel with an ordinary shift is isomorphic to the binary model with thetwo-shift that shifts symbols a pair at a time.

Isomorphism will often provide a variety of equivalent models for ran-dom processes. Unlike the previous notion of equivalence, however, iso-morphism will often yield dynamical systems that are decidedly differ-ent, but that are isomorphic and hence produce equivalent random pro-cesses by invertible coding for discrete-alphabet processes or filteringfor continuous-alphabet processes.

Chapter 3

Standard Alphabets

Abstract Standard spaces provide a model for random process outputsthat is arguably as general as needed in almost all applications and hassufficient structure to yield several fundamental results for random pro-cesses that do not necessarily hold for more abstract models. The struc-ture eases the construction of probability measures on product spacesand from sample sequences which is essential to the Kolmogorov exten-sion theorem, the existence of regular conditional probability measures,and the ergodic decomposition, all results encountered in this book. Inthis chapter the basic definitions and properties are developed for stan-dard measurable spaces and random processes with standard alphabetsand simple examples are given. The properties are used to prove theKolmogorov extension theorem for processes with standard alphabets.As an example of the Kolmogorov theorem, simple Bernoulli processes(discrete IID processes) are constructed and used to generate a fairlygeneral class of discrete random processes known as B-processes. In thenext chapter more complicated and general examples of standard spacesare treated.

3.1 Extension of Probability Measures

It is often desirable to construct probability measures and sources byspecifying the values of the probability of certain simple events and thenby finding a probability measure on the σ -field containing these events.It was shown in Section 1.8 that if one has a set function meeting therequired conditions of a probability measure on a field, then there is aunique extension of that set function to a consistent probability measureon the σ -field generated by the field. Unfortunately, the extension the-orem is often difficult to apply since countable additivity or continuityof a set function can be difficult to prove, even on a field. Nonnegativity,

© Springer Science + Business Media, LLC 2009R.M. Gray, Probability, Random Processes, and Ergodic Properties, 61DOI: 10.1007/978-1-4419-1090-5_3,

62 3 Standard Alphabets

normalization, and finite additivity are, however, usually quite simple totest. Because several of the results to be developed will require such ex-tensions of finitely additive measures to countably additive measures,we shall develop in this chapter a class of alphabets for which such ex-tension will always be possible.

To apply the Carathéodory extension theorem, one must first pindown a candidate probability measure on a generating field. In most suchconstructions it is at best feasible to force a set function to be nice on acountable collection of sets, so we focus on σ -fields that are countablygenerated, that is, σ -fields for which there is a countable collection ofsets G that generates. Say that we have a σ -field B = σ(G) for somecountable class G. Let F(G) denote the field generated by G, that is, thesmallest field containing G. Unlike σ -fields, there is a simple construc-tive definition of the field generated by a class: F(G) consists exactly ofall elements of G together with all sets obtainable from finite sequencesof set theoretic operations on G. Thus if G is countable, so is F(G). It iseasy to show using the good sets principle that if B = σ(G), then alsoB = σ(F(G)) and hence if B is countably generated, then it is generatedby a countable field.

Our goal is to find countable generating fields which have the propertythat every nonnegative, normalized, finitely additive set function on thefield is also countably additive on the field (and hence will have a uniqueextension to a probability measure on the σ -field generated by the field).We formalize this property in a definition:

A field F is said to have the countable extension property if it hasa countable number of elements and if every set function P satisfying(1.19), (1.20), and (1.22) on F also satisfies (1.23) on F . A measurablespace (Ω,B) is said to have the countable extension property if B = σ(F)for some field F with the countable extension property.

Thus a measurable space has the countable extension property if thereis a countable generating field such that all normalized, nonnegative,finitely additive set functions on the field extend to a probability mea-sure on the full σ -field. This chapter is devoted to characterizing thosefields and measurable spaces possessing the countable extension prop-erty and to prove one of the most important results for such spaces–the Kolmogorov extension theorem. The Kolmogorov result provides oneof the most useful means for describing specific basic random processmodels, which can then in turn be used to construct a rich variety ofother processes by means of coding or filtering. We also develop somesimple properties of standard spaces in preparation for the next chap-ter, where we develop the most important and most general known classof such spaces.

3.2 Standard Spaces 63

3.2 Standard Spaces

As a first step towards characterizing fields with the countable extensionproperty, we consider the special case of fields having only a finite num-ber of elements. Such finite fields trivially possess the countable exten-sion property. We shall then proceed to construct a countable generatingfield from a sequence of finite fields and we will determine conditionsunder which the limiting field will inherit the extendibility properties ofthe finite fields.

Let F = Fi, i = 0,1,2, . . . , n− 1 be a finite field of a sample space Ω,that is, F is a finite collection of sets in Ω that is closed under finite settheoretic operations. Note thatF is trivially also a σ -field.F itself can berepresented as the field generated by a more fundamental class of sets.A set F in F will be called an atom if its only subsets that are also fieldmembers are itself and the empty set, that is, it cannot be broken up intosmaller pieces that are also in the field. Let A denote the collection ofatoms of F . Clearly there are fewer than n atoms. It is easy to show thatA consists exactly of all nonempty sets of the form

n−1⋂i=0

F∗i

where F∗i is either Fi or Fci . Such sets are called intersection sets andany two intersection sets must be disjoint since for at least one i oneintersection set must lie inside Fi and the other within Fci . Thus all inter-section sets must be disjoint. Any field element can be written as a finiteunion of intersection sets — just take the union of all intersection setscontained in the given field element. Let G be an atom of F . Since it is anelement of F , it is the union of disjoint intersection sets. There can beonly one nonempty intersection set in the union, however, or G wouldnot be an atom. Hence every atom is an intersection set. Conversely, ifG is an intersection set, then it must also be an atom since otherwiseit would contain more than one atom and hence contain more than oneintersection set, contradicting the disjointness of the intersection sets.

In summary, given any finite field F of a space Ω we can find a uniquecollection of atoms A of the field such that the sets in A are disjoint,nonempty, and have the entire space Ω as their union (since Ω is a fieldelement and hence can be written as a union of atoms). Thus A is apartition of Ω. Furthermore, since every field element can be written as aunion of atoms, F is generated by A in the sense that it is the smallestfield containingA. Hence we write F = F(A). Observe that if we assignnonnegative numbers pi to the atoms Gi in A such that their sum is 1,then this immediately gives a finitely additive, nonnegative, normalizedset function on F by the formula


P(F) =∑

i:Gi⊂Fpi.

Furthermore, this set function is trivially countably additive since thereare only a finite number of elements in F .

The next step is to consider an increasing sequence of finite fieldsthat in the limit will give a countable generating field. A sequence offinite fields Fn, n = 1,2, . . . is said to be increasing if the elements ofeach field are also elements of all of the fields with higher indices, thatis, if Fn ⊂ Fn+1, all n. This implies that if An are the correspondingcollections of atoms, then the atoms inAn+1 are formed by splitting upthe atoms in An. Given an increasing sequence of fields Fn, define thelimit F as the union of all of the elements in all of the Fn, that is,

F =∞⋃i=1

Fi.

F is easily seen to be itself a field. For example, any F ∈ F must be in Fnfor some n, hence the complement Fc must also be in Fn and hence alsoinF . Similarly, any finite collection of sets Fn, n = 1,2, . . . ,mmust all becontained in some Fk for sufficiently large k. The latter field must hencecontain the union and hence so must F . Thus the increasing sequenceFn can be thought of as increasing to a limit field F , an idea capturedby the notation

Fn ↑ Fif Fn ⊂ Fn+1, all n, and F is the union of all of the elements of the Fn.Note that F has by construction a countable number of elements. WhenFn ↑ F , we shall say that the sequence Fn asymptotically generates F .

Lemma 3.1. Given any countable field F of subsets of a space Ω, thenthere is a sequence of finite fields Fn;n = 1,2, . . . that asymptoticallygenerates F . In addition, the sequence can be constructed so that thecorresponding sets of atomsAn of the field Fn can be indexed by a subsetof the set of all binary n-tuples, that is, An = Gun,un ∈ B, where B isa subset of 0,1n, with the property that Gun ⊂ Gun−1 . Thus if un is aprefix of um for m > n, then Gum is contained in Gun . (We shall refer tosuch a sequence of atoms of finite fields as a binary indexed sequence.)

Proof. Let F = Fi, i = 0,1, . . .. Consider the sequence of finite fieldsdefined by Fn = F(Fi, i = 0,1, . . . , n− 1), that is, the field generated bythe first n elements inF . The sequence is increasing since the generatingclasses are increasing. Any given element in F is eventually in one of theFn and hence the union of the Fn contains all of the elements in F andis itself a field, hence it must contain F . Any element in the union of theFn, however, must be in an Fn for some n and hence must be an elementof F . Thus the two fields are the same. A similar argument to that used


above to construct the atoms of an arbitrary finite field will demonstratethat the atoms in F(F0, . . . , Fn−1) are simply all nonempty sets of theform

⋂n−1k=0 F

∗k , where F∗k is either Fk or Fck . For each such intersection set

let un denote the binary n-tuple having a one in each coordinate i forwhich F∗i = Fi and zeros in the remaining coordinates and define Gun asthe corresponding intersection set. Then each Gun is either an atom orempty and all atoms are obtained as u varies through the set 0,1n ofall binary n-tuples. By construction,

ω ∈ Gun if and only if ui = 1Fi(ω), i = 0,1, . . . , n− 1, (3.1)

where for any set F 1F(ω) is the indicator function of the set F , that is,

1F(ω) =

1 ω ∈ F0 ω 6∈ F.

Eq. (3.1) implies that for any n and any binary n-tuple un that Gun ⊂Gun−1 , completing the proof. 2

Before proceeding to further study sequences of finite fields, we notethat the construction of the binary indexed sequence of atoms for acountable field presented above provides an easy means of determiningwhich atoms contain a given sample point. For later use we formalizethis notion in a definition.

Given an enumeration Fn;n = 0,1, . . . of a countable field F of sub-sets of a sample space Ω and the single-sided binary sequence space

M+ = 0,1Z+ = ×∞i=00,1,

define the canonical binary sequence function f : Ω→M+ by

f(ω) = 1Fi(ω) ; i = 0,1, . . .. (3.2)

Given an enumeration of a countable field and the corresponding binaryindexed set of atoms An = Gun as above, then for any point ω ∈ Ωwe have that

ω ∈ Gf(ω)n,n = 1,2, . . . , (3.3)

where f(ω)n denotes the binary n-tuple comprising the first n symbolsin f(ω).

Thus the sequence of decreasing atoms containing a given pointω canbe found as a prefix of the canonical binary sequence function. Observe,however, that f is only an into mapping, that is, some binary sequencesmay not correspond to points in Ω. In addition, the function may bemany to one, that is, different points in Ω may yield the same binarysequences.


Unfortunately, the sequence of finite fields converging upward to acountable field developed above does not have sufficient structure toguarantee that probability measures on the finite fields will imply a prob-ability measure on the limit field. The missing item that will prove to bethe key is specified in the following definitions:

A sequence of finite fields Fn, n = 0,1, . . ., is said to be a basis for afield F if it has the following properties:

1. It asymptotically generates F , that is,

Fn ↑ F . (3.4)

2. If Gn is a sequence of atoms of Fn such that if Gn ∈ An and Gn+1 ⊂Gn ,n = 0,1,2, . . ., then

∞⋂n=1

Gn 6= ∅. (3.5)

with the property that a decreasing sequence of atoms cannot convergeto the empty set. The property (3.4) of a sequence of fields is calledthe finite intersection property. The name comes from the similar usein topology and reflects the fact that if all finite intersections of thesequence of sets Gn are not empty (which is true since the Gn are adecreasing sequence of nonempty sets), then the intersection of all ofthe Gn cannot be empty.

A sequence Fn, n = 1, . . . is said to be a basis for a measurable space(Ω,B) if the Fn form a basis for a field that generates B.

If the sequence Fn ↑ F and F generates B, that is, if

B = σ(F) = σ(∞⋃n=1

Fn),

then the Fn are said to asymptotically generate the σ -field B.A field F is said to be standard if it possesses a basis. A measurable

space (Ω,B) is said to be standard if B can be generated by a standardfield, that is, if B possesses a basis.

The requirement that a σ -field be generated by the limit of a sequenceof simple finite fields is a reasonably intuitive one if we hope for the σ -field to inherit the extendibility properties of the finite fields. The secondcondition — that a decreasing sequence of atoms has a nonempty limit— is less intuitive, however, and will prove harder to demonstrate. Al-though nonintuitive at this point, we shall see that the existence of a ba-sis is a sufficient and necessary condition for extending arbitrary finitelyadditive measures on countable fields to countably additive measures.

The proof that the standard property is sufficient to ensure that anyfinitely additive set function is also countably additive requires addi-

A basis for a field is an asymptotically generating sequence of finite fields


tional machinery that will be developed later. The proof of necessity,however, can be presented now and will perhaps provide some insightinto the standard property by showing what can go wrong in its absence.

Lemma 3.2. Let F be a field of subsets of a space Ω. A necessary conditionfor F to have the countable extension property is that it be standard, thatis, that it possess a basis.

Proof. We assume that F does not possess a basis and we construct afinitely additive set function that is not continuous at ∅ and hence notcountably additive. To have the countable extension property,F must becountable. From Lemma 3.1 we can construct a sequence of finite fieldsFn such that Fn ↑ F . Since F does not possess a basis, we know that forany such sequence Fn there must exist a decreasing sequence of atomsGn of Fn such that Gn ↓ ∅. Define set functions Pn on Fn as follows: IfGn ⊂ F , then Pn(F) = 1, if F ∩Gn = ∅, then Pn(F) = 0. Since F ∈ Fn, Feither wholly contains the atom Gn or F and Gn are disjoint, hence thePn are well defined. Next define the set function P on the limit field Fin the natural way: If F ∈ F , then F ∈ Fn for some smallest value ofn (e.g., if the Fn are constructed as before as the field generated by thefirst n elements of F , then eventually every element of the countablefield F must appear in one of the Fn). Thus we can set P(F) = Pn(F). Byconstruction, ifm ≥ n then also Pm(F) = Pn(F) and hence P(F) = Pn(F)for any n such that F ∈ Fn. P is obviously nonnegative, P(Ω) = 1 (sinceall of the atoms Gn in the given sequence are in the sample space), andP is finitely additive. To see the latter fact, let Fi, i = 1, . . . ,m be a finitecollection of disjoint sets in F . By construction, all must lie in some fieldFn for sufficiently large n. If Gn lies in one of the sets Fi (it can lie in atmost one since the sets are disjoint), then (1.22) holds with both sidesequal to one. If none of the sets contains Gn, then (1.22) holds with bothsides equal to zero. Thus P satisfies (1.19), (1.20), and (1.22). To proveP to be countably additive, then, we must verify (1.23). By constructionP(Gn) = Pn(Gn) = 1 for all n and hence

limn→∞

P(Gn) = 1 6= 0,

and therefore (1.23) is violated since by assumption the Gn decrease tothe empty set∅ which has zero probability. Thus P is not continuous at∅ and hence not countably additive.

If a field is not standard, then finitely additive probability measuresthat are not countably additive can always be constructed by puttingprobability on a sequence of atoms that collapses down to nothing. Thusthere can always be probability on ever smaller sets, but the limit cannotsupport the probability since it is empty. Thus the necessity of the stan-dard property for the extension of arbitrary additive probability mea-sures justifies its candidacy for a general, useful alphabet.


Corollary 3.1. A necessary condition for a countably generated measur-able space (Ω,B) to have the countable extension property is that it be astandard space.

Proof. To have the countable extension property, a measurable spacemust have a countable generating field. If the measurable space is notstandard, then no such field can possess a basis and hence no such fieldwill possess the countable extension property. In particular, one can al-ways find as in the proof of the lemma a generating field and an additiveset function on that field which is not countably additive and hence doesnot extend. 2

Exercise

1. A class of subsets V of A is said to be separating if given any twopoints x,y ∈ A, there is a V ∈ V that contains only one of the twopoints and not the other. Suppose that a σ -field B has a countablegenerating classV = Vi; i = 1,2, . . . that is also separating. Describethe intersection sets

∞⋂n=1

V∗n ,

where V∗n is either Vn or Vcn.

3.3 Some Properties of Standard Spaces

The following results provide some useful properties of standard spaces.In particular, they show how certain combinations of or mappings onstandard spaces yield other standard spaces. These results will proveuseful for demonstrating that certain spaces are indeed standard. Thefirst result shows that if we form a product space from a countable num-ber of standard spaces as in Section 2.4, then the product space is alsostandard. Thus if the alphabet of a source or random process for onesample time is standard, then the space of all sequences produced bythe source is also standard.

Lemma 3.3. Let Fi, i ∈ I, be a family of standard fields for some count-able index set I. Let F be the product field generated by all rectangles ofthe form F = xI : xi ∈ Fi, i ∈M, where Fi ∈ Fi all i and M is any finitesubset of I. That is,

F = F(rect(Fi, i ∈ I)),

then F is also standard.

3.3 Some Properties of Standard Spaces 69

Proof. Since I is countable, we may assume that I = 1,2, . . .. For eachi ∈ I, Fi is standard and hence possesses a basis, say Fi(n),n =1,2, . . .. Consider the sequence

Gn = F(rect(Fi(n), i = 1,2, . . . , n)), (3.6)

that is, Gn is the field of subsets formed by taking all rectangles formedfrom the nth order basis fields Fi(n), i = 1, . . . , n in the first n coordi-nates. The lemma will be proved by showing that Gn forms a basis forthe field F and hence the field is standard. The fields Gn are clearly finiteand increasing since the coordinate fields are. The field generated by theunion of all the fields Gn will contain all of the rectangles in F since foreach i the union in that coordinate contains the full coordinate field Fi.Thus Gn ↑ F . Say we have a sequence Gn of atoms of Gn (Gn ∈ Gn forall n) decreasing to the null set. Each such atom must have the form

Gn = xI : xi ∈ Gn(i); i = 1,2, . . . , n.

where Gn(i) is an atom of the coordinate field Fi(n). For Gn ↓ ∅, how-ever, this requires that Gn(i) ↓ ∅ at least for one i, violating the defini-tion of a basis for the ith coordinate. Thus Fn must be a basis. 2

Corollary 3.2. Let (Ai,Bi), i ∈ I, be a family of standard spaces for somecountable index set I. Let (A,B) be the product measurable space, that is,

(A,B) = ×i∈I(Ai,Bi) = (×i∈IAi,×i∈IBi),

where×i∈IAi = all xI = (xi, i ∈ I) : xi ∈ Ai ; i ∈ I

is the Cartesian product of the alphabets Ai, and

×i∈IBi = σ(rect(Bi, i ∈ I)).

Then (A,B) is standard.

Proof. Since each (Ai,Bi), i ∈ I, is standard, each possesses a basis, sayFi(n);n = 1,2, . . ., which asymptotically generates a coordinate fieldFi which in turn generates Bi. (Note that these are not the same as theFi(n) of (3.6).) From Lemma 3.3 the product field of the nth order basisfields given by (3.6) is a basis for F . Thus we will be done if we can showthat F generates B. It is an easy exercise, however, to show that if Figenerates Bi, all i ∈ I, then

B = σ(rect(Bi, i ∈ I)) = σ(rect(Fi, i ∈ I))= σ(F(rect(Fi, i ∈ I))) = σ(F). (3.7)

2


We now turn from Cartesian products of different standard spaces tocertain subspaces of standard spaces. First, however, we need to definesubspaces of measurable spaces.

Let (A,B) be a measurable space (not necessarily standard). For anyF ∈ B and any class of sets G define the new class G ∩ F = all setsof the form G ∩ F for some G ∈ G. Given a set F ∈ B we can definea measurable space (F,B ∩ F) called a subspace of (A,B). It is easy toverify that B ∩ F is a σ -field of subsets of F . Intuitively, it is the fullσ -field B cut down to the set F .

We cannot show in general that arbitrary subspaces of standardspaces are standard. One case where they are is described next.

Lemma 3.4. Suppose that (A,B) is standard with a basis Fn, n = 1,2, . . ..If F is any countable union of members of the Fn, that is, if

F =∞⋃i=1

Fi;Fi ∈∞⋃n=1

Fn,

then the space (A− F,B∩ (A− F)) is also standard.

Proof. The basic idea is that if you take a standard space with a basisand remove some atoms, the original basis with these atoms removedstill works since decreasing sequences of atoms must still be empty forsome finite n or have a nonempty intersection. Given a basis Fn for(A,B), form a candidate basis Gn = Fn ∩ Fc . This is essentially the orig-inal basis with all of the atoms in the sets in F removed. Let us callthe countable collection of basis field atoms in F “bad atoms.” From thegood sets principle, the Gn asymptotically generate B∩ Fc . Suppose thatGn ∈ Gn is a decreasing sequence of nonempty atoms. By constructionGn = An ∩ Fc for some sequence of atoms An in Fn. Since the originalspace is standard, the intersection

⋂∞n=1An is nonempty, say some set

A. Suppose that A ∩ Fc is empty, then A ⊂ F and hence A must be con-tained in one of the bad atoms, say A ⊂ Bm ∈ Fm. But the atoms of Fmare disjoint and A ⊂ Am with Am ∩ Bm = ∅, a contradiction unless Amis empty in violation of the assumptions. 2

Next we consider spaces isomorphic to standard spaces. Recall thatgiven two measurable spaces (A,B) and (B,S), an isomorphism f is ameasurable function f : A → B that is one-to-one and has a measur-able inverse f−1. Recall also that if such an isomorphism exists, the twomeasurable spaces are said to be isomorphic. We first provide a usefultechnical lemma and then show that if two spaces are isomorphic, thenone is standard if and only if the other one is standard.

Lemma 3.5. If (A,B) and (B,S) are isomorphic measurable spaces, letf : A → B denote an isomorphism. Then B = f−1(S) = all sets of theform f−1(G) for G ∈ S.

3.3 Some Properties of Standard Spaces 71

Proof. Since f is measurable, the class of sets F = f−1(S) has all its ele-ments in B. Conversely, since f is an isomorphism, its inverse function,say g, is well defined and measurable. Thus if F ∈ B, then g−1(F) = f(F)= D ∈ S and hence F = f−1(D). Thus every element of B is given byf−1(D) for some D in S. In addition, F is a σ -field since the inverseimage operation preserves all countable set theoretic operations. Sup-pose that there is an element D in B that is not in F . Let g : B → Adenote the inverse operation of f . Since g is measurable and one-to-one, g−1(D) = f(D) is in G and therefore its inverse image under f ,f−1(f (D)) = D, must be in F = f−1(G) since f is measurable. But thisis a contradiction since D was assumed to not be in F and hence F = B.2

Corollary 3.3. If (A,B) and (B,S) are isomorphic, then (A,B) is standardif and only if (B,S) is standard.

Proof. Say that (B,S) is standard and hence has a basis Gn, n = 1,2, . . .and that G is the generating field formed by the union of all of the Gn. Letf : A→ B denote an isomorphism and define Fn = f−1(Gn), n = 1,2, . . .Since inverse images preserve set theoretic operations, the Fn are in-creasing finite fields with

⋃Fn = f−1(G). A straightforward application

of the good sets principle show that if G generates S and B = f−1(S),then f−1(G) = F also generates B and hence the Fn asymptoticallygenerate B. Given any decreasing set of atoms, say Dn, n = 1,2, . . ., ofFn, then f(Dn) = g−1(Dn)must also be a decreasing sequence of atomsof Gn (again, since inverse images preserve set theoretic operations). Thelatter sequence cannot be empty since the Gn form a basis, hence neithercan the former sequence since the mappings are one-to-one. Reversingthe argument shows that if either space is standard, then so is the other,completing the proof. 2

As a final result of this section we show that if a field is standard, thenwe can always find a basis that is a binary indexed sequence of fields inthe sense of Lemma 3.1. Indexing the atoms by binary vectors will provea simple and useful representation.

Lemma 3.6. If a field F is standard, then there exists a binary indexedbasis and a corresponding canonical binary sequence function.

Proof. The proof consists simply of relabeling of a basis in binary vec-tors. Since F is standard, then it possesses a basis, say Fn, n = 0,1, . . ..Use this basis to enumerate all of the elements of F in order, that is,first list the elements of F0, then the elements of F1, and so on. Thisproduces a sequence Fn,n = 0,1, . . . = F . Now construct a corre-sponding binary indexed asymptotically generating sequence of fields,say Gn,n = 0,1, . . ., as in Lemma 3.1. To prove Gn is a basis we need onlyshow that it has the finite intersection property of (3.4). First observe


that for every n there is an m ≥ n such that Fn = Gm. That is, eventu-ally the binary indexed fields Gm match the original fields, they just takelonger to get all of the elements since they include them one new oneat a time. If Gn is a decreasing sequence of atoms in Gn, then there is asequence of integers n(k) ≥ k, k = 1,2, . . ., such that Gn(k) = Bk, whereBk is an atom of Fk. If Gn collapses to the empty set, then so must thesubsequence Gn(k) and hence the sequence of atoms Bk of Fk. But thisis impossible since the Fk are a basis. Thus the Gn must also possessthe finite intersection property and be a basis. The remaining conclusionfollows immediately from Lemma 3.1 and the definition that follows it.2

3.4 Simple Standard Spaces

First consider a measurable space (A,B) with A a finite set and B theclass of all subsets of A. This space is trivially standard–simply let allfields in the basis be B. Alternatively one can observe that since B isfinite, the space trivially possesses the countable extension property andhence must be standard.

Next assume that (A,B) has a countable alphabet and that B is againthe class of all subsets of A. A natural first guess for a basis is to firstenumerate the points of A as a0, a1, . . . and to define the increasingsequence of finite fields Fn = F(a0, . . . , an−1 ). The sequence is in-deed asymptotically generating, but it does not have the finite intersec-tion property. The problem is that each field has as an atom the leftoveror garbage set Gn = an,an+1, . . . containing all of the points of A notyet included in the generating class. This sequence of atoms is nonemptyand decreasing, yet it converges to the empty set and hence Fn is not abasis.

In this case there is a simple fix. Simply adjoin the undesirable atomsGn containing the garbage to a good atom, say a0, to form a new se-quence of finite fields Gn = F(b0(n), b1(n), . . . , bn−1(n)) with b0(n) =a0 ∪ Gn, bi(n) = ai, i = 1, . . . , n − 1. This sequence of fields also gen-erates the full σ -field (note for example that a0 =

⋂∞n=0 b0(n)). It also

possesses the finite intersection property and hence is a basis. Thus anycountable measurable space is standard.

The above example points out that it is easy to find sequences ofasymptotically generating finite fields for a standard space that are notbases and hence which cannot be used to extend arbitrary finitely addi-tive set functions. In this case, however, we were able to rig the field sothat it did provide a basis.

3.4 Simple Standard Spaces 73

From Corollary 3.2 any countable product of standard spaces must bestandard. Thus products of finite or countable alphabets give standardspaces, a conclusion summarized in the following lemma.

Lemma 3.7. Given a countable index set I, if Ai are finite or countablyinfinite alphabets and Bi the σ -fields of all subsets of Ai for i ∈ I, then

×i∈I(Ai,Bi)

is standard. Thus, for example, the following spaces are standard: theinteger sequence space

(N+,BN+) = (ZZ++ ,BZ+

Z+), (3.8)

where Z+ = 0,1, . . . and BZ+ is the σ -field of all subsets of Z+ (the powerset) of Z+), and the binary sequence space

(M+,BM+) = (0,1Z+ ,BZ+0,1). (3.9)

The proof of Lemma 3.3 coupled with the discussion for the finiteand countable cases provide a specific construction for the bases of theabove two processes. For the case of the binary sequence space M+,the fields Fn consist of n dimensional rectangles of the form x : xi ∈Bi, i = 0,1, . . . , n− 1, Bi ⊂ 0,1 all i. These rectangles are themselvesfinite unions of simpler rectangles of the form

c(bn) = x : xi = bi; i = 0,1, . . . , n− 1 (3.10)

for bn ∈ 0,1n. These sets are called thin cylinders and are the atomsof Fn. The thin cylinders generate the full σ -field and any decreas-ing sequence of such rectangles converges down to a single binary se-quence. Hence it is easy to show that this space is standard. Observe thatthis also implies that individual binary sequences x = xi; i = 0,1, . . .are themselves events since they are a countable intersection of thincylinders. For the case of the integer sequence space N+, the field Fncontains all rectangles of the form x : xi ∈ Fi, i = 0,1, . . . , n, whereFi ∈ Fi(n) for all i, and Fi(n) = b0(n), b1(n), . . . , bn−1(n) for all iwith b0(n) = 0, n,n+ 1, n+ 2, . . . and bj = j for j = 1,2, . . . , n− 1.

As previously mentioned, it is not true in general that an arbitrarysubset of a standard space is standard, but the previous construction ofa basis combined with Lemma 3.4 does provide a means of showing thatcertain subspaces ofM+ andN+ are standard.

Lemma 3.8. Given the binary sequence spaceM+ or the integer sequencespace N+, if we remove from these spaces any countable collection ofrectangles, then the remaining space is standard.


Proof. The rectangles are eventually in the fields of the basis and theresult follows from Lemma 3.4. 2

Exercises

1. Consider the space of (3.9) of all binary sequences and the corre-sponding event space. Is the class of all finite unions of thin cylindersa field? Show that a decreasing sequence of nonempty atoms of Fnmust converge to a point, i.e., a single binary sequence.

3.5 Characterization of Standard Spaces

As a step towards demonstrating that a space is standard, the followingtheorem provides an alternative to the definition.

Theorem 3.1. A field F of subsets of Ω is standard if and only if (a) F iscountable, and (b) if Fn, n = 1,2, . . ., is a sequence of nonempty disjointelements in F , then

⋃n Fn 6∈ F . Given (a), condition (b) is equivalent to the

condition (b’) that if Gn,n = 1,2, . . ., is a sequence of strictly decreasingnonempty elements in F , then

⋂n Gn 6= ∅.

Proof. We begin by showing that (b) and (b’) are equivalent. We first showthat (b) implies (b’). Suppose that Gn satisfy the hypothesis of (b’). Then

H = (∞⋂n=1

Gn)c =∞⋃n=1

Gcn =∞⋃n=1

(Gcn −Gcn−1),

where G0 = Ω. Defining Fn = Gcn −Gcn−1 yields a sequence satisfying theconditions of (a) and hence H and therefore Hc are not in F , showingthat

⋂n Gn = Hc can not be empty (or it would be in F since a field

must contain the empty set). To prove that (b’) implies (b), suppose thatthe sequence Fn satisfies the hypothesis of (b) and define F =

⋃n Fn and

Gn = F−⋃nj=1 Fj . Then Gn is a strictly decreasing sequence of nonempty

sets with empty intersection. From (b’), the sequence Gn cannot all be inF , which implies that F 6∈ F .

We now show that the equivalent conditions (b) and (b’) hold if andonly if the field is standard. First suppose that (b) and (b’) hold and thatF = Fn, n = 1,2, . . .. Define Fn = F(F1, F2, . . . , Fn) for n = 1,2, . . ..Clearly F =

⋃∞n=1 Fn since F contains all of the elements of all of the

Fn and each element in F is in some Fn. Let An be a nonincreasingsequence of nonempty atoms in Fn as in the hypothesis for a basis. Ifthe An are all equal for large n, then clearly

⋂∞n=1An 6= ∅. If this is not

3.5 Characterization of Standard Spaces 75

true, then there is a strictly decreasing subsequence Ank and hence from(b’)

∞⋂n=1

An =∞⋂k=1

Ank 6= ∅.

This means that the Fn form a basis for F and hence F is standard.Lastly, suppose that F is standard with basis Fn, n = 1,2, . . ., and

suppose that Gn is a sequence of strictly decreasing elements in F asin (b’). Define the nonincreasing sequence of nonempty sets Hn ∈ Fn asfollows: If G1 ∈ F1, then set H1 = G1 and set n1 = 1. If this is not thecase, then G1 is too “fine” for the field F1 so we set H1 = Ω. Continuelooking for an n1 such that G1 ∈ Fn1 . When it is found (as it mustbe since G1 is a field element), set Hi = Ω for i < n1 and Hn1 = G1.Next consider G2. If G2 ∈ Fn1+1, then set Hn1+1 = G2. Otherwise setHn1+1 = G1 and find an n2 such that G2 ∈ Fn2 . Then set Hi = G1 forn1 ≤ i < n2 and Hn2 = G2. Continue in this way filling in the sequence ofGn with enough repeats to meet the condition. The sequence Hn ∈ Fnis thus a nonempty nonincreasing sequence of field elements. Clearly byconstruction

∞⋂n=1

Hn =∞⋂n=1

Gn

and hence we will be done if we can prove this intersection nonempty.We accomplish this by assuming that the intersection is empty andshow that this leads to a contradiction with the assumption of a stan-dard space. Roughly speaking, a decreasing set of nonempty field ele-ments collapsing to the empty set must contain a decreasing sequenceof nonempty basis atoms collapsing to the empty set and that violatesthe properties of a basis.

Pick for each n an ωn ∈ Hn and consider the sequence f(ωn) inthe binary sequence space M+. The binary alphabet 0,1 is triviallysequentially compact and hence so isM+ from Corollary 1.1. Thus thereis a subsequence, say f(ωn(k)), which converges to some u ∈ M+ ask → ∞. From Lemma 1.3 this means that f(ωn(k)) must converge to uin each coordinate as k→∞. Since there are only two possible values foreach coordinate in u, this means that given any positive integer n, thereis an N such that for n(k) ≥ N , f(ωn(k))i = ui for i = 0,1, . . . , n− 1,and hence from (3.1) that ωn(k) ∈ Aun , the atom of Fn indexed by un.Assuming that we also choose k large enough to ensure that n(k) ≥ n,then the fact that the fields are decreasing implies that ωn(k) ∈ Hn(k) ⊂Hn. Thus we must have that Gun is contained in Hn. That is, the pointωn(k) is contained in an atom of Fn(k) which is itself contained in anatom of Fn. Since the point is also in the set Hn of Fn, the set mustcontain the entire atom. Thus we have constructed a sequence of atoms


contained in a sequence of sets that decreases to the empty set, whichviolates the standard assumption and hence completes the proof. 2

3.6 Extension in Standard Spaces

In this section we combine the ideas of the previous two sections tocomplete the proofs of the following results.

Theorem 3.2. A field has the countable extension property if and only ifit is standard.

Corollary 3.4. A measurable space has the countable extension propertyif and only if it is standard.

These results will complete the characterization of those spaces hav-ing the desirable extension property discussed at the beginning of thesection. The “only if” portions of the results are contained in Section 3.2.The corollary follows immediately from the theorem and the definitionof standard. Hence we need only now show that a standard field has thecountable extension property.Proof of the theorem. Since F is standard, construct as in Lemma 3.1a basis with a binary indexed sequence of atoms and let f : Ω → M+denote the corresponding canonical binary sequence function of (3.2).Let P be a nonnegative, normalized, finitely additive set function of F .We wish to show that (1.23) is satisfied. Let Fn be a sequence of fieldelements decreasing to the null set. If for some finite n = N , FN is empty,then we are done since then

limn→∞

P(Fn) = P(FN) = P(∅) = 0. (3.11)

Thus we need only consider sequences Fn of nonempty decreasing fieldelements that converge to the empty set. From Theorem 3.1, however,there cannot be such sequences since the field is standard. Thus (3.11)trivially implies (1.23) and the theorem is proved by the Carathéodoryextension theorem. 2

The key to proving sufficiency above is the characterization of stan-dard fields in Theorem 3.1. This has the interpretation using part (b)that a field is standard if and only if truly countable unions of nonemptydisjoint field elements are not themselves in the field. Thus finitely ad-ditive set functions are trivially countably additive on the field becauseall countable disjoint unions of field elements can be written as a finiteunion of disjoint nonempty field elements.

3.7 The Kolmogorov Extension Theorem 77

3.7 The Kolmogorov Extension Theorem

In Chapter 1 in the section on distributions we saw that given a randomprocess, we can determine the distributions of all finite dimensional ran-dom vectors produced by the process. We now possess the machineryto prove the reverse result–the Kolmogorov extension theorem–whichstates that given a family of distributions of finite dimensional randomvectors that is consistent, that is, higher dimensional distributions im-ply the lower dimensional ones, then there is a process distribution thatagrees with the given family. Thus a consistent family of finite dimen-sional distributions is sufficient to completely specify a random processpossessing the distributions. This result is one of the most important inprobability theory since it provides realistic requirements for the con-struction of mathematical models of physical phenomena.

Theorem 3.3. Let I be a countable index set, let (Ai,Bi), i ∈ I, be a familyof standard measurable spaces, and let

(AI,BI) = ×i∈I(Ai,Bi)

be the product space. Similarly define for any index set M ⊂ I the productspace (AM,BM). Given index subsets K ⊂ M ⊂ I, define the projectionsΠM→K : AM → AK by ΠM→K(xM) = xK,

that is, the projection ΠM→K simply looks at the vector or sequencexM = xi, i ∈ M and produces the subsequence xK. A family of proba-bility measures PM on the measurable spaces (AM,BM) for all finite indexsubsets M ⊂ I is said to be consistent if whenever K ⊂M, then

PK(F) = PM(Π−1M→K(F)), all F ∈ BK, (3.12)

that is, probability measures on index subsets agree with those on furthersubsets. Given any consistent family of probability measures, then thereexists a unique process probability measure P on (AI,BI) which agreeswith the given family, that is, such that for all M ⊂ I

PM(F) = P(Π−1I→M(F)), all F ∈ BM, all finite M ⊂ I. (3.13)

Proof. Abbreviate by Πi : AI → Ai the one dimensional sampling func-tion defined by Πi(xI) = xi, i ∈ I. The consistent families of probabilitymeasures induces a set function P on all rectangles of the form

B =⋂i∈M

Π−1i (Bi)

for any Bi ∈ Bi via the relation


P(B) = PM(×i∈MBi), (3.14)

that is, we can take (3.13) as a definition of P on rectangles. Since thespaces (Ai,Bi) are standard, each has a basis Fi(n), n = 1,2, . . . andthe corresponding generating fieldFi. From Lemma 3.3 and Corollary 3.2the product fields given by (3.6) form a basis for F = F(rect(Fi, i ∈ I)),which in turn generates BI; hence P is also countably additive on F .Hence from the Carathéodory extension theorem, P extends to a uniqueprobability measure on (AI,BI). Lastly observe that for any finite indexset M ⊂ I, both PM and P(ΠI→M)−1 (recall that Pf−1(F) is defined asP(f−1(F)) for any function f : it is the distribution of the random vari-able f) are probability measures on (AM,BM) that agree on a generatingfield F(rect(Fi, i ∈M)) and hence they must be identical. Thus the pro-cess distribution indeed agrees with the finite dimensional distributionsfor all possible index subsets, completing the proof. 2

By simply observing in the above proof that we only required the prob-ability measure on the coordinate generating fields and that these fieldsare themselves standard, we can weaken the hypotheses of the theoremsomewhat to obtain the following corollary. The details of the proof areleft as an exercise.

Corollary 3.5. Given standard measurable spaces (Ai,Bi), i ∈ I, for acountable index set I, let Fi be corresponding generating fields possessinga basis. If we are given a family of normalized, nonnegative, finitely ad-ditive set functions PM that are defined on all sets F ∈ F(rect(Fi, i ∈ I))and are consistent in the sense of satisfying (3.13) for all such sets, thenthe conclusions of the previous theorem hold.

In summary, if we have a sequence space composed of standard mea-surable spaces and a family of finitely additive candidate probabilitymeasures that are consistent, then there exists a unique random pro-cess or a process distribution that agrees with the given probabilities onthe generating events.

3.8 Bernoulli Processes

The principal advantage of a standard space is the countable extensionproperty ensuring that any finitely additive candidate probability mea-sure on a standard field is also countably additive. We have not yet, how-ever, actually demonstrated an example of the construction of a proba-bility measure by extension. To accomplish this in the manner indicatedwe must explicitly construct a basis for a sample space and then providea finitely additive set function on the associated finite fields. In this sec-tion the simplest possible examples of this construction are provided,

3.8 Bernoulli Processes 79

the Bernoulli processes or Bernoulli shifts. Although very simple, theywill prove useful in constructing quite general classes of random pro-cesses.

We have demonstrated an explicit basis for the special case of a dis-crete space and for the relatively complicated example of a productspace with discrete coordinate spaces. Hence in principle we can con-struct measures on these spaces by providing a finitely additive set func-tion for this basis. As a specific example, consider the binary sequencespace (M+,BM+) of (3.9). Define the set function m on thin cylindersc(bn) = x : xn = bn by

m(c(bn)) = 2−n; all bn ∈ 0,1n,n = 1,2,3, . . .

and extend this to Fn and hence to F =⋃∞n=1Fn in a finitely additive

way, that is, any G ∈ F is in Fn for some n and hence can be writtenas a disjoint union of thin cylinders, say G =

⋃Ni=1 c(b

ni ). In this case de-

fine m(G) =∑Ni=1m(c(b

Ni )). This defines m for all events in F and the

resulting m is finitely additive by construction. It was shown in Section3.4 that the Fn, n = 1,2, . . . form a basis for F , however, and hence mextends uniquely to a measure on BM+ . Thus with a basis we can eas-ily construct probability measures by defining them on simple sets anddemonstrating only finite and not countable additivity.

Observe that in this case of equal probability on all thin cylinders of agiven length, the probability of individual sequences must be 0 from thecontinuity of probability. For example, given a sequence x = (x0, x1, . . .),we have that

n⋂i=1

c(xn) ↓ x

and hencem(x) = lim

n→∞m(c(xn)) = lim

n→∞2−n = 0.

The probability space (M+,BM+ ,m,T) yields an important dynami-cal system by letting T be the shift onM+. Similarly, the example yieldsan important random process by defining X0 as the zero-time coordi-nate function on M+ and defining the random process Xn;n ∈ Z+ bydefining Xn = TnX0, that is, Xn(ω) = X0(Tnω), ω ∈M+. Alternatively,any ω ∈ M+ has the form ω = (x0, x1, · · · ) and hence Xn(ω) = xn.From the definition it is immediate that for any positive integer n andany collection of sample times (t0, t1, . . . , tn−1) that the distribution forthe random vector (Xt0 , Xt1 , . . . , Xtn−1) is described by the PMF

pXt0 ,Xt2 ,...,Xtn−1(x0, x1, . . . , xn−1) =

12n

;

xn = (x0, x1, . . . , xn−1) ∈ 0,1n.


Thus, for example, the marginal PMF’s pXn(x) are all identical and henceequal to a common PMF pX(x). Furthermore,

pXt1 ,Xt2 ,...,Xtn (x1, x2, . . . , xn) =n∏i=1

pX(xi). (3.15)

When a discrete-alphabet random process has distributions of this form,i.e., the PMFs for sample vectors are products of a common PMF, theprocess is said to be independent and identically distributed (IID). If thealphabet is binary, the process is called a Bernoulli process, although theterm is also used for more general discrete-alphabet processes satisfying(3.15). In the general (not necessary binary or equiprobable) discrete-alphabet case, the corresponding dynamical system is called a Bernoullishift in ergodic theory.

Bernoulli processes provide a classic model for random phenomenalike flipping coins and spinning roulette wheels. The binary example withequiprobable 0 and 1 (pX(x) = 1/2 for x = 0,1) will be referred to asthe fair coin flipping process. The IID random process with an alphabetof 37 and a uniform marginal PMF is a roulette process (38 in the USA,where 00 is added to 0).

We have constructed one-sided models for these processes, but two-sided models with the same distributions follow in the same way byconsidering two-sided Cartesian products. Such random processes haveno memory and may appear to be very limited in their ability to modelreal-life random phenomena, but they can lead to a useful and moregeneral class of processes by coding as discussed in Section 2.6. Thisconstruction is explored in the next section.

3.9 Discrete B-Processes

Fix a discrete alphabet A and let p denote a PMF on A and let P de-note the corresponding distribution on BA (the power set of A sinceA is discrete). As in the previous constructions, construct a sequencemeasurable space (AZ,BZ

A) and the Kolmogorov representation of an IIDrandom process (AZ,BZ

A, µ) where µ is the process distribution from theKolmogorov extension theorem applied to the product PMFs satisfying(3.15). Let TA denote the (left) shift on AZ. Given any discrete alpha-bet A and any PMF p on A the previous construction demonstrates theexistence of an IID random process X = Xn with this alphabet andmarginal PMF.

As in Section 2.6, suppose that Y is a stationary coding of X; that is,there is a measurable mapping f : AZ → B from (AZ,BZ

A) to a space(B,BB) and the implied stationary code f : AZ → BZ defined by

3.9 Discrete B-Processes 81

f (xZ) = f(TnAxZ);n ∈ Z.

For a concrete simple example, see the construction of Section 2.6.Stationary coding yields a new random process Yn with a Kol-

mogorov directly-given representation (BZ,BZB, η) where η = µf−1. The

corresponding dynamical system is (BZ,BZB, η, TB), where TB is the shift

on BZ. As in Section 2.6,

f (TAxZ) = TBf (xZ). (3.16)

The class of all random processes constructed in this fashion, thatis, by stationary (or sliding-block) coding an IID process, is called theclass of B-processes after Ornstein[77, 78]. While it is not by any meansthe only class of interesting processes, it is arguably the most importantsingle class of random process models for both theory and practical ap-plications.

The dependence of the code f on the entire input sequence is forgenerality, in practice it may depend on the input sequence through onlya few coordinate values. For example, f might view only a “window” ofrecent inputs, say (x−K+1, x−K+2, . . . , x0), in order to produce the output

y0 = f(xZ) = fK(x−K+1, x−K+2, . . . , x0)

for a mapping fK : AK → B. As examples, suppose that the input processis binary, A = 0,1. fK might produce a 1 if there are an odd number ofones in in (x−K+1, x−K+2, . . . , x0) and a 0 otherwise. This can be viewedas a code, or as a linear filter using mod 2 arithmetic — y0 is the mod2 sum of the x−K+1, x−K+2, . . . , x0. As another code, we might divide AKinto two classes, say C0 and C1, that is, C0, C1 for a partition of AK .fK is then defined as 0 if (x−K+1, x−K+2, . . . , x0) ∈ C0 and 1 otherwise.The output alphabet B need not be the same as A. For example, we coulddefine

fK(x−K+1, x−K+2, . . . , x0) =1K

0∑i=−K+1

xi,

which has a discrete alphabet 0,1/K, . . . , (K − 1)/K ⊂ [0,1). Alterna-tively, we could form an exponentially weighted sum

fK(x−K+1, x−K+2, . . . , x0) =0∑

i=−K+1

xi2−i.

Since the xi are binary, this sum converges as K →∞ to produce a limit-ing code

f(x) =0∑

i=−∞xi2−i−1 ∈ [0,1]. (3.17)


While the mappings fK are measurable since the alphabet B is discrete,we have a problem now verifying measurability because the output spaceB is no longer discrete, it is the entire unit interval, and we have not yetconstructed a σ -field for this case. We will do so in the next section.

Given an IID input, none of these new processes is itself IID. By allow-ing general mappings f , one can construct quite complicated randomprocesses. Just how general is a question that is fundamental to ergodictheory and will be occasionally considered in this book and its sequel.The immediate purpose, however, was to provide examples of the con-struction of complete random process models, and the IID processes andB-processes provide both the simplest possible example and an impor-tant more general class.

Before continuing, we note a simple but important property of B-processes.

Lemma 3.9. A stationary coding of a B-process is itself a B-process.

Proof. A B-process is a stationary coding of an IID process. If the B-process is itself coded by a stationary mapping, then the cascade of thetwo mappings will be measurable and stationary, hence the final processis also a stationary coding of the original IID process. 2

3.10 Extension Without a Basis

One goal of this section is to show that in general it is not easy to con-struct a basis as it was in the product space construction used to de-scribe IID discrete-alphabet random processes. This is true even in anapparently simple case such as the unit interval [0,1). Thus although astandard space may be useful in theory, it may be less useful in practiceif one only knows that a basis exists, but is unable to construct one.

A second goal of this section is to show that the inability to constructa basis may not be a problem when constructing certain explicit, butvery important, probability measures. A basis is nice and important fortheory because all finitely additive candidate probability measures ona basis are also countably additive and hence have an extension. If weare trying to prove a specific candidate probability measure is countablyadditive, we can sometimes use the structure of a related standard spacewithout having to explicitly describe a basis.

To demonstrate the above considerations we focus on a particularfundamentally important example: The Borel and Lebesgue measures onthe unit interval. We shall show the difficulties in trying to find a basisand how to circumvent them by instead using the structure of binarysequence space.

3.10 Extension Without a Basis 83

Borel Measure on the Unit Interval

Suppose that we wish to construct a probability measure m on the unitinterval [0,1) = r : 0 ≤ r < 1 in such a way that the probability of anopen interval of the form (a, b) = r : a < r < b for 0 ≤ a < b < 1 willbe simply the length of the interval, b − a, that is,

m((a,b)) = b − a. (3.18)

The question is, is this enough to completely describe a measure on[0,1)? That is, does this set function extend to a normalized, nonneg-ative, countably additive set function on a suitably defined σ -field?

To begin, a useful σ -field for this alphabet is required. Here it isnatural to use the the Borel field for metric spaces of Section 1.4. LetB([0,1)) = σ(O([0,1)), the collection of all open intervals of the formx : a < x < b, 0 ≤ a < b < 1, in [0,1)). Note that B([0,1)) mustcontain all of the one-point sets a for a ∈ [0,1). To see this, if a > 0,a = limn→∞(a − 1/n,a + 1/n) and for large enough n all of the sets(a−1/n,a+1/n) are in [0,1) and the σ -field must contain such decreas-ing limits. If a = 0, the a = (0,1)c and hence 0 is in the σ -field. Thushalf open intervals of the form [a, b) = r : a ≤ r < b = a ∪ (a, b)are in σ(O([0,1)).

Lemma 3.10. B([0,1)) is countably generated.

Proof. Define G([0,1)) as the collection of all half open intervals of theform [0, k/n) = r : 0 ≤ r < k/n for all positive integers 0 ≤ k < n.This collection is clearly countable, and hence the σ -field σ(G([0,1)))is countably generated. The σ -field σ(G([0,1))) must contain all halfopen intervals of the form [0, a) for 0 ≤ a < 1 since we can constructa sequence of decreasing rational an converging downward to a andhence [0, an) is in the σ -field and hence so is its limit [0, a). Givenany 0 ≤ a < b < 1, [a, b) = [0, b) − [0, a) must be in σ(G([0,1)))since it is a set theoretic difference of two sets in σ(G([0,1))). Points in[0,1) must be in the σ field, e.g., since a = limn→∞[a,a + 1/n). Thismeans all open intervals are in σ(G([0,1))) since (a, b) = [a, b) − a.Since O ⊂ σ(G([0,1))), and the latter is a σ -field, σ(O) ⊂ σ(G([0,1))).Conversely, every set in G([0,1)) is a half-open interval and henceis in σ(O([0,1))) and therefore σ(G([0,1))) ⊂ σ(O([0,1))). Thusσ(G([0,1))) = σ(O([0,1)). 2

Next consider the field generated by the countable collection G([0,1)),F(G([0,1))). It has been seen that this field is formed by includingG([0,1)) together with all finite sequences of set theoretic operationson members of G([0,1)). It turns out, however, that there is an evensimpler description of F(G([0,1))).


Lemma 3.11. Define as before G([0,1)) as the collection of all half openintervals [0, k/n) for positive integers 0 ≤ k < n. Then every set F ∈F(G([0,1))) has the form

F =n⋃i=1

[ai, bi)

for some finite positive integer n and rational 0 ≤ ai < bi < 1, i =1,2, . . . n, where the intervals are all disjoint.

Note: The importance of the lemma is that it provides an obvious exten-sion of the definition of m from G([0,1)) to F(G([0,1))) using finiteadditivity:

m(F) =n∑i=1

m([ai, bi)) =n∑i=1

(bi − ai). (3.19)

It immediately follows that m is finitely additive on F(G([0,1)))

Proof. Consider the collectionH of all sets of the form F =⋃ni=1[ai, bi)

where the ai and bi are rational and the sets are disjoint. First considercomplements. If F has this form, then

Fc =n⋂i=1

[ai, bi)c =n⋂i=1

([0, ai)∪ [bi,1))

= n⋂i=1

[0, ai)

∪ n⋂i=1

[bi,1)

= [0, amin)∪ [bmax,1), (3.20)

whereamin =min

iai, bmax =max

ibi.

This has the required form and hence Fc ∈ H . Next consider fi-nite unions. Suppose that we have n sets Fi; i = 1,2, . . . , n in H , sayFj =

⋃nji=1[aj,i, bj,i), and we form the union

F =N⋃j=1

Fj =N⋃j=1

nj⋃i=1

[aj,i, bj,i).

This at least has the form F =⋃ni=1[ci, di), but the half-open intervals

[ci, di) need not be disjoint. It will be seen, however, that this collectioncan be used to obtain an equivalent representation of the desired form.First relabel things if necessary to order the ci so that 0 ≤ c1 ≤ c2 ≤c3 ≤ · · · ≤ cn. Analogous to the construction based on (1.16) used in theproof of Lemma 1.17, define the sets


G1 = [c1, d1)G2 = [c2, d2)−G1

G3 = [c3, d3)−2⋃i=1

Gi

...

Gn = [cn,dn)−n−2⋃i=1

Gi.

By construction the Gi are disjoint and

F =n⋃i=1

Gi.

The proof is completed by showing that each of the Gi has the desiredform of being a finite union of disjoint half-intervals and hence F alsohas this form since the Gi are disjoint and F is their union.

Consider first G2:

G2 = [c2, d2)− [c1, d1) = [c2, d2)∩ [c1, d1)c

= [c2, d2)∩ ([0, c1)∪ [d1,1))= [c2, d2)∩ [d1,1)

since c2 ≥ c1. Thus

G2 = [c2, d2)∩ [d1,1) =

∅ if d2 ≤ d1

[d1, d2) if c2 ≤ d1 < d2

[c2, d2) if d1 < c2

and hence G2 is either empty or is a half-open interval disjoint from G1.Thus also

⋃2i=1Gi =

⋃2i=1[ci, di) is a finite union of disjoint intervals.

Proceed by induction. Suppose that⋃n−1i=0 Gi is a finite union of disjoint

intervals, say⋃mi=0[ai, bi), and use (3.20) to obtain

Gn = [cn,dn)−n−1⋃i=0

Gi = [cn,dn)∩ m⋃i=0

[ai, bi)

c= [cn,dn)∩ ([0, amin)∪ [bmax,1))= ([cn,dn)∩ [0, amin))∪ ([cn,dn)∩ [bmax,1))

which is either empty, a single half-open interval, or the union of twodisjoint half-open intervals. Thus Gn is a finite union of disjoint half-open intervals (possibly empty) and hence so is

⋃ni=1Gi. 2


Thus we have constructed a finitely-additive candidate probabilitymeasure on a countable field F(G([0,1))) that generates B([0,1)) Thequestion now is whether m is also countably additive on F . This wouldfollow if we new the measurable space ([0,1),B([0,1))) were standard,but we have not (yet) shown this to be the case.

The discussion of standard spaces suggests that we should constructa basis Fn ↑ F = F(G([0,1))). Knowing that m is finitely additive onFn will then suffice to prove that it is countably additive on F and henceuniquely extends to B([0,1)).

So far so good. Unfortunately, however, there is no simple, easily de-scribable basis for F . A natural guess is to have ever larger and finer Fngenerated by intervals with rational endpoints. For example, let Fn beall finite unions of intervals of the form

Gn,k = [k2−n, (k+ 1)2−n), k = 0,1, . . . ,2n. (3.21)

Analogous to the binary sequence example, the measure of the atomsGn,k of Fn is 2−n and the set function extends to a finitely additive setfunction on F . Clearly Fn ↑ F , but unfortunately this simple sequencedoes not form a basis because atoms can collapse to empty sets! As anexample, consider the atoms Gn ∈ Fn defined by

Gn = [kn2−n, (kn + 1)2−n),

where we choose kn = 2n−1−1 so that Gn = [2−1−2−n,2−1). The atomsGn are nonempty and have nonempty finite intersections, but

∞⋂n=1

Gn = ∅,

which violates the conditions required for Fn to form a basis.Unlike the discrete or product space case, there does not appear to be

a simple fix and no simple variation on the above Fn appears to providea basis. Thus we cannot immediately infer that m is countably additiveon F .

A second approach is to try to use an isomorphism as in Lemma 3.5 toextend m by actually extending a related measure in a known standardspace. Again there is a natural first guess. Let (M+,BM+) be the binarysequence space of (3.9), where BM+ is the σ -field generated by the thincylinders. The space is standard and there is a natural mapping f :M+ →[0,1) defined by

f(u) =∞∑k=0

uk2−k−1 mod 1, (3.22)

where u = (u0, u1, u2, · · · ) and where by mod1 we take the value of thesum modulo 1 so that the sequence of all 1s is mapped into 0. This


technical detail allows us to consider the output alphabet of f to bethe half-open unit interval [0,1) instead of [0,1]. The mapping f(u) issimply the real number whose binary expansion is .u0u1u2 · · · , and wehave previously encountered it in (3.17) as a sliding-block code.

The function f will be shortly proved to be measurable and henceone might hope that [0,1) would indeed inherit a basis fromM+ throughthe mapping f . Sadly, this is not immediately the case — the problembeing that f is a many-to-one mapping and not a one-to-one mapping.Hence it cannot be an isomorphism. For example, both 011111 · · · and10000 · · · yield the value f(u) = 1/2. We could make f a one-to-onemapping by removing from M+ all sequences with only a finite numberof 0’s, that is, form a new space

M+′ =M+ −H (3.23)

by removing the countable collection H of all sequences sequenceswith no 0’s, with a single 0, with two zeros, etc.: 11111 · · · , 01111 · · · ,00111 · · · , 10111 · · · , 000111 · · · and so on and consider f as a map-ping fromM+

′ to [0,1).Here again we are stymied — none of the results proved so far per-

mit us to conclude that the reduced space M+′ is standard. In fact,

since we have removed individual sequences (rather than cylinders asin Lemma 3.8), there will be atoms collapsing to the empty set and thepreviously constructed basis forM+ does not work for the smaller spaceM+

′.We have reached an apparent impasse in that we can not (yet) either

show that F has a basis or thatm on F is countably additive. We resolvethis dilemma by giving up trying to find a basis for F and instead usethe standard spaceM+ to show directly that m is countably additive.

We begin by showing that f : M+ → [0,1) is measurable. FromLemma 2.4 this will follow if we can show that for any G ∈ F , f−1(G) ∈BM+ . Since every G ∈ F can be written as a finite union of disjoint setsof the form (3.21), it suffices to show that f−1(G) ∈ BM+ for all sets Gn,kof the form (3.21). To find

f−1(Gn,k) = u : k2−n ≤ f(u) < (k+ 1)2−n,

let bn = (b0, b1, · · · , bn−1) satisfy

n−1∑j=0

bj2n−j−1 =n−1∑j=0

bn−j−12j = k, (3.24)

that is, (bn−1, · · · , b0) is the unique binary representation of the integerk. The thin cylinder c(bn) will then have the property that if u ∈ c(bn),then


f(u) =∞∑j=0

bj2−j−1 ≥n−1∑j=0

bj2−j−1 = 2−nn−1∑j=0

bj2n−j−1 = k2−n

and

f(u) ≤n−1∑j=0

bj2−j−1 +∞∑j=n

2−j−1 = k2−n + 2−n

with equality on the upper bound if and only if bj = 1 for all j ≥ n.Thus if u ∈ c(bn)−bn111 · · · , then u ∈ f−1(Gn,k) and hence c(bn)−bn111 · · · ⊂ f−1(Gn,k). Conversely, if u ∈ f−1(Gn,k), then f(u) ∈Gn,k and hence

k2−n ≤∞∑j=0

uj2−j−1 < (k+ 1)2−n. (3.25)

The left-hand inequality of (3.25) implies

k ≤n−1∑j=0

uj2n−j−1 + 2n∞∑j=nuj2−j−1 ≤

n−1∑j=0

uj2n−j−1 + 1, (3.26)

with equality on the right if and only if uj = 1 for all j > n. The right-hand inequality of (3.25) implies that

n−1∑j=0

uj2n−j−1 ≤n−1∑j=0

uj2n−j−1 + 2n∞∑j=nuj2−j−1

2n∞∑j=0

uj2n−j−1 < k+ 1. (3.27)

Combining (3.26) and (3.27) yields for u ∈ f−1(Gn,k)

k− 1 ≤n−1∑j=0

uj2n−j−1 < k+ 1, (3.28)

with equality on the left if and only if uj = 1 for all j ≥ n. Since the sumin (3.28) must be an integer, it can only be k or k−1. It can only be k−1,however, if uj = 1 for all j ≥ n. Thus either

n−1∑j=0

uj2n−j−1 = k,

in which case u ∈ c(bn) with bn defined as in (3.24), or


n−1∑j=0

uj2n−j−1 = k− 1 and uj = 1, all j ≥ n.

The second relation can only occur if for bn as in (3.24) we have thatbn−1 = 1 and uj = bj , j = 0, . . . , n − 2, un−1 = 0, and uj = 1, j ≥ n.Letting w = (bn−10111 · · · ) denote this sequence, we therefore havethat

f−1(Gn,k) ⊂ c(bn)∪w.In summary, every sequence in c(bn) is contained in f−1(Gn,k) except

the single sequence (bn111 · · · ) and every sequence in f−1(Gn,k) is con-tained in c(bn) except the sequence (bn−10111 · · · ) if bn = 1 in (3.24).Thus given Gn,k, if bn = 0

f−1(Gn,k) = c(bn)− (bn111 · · · ) (3.29)

and if bn = 1

(c(bn)− (bn111 · · · ))∪ (bn−10111 · · · ). (3.30)

In both cases, f−1(Gn,k) ∈ BM+ , proving that f is measurable.We note in passing that this shows that the sliding-block code of (3.17)

is measurable and hence well-defined.Measurability of f almost immediately provides the desired conclu-

sions: Simply specify a measure P on the standard space (M+,BM+) via

P(c(bn)) = 2−n; all bn ∈ 0,1n

and extend this to the field generated by the cylinders in an additivefashion as before. Since this field is standard, this provides a measureon (M+,BM+). Now define a measure m on ([0,1),B([0,1))) by

m(G) = P(f−1(G)), all G ∈ B[0,1).

As in Section 3.2, f is a random variable and m is its distribution. Inparticular, m is a measure. From (3.28) and the fact that the individualsequences have 0 measure, we have that

m([k2−n, (k+ 1)2−n)) = P(c(bn)) = 2−n, k = 1,2, . . . , n− 1.

From the continuity of probability, this implies that for any interval[a, b), 1 > b > a ≥ 0, m([a,b)) = b − a. Also from the conti-nuity of property, m(a) = 0 for any point a ∈ [0,1), and hencem((a,b)) = b−a as originally desired. Since this agrees with the originaldescription of the measure on intervals, it must also agree on the fieldgenerated by intervals. Thus there exists a probability measure agreeingwith (3.18) and hence with the additive extension of (3.18) to the field


generated by the intervals. By the uniqueness of extension, the measurewe have constructed must be the unique extension of (3.18).

The probability measure m constructed in this way is called the Borelmeasure on [0,1) and the space ([0,1),B([0,1)),m) is an example ofa Borel measure space. The corresponding completed probability spaceof Lemma 1.14 ([0,1),B, λ), with B = B(([0,1))m, the completion ofB(([0,1)) with respect to m, and λ the extension of m to B, is calledthe Lebesgue measure space on [0,1) and λ is the Lebesgue measure on[0,1). The name “Lebesgue measure” is sometimes used in the literatureto denote the uncompleted probability space as well as the completedmeasure. The dominant usage, however, is to use “Lebesgue” for thecompleted measure. Unfortunately “Borel measure” also has a more gen-eral meaning as any measure on a Borel measurable space. The meaningshould be clear from context, and we will use the articles “a” and “the”to distinguish between the general and particular cases.

We have not shown that the measurable space ([0,1),B([0,1))) wehave constructed based on the unit interval is standard, all we haveshown is that for this measurable space we could extend a specificfinitely additive set function (candidate measure) to a measure — theBorel measure — on this measurable space. (We will show that this mea-surable space is standard in the next chapter, but this will require ad-ditional work.) Thus even without a basis or an isomorphism, we canconstruct probability measures on a given space if we have a measur-able mapping from a standard space to the given space. The exampleof the Borel measure space may seem artificially simple, but it will turnout to lead to a surprisingly general class of probability spaces, which isdefined in the next and final section of this chapter.

The same construction can be used to obtain a Borel measure space(and the corresponding Lebesgue measure space by completion) on anyfinite interval [a, b) for −∞ < a < b <∞. Specifically, there is a measurem on [a, b) and a σ -field of subsets generated by G[a,b), the collectionall sets of the form [a,a + k/n) for all positive integers k,n such thatk/n ≤ b−a, such that for any a ≤ c < d < b m([c,d)) = d−c, the lengthof the interval. The measure can be made into a probability measure bynormalizing, that is, by defining m([c,d)) = (d− c)/(b − a). The name“Lebesgue measure” is reserved for the completed unnormalized case,so that the measure of an interval is the length of the interval.

3.11 Lebesgue Spaces

The mapping f that took binary sequences into numbers in the unit in-terval was almost an isomorphism, and it was the failure to actually bean isomorphism that caused all of the technical difficulties, since other-

3.12 Lebesgue Measure on the Real Line 91

wise it would have been immediate that ([0,1),B([0,1))) was standard.Suppose that f is cut down to the reduced space M+

′ = M+ − H of3.23. Then the mapping f : M+

′ → [0,1) is still measurable since wehave removed a countable number of sequences from its domain of def-inition and the mapping f is now one-to-one. Furthermore its inversef−1 is well defined, it is the unique binary expansion (without havingonly a finite number of 0s) of [0,1). Since M+

′ was formed by remov-ing a countable set from, f is an isomorphism except for a null set inthe binary sequence space and hence is an isomorphism mod 0 as intro-duced in Section 2.7. This sets the stage for a general class of probabilityspaces. A complete probability space (Ω,B, P) is called a Lebesgue spaceif P is a mixture of the form P = λP1 + (1 − λ)P2 for λ ∈ [0,1], where(Ω,B, P1) is isomorphic (mod 0) to a Lebesgue measure space and whereP2 is discrete probability measure, that is, there are a countable numberof points xi ∈ Ω, i = 0,1,2, . . ., with probabilities pi = P1(xi) and

P2(F) =∑i:xi∈F

pi;F ∈ Ω.The theory of Lebesgue spaces is generally credited to Rohlin [90] and

Halmos and von Neuman [48]. Lebesgue spaces consist of a continuousportion isomorphic to and hence essentially the same as a uniform distri-bution on the unit interval plus a purely discrete component, with bothextremes being possible. Thus, for example, the Kolmogorov model forfair coin flips constructed earlier in this chapter is a Lebesgue space.

Lebesgue spaces are the most common probability space used in er-godic theory because of the useful properties they possess (includingthose developed in this chapter) and their generality (which will be dis-cussed later). They are sometimes referred to as “standard Lebesguespaces” and we shall see that they are essentially equivalent (if we ignorenull sets) to the probability spaces of most interest here — probabilityspaces built from Polish spaces. Our focus will provide more structurebased on the underlying metric, and the structure will prove useful in thesequel book on information theory [30]. We shall see in the next chapterthat the measurable space constructed for a Lebesgue space is, in fact, astandard measurable space.

3.12 Lebesgue Measure on the Real Line

This section is the only section of this book to focus on a measure thatis not a probability measure and cannot be normalized to make a prob-ability measure — the Lebesgue measure on the real line. The case isimportant in the context of this book because it provides a means of de-


scribing the probability distributions of continuous random variables interms of integrals of probability density functions.

For any positive integer n let ([−n,n),B([−n,n)), λn) denote theLebesgue measure space on [−n,n) as developed in the previous sec-tion. Given the real line R, define the collection of subsets

BR = all F : F ∩ [−n,n) ∈ B([−n,n)) for all n = 1,2, . . ..

It is easy to verify that BR is a σ -field. First consider complements. If F ∈BR, then for every n F∩[−n,n) ∈ B([−n,n)) and hence the complementof F∩[−n,n) in [−n,n), [−n,n)−F−[−n,n)∩Fc is in B([−n,n)) for alln, which implies that Fc ∈ BR. Next consider countable unions. Supposethat for i = 1,2, . . . Fi ∈ BR and hence Fi ∩ [−n,n) ∈ B([−n,n)) for alln. Since B([−n,n)) is a σ -field,

⋃i(Fi ∩ [−n,n)) = (

⋃i Fi) ∩ [−n,n) ∈

B([−n,n)) and therefore⋃i Fi ∈ BR. Thus BR is indeed a σ -field.

To define a measure on the measurable space, (R,BR), let F ∈ BR andconsider the measures λn(F ∩ [−n,n)). First observe that if F ∈ B[−k,k)for a fixed k, then λn(F) = λk(F) for all n ≥ k. This is obvious forintervals and hence also for the field generated by the intervals. Sincethe two measures agree on a generating field, they must be the same.This implies that for a fixed F , λn(F ∩ [−n,n)) is an non-decreasingsequence in n since

λn+1(F ∩ [−(n+ 1),n+ 1))= λn+1(F ∩ [−n,n))+ λn+1(F ∩ ([−(n+ 1),n+ 1)− [−n,n))≥ λn+1(F ∩ [−n,n)) = λn(F ∩ [−n,n)).

Since λn(F∩[−n,n)) is a nondecreasing and bounded by 1, it has a limit,say

λ(F) = limn→∞

λn(F ∩ [−n,n)). (3.31)

The set function λ is obviously nonnegative, so to demonstrate that itis a measure only countable additivity need be demonstrated. Supposethat Fk ∈ BR for k = 1,2, . . . are disjoint. Then

λ(∞⋃k=1

Fk) = limn→∞

λn(

∞⋃k=1

Fk

∩ [−n,n))= limn→∞

λn(∞⋃k=1

(Fk ∩ [−n,n)))

= limn→∞

∞∑k=1

λn(Fk ∩ [−n,n)).

Since λn(Fk∩[−n,n)) is nondecreasing, from Corollary 5.8 the limit canbe taken term-by-term to obtain

3.12 Lebesgue Measure on the Real Line 93

λ(∞⋃k=1

Fk) =∞∑k=1

limn→∞

λn(Fk ∩ [−n,n)) =∞∑k=1

λ(Fk),

and hence λ is a measure on (R,BR), which is called the Lebesgue mea-sure on the real line. The Lebesgue measure on the real line is sigma-finite since R =

⋃∞n=1[−n,n) and λ([−n,n)) = λn([−n,n)) = 2n is

finite for all n.Although the Lebesgue measure on the real line is not itself a prob-

ability measure, it will later be seen to be quite useful in constructingprobability measures.

Chapter 4

Standard Borel Spaces

Abstract The most important class of standard spaces is developed,standard Borel spaces formed using separate, complete metric spacesor Polish spaces. The spaces are shown to be standard by constructingan isomorphism with a standard subset of a discrete sequence space.

4.1 Products of Polish Spaces

This section and the next present properties of Polish spaces that resem-ble those of standard spaces. We show that Cartesian products of Polishspaces are Polish spaces, and that certain subsets of Polish spaces arealso Polish. Given a Polish space A and an index set I, define a metric dI

on the product space AI as in Example 1.12. The following lemma showsthat the product space is also Polish.

Lemma 4.1.

(a) If A is a Polish space with metric d and I is a countable index set,then the product space AI is a Polish space with respect to the productspace metric dI.

(b) In addition, if B(A) is the Borel σ -field of subsets of A with re-spect to the metric d, then the product σ -field of Section 2.4, B(A)I =σ(rect(Bi, i ∈ N)), where the Bi are all replicas of B(A), is exactly theBorel σ -field of AI with respect to dI. That is, B(A)I = B(AI).

Proof. (a) If A has a dense subset, say B, then a dense subset for AIcan be obtained as the class of all sequences taking values in B on afinite number of coordinates and taking some fixed reference value, sayb, on the remaining coordinates. Hence AI is separable. If a sequencexIn = xn,i, i ∈ I is Cauchy, then dI(xI

n,xIm) → 0 as n,m → ∞. From

Lemma 1.3 this implies that for each coordinate d(xn,i, xm,i) → 0 as


96 4 Standard Borel Spaces

n,m → ∞ and hence the sequence xn,i is a Cauchy sequence in A foreach i. Since A is complete, the sequence has a limit, say yi. From Lemma1.3 again, this implies that xI

n →n→∞ y I and hence AI is complete.(b) We next show that the product σ -field of a countably generated

Borel σ -field is the Borel σ -field with respect to the product space met-ric; that is, both means of constructing σ -fields yield the same result.(This result need not be true for products of Borel σ -fields that are notcountably generated.)

For each integer n let Πn : AI → A be the coordinate or samplingfunction Πn(xI) = xn. It is easy to show that Πn is a continuous functionand hence is also measurable from Lemma 2.1. Thus Π−1

n (F) ∈ B(AI) forall F ∈ B(A) and n ∈ N . This implies that all finite intersections of suchsets and hence all rectangles with coordinates in B(A)must be in B(AI).Since these rectangles generate B(A)I, we have shown that

B(A)I ⊂ B(AI).

Conversely, let Sr (x) be an open sphere in AI: Sr (x) = y : y ∈AI, dI(x,y) < r. Observe that for any n

dI(x,y) ≤n∑k=1

2−kd(xk,yk)

1+ d(xk,yk)+ 2−n

and hence we can write

Sr (x) =⋃

n:2−n<r

⋃dk:

∑nk=1 dk<r−2−n

n⋂k=0

Π−1k (a : dI(xk, a)

dk1− dk

),

where the union over the dk is restricted to rational dk. Every pieceof the union on the right-hand side is contained in the sphere on theleft, and every point in the sphere on the left is eventually includedin one of the terms in the union on the right for sufficiently large n.Thus all spheres in AI under dI can be written as countable unions ofof rectangles with coordinates in B(A), and hence these spheres are inB(A)I. Since AI is separable under dI all open sets in AN can be writ-ten as a countable union of spheres in AI using (1.18), and hence allopen sets in AI are also in B(A)I. Since these open sets generate B(AI),B(AI) ⊂ B(A)I, completing the proof. 2

Example 4.1. Integer Sequence Spaces. As a simple example of some ofthe ideas under consideration, consider the space of integer sequencesintroduced in Lemma 3.7. Let B = Z+ and A = N = BB , the space ofsequences with nonnegative integer values.

A discrete space such as A0 is a Polish space with respect to the Ham-ming metric dH of Example 1.1. This is trivial since the countable set of

4.2 Subspaces of Polish Spaces 97

symbols is dense in itself and if a sequence an is Cauchy, then choos-ing ε < 1 implies that there is an integer N such that d(an,am) < 1 andhence an = am for all n, m ≥ N . Note that an open sphere Sε(a) in Bwith ε < 1 is just a point and that any subset of B is a countable unionof such open spheres. Thus in this case the Borel σ -field B(Z+) is justthe power set, the collection of all subsets of B.

The sequence space A is a metric space with the metric of Exam-ple 1.12, which with a coordinate Hamming distance becomes

dZ+(x,y) = 12

∞∑i=0

2−idH(xi,yi). (4.1)

Applying Lemma 4.1 (b) and (4.1) we have

B(N) = B(ZZ++ ) = B(Z+)Z+ . (4.2)

It is interesting to compare the structure used in Lemma 3.7 with theBorel or topological structure currently under consideration. Recall thatin Lemma 3.7 the σ -field B(Z+)Z+ was obtained as the σ -field generatedby the thin cylinders, the rectangles or thin cylinders having the formx : xn = an for some positive integer n and some n-tuple an ∈ Bn.For convenience abbreviate dB to d. Suppose we know that d(x,y) <1/2. Then necessarily x0 = y0 and hence both sequences must lie ina common thin cylinder, the thin cylinder u : u0 = x0. Similarly, ifd(x,y) < 1/4, then necessarily x2 = y2 and both sequences must againlie in a common thin cylinder, u : u2 = x2. In general, if d(x,y) < εand ε ≤ 2−(n+1), then xn = yn. Thus if x ∈ A and ε ≤ 2−(n+1), then

Sε(x) = y : yn = xn. (4.3)

This means that the thin cylinders are simply the open spheres in A!Thus the generating class used in Lemma 3.7 is really the same as thatused to obtain a Borel σ -field for A. In other words, not only are theσ -fields of (4.2) all equal, the collections of sets used to generate thesefields are the same.

4.2 Subspaces of Polish Spaces

The following result shows that certain subsets of Polish spaces arethemselves Polish, although in some cases we need to consider an equiv-alent metric rather than the original metric.

Lemma 4.2. Let A be a complete, separable metric space with respect toa metric d. If F = C , C a closed subset of A, then F is itself a complete,


separable metric space with respect to d. If F = O, where O is an opensubset of A, then F is itself a complete separable metric space with respectto a metric d′ such that d′-convergence and d-convergence are equivalenton F and hence d and d′ are equivalent metrics on F . In fact, the followingprovides such a metric:

d′(x,y) = d(x,y)+ | 1d(x,Oc)

− 1d(y,Oc)

|. (4.4)

If F is the intersection of a closed set C with an open set O of A, thenF is a complete separable metric space with respect to a metric d′ suchthat d′-convergence and d-convergence are equivalent on F , and hence dand d′ are equivalent metrics on F . Here, too, d′ of (4.4) provides such ametric.

Proof. Suppose that F is an arbitrary subset of a separable metric spaceA with metric d having the countable base VA = Vi; i = 1,2, . . . of(1.18). Define the class of sets VF = F ∩ Vi; i = 1,2, . . ., where weeliminate all empty sets. Let yi denote an arbitrary point in F ∩ Vi. Weclaim that the set of all yi is dense in F and hence F is separable underd. To see this let x ∈ F . Given ε there is a Vi of radius less than εwhich contains x and hence F

⋂Vi is not empty (since it at least contains

x ∈ F) and

d(x,yi) ≤ supy∈F∩Vi

d(x,y) ≤ supy∈Vi

d(x,y) ≤ 2ε.

Thus for any x in F we can find a yi ∈ VF that is arbitrarily close to it.Thus VF is dense in F and hence F is separable with respect to d. Theresulting metric space is not in general complete, however, because com-pleteness of A only ensures that Cauchy sequences converge to somepoint in A and hence a Cauchy sequence of points in F may convergeto a point not in F . If F is closed, however, it contains all of its limitpoints and hence F is complete with respect to the original metric. Thiscompletes the proof of the lemma for the case of a closed subset.

If F is an open set O, then we can find a metric that makes O com-plete by modifying the original metric to blow up near the boundary andhence prevent Cauchy sequences from converging to something not inO. In other words, we modify the metric so that if a sequence an ∈ O isCauchy with respect to d but converges to a point outside O, then thesequence will not be Cauchy with respect to d′. Define the metric d′ onO as in (4.4). Observe that by construction d ≤ d′, and hence conver-gence or Cauchy convergence under d′ implies the same under d. Fromthe triangle inequality, |d(an,Oc)−d(a,Oc)| ≤ d(an,a). Thus providedthe an and a are in O, d-convergence implies d′-convergence. Thus d-convergence and d′-convergence are equivalent on O and hence yield thesame open sets from Lemma 1.4. In addition, points in O are limit points

4.2 Subspaces of Polish Spaces 99

under d if and only if they are limit points under d′ and hence O is sep-arable with respect to d′ since it is separable with respect to d. Since asequence Cauchy under d′ is also Cauchy under d, a d′-Cauchy sequencemust have a limit since A is complete with respect to d. This point mustalso be in O, however, since otherwise 1/d(an,Oc) → 1/d(a,Oc) = ∞and hence d′(an,am) could not be converging to zero and hence ancould not be Cauchy. (Recall that for such a double limit to tend to zero,it must so tend regardless of the manner in which m and n go to infin-ity.) Thus O is complete. Since O is complete and separable, it is Polish.

If F is the intersection of a closed set C and an open set O, then thepreceding metric d′ yields a Polish space with the desired properties forthe same reasons. In particular, d-convergence and d′-convergence areequivalent for the same reasons and a d′-Cauchy sequence must con-verge to a point that is in the intersection of C and O: it must be in Csince C is closed and it must be in O or d′ would blow up. 2

It is important to point out that d-convergence and d′-convergenceare only claimed to be equivalent within F ; that is, a sequence an in Fconverges to a point a in F under d if and only if it converges under d′.It is possible, however, for a sequence an to converge to a point a notin F under d and to not converge at all under d′ (the d′ distance is noteven defined for points not in F). In other words, F may not be a closedsubset of A with respect to d.

Example 4.2. Integer Sequence Spaces RevisitedWe return to the integer sequence space of Example 4.1 to consider

a simple but important special case where a subset of a Polish spaceis Polish. The following manipulation will resemble that of Lemma 3.4.Let A and d be as in that lemma and recall that A is Polish and has aBorel σ -field B(A) = B(B)B . Suppose that we have a countable collectionFn, n = 1,2, . . . of thin cylinders with union F =

⋃n Fn. Consider the

sequence space

A = A−∞⋃n=1

Fn =∞⋂n=1

Fcn

formed by removing the countable collection of thin cylinders from A.Since the thin cylinders are open sets, their complements are closed.Since an arbitrary intersection of closed sets is closed, A is a closed sub-set of A. Thus A is itself Polish with respect to the same metric, d.

The Borel σ -fields of the spaces A and A can be easily related byobserving as in Example 4.1 that an open sphere about a sequence ineither sequence space is simply a thin cylinder. Every thin cylinder in Ais either contained in or disjoint fromA−F . ThusB(A) is generated by allof the thin cylinders in A andB(A) is generated by the thin cylinders in Aand hence by all sets of the form x : xn = an∩(A−F), that is, by all thethin cylinders except those removed. Thus every open sphere in A is also


an open sphere in A and is contained in Fc ; thus B(A) ⊂ B(A)∩ (A− F).Similarly, every open sphere in A that is contained in F is also an opensphere in A, and hence the reverse inclusion holds. In summary,

B(A) = B(A)∩ (A− F). (4.5)

Carving

We now turn to a technique for carving up complete, separable metricspaces into pieces with a useful structure. From the previous lemma, wewill then also be able to carve up certain other sets in a similar manner.

Lemma 4.3. (The carving lemma.) Let A be a complete, separable metricspace. Given any real r > 0, there exists a countably infinite family ofBorel sets (sets in B(A)) Gn, n = 1,2, . . . with the following properties:

(a) The Gn partition A; that is, they are disjoint and their union is A.(b) Each Gn is the intersection of the closure of an open sphere and

an open set.(c) For all n, diam(Gn) ≤ 2r .

Proof. Let ai,1 = 0,1,2, . . . be a countable dense set in A. Since Ais separable, the family Vi, i = 0,1,2, . . . of the closures of the openspheres Si = x : d(x,ai) < r covers A; that is, its union is A. Inaddition, diam(Vi) = diam(Si) ≤ r .

The proof uses the construction of (3.11) in a manner similar to thatused in the proof of Lemma 1.17.

Construct the set Gi(1) as the set Vi with all sets of lower index re-moved and all empty sets omitted; that is,

G0(1) = V0

G1(1) = V1 − V0

...

Gn(1) = Vn −⋃i<nVi

= Vn −⋃i<nGi(1)

...

where we remove all empty sets from the sequence. The sets are clearlydisjoint and partition B. Furthermore, each set Gn(1) has the form

Gn(1) = Vn ∩ (⋂i<nVci );

4.3 Polish Schemes 101

that is, it is the intersection of the closure of an open sphere with anopen set. Since Gn(1) is a subset of Vn, it has diameter less than r . Thisprovides the sequence with properties (a)–(c). 2

We are now ready to iterate on the previous lemmas to show thatwe can carve up Polish spaces into shrinking pieces that are also Polishspaces that can then be themselves carved up. This representation of aPolish space is called a scheme, and it is detailed in the next section.

Exercises

1. Prove the claims made in the first paragraph of this section.2. Let (Ω,B, P) be a probability space and let A = B, the collection of

all events. Two events are considered to be the same if the proba-bility of the symmetric difference is 0. Define a metric d on A as inExample 1.15. Is the corresponding Borel field a countably generatedσ -field? Is A a separable metric space under d?

3. Prove that in the Example 4.2 that all finite dimensional rectangles inA are open sets with respect to d. Since the complements of rectanglesare other rectangles, this means that all rectangles (including the thincylinders) are also closed. Thus the thin cylinders are both open andclosed sets. (These unusual properties occur because the coordinatealphabets are discrete.)

4.3 Polish Schemes

In this section we develop a representation of a Polish space that re-sembles an infinite version of the basis used to define standard spaces;the principal difference is that at each level we can now have an infiniterather than finite number of sets, but now the diameter of the sets mustbe decreasing as we descend through the levels. It will provide a canoni-cal representation of Polish spaces and will provide the key element forproving that Polish spaces yield standard Borel spaces.

Recall that Z+ =0,1,2, . . ., Zn+ is the corresponding finite dimen-sional Cartesian product, and N = ZZ+

+ is the one-sided infinite Cartesianproduct. Given a metric space A, a scheme is defined as a family of setsA(un), n = 1,2, . . ., indexed by un in Zn+ with the following properties:

1.A(un) ⊂ A(un−1) (4.6)

(the sets are decreasing),


A(un)∩A(vn) = ∅ if un 6= vn (4.7)

(the sets are disjoint),2.

limn→∞

diam(A(un)) = 0, all u = u0, u1, . . . ∈ N. (4.8)

3. If a decreasing sequence of sets A(un),n = 1,2, . . . are all nonempty,then the limit is not empty; that is,

A(un) 6= ∅ all n implies Au 6= ∅, (4.9)

where Au is defined by

Au =∞⋂n=1

A(un). (4.10)

The scheme is said to be measurable if all of the sets A(un) are Borelsets. The collection of all nonempty sets Au, u ∈ N, of the form (4.10)will be called the kernel of the scheme. We shall let A(u0) denote theoriginal space A.

Theorem 4.1. If A is a complete, separable metric space, then it is thekernel of a measurable scheme.

Proof. Construct the sequence Gn(1) as in the first part of the carvinglemma to satisfy properties (a)–(c) of that lemma. Choose a radius r of,say, 1. This produces the first level sets A(u1),u1 ∈ Z+. Each set thusproduced is the intersection of the d closure C(u1) of a d-open sphereand a d−open set O(u1) in A, and hence by Lemma 4.2 each set A(u1)is itself a Polish space with respect to a metric du1 that within A(u1)is equivalent to d. We shall also make use of the property of the spe-cific equivalent metric of (4.4) of that lemma that du1(a, b) ≤ d(a,b).Henceforth let du0 denote the original metric d. The closure of A(u1)with respect to the original metric d may not be in A(u1) itself, but it isin the set C(u1).

We then repeat the procedure on these sets by applying Lemma 4.2to each of the new Polish spaces with an r of 1/2 to obtain sets A(u2)that are intersections of a du1 closure C(u2) of an open sphere anda du1 -open set O(u2). Each of these sets is then itself a Polish spacewith respect to a metric du2(x,y) which is equivalent to the parent met-ric du1(x,y) within A(u2). In addition, du2(x,y) ≥ du1(x,y). SinceA(u2) ⊂ A(u1) and the parent metric du1 is equivalent to its parentmetric du0 within A(u1), du2 is equivalent to the original metric d withinA(u2).

A final crucial property is that the closure of A(u2) with respect toits parent metric du1 is contained within its parent set A(u1). This istrue since A(u2) = C(u2) ∩ O(u2) and C(u2) is a du1 -closed subset of


A(u1) and therefore the du1 - closure of A(u2) lies within C(u2) andhence within A(u1).

We now continue in this way to construct succeeding levels of thescheme so that the preceding properties are retained at each level. Atlevel n, n = 1,2, . . ., we have sets

A(un) = C(un)∩O(un), (4.11)

where C(un) is a dun−1 -closed subset of A(un−1) and O(un) is a dun−1 -open subset of A(un−1). The set C(un) is the dun−1 -closure of an openball of the form

S(un−1, i) = S1/n(ci;un−1) = x : x ∈ A(un−1),dun−1(x, ci) < 1/n, i = 0,1,2, . . . . (4.12)

Each set A(un) is then itself a Polish space with metric

dun(x,y) =

dun−1(x,y)+∣∣∣∣∣ 1dun−1(x,O(un)c)

− 1dun−1(y,O(un)c)

∣∣∣∣∣ . (4.13)

We have by construction that for any un in Zn+

d ≤ du1 ≤ du2 ≤ . . . ≤ dun (4.14)

and hence since A(un) is in the closure of an open sphere of the form(4.12) having dun−1 radius 1/n, we have also that the d-diameter of thesets satisfies

diam(A(un)) ≤ 2/n. (4.15)

Furthermore, on A(un), dun (the complete metric) and dun−1 yield equiv-alent convergence and are equivalent metrics. Since each A(un) is con-tained in all of the Aui for i = 0, . . . , n− 1, this implies that

given ak ∈ A(un), k = 1,2, . . . , and a ∈ A(un), then

dun(ak,a) →k→∞

0 iff duj(ak, a) →k→∞

0 for all j = 0,1,2, . . . , n− 1. (4.16)

This sequence of sets satisfies conditions (4.6)–(4.8). To complete theproof of the theorem we need only show that (4.9) is satisfied.

If we can demonstrate that the d-closure in A of the set A(un) liescompletely in its parent set A(un−1), then (4.9) will follow immediatelyfrom Corollary 1.2. We next provide this demonstration.

Say we have a sequence ak; k = 1,2, . . . such that ak ∈ A(un), all k,and a point a ∈ A such that d(ak,a) → 0 as k → ∞. We will be doneif we can show that necessarily a ∈ A(un−1). We accomplish this by


assuming the contrary and developing a contradiction. Hence we assumethat ak → a in d, but a 6∈ A(un−1). First observe from (4.14) and (4.12)that since all of the ak are in the closure of a sphere of dun−1 -radius 1/ncentered at, say, c, then we must have that

duj(ak, c) ≤ 1/n, j = 0,1,2, . . . , n− 1; all k; (4.17)

that is, the ak must all be close to the center c ∈ A(un−1) of the sphereused to construct C(un) with respect to all of the previous metrics.

By assumption a 6∈ A(un−1). Suppose that l is the smallest indexsuch that a 6∈ A(u1). Since the A(ui) are nested, we will then have thata ∈ A(ui) for i < l. In addition, l ≥ 1 since A(u0) is the entire space andhence clearly a ∈ A(u0).

Since a ∈ A(ul−1) and ak ∈ A(u1) ⊂ A(ul−1), from (4.16) appliedto A(ul−1) d-convergence of ak to a is equivalent to dul−1 -convergence.Since the set C(u1) ⊂ A(ul−1) used to form A(u1) via A(u1) = C(u1)∩O(u1) is closed under dul−1 , we must have that a ∈ C(u1). Thus since

a 6∈ A(ul) = C(ul)∩O(ul)

anda ∈ C(ul),

then necessarily a ∈ O(u1)c . This, however, contradicts (4.17) for j = lsince

dul(ak, c) = dul−1(ak, c)+∣∣∣∣∣ 1dul−1(ak,O(ul)c)

− 1dul−1(c,O(ul)c)

∣∣∣∣∣≥∣∣∣∣∣ 1dul−1(ak,O(ul)c)

− 1dul−1(c,O(ul)c)

∣∣∣∣∣ →k→∞∞since

dul−1(ak,O(ul)c) = infb∈O(ul)c)

dul−1(ak, b) ≤ dul−1(ak, a) →k→∞

0,

where the final inequality follows since a ∈ O(u1)c . Thus a cannot be inA(ul−1). But this contradicts the assumption that l is the smallest indexfor which a 6∈ A(ul) and thereby completes the proof that the givensequence of sets is indeed a scheme.

Since at each stage every point in A must be in one of the sets A(un),if we define A(un)(x) as the set containing x, then

x =∞⋂n=1

A(un)(x) (4.18)


and hence every singleton set x is in the kernel of the scheme. Observethat many of the sets in the scheme may be empty and not all sequencesu in N = ZZ+

+ will produce a point via (4.18). Since every point in A isin the kernel of a measurable scheme, (4.18) gives a map f : A → Nwhere f(x) is the u for which (4.18) is satisfied. This scheme can bethought of as a quantizer mapping a continuous space A into a discreterepresentation in the following sense: Suppose that we view a point aand we encode it into a sequence of integers un;n = 1,2, . . . whichare then viewed by a receiver. The receiver at time n uses the n inte-gers to produce an estimate or reproduction of the original point, saygn(un). The badness or distortion of the reproduction can be measuredby d(a,gn(un)). The goal is to decrease this distortion to 0 as n → ∞.The scheme provides such a code: Let the encoder at time n map x intothe un for which x ∈ A(un) and let the decoder gn(un) simply pro-duce an arbitrary member of A(un) (if it is not empty). This distortionwill be bound above by diam(A(un)) ≤ 2/n. This produces the sequenceencoder mapping f : A→ N.

Unfortunately the mapping f is in general into and not onto. For ex-ample, if at level n for some n-tuple bn ∈ Zn+ A(bn) is empty, then theencoding will produce no sequences that begin with bn since there cannever be an x in A(bn). In other words, any sequence u with un = bncannot be of the form u = f(x) for any x ∈ A. Roughly speaking, notall of the integers are useful at each level since some of the cells may beempty. Alternatively, some integer sequences do not correspond to (theycannot be decoded into) any point in A. We pause to tackle the problemsof empty cells.

Let N1 denote all integers u1 for which A(u1) is empty. Let N2 denotethe collection of all integer pairs u2 such that A(u2) is empty but A(u1)is not. That is, once we throw out an empty cell we no longer consider itsdescendants. Continue in this way to form the sets Nk consisting of alluk for which A(uk) is empty but A(uk−1) is not. These sets are clearlyall countable. Define next the space

N0 = N−∞⋃n=1

⋃bn∈Nn

u : un = bn =∞⋂n=1

⋂bn∈Nn

u : un = bnc. (4.19)

Thus N0 is formed by removing all thin cylinders of the form u : un =bn for some bn ∈ Nn, that is, all thin cylinders for which the first partof the sequence corresponds to an empty cell. This space has severalconvenient properties:

1. For all x ∈ A, f(x) ∈ N0. This follows since if x ∈ A, then x ∈ A(un)for some un for every n; that is, x can never lie in an empty cell, andhence f(x) cannot be in the portion of N that was removed.


2. The set N0 is a Borel set since it is formed by removing a countablenumber of thin cylinders from the entire space.

3. If we form a σ -field of N0 in the usual way as B(N) ∩ N0, then fromLemma 3.4 (or Lemma 3.8) the resulting measurable space (N0,B(N)∩N0) is standard.

4. From Examples 3.2.1 and 3.2.2 and (4.5), the Borel field B(N0) (usingdZ+) is the same as B(N) ∩ N0. Thus with Property 3, (N0,B(N0)) isstandard.

5. In the reduced space N0, all of the sequences correspond to sequencesof nonempty cells in A. Because of the properties of a scheme, anydescending sequence of nonempty cells must contain a point, and ifthe integer sequences ever differ, the points must differ. Define themapping g : N0 → A by

g(u) =∞⋂n=1

A(un).

Since every point in A can be encoded uniquely into some integersequence in N0, the mapping g must be onto.

6. N0 = f(A); that is, N0 is the range space of N. To see this, observefirst that Property 1 implies that f(A) ⊂ N0. Property 5 implies thatevery point in N0 must be of the form f(x) for some x ∈ A, and henceN0 ⊂ f(A), thereby proving the claim. Note that with Property 2 thisprovides an example wherein the forward image f(A) of a Borel set isalso a Borel set.

By construction we have for any x ∈ A that g(f(x)) = x and that forany u ∈ N0 f(g(u)) = u. Thus if we define f : A → N0 as the mappingf considered as having a range N0, then f is onto, the mapping g is justthe inverse of the mapping f , and the mappings are one-to-one. If thesemappings are measurable, then the spaces (A,B(A)) and (N0,B(N0))are isomorphic. Thus if we can show that the mappings f and g aremeasurable, then (A,B(A)) will be proved standard by Lemma 3.5 andProperty 4 because it is isomorphic to a standard space. We now provethis measurability and complete the proof.

First consider the mapping g : N0 → A. N and hence the subspace N0

are metric spaces under the product space metric dZ+ using the Ham-ming metric of Example 1.1; that is,

dZ+(u,v) =∞∑i=1

2−idH(ui, vi),

Given an ε > 0, find an n such that 2/n < ε and then choose a δ < 2−n.If dZ+(u,v) < δ, then necessarily un = vn which means that both g(u)


and g(v) must be in the same A(un) and hence d(g(u), g(v)) < 2/n <ε. Thus g is continuous and hence B(N0)-measurable.

Next consider f : A → N0. Let G = c(vn) denote the thin cylinderu : u ∈ N, un = vn in N0. Then by construction f−1(G) = g(G) =A(vn), which is a measurable set. From Example 3.2.2, however, suchthin cylinders generate B(N0). Thus f is measurable on a generatingclass and hence, by Lemma 2.4, f is measurable.

We have now proved the following result.

Theorem 4.2. If A is a Polish space, then the Borel space (A,B(A)) is stan-dard.

Thus the real line, the unit interval, Euclidean space, and separableHilbert spaces and Banach spaces are standard. We shall encounter sev-eral specific examples in the sequel.

The development of the theorem and some simple observations pro-vide some subsidiary results that we now collect.

We saw that a Polish space A is the kernel of a measurable schemeand that the Borel space (A,B(A)) is isomorphic to (N0,B(N0)), a sub-space of the space of integer sequences, using the metric dZ+ . In addi-tion, the inverse mapping in the construction is continuous. The isomor-phism and its properties depended only on the fact that that the set wasthe kernel of a measurable scheme. Thus if any set F in a metric spaceis the kernel of a measurable scheme, then it is isomorphic to some(N0,B(N0)), N0 ⊂ N, with the metric dZ+ on N0, and the inverse map-ping is continuous. We next show that this subspace, say N0, is closedwith respect to dZ+ . If ak → a and ak ∈ N0, then the first n coordinatesof ak must match those of a for sufficiently large k and hence be con-tained in the thin cylinder c(an) = u : un = an. But these sets are allnonempty for all n (or some ak would not be in N0) and hence a 6∈ Nc0and a ∈ N0. Thus N0 is closed. Since it is a closed subset of a Polishspace, it is also Polish with respect to the same metric. Thus any kernelof a measurable scheme is isomorphic to a Polish space (N0 in particular)with a continuous inverse.

Conversely, suppose that F is a subset of a Polish space and that(F,B(F)) is itself isomorphic to another Polish Borel space (B,B(B))withmetric m and suppose the inverse mapping is continuous. Suppose thatB(un) forms a measurable scheme for B. If f : A → B is the isomor-phism and g : B → A the continuous inverse, then A(un) defined byf−1(B(un)) = g(B(un)) satisfy the conditions (4.6) and (4.7) becauseinverse images preserve set theoretic operations. Since g is continuous,diam(B(un)) → 0 implies that also diam(A(un)) → 0. If the A(un) arenonempty for all n, then so are the B(un). Since B is Polish, they musthave a nonempty intersection of exactly one point, say b, but then theintersection of the A(un) will be g(b) and hence nonempty. Thus F is


the kernel of a measurable scheme. We have thus proved the followinglemma.

Lemma 4.4. Let A be a Polish space and F a Borel set in A. Then F is thekernel of a measurable scheme if and only if there is an isomorphism withcontinuous inverse from (F,B(F)) to some Polish Borel space.

Corollary 4.1. Let (A,B(A)) be the Borel space of a Polish space A. Let Gbe the class of all Borel sets F with the property that F is the kernel of ameasurable scheme. Then G is closed under countable disjoint unions andcountable intersections.

Proof. If Fi, i = 1,2, . . . are disjoint kernels of measurable schemes, thenit is easy to say that the union

⋃i Fi will also be the kernel of a mea-

surable scheme. The cells of the first level are all of the first level cellsfor all of the Fi. They can be reindexed to have the required form. Thedescending levels follow exactly as in the individual Fi. Since the Fi aredisjoint, so are the cells, and the remaining properties follow from thoseof the individual sets Fi.

Define the intersection F =⋂∞i=1 Fi. From the lemma there are isomor-

phisms fi : Fi → Ni with Ni being Polish subspaces of N which havecontinuous inverses gi : Ni → Fi. Define the space

N∞ = ×∞i=1Ni

and a subset H of N∞ containing all of those (z1, z2, . . .), zi ∈ Ni,such that g1(z1) = g2(z2) = . . . and define a mapping g : H → F byg(z1, z2, . . .) = g1(z1) (which equals gn(zn) for all n). Since the gn arecontinuous, H is a closed subset of N∞. Since H is a closed subset of N∞and since N∞ is Polish from Lemma 4.1, H is also Polish. g is continu-ous since the gn are, hence it is also measurable. It is also one-to-one,its inverse being f : F → H being defined by f(a) = (f1(a), f2(a), . . .).Since the mappings fn are measurable, so is f . Thus F is isomorphic toa Polish space and has a continuous inverse and hence is the kernel of ameasurable scheme. 2

Theorem 4.3. If F is a Borel set of a Polish space A, then the Borel space(F,B(F)) is standard.

Comment It should be emphasized that the set F might not itself bePolish.

Proof. From the previous corollary the collection of all Borel sets thatare kernels of measurable schemes contains countable disjoint unionsand countable intersections. First, we know that each such set is isomor-phic to a Polish Borel space by Lemma 4.4 and that each of these Polishspaces is standard by Theorem 4.2. Hence each set in this collection is

4.4 Product Measures 109

isomorphic to a standard space and so by Lemma 3.5 is itself standard.To complete the proof we need to show that our collection includes allthe Borel sets. Consider the closed subsets of A. Lemma 4.2 implies thateach such subset is itself a Polish space and hence, from Theorem 4.1,is the kernel of a measurable scheme. Hence our collection contains allclosed subsets of A. Since it also contains all countable disjoint unionsand countable intersections, by Lemma 1.2 it contains all the Borel sets.Since our collection is itself a subset of the Borel sets, the collection mustbe exactly the Borel σ -field. 2

This theorem gives the most general known class of standard Borelspaces. The development here departs from the traditional and elegantdevelopment of Parthasarathy [82] in a variety of ways, although thestructure and use of schemes strongly resembles his use of analytic sets.The use of schemes allowed us to map a Polish space directly onto a sub-set of a standard space that was easily seen to be both a measurable setand standard itself. By showing that all Borel sets had such schemes, thegeneral result was obtained. The traditional method is instead to mapthe original space into a subset of a known standard space that is thenshown to be a Borel set and to possess a subset isomorphic to the fullstandard space. By showing that a set sandwiched between two isomor-phic spaces is itself isomorphic, the result is proved. As a side result itis shown that all Polish spaces are actually isomorphic to N or, equiva-lently, to M. This development perhaps adds insight, but it requires theadditional and difficult proof that the one-to-one image of a Borel set isitself a Borel set: a result that we avoided by mapping onto a subset of Nformed by simply removing a countable number of thin cylinders fromit. While the approach adopted here does not yield the rich harvest of re-lated results of Parthasarathy, it is a simpler and more direct route to thedesired result. Alternative but often similar developments of the proper-ties of standard Borel spaces may be found in Schwartz [94], Christensen[12], Bourbaki [8], Cohn [15], Mackey [68], and Bjornsson [6].

4.4 Product Measures

Product measures on Borel subsets of Rk provide an example of themachinery developed thus far. We could consider more general Polishspaces, but R provides a concrete example. Let (R,B(R)) be the Borelmeasurable space of the real line, which has been shown to a standardmeasurable space. Let A ∈ B(R). The measurable space (A,B(A)), whereB(A) = all F∩A,F ∈ B(R), has also been shown to be a standard mea-surable space. Let P1 be a probability measure on the space.

Next consider the product measurable space (Ak,B(A)k), whereB(A)kis the σ -field generated by the rectangles. Recall from Lemma 4.1(b) that


since A is Polish B(A)k = B(Ak), that is, the σ -field generated by therectangles is the same as the Borel σ -field generated by the open setswith respect to the Euclidean metric. Since (Ak,B(A)k) is the product ofa standard measurable space, it is standard. Define a set function Pk onrectangles F = ×ki=1Fi), Fi ∈ B(A) by

Pn(×ki=1Fi) =k∏i=1

P1(Fi).

This set function immediately extends to the field FnA generated bythe rectangles using Lemma 2.2 since every F ∈ FnA has the form

F =⋃Nn=1 Fn where the Fn = ×ki=1Fn,i:

Pk(F) =N∑n=1

Pk(Fn) =N∑n=1

k∏i=1

P1(Fn,i). (4.20)

This set function is obviously nonnegative and it is finitely additiveby construction since the union of unions of disjoint rectangles is it-self a union of disjoint rectangles. Hence since the measurable spaceis standard, the measure on the field extends to the σ -field generatedby the field and hence there is a unique probability measure Pn on(Ak,B(A)k) agreeing with its value on cylinders. This probability mea-sure is called a product measure and is a generalization of the discrete-alphabet Bernoulli processes already considered.

As an example, let (A,B(A), P1) = ([0,1),B([0,1)), λ), with λ([a, b) =b − a for all 0 ≤ a < b < 1, that is, the Borel (or uncompleted Lebesgue)measure on [0,1)). Then the product measure λk on ([0,1)k,B([0,1)k)is well defined. The product measure λn extended to the product mea-surable space is called the Borel measure on [0,1)k and the completionis called the Lebesgue measure on [0,1)k . (As mentioned before, thename “Lebesgue measure” in the literature sometimes refers to the un-completed measure.)

As in the case with k = 1, essentially the same arguments can be usedto derive the Borel or Lebesgue measure λk on ([a, b)k,B([a, b)k) forany finite a,b and the Borel or Lebesgue measure on (Rk,B(Rk)).

4.5 IID Random Processes and B-processes

The construction of the previous section together with the Kolmogorovextension theorem provide the continuous-alphabet analog of the discrete-time IID processes. Again suppose that (R,B(R)) is the Borel measurablespace of the real line and A ∈ B(R) and let µ1 be a probability measure

4.5 IID Random Processes and B-processes 111

on the measurable space (A,B(A)). Let T be Z or Z+ and consider the theproduct measurable space (AT,B(A)T). The goal is a probability measureµ on this product space of sequences which will be a process distribu-tion for a Kolmogorov directly given random process Xn;n ∈ T whereXn is the coordinate function on AT.

For any finite index set I ⊂ T, define the product measure µn on theproduct measurable space (AI,B(A)I) exactly as in the previous section,that is, define the set function on rectangles, extend it to the field gen-erated by rectangles, and then extend it to (AI,B(A)I). This provides adistribution for any collection of n samples Xn;n ∈ I from the ran-dom process Xn;n ∈ T.These distributions are all consistent sincerectangles of smaller dimension can be formed by simply substitutingthe alphabet A for one or more of the coordinate events of the higherdimensional rectangles. Thus the conditions of the Kolmogorov exten-sion theorem are met and there exists a unique process distributionagreeing with the distributions for finite-dimensional sample vectors.A random process having identical product distributions for all possi-ble n-dimensional sample vectors is said to be independent, identicallydistributed (IID), generalizing the discrete-alphabet Bernoulli processesalready considered in Section 3.8.

For example, ([0,1),B([0,1)), λ1) with λ1 the Borel measure on [0,1)could be the scalar probability space used to produce an IID processwith uniformly distributed outputs Xn. Alternatively, the process couldbe constructed using the real line as an alphabet using the scalar space(R,B(R), µ1) with µ1(F) = λ1(F ∩ [0,1)) for all F ∈ B(R).

As with IID discrete-alphabet processes, these memoryless processescan be used to construct new processes by stationary mapping, eithercoding (with discrete outputs) or filtering (with continuous outputs). Thediscrete alphabet B-process definition immediately extends to the real-valued alphabet case: a B-process is defined as any random process thatis equivalent to a process formed by a stationary mapping of an IID pro-cess. As an important example, suppose that Xn;n ∈ I is an IID uni-form process and that hk;k = 0, . . . , K are finite real numbers. Define thetime-invariant filter mapping an input sequence x into an output symbol

y0 = f(x) = g(x−K,x−K+1, . . . , x−1, x0) =K∑k=0

hkx−k.

Then the output at time n will be

yn = f(Tnx) = g(xn−K,xn−K+1, . . . , xn−1, x0) =K∑k=0

hkxn−k,

or in terms of the random processes,


Yn =K∑k=0

hkXn−k.

This is an example of a linear time-invariant (LTI) filter, sometimes calleda moving average filter. We include only a finite number of terms to avoidthe technicalities of convergence in the general case, but the basic ideadoes not change. Driving an LTI filter with an IID continuous alphabetprocess produces a B-process.

4.6 Standard Spaces vs. Lebesgue Spaces

A natural question is how standard spaces and Lebesgue spaces are re-lated. As has been mentioned, Lebesgue spaces are the ubiquitous modelin ergodic theory, but standard spaces are emphasized in this book be-cause they capture the key properties of standard Borel spaces for manyof the results developed herein. This section highlights similarities anddifferences between the two types of probability space.

Suppose that a probability space (Ω,B, P) is a Lebesgue space andhence

• P can be expressed as a mixture or two measures P = αP1+ (1−α)P2

where α ∈ [0,1],• (Ω,B, P1) is isomorphic (mod 0) to Lebesgue measure space on [0,1),

and• P2 is purely discrete, that is, there is an at most countable collection

of points Ω2 = ai, i = 1,2, . . . with P2(Ω2) = 1.

Isomorphism mod 0 means that there are sets Ω1 ∈ B, Λ ∈ B([0,1) forwhich P1(Ω1) = λ(Λ) = 1 and there is an isomorphism φ : Ω1 → Λ, thatis, φ is one-to-one and both φ and its inverse φ−1 are measurable.

Lebesgue spaces are in a sense more general since they are completewhile in general standard measure spaces are not, in particular, theirσ -fields contain more sets. But the additional sets do not change theprobabilities of events with nonzero probabilities, they simply includeas events all subsets of events with 0 probability.

The Lebesgue measure space ([0,1),B, λ) is the completion of theBorel measure space ([0,1),B([0,1)), λ), where λ of any interval is thelength of the interval and the measurable space ([0,1),B([0,1))) is stan-dard. Since the Lebesgue measure space of the unit interval is the com-pletion of a particular standard Borel measure space (which is a stan-dard measurable space), the continuous part of any Lebesgue space isisomorphic mod 0 to the completion of a standard Borel measure space.The discrete part of any Lebesgue space is countable and hence stan-dard (and standard Borel). Hence any Lebesgue space is a mixture of a

4.6 Standard Spaces vs. Lebesgue Spaces 113

standard measure space and a measure space isomorphic mod 0 to thecompletion of a particular standard Borel space. In this sense Lebesguespaces can be considered as being less general than standard measur-able spaces since the are a mixture of spaces isomorphic mod 0 to thecompletion of a particular standard measure space, the Borel measurespace of the unit interval.

On the other hand, Borel spaces are not really more general in the fol-lowing sense, which we state without proof (the result follows, e.g., fromresults of Parthasarthy [82]). The completion of a standard Borel spaceis a Lebesgue space. Thus Lebesgue spaces can be viewed as simply com-pleted standard Borel measure spaces. The two spaces are intimatelyconnected and share many properties, but in general Borel measurablespaces have the countable additivity property, while the Lebesgue mea-surable space does not, but the Lebesgue space is complete and the Borelmeasure space is not.

The key fact for this book is that standard measure spaces have thecountable additivity property, that is, any finitely additive set functionuniquely extends to a countably additive set function, and hence it willbe enough to construct candidate probability measures that are finitelyadditive. This property is not possessed by Lebesgue spaces in generalbecause null sets with respect to one set function may not be null setswith respect to another, hence including all subsets of a null set withrespect to one set function in the σ -field can cause problems for extend-ing another set function which does not assign zero probability to theoriginal null set. Standard measure spaces usually have smaller σ -fieldsthan do Lebesgue spaces (they will be the same if the sample space isdiscrete), but this eases extension of candidate probability measures totrue probability measures.

Chapter 5

Averages

Abstract Probabilistic averages or expectations are developed along withtime-averages and the two types of averages are seen to share manyfundamental properties. The underlying connections between the twoare explored in detail in the chapters on ergodic properties (Chapter 7)and ergodic theorems (Chapter 8).

5.1 Discrete Measurements

We begin by focusing on measurements that take on only a finite numberof possible values. This simplifies the definitions and proofs of the basicproperties and provides a fundamental first step for the more generalresults. Let (Ω,B,m) be a probability space, e.g., the probability space(AI,BI

A,m) of a directly given random process with alphabet A. A real-valued random variable f : Ω → R will also be called a measurementsince it is often formed by taking a mapping or function of some otherset of more general random variables, e.g., the outputs of some randomprocess that might not have real-valued outputs. Measurements made onsuch processes, however, will always be assumed to be real.

Suppose next we have a measurement f whose range space or alpha-bet f(Ω) ⊂ R of possible values is finite. Then f is called a discreterandom variable or discrete measurement or digital measurement or,in the common mathematical terminology, a simple function. Such dis-crete measurements are of fundamental mathematical importance be-cause many basic results are easy to prove for simple functions and,as we shall see, all real random variables can be expressed as limits ofsuch functions. In addition, discrete measurements are good models formany important physical measurements. No human can distinguish allreal numbers on a meter, nor is any meter perfectly accurate. Hence,in a sense, all physical measurements have for all practical purposes a


115

116 5 Averages

finite (albeit possibly quite large) number of possible readings. Finally,many measurements of interest in communications systems are inher-ently discrete–binary computer data and ASCII codes being ubiquitousexamples.

Given a discrete measurement f , suppose that its range space isf(Ω) = bi, i = 1, . . . ,N, where the bi are distinct. Define the setsFi = f−1(bi) = x : f(x) = bi, i = 1, . . . ,N . Since f is measurable,the Fi are all members of B. Since the bi are distinct, the Fi are disjoint.Since every input point in Ω must map into some bi, the union of theFi equals Ω. Thus the collection Fi; i = 1,2, . . . ,N forms a partitionof Ω. We have therefore shown that any discrete measurement f can beexpressed in the form

f(x) =M∑i=1

bi1Fi(x), (5.1)

where bi ∈ R, the Fi ∈ B form a partition of Ω, and 1Fi is the indicatorfunction of Fi, i = 1, . . . ,M . Every simple function has a unique represen-tation in this form except possibly for sets that differ on a set of measure0.

The expectation or ensemble average or probabilistic average or meanof a discrete measurement f : Ω → R as in (5.1) with respect to a proba-bility measure m is defined by

Emf =M∑i=0

bim(Fi). (5.2)

The following simple lemma shows that the expectation of a discretemeasurement is well defined.

Lemma 5.1. Given a discrete measurement (measurable simple function)f : Ω → R defined on a probability space (Ω,B,m), then expectation iswell defined; that is, if

f(x) =N∑i=1

bi1Fi(x) =M∑j=1

aj1Gj (x)

then

Emf =N∑i=1

bim(Fi) =M∑j=1

ajm(Gj).

Proof. Given the two partitions Fi and Gj, form the new partition(called the join of the two partitions) Fi ∩Gj and observe that

f(x) =N∑i=1

M∑j=1

cij1Fi∩Gj (x)

5.1 Discrete Measurements 117

where cij = ai = bj . Thus

N∑i=1

bim(Fi) =N∑i=1

bi(M∑j=1

m(Gj ∩ Fi)) =N∑i=1

M∑j=1

bim(Gj ∩ Fi)

=N∑i=1

M∑j=1

cijm(Gj ∩ Fi) =M∑j=1

N∑i=1

ajm(Gj ∩ Fi)

=M∑j=1

aj(N∑i=1

m(Gj ∩ Fi)) =M∑j=1

bjm(Gj).

2

An immediate consequence of the definition of expectation is the sim-ple but useful fact that for any event F in the original probability space,

Em1F =m(F); (5.3)

that is, probabilities can be found from expectations of indicator func-tions.

The next lemma collects the basic properties of expectations of dis-crete measurements. These properties will be later seen to hold for gen-eral expectations and similar ones to hold for time averages. Thus theycan be viewed as fundamental properties of averages.

Lemma 5.2. (a) If f ≥ 0 with probability 1, then Emf ≥ 0.(b) Em1 = 1.(c) Expectation is linear; that is, for any real α, β and any discrete mea-

surements f and g,

Em(αf + βg) = αEmf + βEmg.

(d) If f is a discrete measurement, then

|Emf | ≤ Em|f |.

(e) Given two discrete measurements f and g for which f ≥ g withprobability 1 (m(x : f(x) ≥ g(x)) = 1), then

Emf ≥ Emg.

Comments: The first three statements bear a strong resemblance to thefirst three axioms of probability and will be seen to follow directly fromthose axioms: Since probabilities are nonnegative, normalized, and addi-tive, expectations inherit similar properties. A limiting version of additiv-ity will be postponed. Properties (d) and (e) are useful basic integrationinequalities. The following proofs are all simple, but they provide someuseful practice manipulating integrals of discrete functions.

118 5 Averages

Proof. (a) If f ≥ 0 with probability 1, then bi ≥ 0 in (5.1) and Emf ≥ 0from (5.2) and the nonnegativity of probability.

(b) Since 1 = 1Ω, Em1 = 1m(Ω) = 1 and the property follows from thenormalization of probability.

(c) Suppose that

f =∑ibi1Fi , g =

∑jai1Gi .

Again consider the join of the two partitions and write

f =∑i

∑jbi1Fi∩Gj

g =∑i

∑jaj1Fi∩Gj .

Thenαf + βg =

∑i

∑j(αbi + βaj)1Fi∩Gj

is a discrete measurement with the partition Fi ∩Gj and hence

Em(αf + βg) =∑i

∑j(αbi + βaj)m(Fi ∩Gj)

= α∑i

∑jbim(Fi ∩Gj)+ β

∑i

∑jajm(Fi ∩Gj)

= α∑ibim(Fi)+ β

∑jajm(Gj),

from the additivity of probability. This proves the claim.(d) Let J denote those i for which bi in (5.1) are nonnegative. Then

Emf =∑ibim(Fi)

=∑i∈J|bi|m(Fi)−

∑i6∈J|bi|m(Fi)

≤∑i|bi|m(Fi) = Em|f |.

(e) This follows from (a) applied to f − g. 2

In the next section we introduce the notion of quantizing arbitraryreal measurements into discrete approximations in order to extend thedefinition of expectation to such general measurements by limiting ar-guments.

5.2 Quantization 119

5.2 Quantization

Again let (Ω,B,m) be a probability space and f : Ω → R a measure-ment, that is, a real-valued random variable or measurable real-valuedfunction. The following lemma shows that any measurement can be ex-pressed as a limit of discrete measurements and that if the measurementis nonnegative, then the simple functions converge upward.

Lemma 5.3. Given a real random variable f , define the sequence of quan-tizers qn : R→ R, n = 1,2, . . ., as follows:

qn(r) =

n r ≥ n(k− 1)2−n (k− 1)2−n ≤ r < k2−n, k = 1,2, . . . , n2n

−(k− 1)2−n −k2−n ≤ r < −(k− 1)2−n, k = 1,2, . . . , n2n

−n r < −n.

The sequence of quantized measurements have the following proper-ties:

(a)f(x) = lim

n→∞qn(f(x)), all x.

(b) If f(x) ≥ 0, all x, then f ≥ qn+1(f ) ≥ qn(f), all n; that is, qn(x) ↑f . More generally, |f | ≥ |qn+1(f )| ≥ |qn(f)|.

(c) If f ≥ g ≥ 0, then also qn(f) ≥ qn(g) ≥ 0 for all n.(d) The magnitude quantization error |qn(f)− f | satisfies

|qn(f)− f | ≤ |f | (5.4)

|qn(f)− f | ↓ 0; (5.5)

that is, the magnitude error is monotonically nonincreasing and con-verges to 0.

(e) If f is G-measurable for some sub-σ -field G, than so are the fn. Thusif f is G-measurable, then there is a sequence of G-measurable simplefunctions which converges to f .

Proof. Statements (a)-(c) are obvious from the the construction. To prove(d) observe that if x is fixed, then either f is nonnegative or it is negative.If it is nonnegative, then so is qn(f(x)) and |qn(f)−f | = f−qn(f). Thisis bound above by f and decreases to 0 since qn(f) ↑ f from part (b).If f(x) is negative, then so is qn(f) and |qn(f) − f | = qn(f) − f . Thisis bound above by −f = |f | and decreases to 0 since qn(f) ↓ f frompart (b). To prove (e), suppose that f is G-measurable and hence thatinverse images of all Borel sets under f are in G. If qn, n = 1,2, . . . is

120 5 Averages

the quantizer sequence, then the functions qn(f) are also G-measurable(and hence also B(R) measurable) since the inverse images satisfy

qn(f)−1(F) = ω : qn(f)(ω) ∈ F= ω : qn(f(ω)) ∈ F= ω : f(ω) ∈ q−1

n (F)= f−1(q−1

n (F)) ∈ G

since all inverse images under f of events F in B(R) are in G. 2

Note further that if |f | ≤ K for some finite K, then for n > K |f(x)−qn(f(x))| ≤ 2−n.

Measurability

As a prerequisite to the study of averages, we pause to consider somespecial properties of real-valued random variables. We will often be inter-ested in the limiting properties of a sequence fn of integrable functions.Of particular importance is the measurability of limits of measurementsand conditions under which we can interchange limits and expectations.In order to state the principal convergence results, we require first a fewdefinitions. Given a sequence an, n = 1,2, . . . of real numbers, define thelimit supremum or upper limit or limit superior

lim supn→∞

an = infn

supk≥n

ak

and the limit infimum or lower limit or limit inferior

lim infn→∞

an = supn

infk≥nak.

In other words, lim supn→∞ an = a if a is the largest number for whichan has a subsequence, say an(k), k = 1,2, . . . converging to a. Similarly,lim infn→∞ an is the smallest limit attainable by a subsequence. Whenclear from context, we shall often drop the n → ∞ and simply writelim supan or lim infan. The limit of a sequence an, say limn→∞ an = a,will exist if and only if lim supan = lim infan = a, that is, if all subse-quences converge to the same thing–the limit of the sequence.

Lemma 5.4. Given a sequence of measurements fn, n = 1,2, . . ., then theupper and lower limits, f = lim supn→∞ fn and f = lim infn→∞ fn, arealso measurements, that is, are also measurable functions. If limn→∞ fnexists, then it is a measurement. If f and g are measurements, then soare the functions f + g, f − g, |f − g|, min(f , g), and max(f , g); that is,all these functions are measurable.

5.2 Quantization 121

Proof. To prove a real-valued function f : Ω → R measurable, it suf-fices from Lemma 2.4 to show that f−1(F) ∈ B for all F in a generat-ing class. From Chapter 4 R is generated by the countable collection ofopen spheres of the form x : |x − ri| < 1/n; i = 1,2, . . .; n = 1,2, . . .,where the ri run through the rational numbers. More simply, we cangenerate the Borel σ -field of R from the class of all sets of the form(−∞, ri] = x : x ≤ ri, with the same ri. This is because the σ -field con-taining these half-open intervals also contains the open intervals since itcontains the half-open intervals

(a, b] = (−∞, b]− (−∞, a]

with rational endpoints, the rationals themselves:

a =∞⋂n=1

(a− 1n,a],

and hence all open intervals with rational endpoints

(a, b) = (a, b]− b.

Since this includes the spheres described previously, this class generatesthe σ -field. Consider the limit supremum f . For f ≤ r we must have forany ε > 0 that fn ≤ r + ε for all but a finite number of n. Define

Fn(ε) = x : fn(x) ≤ r + ε.

Then the collection of x for which fn(x) ≤ r + ε for all but a finitenumber of n is

F(ε) =∞⋃n=1

∞⋂k=nFn(ε).

This set is called the limit infimum of the Fn(ε) in analogy to the notionfor sequences. As this type of set will occasionally be useful, we pause toformalize the definition: Given a sequence of sets Fn, n = 1,2, . . ., definethe limit infimum of the sequence of sets by

F = lim infn→∞

Fn =∞⋃n=1

∞⋂k=nFn(ε) =

x : x ∈ Fn for all but a finite number of n. (5.6)

Since the fn are measurable, Fn(ε) ∈ B for all n and ε and henceF(ε) ∈ B. Since we must be in F(ε) for all ε for f ≤ r ,

F = x : f(x) ≤ r =∞⋂n=1

F(1n).

122 5 Averages

Since the F(1/n) are all measurable, so is F . This proves the measura-bility of f . That for f follows similarly. If the limit exists, then f = fand measurability follows from the preceding argument. The remainingresults are easy if the measurements are discrete. The conclusions thenfollow from this fact and Lemma 5.3. 2

Exercises

1. Fill in the details of the preceding proof; that is, show that sums, dif-ferences, minima, and maxima of measurable functions are measur-able.

2. Show that if f and g are measurable, then so is the product fg. Underwhat conditions is the division f/g measurable?

3. Prove the following corollary to the lemma:

nDefine Ff as the set of x ∈ Ω for which limn→∞ fn exists; then 1F is ameasurement and F ∈ B.

4. Is it true that if F is the limit infimum of a sequence Fn of events, then

1F(x) = lim infn→∞

1Fn(x)?

5.3 Expectation

We now define expectation for general measurements in two steps. Iff ≥ 0, then choose as in Lemma 5.3 the sequence of asymptoticallyaccurate quantizers qn and define

Emf = limn→∞

Em(qn(f)). (5.7)

Since the qn are discrete measurements on f , the qn(f) are discretemeasurements on Ω (qn(f)(x) = qn(f(x)) is a simple function), andhence the individual expectations are well defined. Since the qn(f) arenondecreasing, so are the Em(qn(f)) from Lemma 5.2(e). Thus the se-quence must either converge to a finite limit or grow without bound,in which case we say it converges to ∞. In both cases the expectationEmf is well defined, although it may be infinite. It is easily seen that iff is a discrete measurement, then (5.7) is consistent with the previousdefinition (5.1).

If f is an arbitrary real random variable, define its positive and neg-ative parts f+(x) = max(f (x),0) and f−(x) = −min(f (x),0) so that

be a sequence of measurements defined on (Ω,B,m).Corollary 5.1. Let f

5.3 Expectation 123

f(x) = f+(x)− f−(x) and set

Emf = Emf+ − Emf− (5.8)

provided this does not have the form +∞−∞, in which case the expec-tation does not exist. Observe that an expectation exists for all nonneg-ative random variables, but that it may be infinite. The following lemmaprovides an alternative definition of expectation as well as a limitingtheorem that shows that there is nothing magic about the particular se-quence of discrete measurements used in the definition.

Lemma 5.5. (a) If f ≥ 0, then

Emf = supdiscrete g:g≤f

Emg.

(b) Suppose that f ≥ 0 and fn is any nondecreasing sequence of discretemeasurements for which limn→∞ fn(x) = f(x), then

Emf = limn→∞

Emfn.

Comments: Part (a) states that the expected value of a nonnegative mea-surement is the supremum of the expected values of discrete measure-ments that are no greater than f . This property is often used as a def-inition of expectation. Part (b) states that any nondecreasing sequenceof asymptotically accurate discrete approximations to f can be used toevaluate the expectation. This is a special case of the monotone conver-gence theorem to be considered shortly.

Proof. Since the qn(f) are all discrete measurements less than or equalto f , clearly

Emf = limn→∞

Emqn(f) ≤ supgEmg,

where the supremum is over all discrete measurements less than orequal to f . For ε > 0 choose a discrete measurement g with g ≤ fand

Emg ≥ supgEmg − ε.

From Lemmas 5.3 and 5.2

Emf = limn→∞

Emqn(f) ≥ limn→∞

Emqn(g) = Emg ≥ supgEmg − ε.

Since ε is arbitrary, this completes the proof of (a).To prove (b), use (a) to choose a discrete function g =

∑i gi1Gi that

nearly gives Emf ; that is, fix ε > 0 and choose g ≤ f so that

124 5 Averages

Emg =∑igim(Gi) ≥ Emf −

ε2.

Define for all n the set

Fn = x : fn(x) ≥ g(x)−ε2.

Since g(x) ≤ f(x) and fn(x) ↑ f(x) by assumption, Fn ↑ Ω, and hencethe continuity of probability from below implies that for any event H,

limn→∞

m(Fn ∩H) =m(H). (5.9)

Clearly 1 ≥ 1Fn and for each x ∈ Fn fn ≥ g − ε. Thus

fn ≥ 1Fnfn ≥ 1Fn(g − ε),

and all three measurements are discrete. Application of Lemma 5.2(e)therefore gives

Emfn ≥ Em(1Fnfn) ≥ Em(1Fn(g − ε)).

which, from the linearity of expectation for discrete measurements(Lemma 5.2(c)), implies that

Emfn ≥ Em(1Fng)−ε2m(Fn) ≥ Em(1Fng)−

ε2

= Em(1Fn∑igi1Gi)−

ε2= Em(

∑igi1Gi∩Fn)−

ε2

=∑igim(Fn ∩Gi)−

ε2.

Taking the limit as n → ∞, using (5.9) for each of the finite number ofprobabilities in the sum,

limn→∞

Emfn ≥∑igim(Gi)−

ε2= Emg −

ε2≥ Emf − ε,

which completes the proof. 2

With the preceding alternative characterizations of expectation inhand, we can now generalize the fundamental properties of Lemma 5.2from discrete measurements to more general measurements.

Lemma 5.6. (a) If f ≥ 0 with probability 1, then Emf ≥ 0.(b) Em1 = 1.(c) Expectation is linear; that is, for any real α, β and any measurementsf and g,

Em(αf + βg) = αEmf + βEmg.

5.3 Expectation 125

(d) Emf exists and is finite if and only if Em|f | is finite and

|Emf | ≤ Em|f |.

(e) Given two measurements f and g for which f ≥ g with probability1, then

Emf ≥ Emg.Proof. (a) follows by defining qn(f) as in Lemma 5.3 and applyingLemma 5.2(a) to conclude Em(qn(f)) ≥ 0, which in the limit proves (a).(b) follows from Lemma 5.2(b). To prove (c), first consider nonnegativefunctions by using Lemma 5.2(c) to conclude that

Em(αqn(f)+ βqn(g)) = αEm(qn(f))+ βEm(qn(G)).

The right-hand side converges upward to αEmf+ βEmg. By the linearityof limits, the sequence αqn(f)+βqn(g) converges upward to f +g, andhence, from Lemma 5.5, the left-hand side converges to Em(αf + βg),proving the result for nonnegative functions. Note that it was to provethis piece that we first considered Lemma 5.5 and hence focused ona nontrivial limiting property before proving the fundamental proper-ties of this lemma. The result for general measurements then followsby breaking the functions into positive and negative parts and using thenonnegative measurement result.

Part (d) follows from the fact that f = f+ − f− and |f | = f+ + f−.The expectation Em|f | exists and is finite if and only if both positive andnegative parts have finite expectations, this implying that Emf exists andis also finite. Conversely, if Emf exists and is finite, then both positiveand negative parts must have finite integrals, this implying that |f |mustbe finite. If the integrals exist, clearly

Em|f | = Emf+ + Emf− ≥ Emf+ ≥ Emf+ − Emf− = Emf .

Part (e) follows immediately by application of part (a) to f − g. 2

Convex Functions and Jensen’s Inequality

The linearity of expectation has a trivial, but important, special case.Suppose that X is a random variable defined on (Ω,B,m) and thatf(x) = ax is a simple linear function of its argument for some scalara. Then E(f(X)) = E(aX) = aE(X) = f(EX) so that for this very spe-cial case we can interchange expectation and the function f . Clearly ingeneral it is not possible to perform this exchange, for example, for non-trivial distributions E(eX) ≠ eEX — we can make the exchange only forlinear functions. Unfortunately most interesting functions are, like the

126 5 Averages

exponential, not linear. There is, however, an important class of func-tions that generalize linear functions for which a simple and powerfulbound relates the expected value of a function to the function of the ex-pected value. We pause to describe the function and prove the inequalitysince the techniques of the previous lemma are key to the proof.

A real-valued random variable f on a linear vector space Ω is said tobe convex if

f(θx + (1− θ)y) ≤ θ(f(x)+ (1− θ)f(y);x,y ∈ Ω, θ ∈ (0,1). (5.10)

The random variable is said to be concave if −f is convex.. In some ofthe literature convex is referred to as convex up or convex ∪ and concaveis referred to as convex down or convex ∩. Convexity can be thought ofas subadditivity: as θ goes from 0 to 1, θx + (1 − θ)y traces a straightline from x to y . For each point on this line, the function evaluated atthe point lies below the corresponding linear combination of the valuesof the weighted functions at the end points, θf(x)+ (1− θ)f(y). Manyinteresting functions in information theory and signal processing areconvex, and there is a wide literature on the constrained optimization ofsuch functions (see, e.g., [9].

Lemma 5.7. Given a real-valued random variable X defined on a proba-bility space (Ω,B,m) and a convex function f , then

Em[f(X)] ≤ f(EmX). (5.11)

Proof. First consider the case where X is a discrete random variable withonly two possible values xi, i = 0,1 with probabilities p0 = θ, p1 = 1−θ.In this case the definition of convexity reduces to (5.11). Next supposethat X is discrete with an arbitrary finite alphabet xi, i = 0,1, . . . , K −1. Jensen’s inequality then follows by induction from the binary case.The general case then follows by a limiting argument as in the proof oflinearity of expectation. 2

Young’s Inequality

We consider as an example the one special case needed later in thischapter for the derivation of other inequalities. Suppose that r > 0 andf(x) = rx . To show that f is convex it is necessary to show that

θ(f(x)+ (1− θ)f(y)− f(θx + (1− θ)y) =θrx + (1− θ)ry − rθxr (1−θ)y

5.3 Expectation 127

is nonnegative for all real x,y and θ ∈ (0,1). Dividing both sides by rythis is equivalent to proving that θrx−y+(1−θ)−rθ(x−y) is nonnegativefor all x,y and θ ∈ (0,1) or, equivalently, that

γ(z) = θz + (1− θ)− zθ

is nonnegative for all z and θ ∈ (0,1). This, however, follows by a simplecalculus minimization, setting the derivative of γ(z) equal to zero showsa stationary point of 0 at z = 1 and the stationary point is a uniqueminimum since the second derivative of γ(z) is positive for θ ∈ (0,1)and positive z. Convexity of rz implies that

rθxr (1−θ)y ≤ θrx + (1− θ)ry.

If we choose α = rx , β = ry , this yields the inequality

αθβ1−θ ≤ θα+ (1− θ)β. (5.12)

Alternatively, setting a = αθ , b = β1−θ , the inequality becomes for non-negative real a,b and θ ∈ (0,1)

ab ≤ θa1θ + (1− θ)b

11−θ , (5.13)

which is known as Young’s inequality

Integration

The expectation is also called an integral and is denoted by any of thefollowing:

Emf =∫fdm =

∫f(x)dm(x) =

∫f(x)m(dx).

The subscript m denoting the measure with respect to which the expec-tation is taken will occasionally be omitted if it is clear from context.The development for expectations assumed probability measures, butall of the methods and results remain true in the more general case ofan integral where m need not be normalized.

We will continue to equate measurable functions with measurements,but we will reserve the name “random variable” for the case of proba-bility spaces with probability measures. A measurement f is said to beintegrable or m-integrable if

∫fdm exists and is finite. Define L1(m) to

be the space of allm-integrable functions. Given anym-integrable f andan event B, define

128 5 Averages∫Bfdm =

∫f(x)1B(x)dm(x).

Two measurable functions f and g are said to be equal m-almost-everywhere or equal m-a.e. if m(f ≠ g) = m(x : f(x) ≠ g(x)) = 0.If m is a probability measure, then we also say that f = g with m-probability one since m(f ≠ g) =m(x : f(x) = g(x)) = 1. The m- isdropped if it is clear from context.

Sums

Sums fit nicely into the integration framework. Suppose that the mea-surable space is Z+ and the σ -field is the collection B of all subsets ofZ+. Define the measure of points in the space by m(k) = ak ≥ 0 anddefine the measure on any event by

m(F) =∑k∈Fak.

If m(Ω) = ∑k∈Z+ ak = 1, then m is a (discrete) probability measure. Theintegral of a measurement f then becomes∫

fdm =∑k∈Z+

f(k)ak.

For example, if ak = 1 for all n, this becomes∫fdm =

∑k∈Z+

f(k) (5.14)

5.4 Limits

The emphasis now returns to probability measures and expectations, butunless stated otherwise the appropriate extensions to integrals follow inthe same manner. The following corollary to the definition of expectationand its linearity characterizes the expectation of an arbitrary integrablerandom variable as the limit of the expectations of the simple measure-ments formed by quantizing the random variables. This is an exampleof an integral limit theorem, a result giving conditions under which theintegral and limit can be interchanged.

Corollary 5.2. Let qn be the quantizer sequence of Lemma 5.3 If f is m-integrable, then

5.4 Limits 129

Emf = limn→∞

Emqn(f)

andlimn→∞

Em|f − qn(f)| = 0.

Proof. If f is nonnegative, the two results are equivalent and follow im-mediately from the definition of expectation and the linearity of Lemma5.6(c). For general f qn(f) = qn(f+) − qn(f−) and qn(f+) ↑ f+ andqn(f−) ↑ f− by construction, and hence from the nonnegative f resultand the linearity of expectation that

E|f − qn| = E(f+ − qn(f+))+ E(f− − qn(f−)) →n→∞0

andE(f − qn) = E(f+ − qn(f+))− E(f− − qn(f−)) →n→∞0.

2

The lemma can be used to prove that expectations have continuityproperties with respect to the underlying probability measure:

Corollary 5.3. If f is m-integrable, then

limm(F)→0

∫Ffdm = 0;

that is, given ε > 0 there is a δ > 0 such that if m(F) < δ, then∫Ffdm < ε.

Proof. If f is a simple function of the form (5.1),then

∫Ffdm = E(1F

M∑i=1

bi1Fi) =M∑i=1

biE1F∩Fi

=M∑i=1

bim(F ∩ Fi) ≤M∑i=1

bim(F) →m(F)→0

0.

If f is integrable, let qn be the preceding quantizer sequence. Given ε > 0use the previous corollary to choose k large enough to ensure that

Em|f − qk(f)| < ε.

From the preceding result for simple functions

130 5 Averages

|∫Ffdm| = |

∫f1Fdm−

∫qk(f)1Fdm+

∫qk(f)1Fdm|

≤ |∫(f − qk(f)1F dm| + |

∫qk(f)1F dm|

≤∫|f − qk(f)|1F dm+ |

∫Fqk(f )dm|

≤ ε+ |∫Fqk(f )dm| →

m(F)→0ε.

Since ε is arbitrary, the result is proved. 2

The preceding proof is typical of many results involving expectationsor, more generally, integrals: the result is easy for discrete measure-ments and it is extended to general functions via limiting arguments.This continuity property has two immediate consequences, the proofsof which are left as exercises.

Corollary 5.4. If f is nonnegative m-a.e., m-integrable, and Emf 6= 0,then the set function m defined by

m(F) =∫Ffdm

is a measure and the set function P defined by

P(F) =∫F fdmEmf

is a probability measure.

Corollary 5.5. If f is m-integrable, then

limr→∞

∫f≥r

fdm = 0.

Corollary 5.6. Suppose that f ≥ 0 m-a.e. Then∫fdm = 0 if and only if

f = 0 m-a.e.

Corollary 5.6 shows that we can consider L1(m) to be a metric spacewith respect to the metric d(f , g) =

∫|f − g|dm if we consider mea-

surements identical if m(f = g) = 0.

5.5 Inequalities

Probability theory contains many important inequalities, results whichbound probabilities and distortions in situations where evaluations may

5.5 Inequalities 131

be tedious or impossible. In this section we provide the most importantand useful of these inequalities.

Given a real-valued random variable with distribution m and p ≥ 1,define

‖f‖p = (Em(fp))1/p . (5.15)

Denote by Lp the collection of all random variables f for which ‖f‖p isfinite. If the probability measure m is not clear from context, the spaceis denoted by Lp(m).

Lemma 5.8. Inequalities

(a) The Markov InequalityGiven a random variable f with f ≥ 0 m-a.e., then for a > 0

m(f ≥ a) ≤ Emfa.

Furthermore, for p > 0

m(f ≥ a) ≤ Em(fp)

ap.

(b) Tchebychev’s InequalityGiven a random variable f with expected value Ef , then if a > 0,

m(|f − Ef | ≥ a) ≤ Em(|f − Ef |2)

a2.

(c) If E(f 2) = 0, the m(f = 0) = 1.(d) The Cauchy-Schwarz Inequality

Given random variables f and g, then

|E(fg)| ≤ E(f 2)1/2E(g2)1/2.

(e) Hölder’s Inequality(A generalization of the Cauchy-Schwarz inequality.) Suppose thatp,q ≥ 1 and that

1p+ 1q= 1.

Then‖fg‖1 ≤ ‖f‖p‖g‖q

or

E(|fg|) ≤ (E(|f |p)1p(E(|g|q)

) 1q .

(f) Minkowski’s InequalitySuppose that p ≥ 1 and f , g ∈ Lp. Then

132 5 Averages

‖f + g‖p ≤ ‖f‖p + ‖g‖p,

and hence ‖ · ‖ is a norm on Lp (if one identifies functions that areequal a.e.).

Proof. (a) Let 1f≥a denote the indicator function of the event x :f(x) ≥ a. We have that

Ef = E(f(1f≥a + 1f<a)) = E(f1f≥a)+ E(f1f<a)≥ aE1f≥a = am(f ≥ a),

proving Markov’s inequality. The generalization follows by substitutingfp for f and using the fact that for p > 0, fp > ap if and only if f > a.

(b) Tchebychev’s inequality follows by replacing f by |f − Ef |2.(c) From the general Markov inequality, if E(f 2) = 0, then for any ε > 0,

m(f 2) ≥ ε2) ≤ E(f2)

ε2= 0

and hence m(|f | ≤ 1/n) = 1 for all n. Since ω : |f(ω)| ≤ 1/n ↓ω : f(ω) = 0, from the continuity of probability, m(f = 0) = 1.

(d) To prove the Cauchy-Schwarz inequality set a = E(f 2)1/2 and b =E(g2)1/2. If a = 0, then f = 0 a.e. from (c) and the inequality is trivial.Hence we can assume that both a and b are nonzero. In this case

0 ≤ E(fa± gb)2 = 2± 2

E(fg)ab

.

(e) Applying Young’s inequality (5.13) with θ = 1/p, 1 − θ = 1/q,a = f/‖f‖p, b = g/‖g‖q∫

f‖f‖p

g‖g‖q

dm ≤∫ [θfp

‖f‖pp+ (1− θ) g

q

‖g‖qq

]dm

= θ + 1− θ = 1

(f) First we show that if if f , g ∈ Lp, then also f + g ∈ Lp. Using thesimple inequality

max(|f |, |g|) ≤ |f | + |g| ≤ 2 max(|f |, |g|)

it follows for any p > 0 that

|f + g|p ≤ (|f | + |g|)p ≤(2 max(|f |, |g|)

)p= max(2p|f |,2p|g|) ≤ 2p(|f |p + |g|p). (5.16)

Thus the integrability of the right hand side ensures that of the left. Theresult is immediate if ‖f + g‖p = 0, so assume it is not. Since

5.5 Inequalities 133

|f(x)+ g(x)|p = |f(x)+ g(x)| × |f(x)+ g(x)|p−1

≤ |f(x)| × |f(x)+ g(x)|p−1|g(x)| × |f(x)+ g(x)|p−1

So that Hölder’s inequality can be applied to each term to obtain

‖f + g‖pp =∫|f + g|pdm

≤ ‖f‖p‖f + g‖1−1p + ‖g‖p‖f + g‖1− 1

p .

Dividing both sides by ‖f + g‖p−1p completes the proof of Minkowski’s

inequality. 2

For ease of reference, we state Minkowski’s inequality for sums. Itfollows from the fact that sums are special cases of integrals.

Corollary 5.7. Given two sequences of nonnegative numbers ai, bi andp ≥ 1, ∑

i(ai + bi)p

1p

≤∑iapi

1p

+∑ibpi

1p

(5.17)

Uniform Integrability

A sequence fn is said to be uniformly integrable if the limit of Corollary5.5 holds uniformly in n; that is, if

limr→∞

∫|fn|≥r

|fn|dm = 0

uniformly in n, i.e., if given ε > 0 there exists an R such that∫|fn|≥r

|fn|dm < ε, all r ≥ R and all n.

Alternatively, the sequence fn is uniformly integrable if

limr→∞

supn

∫|fn|≥r

|fn|dm = 0.

The following lemma collects several examples of uniformly integrablesequences.

Lemma 5.9. If there is an integrable function g (e.g., a constant) for which|fn| ≤ g, all n, then the fn are uniformly integrable. If a > 0 andsupn E(|fn|1+a) <∞, then the fn are uniformly integrable. If |fn| ≤ |gn|for all n and the gn are uniformly integrable, then so are the fn. If asequence f 2

n is uniformly integrable, then so is the sequence fn.

134 5 Averages

Proof. The first statement is immediate from the definition and fromCorollary 5.4. The second follows from Markov’s inequality. The thirdexample follows directly from the definition. The fourth example followsfrom the definition and the Cauchy-Schwarz inequality. 2

5.6 Integrating to the Limit

The following lemma collects the basic asymptotic integration results onintegrating to the limit.

Lemma 5.10. (a) Monotone Convergence TheoremIf fn, n = 1,2, . . ., is a sequence of measurements such that fn ↑ f andfn ≥ 0, m-a.e., then

limn→∞

∫fndm =

∫fdm. (5.18)

In particular, if limn→∞∫fndm < ∞, then

∫(limn→∞ fn)dm exists and

is finite.(b) Extended Monotone Convergence Theorem

If there is a measurement h such that∫hdm > −∞ (5.19)

and if fn, n = 1,2, . . ., is a sequence of measurements such that fn ↑ fand fn ≥ h, m-a.e., then (5.18) holds. Similarly, if there is a function hsuch that ∫

hdm <∞ (5.20)

and if fn, n = 1,2, . . ., is a sequence of measurements such that fn ↓ fand fn ≤ h, m-a.e., then (5.18) holds.

(c) Fatou’s LemmaIf fn is a sequence of measurements and fn ≥ h, where h satisfies(5.19), then ∫

(lim infn→∞

fn)dm ≤ lim infn→∞

∫fndm. (5.21)

If fn is a sequence of measurements and fn ≤ h, where h satisfies(5.20), then

lim supn→∞

∫fn dm ≤

∫(lim supn→∞

fn)dm. (5.22)

(d) Extended Fatou’s Lemma If fn, n = 1,2, . . . is a sequence of uni-formly integrable functions, then

5.6 Integrating to the Limit 135∫(lim infn→∞

fn)dm ≤ lim infn→∞

∫fndm ≤ lim sup

n→∞

∫fndm ≤

∫(lim supn→∞

fn)dm.

(5.23)Thus, for example, if the limit of a sequence fn of uniformly integrablefunctions exists a.e., then that limit is integrable and∫

( limn→∞

fn)dm = limn→∞

∫fndm.

(e) Dominated Convergence Theorem If fn, n = 1,2, . . ., is a sequenceof measurable functions that converges almost everywhere to f and ifh is an integrable function for which |fn| ≤ h, then f is integrable and(5.18) holds.

Proof. (a) Part (a) follows in exactly the same manner as the proof ofLemma 5.5(b) except that the measurements fn need not be discrete andLemma 5.6 is invoked instead of Lemma 5.2.

(b) If∫hdm is infinite, then so is

∫fndm and

∫fdm. Hence assume it

is finite. Then fn−h ≥ 0 and fn−h ↑ f−h and the result follows from (a).The result for decreasing functions follows by a similar manipulation.

(c) Define gn = infk≥n fk and g = lim inffn. Then gn ≥ h for all n, hsatisfies (5.19), and gn ↑ g. Thus from the fact that gn ≤ fn and part (b)∫

fndm ≥∫gndm ↑

∫gdm =

∫(lim infn→∞

fn)dm,

which proves (5.21). Eq. (5.22) follows in like manner.(d) For any r > 0,∫

fndm =∫fn−r

fndm+∫fn≥−r

fndm.

Since the fn are uniformly integrable, given ε > 0 r can be chosen largeenough to ensure that the first integral on the right has magnitude lessthan ε. Thus ∫

fndm ≥ −ε+∫

1fn≥−r fndm.

The integrand on the left is bound below by −r , which is integrable, andhence from part (c)

lim infn→∞

∫fndm ≥ −ε+

∫(lim infn→∞

1fn≥−r fn)dm

≥ −ε+∫(lim infn→∞

fn)dm,

where the last inequality follows since 1fn≥−r fn ≥ fn. Since ε is arbi-trary, this proves the limit infimum part of (5.23). The limit supremum

136 5 Averages

part follows similarly. (e) Follows immediately from (d) and Lemma 5.9.2

Application of the monotone convergence theorem to the special caseof sums using (5.14) yields the following simple corollary, which will beuseful later.

Corollary 5.8. Suppose that fn is a sequence of measurable functionssuch that fn ↑ f . Then

limn→∞

∞∑k=1

fn(k) =∞∑k=1

limn→∞

fn(k) =∞∑k=1

f(k),

that is, the infinite sum can be summed term-by-term or, in other words,the two limits involved in the summation and the summand can be inter-changed.

To complete this section we present several additional properties ofintegration and expectation that will be used later. The proofs are simpleand provide practice with manipulating the preceding fundamental re-sults. The first result provides a characterization of uniformly integrablefunctions.

Lemma 5.11. A sequence of integrable functions fn is uniformly inte-grable if and only if the following two conditions are satisfied: (i) The inte-grals

∫|fn|dm are bounded; that is, there is some finite K that bounds all

of these expectations from above. (ii) The continuity in the sense of Corol-lary 5.5 of the integrals of fn is uniform; that is, given ε > 0 there is aδ > 0 such that if m(F) < δ, then∫

F|fn|dm < ε, all n.

Proof. First assume that (i) and (ii) hold. From Markov’s inequality and(i) it follows that

m(|fn| ≥ r) ≤ r−1∫|fn|dm ≤

Kr→r→∞

0

uniformly in n. This plus uniform continuity then implies uniform in-tegrability. Conversely, assume that the fn are uniformly integrable. Fixε > 0 and choose r so large that∫

|fn|≥r|fn|dm < ε/2, all n.

Then ∫|fn|dm =

∫|fn|<r

|fn|dm+∫|fn|≥r

|fn|dm ≤ ε/2+ r

5.6 Integrating to the Limit 137

uniformly in n, proving (i). Choosing F so that m(F) < ε/(2r),∫F|fn|dm =

∫F∩‖fn|≥r

|fn|dm+∫F∩‖fn|<r

|fn|dm

≤∫|fn|≥r

|fn|dm+ rm(F)

≤ ε/2+ ε/2 = ε,

proving (ii) since ε is arbitrary. 2

Corollary 5.9. If a sequence of random variables fn is uniformly inte-grable, then so is the sequence

n−1n−1∑i=0

fi.

Proof. If the fn are uniformly integrable, then there exists a finite K suchthat E(|fn|) < K, and hence

E(| 1n

n−1∑i=0

fi|) ≤1n

n−1∑i=0

E|fi| < K.

The fn are also uniformly continuous and hence given ε there is a δ suchthat if m(F) ≤ δ, then ∫

Ffidm ≤ ε, all i,

and hence ∫F|n−1

n−1∑i=0

fi|dm ≤ n−1n−1∑i=0

∫F|fi|dm ≤ ε, all n.

Applying the previous lemma then proves the corollary. 2

The final lemma of this section collects a few final journeyman resultson expectation. The proofs are left as exercises.

Lemma 5.12. (a) Given two integrable functions f and g, if∫Ffdm ≤

∫Fg dm all events F,

then f ≤ g, m-a.e. If the preceding relation holds with equality, thenf = g m-a.e.

(b) (Change of Variables Formula) Given a measurable mapping f froma standard probability space (Ω,S, P) to a standard measurable space(Γ ,B) and a real-valued random variable g : Γ → R, let Pf−1 denotethe probability measure on (Γ ,B) defined by Pf−1(F) = P(f−1F), all

138 5 Averages

F ∈ B. Then ∫g(y)dPf−1(y) =

∫g(f(x))dP(x);

that is, if either integral exists so does the other and they are equal.(c) Chain Rule

Consider a probability measure P defined as in Corollary 5.4 as

P(F) =∫F fdmEmf

for somem-integrable f with nonzero expectation. If g is q-integrable,then

EPg = Em (gf).

Exercises

1. Prove that (5.7) yields (5.2) as a special case if f is a discrete measure-ment.

2. Prove Corollary 5.4.3. Prove Corollary 5.5.4. Prove Corollary 5.6.5. Prove Lemma 5.12.

Hint for proof of (b) and (c): first consider simple functions and thenlimits.

6. Show that for a nonnegative random variable X

∞∑n=1

Pr(X ≥ n) ≤ EX.

7. Verify the following relations:

m(F∆G) = Em[(1F − 1G)2]m(F ∩G) = Em(1F1G)m(F ∪G) = Em[max(1F ,1G)].

8. Verify the following:

m(F ∩G) ≤√m(F)m(G)

m(F∆G) ≥ |√m(F)− √m(G|2.

5.7 Time Averages 139

9. Given disjoint events Fn, n = 1,2, . . ., and a measurement f , provethat integrals are countably additive in the following sense:∫

∪∞n=1Fnfdm =

∞∑n=1

∫Fnfdm

10. Prove that if (∫F fdm)/m(F) ∈ [a, b] for all events F , then f ∈ [a, b]

m-a.e.11. Show that if f , g ∈ L2(m), then∫

|fg|dm ≤ 12

∫f 2dm+ 1

2

∫g2dm.

12. Show that if f ∈ L2(m), then∫|f |>r

|f |dm ≤ (∫f 2dm)1/2(m(|f | > r))1/2.

5.7 Time Averages

Given a random process Xn we will often wish to apply a given mea-surement repeatedly at successive times, that is, to form a sequence ofoutputs of a given measurement. For example, if the random processhas a real alphabet corresponding to a voltage reading, we might wishto record the successive outputs of the meter and hence the successiverandom process values Xn. Similarly, we might wish to record succes-sive values of the energy in a unit resistor: X2

n. We might wish to esti-mate the autocorrelation of lag or delay k via successive values of theform XnXn+k. Although all such sequences are random and hence them-selves random processes, we shall see that under certain circumstancesthe arithmetic or Cesàro means of such sequences of measurements willconverge. Such averages are called time averages or sample averages andthis section is devoted to their definition and to a summary of severalpossible forms of convergence.

A useful general model for a sequence of measurements is pro-vided by a dynamical system (AI,BI

A,m,T), that is, a probability space,(AI,BI

A,m) and a transformation T : AI → AI (such as the shift), togetherwith a real-valued measurement f : AI → R. Define the corresponding se-quence of measurements fT i, i = 0,1,2, . . . by fT i(x) = f(T ix); that is,fT i corresponds to making the measurement f at time i. For example,assume that the dynamical system corresponds to a real-alphabet ran-dom process with distribution m and that T is the shift on AZ. If we aregiven a sequence x = (x0, x1, . . .) in AZ, where Z = 0,1,2, . . ., if f(x) =Π0(x) = x0, the “time-zero” coordinate function, then f(T ix) = xi, the

140 5 Averages

outputs of the directly given process with distribution m. More gener-ally, f could be a filtered or coded version of an underlying process. Thekey idea is that a measurement f made on a sample sequence at time icorresponds to taking f on the ith shift of the sample sequence.

Given a measurement f on a dynamical system (Ω,B,m,T), the nth

order time average or sample average < f >n is defined as the arith-metic mean or Césaro mean of the measurements at times 0,1,2, . . . , n−1; that is,

< f >n (x) = n−1n−1∑i=0

f(T ix), (5.24)

or, more briefly,

< f >n= n−1n−1∑i=0

fT i.

For example, in the case with f being the time-zero coordinate function,

< f >n (x) = n−1n−1∑i=0

xi,

the sample mean or time-average mean of the random process. If f(x) =x0xk, then

< f >n (x) = n−1n−1∑i=0

xixi+k,

the sample autocorrelation of lag k of the process. If k = 0, this is thetime average mean of the process since if the xk represent voltages, x2

kis the energy across a resistor at time k.

Finite order sample averages are themselves random variables, but thefollowing lemma shows that they share several fundamental propertiesof expectations.

Lemma 5.13. Sample averages have the following properties for all n:

(a) If fT i ≥ 0 with probability 1, i = 0,1, . . . , n− 1 then < f >n≥ 0.(b) < 1 >n= 1.(c) For any real α, β and any measurements f and g,

< αf + βg >n= α < f >n +β < g >n .

(d)| < f >n | ≤< |f | >n .


< f >n≥< g >n .

5.7 Time Averages 141

Proof. The properties all follow from the properties of summations. Notethat for (a) we must strengthen the requirement that f is nonnegative a.e.to the requirement that all of the fT i are nonnegative a.e. Alternatively,this is a special case of Lemma 5.5 in which for a particular x we de-fine a probability measure on measurements taking values f(Tkx) withprobability 1/n for k = 0,1, . . . , n− 1. 2

We will usually be interested not in the finite order sample averages,but in the long term or asymptotic sample averages obtained as n → ∞.Since the finite order sample averages are themselves random variables,we need to develop notions of convergence of random variables (or mea-surable functions) in order to make such limits precise. There are, infact, several possible approaches. In the remainder of this section weconsider the notion of convergence of random variables closest to theordinary definition of a limit of a sequence of real numbers. In the nextsection three other forms of convergence are considered and compared.

Given a measurement f on a dynamical system (Ω,B,m,T), let Ffdenote the set of all x ∈ Ω for which the limit

limn→∞

1n

n−1∑k=0

f(Tkx)

exists. For all such x we define the time average or limiting time averageor sample average < f > by this limit; that is,

f = limn→∞

< f >n .

Before developing the properties of < f >, we show that the set Ffis indeed measurable. This follows from the measurability of limits ofmeasurements, but it is of interest to prove it directly. Let fn denote< f >n and define the events

Fn(ε) = x : |fn(x)− f(x)| ≤ ε (5.25)

and define as in (5.6)

F(ε) = lim infn→∞

Fn(ε) =∞⋃n=1

∞⋂k=nFk(ε)

= x : x ∈ Fn(ε) for all but a finite number of n. (5.26)

Then the set

Ff =∞⋂k=1

F(1k) (5.27)

is measurable since it is formed by countable set theoretic operations onmeasurable sets. The sequence of sample averages < f >n will be said

142 5 Averages

to converge to a limit < f > almost everywhere or with probability one ifm(Ff ) = 1.

The following lemma shows that limiting time averages exhibit behav-ior similar to the expectations of Lemma 5.6.

Lemma 5.14. Time averages have the following properties when they arewell defined; that is, when the limits exist:

(a) If fT i ≥ 0 with probability 1 for i = 0,1, . . ., then < f >≥ 0.(b) < 1 >= 1.(c) For any real α, β and any measurements f and g,

< αf + βg >= α < f > +β < g > .

(d)| < f > | ≤< |f | > .


< f >≥< g > .

Proof. The proof follows from Lemma 5.13 and the properties of limits.In particular, (a) is proved as follows: If fT i ≥ 0 a.e. for each i, then theevent fT i ≥ 0, all i has probability one and hence so does the event< f >n≥ 0 all i. (The countable intersection of a collection of sets withprobability one also has probability one.) 2

Note that unlike Lemma 5.13, Lemma 5.14 cannot be viewed as beinga special case of Lemma 5.6.

5.8 Convergence of Random Variables

We have introduced the notion of almost everywhere convergence of asequence of measurements. We now further develop the properties ofsuch convergence along with three other notions of convergence.

Suppose that fn; = 1,2, . . . is a sequence of measurements (ran-dom variables, measurable functions) defined on a probability space withprobability measure m.Definitions:

1. The sequence fn is said to converge with probability one or withcertainty or m-almost surely (m-a.s.or m-a.e.) to a random variable fif

m(x : limn→∞

fn(x) = f(x)) = 1;

that is, if with probability one the usual limit of the sequence fn(x)exists and equals f(x). In this case we write f = limn→∞ fn , m-a.e.

5.8 Convergence of Random Variables 143

2. The fn are said to converge in probability to a random variable f iffor every ε > 0

limn→∞

m(x : |fn(x)− f(x)| > ε) = 0,

in which case we writefn→m f

orf =m lim

n→∞fn.

If for all positive real r

limn→∞

m(x : fn(x) ≤ r) = 0,

we writefn→m∞ or m lim

n→∞fn = ∞.

3. The space L1(m) of all m-integrable functions is a pseudo-normedspace with pseudo-norm ‖f‖1 =

∫|f |dm, and hence it is a pseudo-

metric space with metric

d(f , g) =∫|f − g|dm.

We can consider the space to be a normed metric space if we iden-tify or consider equal functions for which d(f , g) is zero. Note thatd(f , g) = 0 if and only if f = g m-a.e. and hence the metric space re-ally has as elements equivalence classes of functions rather than func-tions. A sequence fn of measurements is said to converge to a mea-surement f in L1(m) if f is integrable and if limn→∞ ||fn − f ||1 = 0,in which case we write

fn →L1(m)

f .

Convergence in L1(m) has a simple but useful consequence. FromLemma 5.2(d) for any event F∣∣∣∣∫

Ffndm−

∫Ffdm

∣∣∣∣ = ∣∣∣∣∫F(fn − f)dm

∣∣∣∣ ≤ ∫F|fn − f |dm

≤ ‖fn − f‖1 →n→∞0

and therefore

limn→∞

∫Ffndm =

∫Ffdm if fn → f ∈ L1(m). (5.28)

In particular, if F is the entire space, then limn→∞∫fndm =

∫fdm.

144 5 Averages

4. Denote by L2(m) the space of all square-integrable functions, that is,the space of all f such that

∫f 2dm is finite. This is an inner product

space with inner product (f , g) =∫fgdm and hence also a normed

space with norm ‖f‖2 = (∫f 2dm)1/2 and hence also a metric space

with pseudo-metric d(f , g) = (E[(f − g)2])1/2 (a metric that we iden-tify functions f and g for which d(f , g) = 0). A sequence of square-integrable functions fn is said to converge in L2(m) or to convergein mean square or to converge in quadratic mean to a function f if fis square-integrable and limn→∞ ||fn−f ||2 = 0, in which case we write

fn →L2(m)

f

orf = l.i.m.

n→∞fn,

where l.i.m. stands for limit in the mean

Although we shall focus on L1 and L2, we note in passing that theyare special cases of general Lp norms defined as follows: Let Lp denotethe space of all measurements f for which |f |p is integrable for someinteger p ≥ 1. Then

‖f‖p = (∫|f |pdm)

1p

is a pseudo-norm on this space. It is a norm if we consider measure-ments to be the same if they are equal almost everywhere. We say that asequence of measurements fn converges to f in Lp if ‖fn − f‖p → 0 asn → ∞. As an example of such convergence, the following lemma showsthat the quantizer sequence of Lemma 5.3 converges to the original mea-surement in Lp provided the f is in Lp.

Lemma 5.15. If f ∈ Lp(m) and qn(f), n = 1,2, . . ., is the quantizersequence of Lemma 5.3, then

qn(f)→Lpf .

Proof. From Lemma 5.3 |qn(f)−f | is bound above by |f | and monoton-ically nonincreasing and goes to 0 everywhere. Hence |qn(f)−f |p also ismonotonically nonincreasing, tends to 0 everywhere, and is bound aboveby the integrable function |f |p. The lemma then follows from the domi-nated convergence theorem (Lemma 5.10(e)). 2

The following lemmas provide several properties and alternative char-acterizations of and some relations among the various forms of conver-gence.

Of all of the forms of convergence, only almost everywhere conver-gence relates immediately to the usual notion of convergence of se-quences of real numbers. Hence almost everywhere convergence inherits


several standard limit properties of elementary real analysis. The follow-ing lemma states several of these properties without proof. (Proofs maybe found, e.g., in Rudin [93].)

Lemma 5.16. If limn→∞ fn = f , m-a.e. and limn→∞ gn = g, m-a.e. , thenlimn→∞(fn + gn) = f + g and limn→∞ fngn =fg, m-a.e.. If also g 6= 0m-a.e. then limn→∞ fn/gn = f/g, m-a.e. If h : R → R is a continuousfunction, then limn→∞ h(fn) = h(f), m-a.e.

The next lemma and corollary provide useful tools for proving andinterpreting almost everywhere convergence. The lemma is essentiallythe classic Borel-Cantelli lemma of probability theory.

Lemma 5.17. Given a probability space (Ω,B,m) and a sequence ofevents Gn ∈ B, n = 1,2, . . . , define the set Gn i.o. as the set ω :ω ∈ Gn for an infinite number of n; that is,

Gn i.o. =∞⋂n=1

∞⋃j=nGj = lim sup

n→∞Gn.

Then m(Gn i.o.) = 0 if and only if

limn→∞

m(∞⋃k=nGk) = 0.

A sufficient condition for m(Gn i.o.) = 0 is that

∞∑n=1

m(Gn) <∞. (5.29)

Proof. From the continuity of probability,

m(Gn i.o.) =m(∞⋂n=1

∞⋃j=nGj) = lim

n→∞m(

∞⋃j=nGj),

which proves the first statement. From the union bound,

m(∞⋃j=nGn) ≤

∞∑j=nm(Gn),

which goes to 0 as n→∞ if (5.29) holds. 2

Corollary 5.10. limn→∞ fn = f m-a.e. if and only if for every ε > 0

limn→∞

m(∞⋃k=nx : |fk(x)− f(x)| < ε) = 0.

146 5 Averages

If for any ε > 0∞∑n=0

m(|fn| > ε) <∞, (5.30)

then fn → 0 m-a.e.

Proof. Define Fn(ε) and F(ε) as in (5.25)-(5.26). Applying the lemma toGn = Fn(ε)c shows that the condition of the corollary is equivalent tom(Fn(ε)c i.o.) = 0 and hence m(F(ε)) = 1 and therefore

m(x : limn→∞

fn(x) = f(x)) =m(∞⋂k=1

F(1k)) = 1.

Conversely, if

m(∩kF(1k)) = 1,

then since⋂k≥1 F(1/k) ⊂ F(ε) for all ε > 0, m(F(ε)) = 1 and therefore

limn→∞

m(⋃k≥nFk(ε)c) = 0.

Eq. (5.30) is (5.29) for Gn = x : fn(x) > ε, and hence this condition issufficient from the lemma. 2

Note that almost everywhere convergence requires that the probabilitythat |fk − f | > ε for any k ≥ n tend to zero as n → ∞, whereas conver-gence in probability requires only that the probability that |fn − f | > εtend to zero as n→∞.

Lemma 5.18. : Given random variables fn, n = 1,2, . . ., and f , then

fn →L2(m)

f ⇒ fn →L1(m)

f ⇒ fn→m f ,

limn→∞

fn = f m− a.e.⇒ fn→m f .

Proof. From the Cauchy-Schwarz inequality ‖fn − f‖1 ≤ ‖fn − f‖2 andhence L2(m) convergence implies L1(m) convergence. From Markov’sinequality

m(|fn − f | ≥ ε) ≤Em(|fn − f |)

ε,

and hence L1(m) convergence implies convergence in m-probability. Iffn → f m-a.e., then

m(|fn − f | > ε) ≤m(∞⋃k=n|fk − f | > ε) →n→∞0

and hence m− limn→∞ fn = f . 2


Corollary 5.11. If fn → f and fn → g in any of the given forms of con-vergence, then f = g, m-a.e.

Proof. Fix ε > 0 and define Fn(δ) = x : |fn(x)−f(x)| ≤ δ and Gn(δ) =x : |fn(x)− g(x)| ≤ δ, then

Fn(ε/2)∩Gn(ε/2) ⊂ x : |f(x)− g(x)| ≤ ε

and hence

m(|f − g| > ε) ≤m(Fn(ε2)c ∪ Gn(

ε2)c )

≤m(Fn(ε2)c )+m(Gn(

ε2)c ) →

n→∞0.

Since this is true for all ε, m(|f − g| > 0) = 0, so the corollary is provedfor convergence in probability. From the lemma, convergence in prob-ability is implied by the other forms, and hence the proof is complete.2

The following lemma shows that the metric spaces L1(m) and L2(m)are each Polish spaces (when we consider two measurements to be thesame if they are equal m-a.e.).

Lemma 5.19. The spaces Lp(m), k = 1,2, . . . are Polish spaces (complete,separable metric spaces).

Proof. The spaces are separable since from Lemma 5.15 the collectionof all simple functions with rational values is dense. Let fn be a Cauchysequence, that is, ‖fm − fn‖ →n,m→∞ 0, where ‖ · ‖ denotes the Lp(m)norm. Choose an increasing sequence of integers nl, l = 1,2, . . . so that

‖fm − fn‖ ≤ (14)l; for m,n ≥ nl.

Let gl = fnl and define the set Gn = x : |gn(x)−gn+1(x)| ≥ 2−n. Fromthe Markov inequality

m(Gn) ≤ 2nk‖gn − gn+1‖k ≤ 2−nk

and therefore from Lemma 5.18

m(lim supn→∞

Gn) = 0.

Thus with probability 1 x 6∈ lim supn Gn. If x 6∈ lim supn Gn, however,then |gn(x) − gn+1(x)| < 2−n for large enough n. This means that forsuch x gn(x) is a Cauchy sequence of real numbers, which must have alimit since R is complete under the Euclidean norm. Thus gn convergeson a set of probability 1 to a function g. The proof is completed byshowing that fn → g in Lp. We have that

148 5 Averages

‖fn − g‖k =∫|fn − g|kdm =

∫lim infl→∞

|fn − gl|kdm

≤ lim infl→∞

∫|fn − fnl|kdm = lim inf

l→∞‖fn − fnl‖k

using Fatou’s lemma and the definition of gl. The right-most term, how-ever, can be made arbitrarily small by choosing n large, proving the re-quired convergence. 2

Lemma 5.20. If fn→m f and the fn are uniformly integrable, then f is

integrable and fn → f in L1(m). If fn→m f and the f 2n are uniformly

integrable, then f ∈ L2(m) and fn → f in L2(m).

Proof. We only prove the L1 result as the other is proved in a similarmanner. Fix ε > 0. Since the fn are uniformly integrable we can choosefrom Lemma 5.11 a δ > 0 sufficiently small to ensure that if m(F) < δ,then ∫

F|fn|dm <

ε4, all n.

Since the fn converge to f in probability, we can chooseN so large that ifn > N , thenm(|fn−f | > ε/4) < δ/2 for n > N . Then for all n,m > N(using the L1(m) norm)

‖fn − fm‖ =∫|fn − f | ≤ ε

4and

|fm − f | ≤ ε4

|fn − fm|dm+∫|fn − f | > ε

4or

|fm − f | > ε4

|fn − fm|dm.

If fn and fm are each within ε/4 of f , then they are within ε/2 of eachother from the triangle inequality. Also using the triangle inequality, theright-most integral is bound by∫

F|fn|dm+

∫F|fm|dm

where F = |fn − f | > ε/4 or |fm − f | > ε/4, which, from the unionbound, has probability less than δ/2+δ/2 = δ, and hence the precedingintegrals are each less than ε/4. Thus ||fn − fm|| ≤ ε/2+ ε/4+ ε/4 = ε,and hence fn is a Cauchy sequence in L1(m), and it therefore musthave a limit, say g, since L1(m) is complete. From Lemma 5.18, however,this means that also the fn converge to g in probability, and hence fromCorollary 5.11 f = g, m-a.e., completing the proof. 2

The lemma provides a simple generalization of Corollary 5.2 and analternative proof of a special case of Lemma 5.15.

Corollary 5.12. If qn is the quantizer sequence of Lemma 5.3 and f issquare-integrable, then qn(f)→ f in L2.


Proof. From Lemma 5.5, qn(f(x)) → f(x) for all x and hence also inprobability. Since for all n |qn(f)| ≤ |f | by construction, |qn(f)|2 ≤|f |2, and hence from Lemma 5.9 the qn(f)2 are uniformly integrableand the conclusion follows from the lemma. 2

The next lemma uses uniform integrability ideas to provide a partialconverse to the convergence implications of Lemma 5.18.

Lemma 5.21. If a sequence of random variables fn converges in probabil-ity to a random variable f , then the following statements are equivalent(where p = 1 or 2):

(a)fn→

Lpf .

(b) The fpn are uniformly integrable.(c)

limn→∞

E(|fn|p) = E(|f |p).

Proof. Lemma 5.20 implies that (b) => (a). If (a) is true, then the triangleinequality implies that |‖fn‖ − ‖f‖| ≤ ‖fn − f‖ → 0, which implies (c).Finally, assume that (c) holds. Consider the integral∫

|fn|k≥r|fn|kdm =

∫|fn|kdm−

∫|fn|kr

fndm

=∫|fn|kdm−

∫min(|fn|k, r)dm

+r∫|fn|k≥r

dm. (5.31)

The first term on the right tends to∫|f |k dm by assumption. We con-

sider the limits of the remaining two terms. Fix δ > 0 small and observethat

E(|min(|fn|k, r)−min(|f |k, r)|) ≤δm(||fn|k − |f |k| ≤ δ)+ rm(||fn|k − |f |k| > δ).

Since fn → f in probability, it must also be true that |fn|k → |f |k inprobability, and hence the preceding expression is bound above by δ asn→∞. Since δ is arbitrarily small

limn→∞

|E(min(|fn|k, r))− E(min(|f |k, r))| ≤

limn→∞

E(|min(|fn|k, r)−min(|f |k, r)|) = 0.

Thus the middle term on the right of (5.31) converges to E(min(|f |k, r)).The remaining term is rm(|fn|k ≥ r). For any ε > 0

150 5 Averages

m(|fn|k ≥ r) =m(|fn|k ≥ r and ||fn|k − |f |k| ≤ ε)+m(|fn|k ≥ r and ||fn|k − |f |k| > ε).

The right-most term tends to 0 as n → ∞, and the remaining term isbound above by m(|f |k ≥ r − ε). From the continuity of probability,this can be made arbitrarily close to m(|f |k ≥ r) by choosing ε smallenough, e.g., choosing Fn = |f |k ∈ [r − 1/n,∞), then Fn decreases tothe set F = |f |k ∈ [r ,∞) and limn→∞m(Fn) =m(F).

Combining all of the preceding facts

limn→∞

∫|fn|k≥r

|fn|kdm =∫|f |k≥r

|f |kdm.

From Corollary 5.5 given ε > 0 we can choose R so large that for r > Rthe right-hand side is less than ε/2 and then choose N so large that∫

|fn|k>R|fn|kdm ≤ ε,n > N. (5.32)

Then

limr→∞

supn

∫|fn|k>r

|fn|kdm ≤

limr→∞

supn≤N

∫|fn|k>r

|fn|kdm+ limr→∞

supn>N

∫|fn|k>r

|fn|kdm. (5.33)

The first term to the right of the inequality is zero since the limit of eachof the finite terms in the sequence is zero from Corollary 5.5. To boundthe second term, observe that for r > R each of the integrals inside thesupremum is bound above by a similar integral over the larger region|fn|k > R. From (5.32) each of these integrals is bound above by ε, andhence so is the entire right-most term. Since the left-hand side of (5.33)is bound above by any ε > 0, it must be zero, and hence the |fn|k areuniformly integrable. 2

Note as a consequence of the Lemma that if fn → f in Lp for p =1,2, . . . then necessarily the fpn are uniformly integrable.

Exercises

1. Prove Lemma 5.16.2. Show that if fk ∈ L1(m) for k = 1,2, . . ., then there exists a subse-

quence nk, k = 1,2, . . ., such that with probability one

5.9 Stationary Random Processes 151

limk→∞

fnknk= 0

Hint: Use the Borel-Cantelli lemma.3. Suppose that fn converges in probability to f . Show that given ε, δ > 0

there is an N such that if

Fn,m(ε) = x : |fn(x)− fm(x) > ε,

then m(Fn,m) ≤ δ for n, m ≥ N ; that is,

limn,m→∞

m(Fn,m(ε)) = 0, all ε > 0.

In words, if fn converges in probability, then it is a Cauchy sequencein probability.

4. Show that if fn → f and gn − fn → 0 in probability, then gn → f inprobability. Show that if fn → f and gn → g in L1, then fn+gn → f+gin L1. Show that if fn → f and gn → g in L2 and supn ||fn||2 <∞, thenfngn → fg in L1.

5.9 Stationary Random Processes

We have seen in the discussions of ensemble averages and time aver-ages that uniformly integrable sequences of functions have several niceconvergence properties. In particular, we can compare expectations oflimits with limits of expectations and we can infer L1(m) convergencefrom m-a.e. convergence from such sequences. As time averages of theform < f >n are the most important sequences of random variables forergodic theory, a natural question is whether or not such sequences areuniformly integrable if f is integrable. Of course in general time averagesof integrable functions need not be uniformly integrable, but in this sec-tion we shall see that in a special, but very important, case the responseis affirmative.

Let (Ω,B,m,T) be a dynamical system, where as usual we keep inmind the example where the system corresponds to a directly given ran-dom process Xn,n ∈ I with alphabet A and where T is a shift.

In general the expectation of the shift fT i of a measurement f can befound from the change of variables formula of Lemma 5.11(b); that is, ifmT−i is the measure defined by

mT−i(F) =m(T−iF), all events F, (5.34)

then

EmfTi =∫fT idm =

∫fdmT−i = EmT−if . (5.35)

152 5 Averages

We shall find that the sequences of measurements fTn and sampleaverages < f >n will have special properties if the preceding expecta-tions do not depend on i; that is, the expectations of measurements donot depend on when they are taken. From (5.35), this will be the case ifthe probability measuresmT−n defined bymT−n(F) =m(T−nF) for allevents F and all n are the same. This motivates the following definition.The dynamical system is said to be stationary (or T -stationary or sta-tionary with respect to T) or invariant (or T -invariant or invariant withrespect to T) if

m(T−1F) =m(F), all events F. (5.36)

We shall also say simply that the measure m is stationary with respectto T if (5.36) holds. A random process Xn is said to be stationary (withthe preceding modifiers) if its distribution m is stationary with respectto T , that is, if the corresponding dynamical system with the shift T isstationary. A stationary random process is one for which the probabil-ities of an event are the same regardless of when it occurs. Stationaryrandom processes will play a central, but not exclusive, role in the er-godic theorems.

Observe that (5.36) implies that

m(T−nF) =m(F), all events F, all n = 0,1,2, . . . (5.37)

The next lemma follows immediately from (5.34), (5.35), and (5.37).

Lemma 5.22. Ifm is stationary with respect to T and f is integrable withrespect to m, then

Emf = EmfTn,n = 0,1,2, . . .

As an example, suppose that f = X0 is the zero coordinate function ina Kolmogorov representation, i.e., f(x) = x0, and hence the nth coor-dinate is Xn = fTn. Then for a stationary process the mean functionE(Xn) satisfies E(Xn) = E(X0). Another common moment of interestis the autocorrelation function RX(n, k) = E(XnXk). Application of thelemma implies that RX(k,n) = E(Xn−kX0). Since this depends on the twoarguments only through their difference, the notation is usually simpli-fied to RX(n− k). Functions depending on two arguments through theirdifference are known as Toeplitz functions. This shows that the autocor-relation of a stationary process is Toeplitz.

Lemma 5.23. If f is in L1(m) and m is stationary with respect to T , thenthe sequences fTn;n = 0,1,2, . . . and < f >n;n = 0,1,2, . . ., where

< f >n= n−1n−1∑i=0

fT i,

5.9 Stationary Random Processes 153

are uniformly integrable. If also f ∈ L2(m), then in addition the sequenceffTn;n = 0,1,2, . . . and its arithmetic mean sequence wn;n =0,1, . . . defined by

wn(x) = n−1n−1∑i=0

f(x)f(T ix) (5.38)

are uniformly integrable.

Proof. First observe that if G = x : |f(x)| ≥ r, then T−iG = x :f(T ix) ≥ r since x ∈ T−iG if and only if Tix ∈ G if and only iff(T ix) > r . Thus by the change of variables formula and the assumedstationarity ∫

T−iG|fT i|dm =

∫G|f |dm, independent of i,

and therefore∫|fT i|r

|fT i|dm =∫|f |r|f |dm →

r→∞0 uniformly in i.

This proves that the fT i are uniformly integrable. Corollary 5.9 impliesthat the arithmetic mean sequence is then also uniformly integrable. 2

We close this section with an important attribute of stationary ran-dom processes. We have seen that random processes can be one-sidedor two sided. In the stationary case the two types of processes are es-sentially equivalent. Suppose that Xn;n ∈ Z is a two-sided station-ary random process with alphabet A and a directly given Kolmogorovrepresentation of (AZ,AZ,mX). Then mX implies a distribution on allfinite dimensional cylinders, and the distribution is unchanged by shift-ing. Hence we can construct a one-sided process X′n;n ∈ Z+ such thatfor any N and any N-dimensional cylinder set C ∈ BNA we have thatPr((Xn,Xn+1, · · · , Xn+N−1) ∈ C) = Pr((X′n,X

′n+1, · · · , X′n+N−1) ∈ C) for

all n ≥ 0. In other words, the two distributions agree on all nonnegativetime cylinders, and hence also on BNA . Thus a one-sided process can beextracted from the two-sided process by restricting the probabilities tononnegative time indexes. Similarly, the cylinder probabilities from theone-sided process define the probabilities for all shifts of cylinders in thetwo-sided process, so one can embed the stationary one-sided processinto a two-sided process. This trick only works for stationary processes,but we will see that most nonstationary processes of interest are relatedto a stationary process.

154 5 Averages

5.10 Block and Asymptotic Stationarity

While stationarity is the most common assumption used to obtain er-godic theorems relating sample averages and expectations, there are gen-eralizations of the idea which lead to similar results. The generalizationsmodel nonstationary behavior that arises in real-world applications suchas artifacts due to mapping blocks or vectors when coding or signal pro-cessing and initial conditions that die out with time.

A dynamical system (Ω,B,m,T) is said to be N-stationary for a posi-tive integer N if it is stationary with respect to TN ; that is,

m(T−NF) =m(F), all F ∈ B. (5.39)

This implies that for any event F m(T−nNF) = m(F) and hence thatm(T−nF) is a periodic function in n with period N .

An N-stationary process can be “stationarized” by forming a finitemixture of its shifts. Define the measure mN by

mN(F) =1N

N−1∑i=0

m(T−iF)

for all events F , a definition that is abbreviated by

mN =1N

N−1∑i=0

mT−i.

If m is N-stationary, then mN is stationary since

mN(T−1F) = 1N

N−1∑i=0

m(T−iT−1F)

= 1N

N−1∑i=1

m(T−iF)+m(T−NF)

= 1N

N−1∑i=0

m(T−iF) =mN(F)

A dynamical system with measure m is said to be asymptotically sta-tionary if there is a stationary measure m for which

limn→∞

m(T−iF) =m(F); all F ∈ B.

A dynamical system with measurem is said to be asymptotically meanstationary if there is a stationary measure m for which

5.10 Block and Asymptotic Stationarity 155

limn→∞

1n

n−1∑i=0

m(T−iF) =m(F); all F ∈ B.

The following lemma relates these stationarity properties to the cor-responding expectations. The proof is left as an exercise.

Lemma 5.24. Let (Ω,B,m,T) be a dynamical system and that f is a mea-surement on the system. If m is N-stationary, then

EmNf =1N

N−1∑i=1

EmfTi.

If f is a bounded measurement, that is, |f | ≤ K for some finite K withprobability one, then

Emf = limn→∞

1n

n−1∑i=0

Em(fT i).

Exercises

1. Does Lemma 5.23 hold for N-stationary processes?2. Prove Lemma 5.24. Hint: As usual, first consider discrete measure-

ments and then take limits.3. Show that stationary, N-stationary, and asymptotically stationary sys-

tems are also asymptotically mean stationary.

Chapter 6

Conditional Probabilityand Expectation

Abstract The relations between measurements, that is, measurable func-tions, and events, that is, members of a σ -field, are explored and used toextend the basic ideas of probability to probability conditioned on mea-surements and events. Such relations are useful for developing proper-ties of certain special functions such as limiting sample averages arisingin the study of ergodic properties of information sources. In addition,they are fundamental to the development and interpretation of condi-tional probability and conditional expectation, that is, probabilities andexpectations when we are given partial knowledge about the outcomeof an experiment. Although conditional probability and conditional ex-pectation are both common topics in an advanced probability course,the fact that we are living in standard spaces results in additional prop-erties not present in the most general situation. In particular, we findthat the conditional probabilities and expectations have more structurethat enables them to be interpreted and often treated much like ordi-nary unconditional probabilities. In technical terms, there are always beregular versions of conditional probability and we will be able to defineconditional expectations constructively as expectations with respect toconditional probability measures.

6.1 Measurements and Events

Say we have a measurable spaces (Ω,B) and (A,B(A)) and a functionf : Ω → A. Recall from Chapter 2 that f is a random variable or mea-surable function or a measurement if f−1(F) ∈ B for all F ∈ B(A).Clearly, being told the outcome of the measurement f provides informa-tion about certain events and likely provides no information about oth-ers. In particular, if f assumes a value in F ∈ B(A), then we know thatf−1(F) occurred. We might call such an event in B an f -event since it is


158 6 Conditional Probability and Expectation

specified by f taking a value in B(A). Consider the class G = f−1(B(A))= all sets of the form f−1(F), F ∈ B(A). Since f is measurable, allsets in G are also in B. Furthermore, it is easy to see that G is a σ -fieldof subsets of Ω. Since G ⊂ B and it is a σ -field, it is referred to as asub-σ -field.

Intuitively G can be thought of as the σ - field or event space consistingof those events whose outcomes are determinable by observing the out-put of f . Hence we define the σ -field induced by f as σ(f) = f−1(B(A)).

Next observe that by construction f−1(F) ∈ σ(f) for all F ∈ B(A);that is, the inverse images of all output events may be found in a sub-σ -field. That is, f is measurable with respect to the smaller σ -field σ(f) aswell as with respect to the larger σ -field B. As we will often have morethan one σ -field for a given space, we emphasize this notion. Say we aregiven measurable spaces (Ω,B) and (A,B(A)) and a sub-σ -field G of B(and hence (Ω,G) is also a measurable space). Then a function f : Ω→ Ais said to be measurable with respect to G or G-measurable or, in moredetail, G/B(A)-measurable if f−1(F) ∈ G for all F ∈ B(A). Since G ⊂ Ba G-measurable function is clearly also B-measurable. Usually we shallrefer to B-measurable functions as simply measurable functions or asmeasurements.

We have seen that if σ(f) is the σ -field induced by a measurement fand that f is measurable with respect to σ(f). Observe that if G is anyother σ -field with respect to which f is measurable, then G must containall sets of the form f−1(F), F ∈ B(A), and hence must contain σ(f).Thus σ(f) is the smallest σ -field with respect to which f is measurable.

Having begun with a measurement and found the smallest σ -fieldwith respect to which it is measurable, we can next reverse directionand begin with a σ -field and study the class of measurements that aremeasurable with respect to that σ -field. The short term goal will be tocharacterize those functions that are measurable with respect to σ(f).Given a σ -field G letM(G) denote the class of all measurements that aremeasurable with respect to G. We shall refer to M(G) as the measure-ment class induced by G. Roughly speaking, this is the class of measure-ments whose output events can be determined from events in G. Thus,for example, M(σ(f)) is the collection of all measurements whose out-put events could be determined by events in σ(f) and hence by outputevents of f . We shall shortly make this intuition precise, but first a com-ment is in order. The basic thrust here is that if we are given informationabout the outcome of a measurement f , in fact we then have informa-tion about the outcomes of many other measurements, e.g., f 2, |f |, andother functions of f . The class of functions M(σ(f)) will prove to beexactly that class containing functions whose outcome is determinablefrom knowledge of the outcome of the measurement f .

The following lemma shows that if we are given a measurementf : Ω → A, then a real-valued function g is in M(σ(f)), that is, is mea-

6.1 Measurements and Events 159

surable with respect to σ(f), if and only if g(ω) = h(f(ω)) for somemeasurement h : A → R, that is, if g depends on the underlying samplepoints only through the value of f .

Lemma 6.1. Given a measurable space (Ω,B) and a measurement f :Ω → A, then a measurement g : Ω → R is in M(σ(f)), that is, is measur-able with respect to σ(f), if and only if there is a B(A)/B(R)-measurablefunction h : A→ R such that g(ω) = h(f(ω)), all ω ∈ Ω.

Proof. If there exists such an h, then for F ∈ B(R), then

g−1(F) = ω : h(f(ω)) ∈ F = ω : f(ω) ∈ h−1(F) = f−1(h−1(F)) ∈ B

since h−1(F) is in B(A) from the measurability of h and hence its in-verse image under f is in σ(f) by definition. Thus any g of the givenform is in M(σ(f)). Conversely, suppose that g is in M(σ(f)). Letqn : A → R be the sequence of quantizers of Lemma 5.3. Then gn de-fined by gn(ω) = qn(f(ω)) are σ(f)-measurable simple functions thatconverge to g. Thus we can write gn as

gn(ω) =M∑i=1

ri1F(i)(ω),

where the F(i) are disjoint, and hence since gn ∈M(σ(f)) that g−1n (ri) =

F(i) ∈ σ(f). Since σ(f) = f−1(B(A)), there are disjoint sets Q(i) ∈B(A) such that F(i) = f−1(Q(i)). Define the function hn : A→ R by

hn(a) =M∑i=1

ri1Q(i)(a)

and

hn(f(ω)) =M∑i=1

ri1Q(i)(f (ω)) =M∑i=1

ri1f−1(Q(i))(ω)

=M∑i=1

ri1F(i)(ω) = gn(ω).

This proves the result for simple functions. By construction we havethat g(ω) = limn→∞ gn(ω) = limn→∞ hn(f(ω)) where, in particu-lar, the right-most limit exists for all ω ∈ Ω. Define the functionh(a) = limn→∞ hn(a) where the limit exists and 0 otherwise. Theng(ω) = limn→∞ hn(f(ω)) = h(f(ω)). 2

Thus far we have developed the properties of σ -fields induced by ran-dom variables and of classes of functions measurable with respect toσ -fields. The idea of a σ -field induced by a single random variable is


easily generalized to random vectors and sequences. We wish, however,to consider the more general case of a σ -field induced by a possiblyuncountable class of measurements. Then we will have associated witheach class of measurements a natural σ -field and with each σ -field anatural class of measurements. Toward this end, given a class of mea-surementsM, define σ(M) as the smallest σ -field with respect to whichall of the measurements in M are measurable. Since any σ -field satis-fying this condition must contain all of the σ(f) for f ∈ M and hencemust contain the σ -field induced by all of these sets and since this lattercollection is a σ -field,

σ(M) = σ(⋃f∈M

σ(f)).

The following lemma collects some simple relations among σ -fieldsinduced by measurements and classes of measurements induced by σ -fields.

Lemma 6.2. Given a class of measurementsM, then

M⊂M(σ(M)).

Given a collection of events G, then

G ⊂ σ(M(σ(G))).

If G is also a σ -field, then

G = σ(M(G)),

that is, G is the smallest σ -field with respect to which all G-measurablefunctions are measurable. If G is a σ -field and I(G) = all 1G, G ∈ G isthe collection of all indicator functions of events in G, then

G = σ(I(G)),

that is, the smallest σ -field induced by indicator functions of sets in G isthe same as that induced by all functions measurable with respect to G.

Proof. If f ∈ M, then it is σ(f)-measurable and hence in M(σ(M)).If G ∈ G, then its indicator function 1G is σ(G)-measurable and hence1G ∈ M(σ(G)). Since 1G ∈ M(σ(G)), it must be measurable with re-spect to σ(M(σ(G))) and hence 1−1

G (1) = G ∈ σ(M(σ(G))), provingthe second statement. If G is a σ -field, then since since all functions inM(G) are G-measurable and since σ(M(G)) is the smallest σ -field withrespect to which all functions inM(G) are G-measurable, Gmust containσ(M(G)). Since I(G) ⊂M(G),

6.1 Measurements and Events 161

σ(I(G)) ⊂ σ(M(G)) = G.

If G ∈ G, then 1G is in I(G) and hence must be measurable with respectto σ(I(G)) and hence 1−1

G (1) = G ∈ σ(I(G)) and hence G ⊂ σ(I(G)). 2

We conclude this section with a reminder of the motivation for con-sidering classes of events and measurements. We shall often considersuch classes either because a particular class is important for a partic-ular application or because we are simply given a particular class. Inboth cases we may wish to study both the given events (measurements)and the related class of measurements (events). For example, given aclass of measurementsM, σ(M) provides a σ -field of events whose oc-currence or nonoccurrence is determinable by the output events of thefunctions in the class. In turn, M(σ(M)) is the possibly larger class offunctions whose output events are determinable from the occurrence ornonoccurrence of events in σ(M) and hence by the output events ofM. Thus knowing all output events of measurements inM is effectivelyequivalent to knowing all output events of measurements in the morestructured and possibly larger class M(σ(M)), which is in turn equiva-lent to knowing the occurrence or nonoccurrence of events in the σ -fieldσ(M). Hence when a class M is specified, we can instead consider themore structured classes σ(M) or M(σ(M)). From the previous lemma,this is as far as we can go; that is,

σ(M(σ(M))) = σ(M). (6.1)

We have seen that a function g will be in M(σ(f)) if and only if itdepends on the underlying sample points through the value of f . Sincewe did not restrict the structure of f , it could, for example, be a randomvariable, vector, or sequence, that is, the conclusion is true for countablecollections of measurements as well as individual measurements. If in-stead we have a general classM of measurements, then it is still a usefulintuition to think ofM(σ(M)) as being the class of all functions that de-pend on the underlying points only through the values of the functionsinM.

Exercises

1. Which of the following relations are true and which are false?

f 2 ∈ σ(f), f ∈ σ(f 2)

f + g ∈ σ(f ,g), f ∈ σ(f + g)


2. If f : A → Ω, g : Ω → Ω, and g(f) : A → Ω is defined by g(f)(x) =g(f(x)), then g(f) ∈ σ(f).

3. Given a classM of measurements, is⋃f∈M σ(f) a σ -field?

4. Suppose that (A,B) is a measure space and B is separable with acountable generating class Vn; n = 1,2 . . .. Describe σ(1Vn ; n =1,2, . . .).

6.2 Restrictions of Measures

The first application of the classes of measurements or events consid-ered in the previous section is the notion of the restriction of a proba-bility measure to a sub-σ -field. This occasionally provides a shortcut toevaluating expectations of functions that are measurable with respect tosub-σ -fields and in comparing such functions. Given a probability space(Ω,B,m) and a sub-σ -field G of B, define the restriction of m to G, mG ,by

mG(F) =m(F), F ∈ G.Thus (Ω,G,mG) is a new probability space with a smaller event space.The following lemma shows that if f is a G-measurable real-valued ran-dom variable, then its expectation can be computed with respect to ei-ther m or mG .

Lemma 6.3. Given a G-measurable real-valued measurement f ∈ L1(m),then also f ∈ L1(mG) and∫

f dm =∫f dmG,

where mG is the restriction of m to G.

Proof. If f is a simple function, the result is immediate from the defini-tion of restriction. More generally use Lemma 5.3(e) to infer that qn(f)is a sequence of simple G-measurable functions converging to f andcombine the simple function result with Corollary 5.2 applied to bothmeasures. 2

Corollary 6.1. Given G-measurable functions f , g ∈ L1(m), if∫Ff dm ≤

∫Fg dm, all F ∈ G,

then f ≤ g m-a.e. If the preceding holds with equality, then f = g m-a.e.

Proof. From the previous lemma, the integral inequality holds with mreplaced by mG , and hence from Lemma 5.12 the conclusion holds mG-

6.3 Elementary Conditional Probability 163

a.e. Thus there is a set, say G, in G with mG probability one and hencealso m probability one for which the conclusion is true. 2

The usefulness of the preceding corollary is that it allows the com-parison of G-measurable functions by considering only the restrictedmeasures and the corresponding expectations.

6.3 Elementary Conditional Probability

Say we have a probability space (Ω,B,m), how is the probability mea-sure m altered if we are told that some event or collection of eventsoccurred? For example, how is it influenced if we are given the outputsof a measurement or collection of measurements? The notion of condi-tional probability provides a response to this question. In fact there aretwo notions of conditional probability: elementary and nonelementary.

Elementary conditional probabilities cover the case where we are givenan event, say F , having nonzero probability: m(F) > 0. We would liketo define a conditional probability measure m(G|F) for all events G ∈B. Intuitively, being told an event F occurred will put zero probabilityon the collection of all points outside F , but it should not effect therelative probabilities of the various events inside F . In addition, the newprobability measure must be renormalized so as to assign probabilityone to the new “certain event” F . This suggests the definition m(G|F) =km(G ∩ F), where k is a normalization constant chosen to ensure thatm(F|F) = km(F ∩ F) = km(F) = 1. Thus we define for any F such thatm(F) > 0. the conditional probability measure

m(G|F) = m(G ∩ F)m(F)

, all G ∈ B.

We shall often abbreviate the elementary conditional probability mea-surem(·|F) bymF . Given a probability space (Ω,B,m) and an event F ∈B with m(F) > 0, then we have a new probability space (F,B ∩ F,mF),where B∩F = all sets of the form G∩F,G ∈ B . It is easy to see that wecan relate expectations with respect to the conditional and unconditionalmeasures by

EmF (f ) =∫f dmF =

∫F f dmm(F)

= Em1Ffm(F)

, (6.2)

where the existence of either side ensures that of the other. In particular,f ∈ L1(mF) if and only if f1F ∈ L1(m). Note further that if G = Fc , then

Emf =m(F)EmF (f )+m(G)EmG(f). (6.3)


Instead of being told that a particular event occurred, we might betold that a random variable or measurement f is discrete and takes ona specific value, say a with nonzero probability. Then the elementarydefinition immediately yields

m(F|f = a) = m(G ∩ f = a)m(f = a) .

If, however, the measurement is not discrete and takes on a particu-lar value with probability zero, then the preceding elementary definitiondoes not work. One might attempt to replace the previous definition bysome limiting form, but this does not yield a useful theory in general andit is clumsy. The standard alternative approach is to replace the preced-ing constructive definition by a descriptive definition, that is, to defineconditional probabilities by the properties that they should possess. Themathematical problem is then to prove that a function possessing thedesired properties exists.

In order to motivate the descriptive definition, we make several obser-vations on the elementary case. First note that the previous conditionalprobability depends on the value a assumed by f , that is, it is a functionof a. It will prove more useful and more general instead to consider it afunction ofω that depends onω only through f(ω), that is, to considerconditional probability as a function m(G | f)(ω) = m(G|λ : f(λ) =f(ω)), or, simply, m(G | f = f(ω)), the probability of an event Ggiven that f assumes a value f(ω). Thus a conditional probability is afunction of the points in the underlying sample space and hence is itselfa random variable or measurement. Since it depends on ω only throughf , from Lemma 5.2.1 the function m(G | f) is measurable with respectto σ(f), the σ -field induced by the given measurement. This leads to thefirst property:

For any fixed G,m(G|f) is σ(f)−measurable. (6.4)

Next observe that in the elementary case we can compute the prob-ability m(G) by averaging or integrating the conditional probabilitym(G | f) over all possible values of f ; that is,∫

m(G|f)dm =∑am(G|f = a)m(f = a)

=∑am(G ∩ f = a) =m(G). (6.5)

In fact we can and must say more about such averaging of conditionalprobabilities. Suppose that F is some event in σ(f) and hence its occur-rence or nonoccurrence is determinable by observation of the value off , that is, from Lemma 6.1 1F(ω) = h(f(ω)) for some function h. Thus

6.3 Elementary Conditional Probability 165

if we are given the value of f(ω), being told that ω ∈ F should not addany knowledge and hence should not alter the conditional probability ofan event G given f . To try to pin down this idea, assume that m(F) > 0and letmF denote the elementary conditional probability measure givenF , that is, mF(G) = m(G ∩ F)/m(F). Applying (6.5) to the conditionalmeasure mF yields the formula∫

mF(G|f)dmF =mF(G),

wheremF(G | f) is the conditional probability of G given the outcome ofthe random variable f and given that the event F occurred. But we haveargued that this should be the same as m(G | f). Making this substitu-tion, multiplying both sides of the equation by m(F), and using (6.2) wederive ∫

Fm(G|f)dm =m(G ∩ F), all F ∈ σ(f). (6.6)

To make this plausibility argument rigorous observe that since f is as-sumed discrete we can write

f(ω) =∑a∈A

a1f−1(a)(ω)

and hence1F(ω) = h(f(ω)) = h(

∑a∈A

a1f−1(a)(ω))

and thereforeF =

⋃a:h(a)=1

f−1(a).

We can then write∫Fm(G|f)dm =

∫1Fm(G|f)dm =

∫h(f)m(G|f)dm

=∑ah(a)

m(G ∩ f = a)m(f = a) m(f = a) =

∑ah(a)m(G ∩ f−1(a))

=m(G ∩⋃

a:h(a)=1

f−1(a))

=m(G ∩ F).

Although (6.6) was derived for F with m(F) > 0, the equation is triv-ially satisfied ifm(F) = 0 since both sides are then 0. Hence the equationis valid for all F in σ(f) as stated. Eq. (6.6) states that not only must webe able to average m(G | f) to get m(G), a similar relation must holdwhen we cut down the average to any σ(f)-measurable event.


Equations (6.4) and (6.6) provide the properties needed for a rigorousdescriptive definition of the conditional probability of an event G given arandom variable or measurement f . In fact, this conditional probabilityis defined as any function m(G | f)(ω) satisfying these two equations.At this point, however, little effort is saved by confining interest to a sin-gle conditioning measurement, and hence we will develop the theory formore general classes of measurements. In order to do this, however, werequire a basic result from measure theory—the Radon-Nikodym theo-rem. The next two sections develop this result.

Exercises

1. Let f and g be discrete measurements on a probability space (Ω,B,m).Suppose in particular that

g(ω) =∑iai1Gi(ω).

Define a new random variable E(g|f) by

E(g|f)(ω) = Em(.|f(ω))(g) =∑iaim(Gi|f(ω)).

This random variable is called the conditional expectation of g given f. Prove that

Em(g) = Em[E(g|f)].

6.4 Projections

jection theorem of Hilbert space theory. As this proof is more intuitivethan others (at least to me) and as the projection theorem is of interestin its own right, this section is devoted to developing the properties ofprojections and the projection theorem. Good references for the theoryof Hilbert spaces and projections are Halmos [47] and Luenberger [67].

Recall from Chapter 1 that a Hilbert space is a complete inner productspace as defined in Example 1.10, which is in turn a normed linear spaceas Example 1.8. In fact, all of the results considered in this section will beapplied in the next section to a very particular Hilbert space, the spaceL2(m) of all square integrable functions with respect to a probabilitymeasure m. Since this space is Polish, it is a Hilbert space.

One of the proofs of the Radon-Nikodym theorem is based on the pro-

6.4 Projections 167

Let L be a Hilbert space with inner product (f , g) and norm ||f || =(f , f )1/2. For the case where L = L2(m), the inner product is given by

(f , g) = Em(fg).

A (measurable) subset M of L will be called a subspace of L if f , g ∈ Mimplies that af + bg ∈ M for all scalar multipliers a, b. We will beprimarily interested in closed subspaces of L. Recall from Lemma 4.2that a closed subset of a Polish space is also Polish with the same metric,and hence M is a complete space with the same norm.

A fundamentally important property of linear spaces with inner prod-ucts is the parallelogram law

‖f + g‖2 + ‖f − g‖2 = 2‖f‖2 + 2‖g‖2. (6.7)

This is proved simply by expanding the left-hand side as

(f , f )+ 2(f , g)+ (g, g)+ (f , f )− 2(f , g)+ (g, g).

A simple application of this formula and the completeness of a subspaceyield the following result: given a fixed element f in a Hilbert space andgiven a closed subspace of the Hilbert space, we can always find a uniquemember of the subspace that provides the stpproximation to f in thesense of minimizing the norm of the difference.

Lemma 6.4. Fix an element f in a Hilbert space L and let M be a closedsubspace of L. Then there is a unique element f ∈ M such that

‖f − f‖ = infg∈M

‖f − g‖, (6.8)

where f and g are considered identical if ||f−g|| = 0. Thus, in particular,the infimum is actually a minimum.

Proof. Let ∆ denote the infimum of (6.8) and choose a sequence of fn ∈M so that ||f − fn|| → ∆. From the parallelogram law

‖fn − fm‖2 = 2‖fn − f‖2 + 2‖fm − f‖2 − 4‖fn + fm2

− f‖2.

Since M is a subspace, (fn + fm)/2 ∈ M and hence the latter normsquared is bound below by ∆2. This means, however, that

limn→∞,m→∞

‖fn − fm‖2 ≤ 2∆2 + 2∆2 − 4∆2 = 0,

and hence fn is a Cauchy sequence. Since M is complete, it must have alimit, say f . From the triangle inequality


‖f − f‖ ≤ ‖f − fn‖ + ‖fn − f‖ →n→∞∆,which proves that there is an f in M with the desired property. Supposenext that there are two such functions, e.g., f and g. Again invoking (6.7)

‖f − g‖2 ≤ 2‖f − f‖2 + 2‖g − f‖2 − 4‖ f + g2

− f‖2.

Since (f+g)/2 ∈ M , the latter norm can be no smaller than ∆, and hence

‖f − g‖2 ≤ 2∆2 + 2∆2 − 4∆2 = 0,

and therefore ‖f − g‖ = 0 and the two are identical. 2

Two elements f and g of L will be called orthogonal and we writef ⊥ g if (f , g) = 0. If M ⊂ L, we say that f ∈ L is orthogonal to M iff is orthogonal to every g ∈ M . The collection of all f ∈ L which areorthogonal to M is called the orthogonal complement of M .

Lemma 6.5. Given the assumptions of Lemma 6.4, f will yield the infi-mum if and only if f − f ⊥ M ; that is, f is chosen so that the error f − fis orthogonal to the subspace M .

Proof. First we prove necessity. Suppose that f − f is not orthogonal toM . We will show that f cannot then yield the infimum of Lemma 6.4. Byassumption there is a g ∈ M such that g is not orthogonal to f − f . Wecan assume that ||g|| = 1 (otherwise just normalize g by dividing by itsnorm). Define a = (f − f , g) 6= 0. Now set h = f +ag. Clearly h ∈ M andwe now show that it is a better estimate of f than f is. To see this write

‖f − h‖2 = ‖f − f − ag‖2

= ‖f − f‖2 + a2‖g‖2 − 2a(f − f , g)= ‖f − f‖2 − a2 < ‖f − f‖2

as claimed. Thus orthogonality is necessary for the minimum norm esti-mate. To see that it is sufficient, suppose that f ⊥ M . Let g be any otherelement in M . Then

‖f − g‖2 = ‖f − f + f − g‖2

= ‖f − f‖2 + ‖f − g‖2 − 2(f − f , f − g).

Since f − g ∈ M , the inner product is 0 by assumption, proving that‖f − g‖ is strictly greater than ‖f − f‖ if ‖f − g‖ is not identically 0. 2

Combining the two lemmas we have the following famous result.

Theorem 6.1. (The Projection Theorem)

6.5 The Radon-Nikodym Theorem 169

Suppose thatM is a closed subspace of a Hilbert space L and that f ∈ L.Then there is a unique f ∈ M with the properties that

f − f ⊥ M

and‖f − f‖ = inf

g∈M‖f − g‖.

The resulting f is called the projection of f onto M and will be denotedPM(f).

Exercises

1. Show that projections satisfy the following properties:

a. If f ∈ M , then PM(f) = f .b. For any scalar a PM(af) = aPM(f).c. For any f , g ∈ L and scalars a, b:

PM(af + bg) = aPM(f)+ bPM(g).

2. Consider the special case of L2(m) of all square-integrable functionswith respect to a probability measure m. Let P be the projection withrespect to a closed subspace M . Prove the following:

a. If f is a constant, say c, then P(f ) = c.b. If f ≥ 0 with probability one, then P(f ) ≥ 0.

3. Suppose that g is a discrete measurement and hence is in L2(m).Let M be the space of all square-integrable σ(g)-measurable func-tions. Assume that M is complete. Given a discrete measurement f ,let E(f |g) denote the conditional expectation of exercise 5.4.1. Showthat E(f |g) = PM(f), the projection of f onto M .

4. Suppose that H ⊂ M are closed subspaces of a Hilbert space L. Showthat for any f ∈ L,

‖f −PM(f)‖ ≤ ‖f −PH(f)‖.

6.5 The Radon-Nikodym Theorem

This section is devoted to the statement and proof of the Radon-Nikodymtheorem, one of the fundamental results of measure and integration the-


ory and a key result in the proof of the existence of conditional proba-bility measures.

A measure m is said to be absolutely continuous with respect to an-other measure P on the same measurable space and we write m P ifP(F) = 0 impliesm(F) = 0, that is,m inherits all of P ’s zero probabilityevents.

Theorem 6.2. The Radon-Nikodym TheoremGiven two measures m and P on a measurable space (Ω,B) such that

m P , then there is a measurable function h : Ω→ R with the propertiesthat h ≥ 0 P -a.e. (and hence also m-a.e.) and

m(F) =∫FhdP, all F ∈ B. (6.9)

If g and h both satisfy (6.9), then g = h P -a.e. If in addition f : Ω → R ismeasurable, then ∫

Ff dm =

∫FfhdP, all F ∈ B. (6.10)

The function h is called the Radon-Nikodym derivative of m with respectto P and is written h = dm/dP .

Proof. First note that (6.10) follows immediately from (6.9) for simplefunctions. The general result (6.10) then follows, using the usual approx-imation techniques. Hence we need prove only (6.9), which will take somework.

Begin by defining a mixture measure q = (m + P)/2, that is, q(F) =m(F)/2 + P(F)/2 for all events F . For any f ∈ L2(q), necessarily f ∈L2(m), and hence Emf is finite from the Cauchy-Schwarz inequality. Asa first step in the proof we will show that there is a g ∈ L2(q) for which

Emf =∫fgdq, all f ∈ L2(q). (6.11)

Define the class of measurements M = f : f ∈ L2(q), Emf = 0 andobserve that the properties of expectation imply that M is a closed lin-ear subspace of L2(q). Hence the projection theorem implies that anyf ∈ L2(q) can be written as f = f + f ′, where f ∈ M and f′ ⊥ M.Furthermore, this representation is unique q-a.e. If all of the functionsin the orthogonal complement of M were identically 0, then f = f ∈Mand (6.11) would be trivially true with f ′ = 0 since Emf = 0 for f ∈ Mby definition. Hence we can assume that there is some r ⊥M that is notidentically 0. Furthermore,

∫r dm 6= 0 lest we have an r ∈ M such that

r ⊥M, which can only be if r is identically 0.Define for x ∈ Ω

6.5 The Radon-Nikodym Theorem 171

g(x) = r(x)Emr‖r‖2,

where here ‖r‖ is the L2(q) norm. Clearly g ∈ L2(q) and by constructiong ⊥M since r is. We have that∫

fgdq =∫(f − r Emf

Emr)gdq +

∫rgEmfEmr

dq.

The first term on the right is 0 since g ⊥M and f − r(Emf)/(Emr) haszero expectation with respect to m and hence is inM. Thus∫

fgdq = EmfEmr

∫rgdq = Emf

EmrEmr‖r‖2

‖r‖2= Emf

which proves (6.11). Observe that if we set f = 1F for any event F , then

Em1F =m(F) =∫Fgdq,

a formula which resembles the desired formula (6.9) except that theintegral is with respect to q instead of m. From the definition of q0 ≤m(F) ≤ 2q(F) and hence

0 ≤∫Fgdq ≤

∫F

2dq, all events F.

From Lemma 5.12 this implies that g ∈ [0,2] q-a.e.. Since m q andP q, this statement also holds m-a.e. and P -a.e..

As a next step rewrite (6.11) in terms of P and m as∫f(1− g

2)dm =

∫fg2dP, all f ∈ L2(q). (6.12)

We next argue that (6.12) holds more generally for all measurable f inthe sense that each integral exists if and only if the other does, in whichcase they are equal. To see this, first assume that f ≥ 0 and choose theusual quantizer sequence of Lemma 5.3. Then (6.12) implies that∫

qn(f)(1−g2)dm =

∫qn(f)g

2dP, all f ∈ L2(q)

for all n. Since g is between 0 and 2 with both m and P probability1, both integrands are nonnegative a.e. and converge upward togetherto a finite number or diverge together. In either case, the limit is

∫f(1−

g/2)dm =∫fg/2dP and (6.12) holds. For general f apply this argument

to the pieces of the usual decomposition f = f+−f−, and the conclusionfollows, both integrals either existing or not existing together.


To complete the proof, define the set B = x : g(x) = 2. Applicationof (6.12) to 1B implies that

P(B) =∫BdP =

∫1Bg2dP =

∫1B(1−

g2)dm = 0,

and hence P(Bc) = 1. Combining this fact with (6.12) with f = (1 −g/2)−1 implies that for any event F

m(F) =m(F ∩ B)+m(F ∩ Bc)

=m(F ∩ B)+∫F∩Bc

1− g2

1− g2

dm

=m(F ∩ B)+∫F∩Bc

g1− g

2

12dP. (6.13)

Define the function h by

h(x) = g(x)2− g(x)1B

c (x). (6.14)

and (6.13) becomes

m(F) =m(F ∩ B)+∫F∩Bc

hdP, all F ∈ B. (6.15)

Note that (6.15) is true in general; that is, we have not yet used the factthat m P . Taking this fact into account, m(B) = 0, and hence sinceP(Bc) = 1,

m(F) =∫FhdP (6.16)

for all events F . This formula combined with (6.14) and the fact thatg ∈ [0,2] with P and m probability 1 completes the proof. 2

In fact we have proved more than just the Radon-Nikodym theorem;we have also almost proved the Lebesgue decomposition theorem. Wenext complete this task.

Theorem 6.3. The Lebesgue Decomposition TheoremGiven two probability measures m and P on a measurable space

(Ω,B), there are unique probability measures ma and p and a numberλ ∈ [0,1] such that such that m = λma + (1− λ)p, where ma P andwhere p and P are singular in the sense that there is a set G ∈ B withP(G) = 1 and p(G) = 0.ma is called the part ofm absolutely continuouswith respect to P .

Proof. If m P , then the result is just the Radon-Nikodym theoremwith ma =m and λ = 1. If m is not absolutely continuous with respectto P then several cases are possible: Ifm(B) in (6.16) is 0, then the result

6.6 Probability Densities 173

again follows as in (6.16). If m(B) = 1, then p = m and P are singularand the result holds with λ = 0. If m(B) ∈ (0,1), then (6.15) yields thetheorem by identifying

ma(F) =∫F hdP∫Ω hdP ,

p(F) =mB(F) =m(F ∩ B)m(B)

,

and

λ =∫Ω hdP =m(Bc),

where the last equality follows from (6.15) with F = Bc . 2

The Lebesgue decomposition theorem is important because it permitsany probability measure to be decomposed into an absolutely continu-ous part and a singular part. The absolutely continuous part can alwaysbe computed by integrating a Radon-Nikodym derivative, and hence suchderivatives can be thought of as probability density functions.

Exercises

1. Show that if m P , then P(F) = 1 implies that m(F) = 1. Twomeasures m and P are said to be equivalent if m P and P m.How are the Radon-Nikodym derivatives dP/dm and dm/dP related?

6.6 Probability Densities

The Radon-Nikodym theorem provides a recipe for constructing manycommon examples of random variables, vectors, and processes. Supposethat λ is the (uncompleted) Lebesgue measure (i.e., the Borel measure) onsome interval in R or on R itself. Completion is not an issue here, but wewill refer to the measure as Lebesgue and completing the measure hasno effect save to expand the σ -field of the measurable space. Supposethat m is a probability measure on (R,B(R)) and that P λ. Then theRadon-Nikodym theorem implies that there is a nonnegative functionf = dP/dλ such that

P(F) =∫Ffdλ, (6.17)

an equation which is commonly written as

P(F) =∫Ff (x)dx. (6.18)


From the axioms of probability∫Rf(x)dx = 1. (6.19)

A nonnegative function which integrates to 1 and whose integral over anevent F gives the probability of the event is called a probability densityfunction or PDF . Distributions of continuous real-valued random vari-ables are usually described by their PDFs. A few common examples arelisted below.

Uniform PDF Given −∞ < a < b <∞,

f(x) = 1b−a x ∈ [a, b)0 otherwise

.

This is simply a normalized Lebesgue measure.Exponential PDF Given λ > 0,

f(x) =λe−λx x ≥ 0

0 otherwise.

I apologize for using λ to denote a positive constant so soon after hav-ing used it to denote Lebesgue measure, but this use of λ is ubiquitousin the engineering literature.

Doubly exponential or Laplacian PDF Given λ > 0,

f(x) = λe−λ|x|, x ∈ R.

Gaussian PDF Given σ 2 > 0,m ∈ R,

f(x) = e− 1

2 (x−m)2√2πσ 2

, x ∈ R.

All of these densities can be used to construct IID random processesas in Sections 3.8 and 4.5, which considered only discrete alphabet anduniformly distributed marginals. In turn, these can be used to constructB-processes, generalizing the special cases of Sections 3.9 and 4.5.

In a few examples it is possible to specify consistent multidimensionalprobability distributions and hence a random process using PDFs. Themost important examples of such distributions are the IID distributionsand the Gaussian PDF.

Multidimensional Gaussian PDF An k-dimensional PDF f(x),x ∈ Rkis Gaussian if it has the form

f(x) = (2π)−k/2(detΛ)−1/2e−(x−m)tΛ−1(x−m)/2, x ∈ Rk, (6.20)

6.6 Probability Densities 175

where t denotes transpose, m ∈ Rn, and where Λ is a symmetricpositive definite matrix, that is, atΛa > 0 for all a ∈ Rn.

The above definition assumes that Λ is (strictly) positive definite andhence invertible. The definition of a Gaussian random vector (and laterprocess) can be extended to the case where Λ is only assumed to be non-negative definite and hence not necessarily invertible, but here we stickto the simpler case. It can be shown with some tedious calculus thatm is the mean vector in the sense that if X is a random vector with aGaussian PDF as in (6.20), then EX =m, that Λ is the covariance matrixE[(X −m)(X −m)t] = Λ, and where detΛ is the determinant of thematrix Λ. Since the matrix Λ is positive definite, the inverse of Λ existsand hence the PDF is well defined. A random vector with a distributionspecified by a Gaussian PDF is said to be a Gaussian random vector.

Gaussian random process A random process Xt ; t ∈ T is Gaussianif there is a mean function mt ; t ∈ T and a covariance functionΛ(t, s); t, s ∈ T and for every positive integer k and collection oftimes t0, t1, . . . , tk−1 the random vector (Xt0 , Xt1 , . . . , Xtn−1) has a Gaus-sian distribution with mean vector (mt0 ,mt1 , . . . ,mtn−1) and covari-ance matrix Λ(ti, tj); i, j = 0,1, . . . , n− 1.Gaussian random vectors and processes have many nice properties.

See, for example, [20, 37]. These are important enough to list (withoutproof)

• Linear operations on Gaussian random vectors (processes) producenew Gaussian random vectors (processes). This implies that if a Gaus-sian IID process is the input to a time-invariant linear filter as in Sec-tion 4.5, then the output is also Gaussian.

• If a stationary Gaussian random process Xn is such that it is uncor-related in the sense that for all n

RX(n) = E(XnX0) = E(X20)δn,0,

then the process is also IID.• A Gaussian random process is said to be nondeterministic if the log-

arithm of the Fourier transform of its autocorrelation function is notnegative infinity. Any nondeterministic Gaussian process can be mod-eled as a time invariant linear filtering of an IID Gaussian process.(This is not to hard to prove for the finite order filters consideredearlier, it requires significant extra machinery in the general case.)Roughly speaking, a nondeterministic Gaussian process is one forwhich one can never perfectly predict the future given the past.

• Many real world phenomena are well modeled as Gaussian processes.This includes thermal noise and other processes formed by adding upa large number of independent random components (the central limittheorem).


Our main purpose in introducing such processes is that they providea nice example for process metrics and for information measures in thesequel.

6.7 Conditional Probability

Assume that we are given a probability space (Ω,B,m) and class of mea-surementsM and we wish to define a conditional probability m(G | M)for all events G. Intuitively, if we are told the output values of all of themeasurements f in the class M (which might contain only a single mea-surement or an uncountable family), what then is the probability that Gwill occur? The first observation is that we have not yet specified whatthe given output values of the class are, only the class itself. We caneasily nail down these output values by thinking of the conditional prob-ability as a function of ω ∈ Ω since given ω all of the f(ω), f ∈ M,are fixed. Hence we will consider a conditional probability as a functionof sample points of the form m(G | M)(ω) =m(G | f(ω)), all f ∈ M,that is, the probability of G given that f = f(ω) for all measurements fin the given collection M. Analogous to (6.4), since m(G | M) is to de-pend only on the output values of the f ∈ M, mathematically it shouldbe in M(σ(M)), that is, it should be measurable with respect to σ(M),the σ -field induced by the conditioning class.

Analogous to (6.6) the conditional probability of G given M shouldnot be changed if we are also given that an event in σ(M) occurred andhence averaging yields

m(G ∩ F) =∫Fm(G|M)dm,F ∈ σ(M).

This leads us to the following definition.Given a class of measurements M and an event G, the conditional

probability m(G | M) of G givenM is defined as any function such that

m(G|M) is σ(M)−measurable, and (6.21)

m(G ∩ F) =∫Fm(G|M)dm, all F ∈ σ(M). (6.22)

Clearly we have yet to show that in general such a function exists. First,however, observe that the preceding definition really depends onM onlythrough the σ -field that it induces. Hence more generally we can definea conditional probability of an event G given a sub-σ -field G as any func-tion m(G | G) such that

m(G|G) is G − measurable, and (6.23)

6.7 Conditional Probability 177

m(G ∩ F) =∫Fm(G|G)dm, all F ∈ G. (6.24)

Intuitively, the conditional probability given a σ -field is the condi-tional probability given the knowledge of which events in the σ -fieldoccurred and which did not or, equivalently, given the outcomes of theindicator functions for all events in the σ -field. That is, as stated inLemma 6.2, if G is a σ -field and M = all 1F ;F ∈ G, then G = σ(M)and hence m(G | G) =m(G | M).

By construction,m(G|M) =m(G|σ(M)). (6.25)

The conditional probability m(G|M(σ(M))) given the induced class ofmeasurements is given similarly bym(G|σ(M(σ(M))). From (6.1), how-ever, the conditioning σ -field is exactly σ(M). Thus

m(G|M) =m(G|σ(M)) =m(G|M(σ(M))), (6.26)

reinforcing the intuition discussed in Section 6.1 that knowing the out-comes of functions in M is equivalent to knowing the occurrence ornonoccurrence of events in σ(M), which is in turn equivalent to know-ing the outcomes of functions inM(σ(M)). It will usually be more con-venient to develop general results for the more general notion of a con-ditional probability given a σ -field.

We are now ready to demonstrate the existence and essential unique-ness of conditional probability.

Theorem 6.4. Given a probability space (Ω,B,m), an event G ∈ B, anda sub-σ -field G, then there exists a version of the conditional probabilitym(G | G). Furthermore, any two such versions are equal m-a.e.

Proof. If m(G) = 0, then (6.24) becomes

0 =∫Fm(G|G)dm, all F ∈ G,

and hence if m(G | G) is G-measurable, it is 0 m-a.e. from Corollary 6.1,completing the proof for this case. If m(G) 6= 0, then define the proba-bility measure mG on (Ω,G) as the elementary conditional probability

mG(F) =m(F|G) = m(F ∩G)m(G)

, F ∈ G.

The restrictionmGG ofm(·|G) to G is absolutely continuous with respect

to mG , the restriction of m to G. Thus mGG mG and hence from the

Radon-Nikodym theorem there exists an essentially unique almost ev-erywhere positive G-measurable function h such that


mGG(F) =

∫FhdmG

and hence from Lemma 6.3

mGG(F) =

∫Fhdm

and hencem(F ∩G)m(G)

=∫Fhdm,F ∈ G

and hencem(G)h is the desired conditional probability measurem(G|G).We close this section with a useful technical property: a chain rule for

Radon-Nikodym derivatives.

Lemma 6.6. Suppose that three measures on a common space satisfym λ P . Then P -a.e.

dmdP

= dmdλ

dλdP.

Proof. From (6.9) and (6.10)

m(F) =∫F

dmdλdλ =

∫F(dmdλ

dλdP)dP,

which defines dmdλ

dλdP as a version of dmdP .

Exercise

1. Given a probability space (Ω,B,m), suppose that G is a finite fieldwith atoms Gi; i = 1,2, . . . ,N . Is it true that

m(F|G)(ω) =N∑i=1

m(F|Gi)1Gi(ω)?

6.8 Regular Conditional Probability

Given an event F , the elementary conditional probability m(G | F) of Ggiven F is itself a probability measurem(·|F) when considered as a func-tion of the events G. A natural question is whether or not a similar con-clusion holds in the more general case. In other words, we have definedconditional probabilities for a fixed event G and a sub-σ -field G. What ifwe fix the sample point ω and consider the set function m(·|G)(ω); is

6.8 Regular Conditional Probability 179

it a probability measure on B? In general it may not be, but we shall seetwo special cases in this section for which a version of the conditionalprobabilities exists such that with probability one m(F|G)(ω), F ∈ B, isa probability measure. The first case considers the trivial case of a dis-crete sub-σ -field of an arbitrary σ -field. The second case considers themore important (and more involved) case of an arbitrary sub-σ -field ofa standard space.

Given a probability space (Ω,B,m) and a sub-σ - field G, then a regu-lar conditional probability given G is a function f : B × Ω → [0,1] suchthat

f(F,ω);F ∈ B (6.27)

is a probability measure for each ω ∈ Ω, and

f(F,ω);ω ∈ Ω (6.28)

is a version of the conditional probability of F given G.

Lemma 6.7. Given a probability space (Ω,B,m) and a discrete sub-σ -fieldG (that is, G has a finite or countable number of members), then thereexists a regular conditional probability measure given G.

Proof. Let Fi denote a countable collection of atoms of the countablesub-σ -field G. Then we can use elementary probability to write a versionof the conditional probability for each event G:

m(G|G)(ω) = P(G|Fi) =m(G ∩ Fi)m(Fi)

;ω ∈ Fi

for those Fi with nonzero probability. If ω is in a zero probability atom,then the conditional probability can be set to p∗(G) for some arbitraryprobability measure p∗. If we now fix ω, it is easily seen that the givenconditional probability measure is regular since for each atom Fi in thecountable sub-σ -field the elementary conditional probability P(·|Fi) is aprobability measure. 2

Theorem 6.5. Given a probability space (Ω,B,m) and a sub-σ -field G,then if (Ω,B) is standard there exists a regular conditional probabilitymeasure given G.

Proof. For each event G let m(G | G)(ω), ω ∈ Ω, be a version of theconditional probability of G given G. We have that∫

Fm(G|G)dm =m(G ∩ F) ≥ 0, all F ∈ G,

and hence from Corollary 6.1 m(G|G) ≥ 0 a.e. Thus for each event Gthere is an event, say H(G), with m(H(G)) = 1 such that if ω ∈ H(G),


then m(G|G)(ω) ≥ 0. In addition, H(G) ∈ G since the indicator func-tion of H(G) is 1 on the set of ω on which a G-measurable function isnonnegative, and hence this indicator function must also be G measur-able. Since (Ω,B) is standard, it has a basis Fn which generates B asin Chapter 2. Let F be the countable union of all of the sets in the basis.Define the set

H0 =⋂G∈F

H(G)

and hence H0 ∈ G,m(H0) = 1, and ifω ∈ H0, thenm(G | G)(ω) ≥ 0 forall G ∈ F . We also have that∫

Fm(Ω|G)dm =m(Ω∩ F) =m(F) = ∫

F1dm, all F ∈ G,

and hence m(Ω|G) = 1 a.e. Let H1 be the set ω : m(Ω|G)(ω) = 1 andhence H1 ∈ G and m(H1) = 1. Similarly, let F1, F2, . . . , Fn be any finitecollection of disjoint sets in F . Then∫

Fm(

n⋃i=1

Fi|G)dm =m((n⋃i=1

Fi)∩ F) =m(n⋃i=1

Fi ∩ F)

=n∑i=1

m(Fi ∩ F) =n∑i=1

∫Fm(Fi|G)dm

=∫F

n∑i=1

m(Fi|G)dm.

Hence

m(n⋃i=1

Fi|G) =n∑i=1

m(Fi|G) a.e.

Thus there is a set of probability one in G on which the preceding relationholds. Since there are a countable number of choices of finite collectionsof disjoint unions of elements of F , we can find a set, say H2 ∈ G, suchthat if ω ∈ H2, then m(·|G)(ω) is finitely additive on F for all ω ∈ H2

and m(H2) = 1. Note that we could not use this argument to demon-strate countable additivity on F since there are an uncountable numberof choices of countable collections of F . Although this is a barrier in thegeneral case, it does not affect the standard space case. We now have aset H = H0∩H1∩H2 ∈ G with probability one such that for anyω in thisset, m(·|G)(ω) is a normalized, nonnegative, finitely additive set func-tion on F . We now construct the desired regular conditional probabilitymeasure.

Define f(F,ω) = m(F|G)(ω) for all ω ∈ H, F ∈ F . Fix an arbitraryprobability measure P and for all ω 6∈ H define f(F,ω) = P(F), allF ∈ B. From Theorem 2.6.1 F has the countable extension property,

6.8 Regular Conditional Probability 181

and hence for each ω ∈ H we have that f(·,ω) extends to a uniqueprobability measure on B, which we also denote by f(·,ω). We now havea function f such that f(·,ω) is a probability measure for all ω ∈ Ω,as required. We will be done if we can show that if we fix an event G,then the function we have constructed is a version of the conditionalprobability for G given B, that is, if we can show that it satisfies (6.23)and (6.24).

First observe that for G ∈ F , f(G,ω) = m(G | G)(ω) for ω ∈ H,where H ∈ G, and f(G,ω) = P(G) for ω ∈ Hc . Thus f(G,ω) is a G-measurable function of ω for fixed G in F . This and the extension for-mula (1.38)imply that more generally f(G,ω) is a G-measurable functionof ω for any G ∈ B, proving (6.23).

If G ∈ F , then by constructionm(G | G)(ω) = f(G,ω) a.e. and hence(6.24) holds for f(G, .). To prove that it holds for more general G, firstobserve that (6.24) is trivially true for those F ∈ G with m(F) = 0. Ifm(F) > 0 we can divide both sides of the formula by m(F) to obtain

mF(G) =∫Ff (G,ω)

dm(ω)m(F)

.

We know the preceding equality holds for all G in a generating field F .To prove that it holds for all events G ∈ B, first observe that the left-hand side is clearly a probability measure. We will be done if we canshow that the right-hand side is also a probability measure since twoprobability measures agreeing on a generating field must agree on allevents in the generated σ -field. The right-hand side is clearly nonnega-tive since f(·,ω) is a probability measure for all ω. This also impliesthat f(Ω,ω) = 1 for all ω, and hence the right-hand side yields 1 forG = Ω. If Gk, k = 1,2, . . . is a countable sequence of disjoint sets withunion G, then since the f(Gi,ω) are nonnegative and the f(·,ω) areprobability measures,

n∑i=1

f(Gi,ω) ↑∞∑i=1

f(Gi,ω) = f(G,ω)

and hence from the monotone convergence theorem that


∞∑i=1

∫Ff (Gi,ω)

dm(ω)m(F)

= limn→∞

n∑i=1

∫Ff (Gi,ω)

dm(ω)m(F)

= limn→∞

∫F(n∑i=1

f(Gi,ω))dm(ω)m(F)

=∫F( limn→∞

n∑i=1

f(Gi,ω)dm(ω)m(F)

=∫Ff (G,ω)

dm(ω)m(F)

,

which proves the countable additivity and completes the proof of thetheorem. 2

The existence of a regular conditional probability is one of the mostuseful aspects of probability measures on standard spaces.

We close this section by describing a variation on the regular condi-tional probability results. This form will be the most common use of thetheorem.

Let X : Ω → AX and Y : Ω → AY be two random variables de-fined on a probability space (Ω,B,m). Let σ(Y) = Y−1(BAY ) denotethe σ -field generated by the random variable Y and for each G ∈ Blet m(G|σ(Y))(ω) denote a version of the conditional probability of Fgiven σ(Y). For the moment focus on an event of the form G = X−1(F)for F ∈ BAX . This function must be measurable with respect to σ(Y),and hence from Lemma 6.1 there must be a function, say g : AY → [0,1],such that m(X−1(F)|σ(Y))(ω) = g(Y(ω)). Call this function g(y) =PX|Y (F|y). By changing variables we know from the properties of condi-tional probability that

PXY (F ×D) =m(X−1(F)∩ Y−1(D))

=∫Y−1(D)

m(X−1(F)|σ(Y))(ω)dm(ω)

=∫Y−1(D)

PX|Y (F|Y(ω))dm(ω) =∫DPX|Y (F|y)dPY (y),

where PY = mY−1 is the induced distribution for Y . The preceding for-mulas simply translate the conditional probability statements from theoriginal space and σ -fields to the more concrete example of distributionsfor random variables conditioned on other random variables.

Corollary 6.2. Let X : Ω → AX and Y : Ω → AY be two random variablesdefined on a probability space (Ω,B,m). Then if any of the followingconditions is met, there exists a regular conditional probability measurePX|Y (F|y) for X given Y :

(a) The alphabet AX is discrete.

6.9 Conditional Expectation 183

(b) The alphabet AY is discrete.(c) Both of the alphabets AX and AY are standard.

Proof. Using the correspondence between conditional probabilities givensub-σ -fields and random variables, two cases are immediate: If the alpha-bet AY of the conditioning random variable is discrete, then the resultfollows from Lemma 6.7. If the two alphabets are standard, then so isthe product space, and the result follows in the same manner from The-orem 6.5. If AX is discrete, then its σ -field is standard and we can mimicthe proof of Theorem 6.5. Let F now denote the union of all sets in acountable basis for BAX and replace m(F|G)(ω) by PX|Y (F|y) and dmby dPY . The argument then goes through virtually unchanged. 2

6.9 Conditional Expectation

In the previous section we showed that given a standard probabilityspace (Ω,B,m) and a sub-σ -field G, there is a regular conditional prob-ability measure given G: m(G | G)(ω), G ∈ B,ω ∈ Ω; that is, foreach fixed ω m(·|G)(ω) is a probability measure, and for each fixedG, m(G | G)(ω) is a version of the conditional probability of G givenG, i.e., it satisfies (6.23) and (6.24). We can use the individual probabilitymeasures to define a conditional expectation

E(f |G)(ω) =∫f(u)dm(u|G)(ω) (6.29)

if the integral exists. In this section we collect a few properties of condi-tional expectation and relate the preceding constructive definition to themore common and more general descriptive one.

Fix an event F ∈ G and let f be the indicator function of an eventG ∈ B. Integrating (6.29) over F using (6.24) and the fact that E(1G |G)(ω) =m(G | G)(ω) yields∫

FE(f |G)dm =

∫Ff dm (6.30)

since the right-hand side is simplym(G∩F). From the linearity of expec-tation (by (6.29) a conditional expectation is simply an ordinary expec-tation with respect to a measure that is a regular conditional probabilitymeasure for a particular sample point ) and integration, (6.30) also holdsfor simple functions. For a nonnegative measurable f , take the usualquantizer sequence qn(f) ↑ f of Lemma 5.3, and (6.30) then holds sincethe two sides are equal for each n and since both sides converge up tothe appropriate integral, the right-hand side by the definition of the in-tegral of a nonnegative function and the left-hand side by virtue of the


definition and the monotone convergence theorem. For general f , usethe usual decomposition f = f+ −f− and apply the result to each piece.The preceding relation then holds in the sense that if either integral ex-ists, then so does the other and they are equal. Note in particular thatthis implies that if f is in L1(m), then E(f |G)(ω) must be finite m-a.e.

In a similar sequence of steps, the fact that for fixed G the con-ditional probability of G given G is a G-measurable function impliesthat E(f |G)(ω) is a G-measurable function if f is an indicator func-tion, which implies that it is also G-measurable for simple functions andhence also for limits of simple functions.

Thus we have the following properties: If f ∈ L1(m) and h(ω) =E(f |G)(ω), then

h(ω) is G −measurable, (6.31)

and ∫Fhdm =

∫Ff dm, all F ∈ G. (6.32)

Eq. (6.32) is often referred to as iterated expectation or nested expecta-tion since it shows that the expectation of f can be found in two steps:first find the conditional expectation given a σ -field or class of measure-ments (in which case G is the induced σ -field) and then integrate theconditional expectation.

These properties parallel the defining descriptive properties of con-ditional probability and reduce to those properties in the case of indi-cator functions. The following simple lemma shows that the propertiesprovide an alternative definition of conditional expectation that is validmore generally, that is, holds even when the alphabets are not standard.Toward this end we now give the general descriptive definition of condi-tional expectation: Given a probability space (Ω,B,m), a sub-σ -field G,and a measurement f ∈ L1(m), then any function h is said to be a ver-sion of the conditional expectation E(f |G). From Corollary 6.1, any twoversions must be equal a.e. If the underlying space is standard, then thedescriptive definition therefore is consistent with the constructive def-inition. It only remains to show that the conditional expectation existsin the general case, that is, that we can always find a version of the con-ditional expectation even if the underlying space is not standard. To dothis observe that if f is a simple function

∑i ai1Gi , then

h =∑iaim(Gi|G)

is G-measurable and satisfies (6.32) from the linearity of integration andthe properties of conditional expectation:

6.9 Conditional Expectation 185∫F(∑iaim(Gi|G))dm =

∑iai∫Fm(Gi|G) =

∑iai∫F

1Gidm

=∫F(∑iai1Gi)dm =

∫Ff dm.

We then proceed in the usual way. If f ≥ 0, let qn be the asymptoticallyaccurate quantizer sequence of Lemma 5.3. Then qn(f) ↑ f and henceE(qn(f)|G) is nondecreasing (Lemma 6.3), and h = limn→∞ E(qn(f)|G)satisfies (6.31) since limits of G-measurable functions are measurable.The dominated convergence theorem implies that h also satisfies (6.32).The result for general integrable f follows from the decomposition f =f+ − f−. We have therefore proved the following result.

Lemma 6.8. Given a probability space (Ω,B,m), a sub-σ - field G, and ameasurement f ∈ L1(m), then there exists a G-measurable real-valuedfunction h satisfying the formula∫

Fhdm =

∫Ff dm, all F ∈ G; (6.33)

that is, there exists a version of the conditional expectation E(f |G) of fgiven G. If the underlying space is standard, then also (6.29) holds a.e.;that is, the constructive and descriptive definitions are equivalent on stan-dard spaces.

The next result shows that the remark preceding (6.30) holds even ifthe space is not standard.

Corollary 6.3. Given a probability space (Ω,B,m) (not necessarily stan-dard), a sub-σ -field G, and an event G ∈ G, then with probability one

m(G|G) = E(1G|G).

Proof. From the descriptive definition of conditional expectation∫FE(1G|G)dm =

∫F

1G dm =m(G ∩ F), for allF ∈ G,

for any G ∈ B. But, by (6.24), this is just∫F m(G|G)dm for all F ∈ G.

Since both E(1G|G) and m(G|G) are G-measurable, Corollary 6.1 com-pletes the proof. 2

Corollary 6.4. If f ∈ L1(m) is G-measurable, then E(f |G) = f m-a.e..

Proof. The proof follows immediately from (6.32) and Corollary 6.1. 2

If a conditional expectation is defined on a standard space using theconstructive definitions, then it inherits the properties of an ordinaryexpectation; e.g., Lemma 5.6 can be immediately extended to conditional


expectations. The following lemma shows that Lemma 5.6 extends toconditional expectation in the more general case of the descriptive defi-nition.

Lemma 6.9. Let (Ω,B,m) be a probability space, G a sub-σ -field, and fand g integrable measurements.

(a) If f ≥ 0 with probability 1, then Em(f |G) ≥ 0.(b) Em(1|G) = 1.(c) Conditional expectation is linear; that is, for any real α, β and any

measurements f and g,

Em(αf + βg|G) = αEm(f |G)+ βEm(g|G).

(d) Em(f |G) exists and is finite if and only if Em(|f ||G) is finite and

|Emf | ≤ Em|f |.


Em(f |G) ≥ Em(g|G).

Proof. (a) If f ≥ 0 and∫FE(f |G)dm =

∫Ff dm, all F ∈ G,

then Lemma 5.6 implies that the right-hand side is nonnegative for allF ∈ G. Since E(f |G) is G-measurable, Corollary 6.1 then implies thatE(f |G) ≥ 0.

(b) For f = 1, the function h = 1 is both G-measurable and satisfies∫Fhdm =

∫Ff dm =m(F), all F ∈ G.

Since (6.31)-(6.32) are satisfied with h = 1, 1 must be a version of E(1|G).(c) Let E(f |G) and E(g|G) be versions of the conditional expectations

of f and g given G. Then h = αE(f |G) + βE(g|G) is G-measurable andsatisfies (6.32) with f replaced by αf+βg. Hence h is a version of E(αf+βg|G).

(d) Let f = f+ −f− be the usual decomposition into positive and neg-ative parts f+ ≥ 0, f− ≥ 0. From part (c), E(f |G) = E(f+|G)− E(f−|G).From part (a), E(f+|G) ≥ 0 and E(f−|G) ≥ 0. Thus again using part (c)

E(f |G) ≤ E(f+|G)+ E(f−|G)= E(f+ + f−|G) = E(|f‖G).

(e) The proof follows from parts (a) and (c) by replacing f by f − g. 2


then its conditional expectation E(f |G) has a special interpretation.It is the projection of the measurement onto the space of all square-integrable G-measurable functions.

Lemma 6.10. Given a probability space (Ω,B,m), a measurement f ∈L2(m), and a sub-σ -field G ⊂ B. Let M denote the subset of L2(m) con-sisting of all square-integrable G-measurable functions. ThenM is a closedsubspace of L2(m) and

E(f |G) = PM(f).

M is a closed subspace of L2(m). From the projection theorem (Theorem6.1) there exists a function f = PM(f) with the properties that f ∈ Mand f − f ⊥ M . This fact implies that∫

fg dm =∫f gdm (6.34)

for all g ∈ M . Choosing g = 1G for any G ∈ G yields∫Gf dm =

∫f dm, all G ∈ G.

Since f isF -measurable, this defines f as a version of E(f |G) and provesthe lemma. 2

Eq. (6.34) and the lemma immediately yield a generalization of thenesting property of conditional expectation.

Corollary 6.5. If g is G-measurable and f , g ∈ L2(m), then

E(fg) = E(gE(f |G)).

The next result provides some continuity properties of conditionalexpectation.

Lemma 6.11. For i = 1,2, if f ∈ Li(m), then

‖E(f |G)‖i ≤ ‖f‖i, i = 1,2,

and thus if f ∈ Li(m), then also E(f |G) ∈ L1(m). If f , g ∈ Li(m), then

‖E(f |G)− E(g|G)‖i ≤ ‖f − g‖i.

Thus for i = 1,2, if fn → f in Li(m), then also E(fn|G)→ E(f |G).

Proof. The first inequality implies the second by replacing f with f − gand using linearity. For i = 1 from the usual decomposition f = f+−f−,f+ ≥ 0, f− ≥ 0, the linearity of conditional expectation, and Lemma6.9(c) that

The following result shows that if a measurement f is square-integrable,

Proof. Since sums and limits ofG-measurable functions areG-measurable,


‖E(f |G)‖i =∫|E(f |G)|dm =

∫|E(f+ − f−|G)|dm

=∫|E(f+|G)− E(f−|G)|dm.

From Lemma 6.9(a), however, E(f+|G) ≥ 0 and E(f−|G) ≥ 0 a.e., andhence the right-hand side is bound above by∫

(E(f+|G)+ E(f−|G))dm =∫E(f+ + f−|G)dm

=∫E(|f ||G)dm = E(|f |) = ‖f‖1.

For i = 2 observe that

0 ≤ E[(f − E(f |G))2]= E(f 2)+ E[E(f |G)2]− 2E[fE(f |G)].

Apply iterated expectation and Corollary 6.5 to write

E[fE(f |G)] = E(E[fE(f |G)]|G)= E(E(f |G)E(f |G)) = E[E(f |G)2].

Combining these two equations yields E(f 2)−E[E(f |G)2] ≥ 0 or ‖f‖22 ≥

‖E(f |G)‖22. 2

A special case where the conditional results simplify occurs when themeasurement and the conditioning are independent. Given a probabilityspace (Ω,B,m), two events F and G are said to be independent ifm(F ∩G) = m(F)m(G). Two measurements or random variables f : Ω → Afand g : Ω → Ag are said to be independent if the events f−1(F) andg−1(G) are independent for all events F and G; that is,

m(f ∈ F and g ∈ G) =m(f−1(F)∩ g−1(G)) =m(f−1(F))m(g−1(G))=m(f ∈ F)m(g ∈ G); all F ∈ BAf ;G ∈ BAg .

The following intuitive result for independent measurements is provedeasily by first treating simple functions and then applying the usualmethods to get the general result.

Corollary 6.6. If f and g are independent measurements, then

E(f |σ(g)) = E(f),

andE(fg) = E(f)E(g).

Thus, for example, if f and g are independent, then m(f ∈ F|g) =m(f ∈ F) a.e.


Exercises

1. Prove Corollary 6.6.2. Prove the following generalization of Corollary 6.5.

Lemma 6.12. Suppose that G1 ⊂ G2 are two sub-σ -fields of B and thatf ∈ L1(m). Prove that

E(E(f |G2)|G1) = E(f |G1).

3. Given a probability space (Ω,B,m) and a measurement f , for whatsub-σ -field G ⊂ B is it true that E(f |G) = E(f) a.e.? For what sub-σ -fieldH ⊂ B is it true that E(f |H ) = f a.e.?

4. Suppose that Gn is an increasing sequence of sub-σ -fields: Gn ⊂ Gn+1

and f is an integrable measurement. Define the random variables

Xn = E(f |Gn). (6.35)

Prove that Xn has the properties that it is measurable with respect toGn and

E(Xn|X1, X2, . . . , Xn−1) = Xn−1; (6.36)

that is, the conditional expectation of the next value given all the pastvalues is just the most recent value. A sequence of random variableswith this property is called a martingale and there is an extremely richtheory regarding the convergence and uniform integrability propertiesof such sequences. (See, e.g., the classic reference of Doob [20] or thetreatments in Ash [1] or Breiman [10].)We point out in passing the general form of martingales, but we willnot develop the properties in any detail. A martingale consists of asequence of random variables Xn defined on a common probabilityspace (Ω,B,m) together with an increasing sequence of sub-σ -fieldsBn with the following properties: for n = 1,2, . . .

E|Xn| <∞, (6.37)

Xn is Bn −measurable, (6.38)

andE(Xn+1|Bn) = Xn. (6.39)

If the third relation is replaced by E(Xn+1|Bn) ≥ Xn, then Xn,Bn isinstead a submartingale. If instead E(Xn+1|Bn) ≤ Xn, then Xn,Bnis a supermartingale. Martingales have their roots in gambling theory,in which a martingale can be considered as a fair game–the capitalexpected next time is given by the current capital. A submartingalerepresents an unfavorable game (as in a casino), and a supermartin-


gale a favorable game (as was the stock market during the decadepreceding the first edition of this book).

5. Show that Xn,Bn is a martingale if and only if for all n∫GXn+1dm =

∫GXn dm, all G ∈ Bn.

6.10 Independence and Markov Chains

In this section we apply some of the results and interpretations of theprevious sections to obtain formulas for probabilities involving randomvariables that have a particular relationship called the Markov chainproperty.

Let (Ω,B, P) be a probability space and let X : Ω → AX , Y : Ω →AY , and Z : Ω → AZ be three random variables defined on this spacewith σ -fields BAX , BAY , and BAZ , respectively. As we wish to focus onthe three random variables rather than on the original space, we canconsider (Ω,B, P) to be the space (AX×AY ×AZ,BAX ×BAY ×BAZ , PXYZ).

As a preliminary, let PXY denote the joint distribution of the randomvariables X and Y ; that is,

PXY (F ×G) = P(X−1(F)∩ Y−1(G));F ∈ BX,G ∈ BY .

Translating the definition of independence of measurements of Section6.9 into distributions, the random variables X and Y are independent ifand only if

PXY (F ×G) = PX(F)PY (F);F ∈ BX,G ∈ BY ,

where PX and PY are the marginal distributions of X and Y . We can alsostate this in terms of joint distributions in a convenient way. Given twodistributions PX and PY , define the product distribution PX × PY or PX×Y(both notations are used) as the distribution on (AX × AY ,BAX × BAY )specified by

PX×Y (F ×G) = PX(F)PY (G);F ∈ BX,G ∈ BY .

Thus the product distribution has the same marginals as the originaldistribution, but it forces the random variables to be independent.

Lemma 6.13. Random variables X and Y are independent if and only ifPXY = PX×Y . Given measurements f : AX → Af and g : AY → Ag , then ifX and Y are independent, so are f(X) and g(Y) and∫

fgdPX×Y = (∫fdPX)(

∫gdPY ).

6.10 Independence and Markov Chains 191

Proof. The first statement follows immediately from the definitions ofindependence and product distributions. The second statement followsfrom the fact that

Pf(X),g(Y)(F,G) = PX,Y (f−1(F), g−1(G))

= PX(f−1(F))PY (g−1(G))= Pf(X)(F)Pg(Y)(G).

The final statement then follows from Corollary 6.6. 2

We next introduce a third random variable and a form of conditionalindependence among random variables. As considered in Section 6.8, theconditional probability P(F|σ(Z)) can be written as g(Z(ω)) for somefunction g(z) and we define this function to be P(F|z). Similarly theconditional probability P(F|σ(Z,Y)) can be written as r(Z(ω), Y(ω))for some function r(z,y) and we denote this function P(F|z,y). Wecan then define the conditional distributions

PX|Z(F|z) = P(X−1(F)|z)PY |Z(G|z) = P(Y−1(G)|z)

PXY |Z(F ×G|z) = P(X−1(F)∩ Y−1(G)|z)PX|Z,Y (F|z,y) = P(X−1(F)|z,y).

We say that Y → Z → X is a Markov chain if for any event F ∈ BAX withprobability one

PX|Z,Y (F|z,y) = PX|Z(F|z),or in terms of the original space, with probability one

P(X−1(F)|σ(Z,Y)) = P(X−1(F)|σ(Z)).

The following lemma provides an equivalent form of this property interms of a conditional independence relation.

Lemma 6.14. The random variables Y ,Z , X form a Markov chain Y →Z → X if and only if for any F ∈ BAX and G ∈ BAY we know that withprobability one

PXY |Z(F ×G|z) = PX|Z(F|z)PY |Z(G|z),

or, equivalently, with probability one

P(X−1(F)∩ Y−1(G)|σ(Z)) = P(X−1(F)|σ(Z))P(Y−1(G)|σ(Z)).

If the random variables have the preceding property, we also say that Xand Y are conditionally independent given Z .


Proof. For any F ∈ BAX , G ∈ BAY , and D ∈ BAZ we know from theproperties of conditional probability that

P(X−1(F)∩ Y−1(G)∩ Z−1(D)) =∫Y−1(G)∩Z−1(D)

P(X−1(F)|σ(Z,Y))dP

=∫Y−1(G)∩Z−1(D)

P(X−1(F)|σ(Z))dP

using the definition of the Markov property. Using first (6.32) and thenCorollary 6.3 this integral can be written as∫

1Z−1(D)1Y−1(G)P(X−1(F)|σ(Z))dP

=∫E[1Z−1(D)1Y−1(G)P(X−1(F)|σ(Z))|σ(Z)]dP

=∫

1Z−1(D)E[1Y−1(G)|σ(Z)]P(X−1(F)|σ(Z))dP,

where we have used the fact that both 1Z−1(D) and P(X−1(F)|σ(Z)) aremeasurable with respect to σ(Z) and hence come out of the conditionalexpectation. From Corollary 6.3, however,

E[1Y−1(G)|σ(Z)] = P(Y−1(G)|σ(Z)),

and hence combining the preceding relations yields

P(X−1F ∩ Y−1G ∩ Z−1D) =∫Z−1(D)

P(Y−1(G)|σ(Z))P(X−1(F)|σ(Z))dP.

We also know from the properties of conditional probability that

P(X−1F ∩ Y−1G ∩ Z−1D) =∫Z−1(D)

P(Y−1(G)∩X−1(F)|σ(Z))dP

and hence

P(X−1F ∩ Y−1G ∩ Z−1D)

=∫Z−1(D)

P(Y−1(G)|σ(Z))P(X−1(F)|σ(Z))dP

=∫Z−1(D)

P(Y−1(G)∩X−1(F)|σ(Z))dP. (6.40)

Since the integrands in the two right-most integrals are measurable withrespect to σ(Z), it follows from Corollary 6.1 that they are equal almosteverywhere, proving the lemma. 2

As a simple example, if the random variables are discrete with proba-bility mass function pXYZ , then Y ,Z,X form a Markov chain if and only

6.10 Independence and Markov Chains 193

if for all x,y, zpX|Y ,Z(x|y,z) = pX|Z(x|y)

or, equivalently,

pX,Y |Z(x,y|z) = pX|Z(x|z)pY |Z(y|z).

We close this section and chapter with an alternative statement of theprevious lemma. The result is an immediate consequence of the lemmaand the definitions.

Corollary 6.7. Given the conditions of the previous lemma, let PXYZ de-note the distribution of the three random variables X, Y , and Z . Let PX×Y |Zdenote the distribution (if it exists) for the same random variables specifiedby the formula

PX×Y |Z(F ×G ×D)

=∫Z−1(D)

P(Y−1(G)|σ(Z))P(X−1(F)|σ(Z))dP

=∫DPX|Z(F|z)PY |Z(G|z)dPZ(z), F ∈ BAX ;G ∈ BAY ;D ∈ BAZ ;

that is, PX×Y |Z is the distribution on X,Y ,Z which agrees with the orig-inal distribution PXYZ on the conditional distributions of X given Z andY given Z and with the marginal distribution Z , but which is such thatY → Z → X forms a Markov chain or, equivalently, such that X and Yare conditionally independent given Z . Then Y → Z → X forms a Markovchain if and only if PXYZ = PX×Y |Z .

Comment: The corollary contains the technical qualification that thegiven distribution exists because it need not in general, that is, there isno guarantee that the given set function is indeed countably additiveand hence a probability measure. If the alphabets are assumed to bestandard, however, then the conditional distributions are regular and itfollows that the set function PX×Y |Z is indeed a distribution.

Exercise

1. Suppose that Yn, n = 1,2, . . . is a sequence of independent randomvariables with the same marginal distributions PYn = PY . Define thesum

Xn =n∑i=1

Yi.


Show that the sequence Xn is a martingale. This points out that mar-tingales can be viewed as one type of memory in a random process,here formed simply by adding up memoryless random variables.

Chapter 7

Ergodic Properties

Abstract Random processes and dynamical systems possess ergodicproperties if the time averages of measurements converge. This chapterlooks at the implications of ergodic properties and limiting sample aver-ages. The relation between expectations of the averages and averages ofthe expectations are derived, and it is seen that if a process has sufficientergodic properties, then the process must be asymptotic mean station-ary and that the the properties describe a stationary measure with thesame ergodic properties as the asymptotically mean stationary process.A variety of related characterizations of the system transformation aregiven that relate to the asymptotic behavior, including ergodicity, recur-rence, mixing, block stationarity and ergodicity, and total ergodicity. Thechapter concludes with the ergodic decomposition which shows that thestationary measure implied by convergent sample averages is equivalentto a mixture of stationary and ergodic processes. The ergodic theoremsof the next chapter provide conditions under which a process or sys-tem will possess ergodic properties for large classes of measurements.In particular, it is shown that asymptotic mean stationarity is a sufficientcondition as well as a necessary condition.

7.1 Ergodic Properties of Dynamical Systems

As in the previous chapter we focus on time and ensemble averages ofmeasurements made on a dynamical system. Let (Ω,B,m,T) be a dy-namical system. The case of principal interest will be the dynamical sys-tem corresponding to a random process, that is, a dynamical system(AI, B(A)I,m,T), where I is an index set, usually either the nonnegativeintegers Z+ or the set Z of all integers, and T is a shift on the sequencespace AI. The sequence of coordinate random variables Πn,n ∈ I is thecorresponding random process andm is its distribution. The dynamical


196 7 Ergodic Properties

system may or may not be that associated with a directly given alphabetA random process Xn;n ∈ I described by a distribution m and havingT be the shift on the sequence space AI.

Dynamical systems corresponding to a directly given random processwill be the most common example, but we shall be interested in othersas well. For example, the space may be the same, but we may wish toconsider block or variable-length shifts instead of the unit time shift.

As previously, given a measurement f (i.e., a measurable mapping ofΩ into R ) define the sample averages

< f >n= n−1n−1∑i=0

f(T ix).

A dynamical system will be said to have the ergodic property with respectto a measurement f if the sample average < f >n converges almost ev-erywhere as n → ∞. A dynamical system will be said to have ergodicproperties with respect to a class M of measurements if the sample av-erages < f >n converge almost everywhere as n → ∞ for all f in thegiven class M. When unclear from context, the measure m will be ex-plicitly given. We shall also refer to mean ergodic properties (of order i)when the convergence is in Li.

If limn→∞ < f >n (x) exists, we will usually denote it by < f > (x)or f (x). For convenience we consider < f > (x) to be 0 if the limitdoes not exist. From Corollary 5.11 such limits are unique only in analmost everywhere sense; that is, if < f >n→< f > and < f >n→ ga.e., then < f >= g a.e. Interesting classes of functions that we willconsider include the class of all indicator functions of events, the class ofall bounded measurements, and the class of all integrable measurementswith respect to a given probability measure.

The properties of limits immediately yield properties of limiting sam-ple averages that resemble those of probabilistic averages.

Lemma 7.1. Given a dynamical system, let E denote the class of measure-ments with respect to which the system has ergodic properties. (a) If f ∈ Eand fT i ≥ 0 a.e. for all i, then also < f >≥ 0 a.e. (The condition is met, forexample, if f(ω) ≥ 0 for allω ∈ Ω.) (b) The constant functions f(ω) = r ,r ∈ R, are in E. In particular, < 1 >= 1. (c) If f , g ∈ E then also that< af + bg >= a < f > +b < g > and hence af + bg ∈ E for any reala,b. Thus E is a linear space. (d) If f ∈ E, then also fT i ∈ E, i = 1,2, . . .and < fTi >=< f >; that is, if a system has ergodic properties with re-spect to a measurement, then it also has ergodic properties with respectto all shifts of the measurement and the limit is the same.

Proof. Parts (a)-(c) follow from Lemma 5.14. To prove (d) we need onlyconsider the case i = 1 since that implies the result for all i. If f ∈ Ethen with probability one

7.1 Ergodic Properties of Dynamical Systems 197

limn→∞

1n

n−1∑i=0

f(T ix) = f(x)

or, equivalently,| < f >n −f | →n→∞0.

The triangle inequality shows that < fT >n must have the same limit:

| < fT >n − < f > |

= |(n+ 1n

) < f >n+1 −n−1f− < f >|

≤ |(n+ 1n

) < f >n+1 − < f >n+1| +| < f >n+1 −f | +n−1 | f |

≤ 1n| < f >n+1| +| < f >n+1 −f | +

1n|f | →

n→∞00

where the middle term goes to zero by assumption, implying that thefirst term must also go to zero, and the right-most term goes to zerosince f cannot assume ∞ as a value. 2

The preceding properties provide some interesting comparisons be-

with limiting sample averages and the space L1 of all measurements withexpectations with respect to the same measure. Properties (b) and (c) oflimiting sample averages are shared by expectations. Property (a) is sim-ilar to the property of expectation that f ≥ 0 a.e. implies that Ef ≥ 0,but here we must require that fT i ≥ 0 a.e. for all nonnegative integers i.Property (d) is not, however, shared by expectations in general since inte-grability of f does not in general imply integrability of fT i. There is onecase, however, where the property is shared: if the measure is invariant,then from Lemma 5.22 f is integrable if and only if fT i is, and, if theintegrals exist, they are equal. In addition, if the measure is stationary,then

m(ω : f(Tnω) ≥ 0) =m(T−nω : f(ω) ≥ 0) =m(f ≥ 0)

and hence in the stationary case the condition for (a) to hold is equivalentto the condition that f ≥ 0 a.e.. Thus for stationary measures, there isan exact parallel among the properties (a)-(d) of measurements in E andthose in L1. These similarities between time and ensemble averages willbe useful for both the stationary and nonstationary systems.

The proof of the lemma yields a property of limiting sample averagesthat is very useful. We formalize this fact as a corollary after we givea needed definition: Given a dynamical system (Ω,B,m,T) a functionf : Ω → Λ is said to be invariant (with respect to T) or stationary (withrespect to T) if f(Tω) = f(ω), all ω ∈ Ω. When it is understood that Tis a shift, the name time-invariant is also used.

tween the spaceE of all measurements on a dynamical system (Ω,F ,m,T)


Corollary 7.1. If a dynamical system has ergodic properties with respectto a measurement f , then the limiting sample average < f > is an invari-ant function, that is, < f > (Tx) =< f > (x).

Proof. If limn→∞ < f >n (x) =< f >, then from the lemma limn→∞ <f >n (Tx) also exists and is the same, but the limiting time average offT(x) = f(Tx) is just < fT > (x), proving the corollary. 2

The corollary simply formalizes the observation that shifting x oncecannot change a limiting sample average, hence such averages must beinvariant.

Ideally one would like to consider the most general possible measure-ments at the outset, e.g., integrable or square-integrable measurements.We initially confine attention to bounded measurements, however, as acompromise between generality and simplicity. The class of boundedmeasurements is simple because the measurements and their sampleaverages are always integrable and uniformly integrable and the class ofmeasurements considered does not depend on the underlying measure.All of the limiting results of probability theory immediately and easilyapply. This permits us to develop properties that are true of a givenclass of measurements independent of the actual measure or measuresunder consideration. Eventually, however, we will wish to demonstratefor particular measures that sample averages converge for more generalmeasurements.

Although a somewhat limited class, the class of bounded measure-ments contains some important examples. In particular, it contains theindicator functions of all events, and hence a system with ergodic prop-erties with respect to all bounded measurements will have limiting val-ues for all relative frequencies, that is, limits of < 1F >n as n → ∞. Thequantizer sequence of Lemma 5.3 can be used to prove a form of con-verse to this statement: if a system possesses ergodic properties withrespect to the indicator functions of events, then it also possesses er-godic properties with respect to all bounded measurable functions.

Lemma 7.2. A dynamical system has ergodic properties with respect tothe class of all bounded measurements if and only if it has ergodic prop-erties with respect to the class of all indicator functions of events, i.e., allmeasurable indicator functions.

Proof. Since indicator functions are a subclass of all bounded functions,one implication is immediate. The converse follows by a standard ap-proximation argument, which we present for completeness. Assume thata system has ergodic properties with respect to all indicator functionsof events. Let E be the class of all measurements with respect to whichthe system has ergodic properties. Then E contains all indicator func-tions of events, and, from the previous lemma, it is linear. Hence it alsocontains all measurable simple functions. The general limiting sample

7.1 Ergodic Properties of Dynamical Systems 199

averages can then be constructed from those of the simple functionsin the same manner that we used to define ensemble averages, that is,for nonnegative functions form the increasing limit of the averages ofsimple functions and for general functions form the nonnegative andnegative parts.

Assume that f is a bounded measurable function with, say, | f |≤ K.Let ql denote the quantizer sequence of Lemma 5.3 and assume thatl > K and hence | ql(f (ω)) − f(ω) |≤ 2−l for all l and ω by construc-tion. This also implies that | ql(f (T iω))− f(T iω) | ≤ 2−l for all i, l,ω.It is the fact that this bound is uniform over i (because f is bounded)that makes the following construction work. Begin by assuming that f isnonnegative. Since ql(f ) is a simple function, it has a limiting sample av-erage, say < ql(f ) >, by assumption. We can take these limiting sampleaverages to be increasing in l since j ≥ l implies that

< qj(f) >n − < ql(f ) >n = < qj(f)− ql(f ) >n

= n−1n−1∑i=0

(qj(fT i)− ql(fT i)) ≥ 0,

since each term in the sum is nonnegative by construction of the quan-tizer sequence. Hence from Lemma 7.1 we must have that qj(f )−ql(f ) =qj(f ) − ql(f ) ≥ 0 and hence < ql(f ) > (ω) =< ql(f (ω)) > is a nonde-

creasing function and hence must have a limit, say f . Furthermore, sincef and hence ql(f ) and hence < ql(f ) >n and hence < ql(f ) > are all in[0, K], so must f be. We now have from the triangle inequality that forany l

|< f >n −f |≤|< f >n − < ql(f ) >n|+ |< ql(f ) >n −ql(f ) | + | ql(f )− f | .

The first term on the right-hand side is bound above by 2−l. Sincewe have assumed that < ql(f ) >n converges with probability one to< ql(f ) >, then with probability one we get an ω such that

< ql(f ) >n →n→∞

< ql(f ) >

for all l = 1,2, . . . (The intersection of a countable number of sets withprobability one also has probability one.) Assume that there is an ω inthis set. Then given ε we can choose an l so large that |< ql(f ) > −f |≤ε/2 and 2−l ≤ ε/2, in which case the preceding inequality yields

lim supn→∞

|< f >n −f |≤ ε.


Since this is true for arbitrary ε, for the given ω it follows that < f >n(ω) → f (ω), proving the almost everywhere result for nonnegative f .The result for general bounded f is obtained by applying the nonnega-tive result to f+ and f− in the decomposition f = f+−f− and using thefact that for the given quantizer sequence

ql(f ) = ql(f+)− ql(f−).

2

Because of the importance of these ergodic properties, we shall sim-plify the phrase “ergodic properties with respect to all bounded func-tions” to simply “bounded ergodic properties.” Thus a dynamical system(or random process) is said to have bounded ergodic properties if for anybounded measurable function (including any measurable indicator func-tion) it assigns probability 1 to the set of all sequences for which thesample average of the function (or the relative frequency of the event)converges.

7.2 Implications of Ergodic Properties

To begin developing the implications of ergodic properties, suppose thata system possesses ergodic properties for a class of bounded measure-ments. Then corresponding to each bounded measurement f in the classthere will be a limit function, say < f >, such that with probability one< f >n→< f > as n → ∞. Convergence almost everywhere and theboundedness of f imply L1 convergence, and hence for any event G

| n−1n−1∑i=0

∫GfT i dm−

∫Gf dm | = |

∫G(n−1

n−1∑i=0

fT i − f)dm|

≤∫G|n−1

n−1∑i=0

fT i − f | dm

≤ ‖ < f >n −f‖1 →n→∞0.

Thus we have proved the following lemma.

Lemma 7.3. If a dynamical system with measurem has the ergodic prop-erty with respect to a bounded measurement f , then

limn→∞

n−1n−1∑i=0

∫GfT i dm =

∫Gf dm, all G ∈ B. (7.1)

Thus, for example, if we take G as the whole space

7.2 Implications of Ergodic Properties 201

limn→∞

n−1n−1∑i=0

Em(fT i) = Emf . (7.2)

Suppose that a random process has ergodic properties with respect tothe class of all bounded measurements and let f be an indicator functionof an event F . Application of (7.1) yields

limn→∞

n−1n−1∑i=0

m(T−iF ∩G) = limn→∞

n−1n−1∑i=0

∫G

1FT idm

= Em(1F1G), all F,G ∈ B. (7.3)

For example, if G is the entire space than (7.3) becomes

limn→∞

n−1n−1∑i=0

m(T−iF) = Em(1F), all events F. (7.4)

The limit properties of the preceding lemma are trivial implicationsof the definition of ergodic properties, yet they will be seen to have far-reaching consequences. We note the following generalization for lateruse.

Corollary 7.2. If a dynamical system with measure m has the ergodicproperty with respect to a measurement f (not necessarily bounded) andif the sequence < f >n is uniformly integrable, then (7.1) and hence also(7.2) hold.

Proof. The proof follows immediately from Lemma 5.20 since this en-sures the needed L1 convergence. 2

Recall from Corollary 5.9 that if the fTn are uniformly integrable,then so are the < f >n. Thus, for example, ifm is stationary and f ism-integrable, then Lemma 5.23 and the preceding lemma imply that (7.1)and (7.2) hold, even if f is not bounded.

In general, the preceding lemmas state that if the arithmetic meansof a sequence of “nice” measurements converges in any sense, then thecorresponding arithmetic means of the expected values of the measure-ments must also converge. In particular, they imply that the arithmeticmean of the measuresmT−i must converge, a fact that we emphasize asa corollary:

Corollary 7.3. If a dynamical system with measure m has ergodic prop-erties with respect to the class of all indicator functions of events, then

limn→∞

1n

n−1∑i=0

m(T−iF) exists, all events F. (7.5)


Recall that a measure m is stationary or invariant with respect to T ifall of these measures are identical. Clearly (7.5) holds for stationary mea-sures. More generally, say that (7.5) holds for a measurem and transfor-mation T . Define the set functions mn,n = 1,2, . . . by

mn(F) = n−1n−1∑i=0

m(T−iF).

The mn are easily seen to be probability measures, and hence (7.5) im-plies that there is a sequence of measures mn such that for every eventF the sequence mn(F) has a limit. If the limit of the arithmetic meanof the transformed measures mT−i exists, we shall denote it by m. Thefollowing lemma shows that if a sequence of measures mn is such thatmn(F) converges for every event F , then the limiting set function is it-self a measure. Thus if (7.5) holds and we define the limit as m, then mis itself a probability measure.

Lemma 7.4. (The Vitali-Hahn-Saks Theorem)Given a sequence of probability measures mn and a set function m

such thatlimn→∞

mn(F) =m(F), all events F,

then m is also a probability measure.

Proof. It is easy to see that m is nonnegative, normalized, and finitelyadditive. Hence we need only show that it is countably additive. Observethat it would not help here for the space to be standard as that wouldonly guarantee thatm defined on a field would have a countably additiveextension, not that the given m that is defined on all events would itselfbe countably additive. To prove that m is countably additive we assumethe contrary and show that this leads to a contradiction.

Assume that m is not countably additive and hence from (1.26) theremust exist a sequence of disjoint sets Fj with union F such that

∞∑j=1

m(Fj) = b < a =m(F). (7.6)

We shall construct a sequence of integers M(k), k = 1,2, . . ., a corre-sponding grouping of the Fj into sets Bk as

Bk =M(k)⋃

j=M(k−1)+1

Fj, k = 1,2, . . . , (7.7)

a set B defined as the union of all of the Bk for odd k:

B = B1 ∪ B3 ∪ B5 ∪ . . . , (7.8)


and a sequence of integers N(k), k = 1,2,3, . . ., such that mN(k)(B) os-cillates between values of approximately b when k is odd and approx-imately a when k is even. This will imply that mn(B) cannot convergeand hence will yield the desired contradiction.

Fix ε > 0 much smaller than a− b. Fix M(0) = 0.Step 1: Choose M(1) so that

b −M∑j=1

m(Fj) <ε8

if M ≥ M(1). (7.9)

and then choose N(1) so that

|M(1)∑j=1

mn(Fj)− b |= |mn(B1)− b |<ε4, n ≥ N(1), (7.10)

and|mn(F)− a |≤

ε4, n ≥ N(1). (7.11)

Thus for large n, mn(B1) is approximately b and mn(F) is approxi-mately a.

Step 2: We know from (7.10)-(7.11) that for n ≥ N(1)

|mn(F − B1)− (a− b)| ≤ |mn(F)− a| + |mn(B1)− b| ≤ε2,

and hence we can choose M(2) sufficiently large to ensure that

|mN(1)(M(2)⋃

j=M(1)+1

Fj)− (a− b)| = |mN(1)(B2)− (a− b)| ≤3ε4. (7.12)

From (7.9)

b −M(2)∑j=1

m(Fj) = b −m(B1 ∪ B2) <ε8,

and hence analogous to (7.10) we can choose N(2) large enough to en-sure that

|b −mn(B1 ∪ B2)| <ε4, n ≥ N(2). (7.13)

Observe that mN(1) assigns probability roughly b to B1 and roughly(a− b) to B2, leaving only a tiny amount for the remainder of F . mN(2),however, assigns roughly b to B1 and to the union B1 ∪ B2 and henceassigns almost nothing to B2. Thus the compensation term a − b mustbe made up at higher indexed Bi’s.

Step 3: From (7.13) of Step 2 and (7.11)


|mn(F −2⋃i=1

Bi)− (a− b)| ≤ |mn(F)− a| + |mn(2⋃i=1

Bi)− b|

≤ ε4+ ε

4= ε

2, n ≥ N(2).

Hence as in (7.12) we can choose M(3) sufficiently large to ensure that

|mN(2)(M(3)⋃

j=M(2)+1

Fj)− (a− b)| = |mN(2)(B3)− (a− b)| ≤3ε4. (7.14)

Analogous to (7.13) using (7.9) we can choose N(3) large enough to en-sure that

|b −mn(3⋃i=1

Bi)| ≤ε4, n ≥ N(3). (7.15)

...

Step k: From step k− 1 for sufficiently large M(k)

|mN(k−1)(M(k)⋃

j=M(k−1)+1

Fj)−(a−b)| = |mN(k−1)(Bk)−(a−b)| ≤3ε4

(7.16)

and for sufficiently large N(k)

|b −mn(k⋃i=1

Bi)| ≤ε4, n ≥ N(k). (7.17)

Intuitively, we have constructed a sequencemN(k) of probability mea-sures such that mN(k) always puts a probability of approximately b onthe union of the first k Bi and a probability of about a − b on Bk+1. Infact, most of the probability weight of the union of the first k Bi lies in B1

from (7.9). Together this gives a probability of about a and hence yieldsthe probability of the entire set F , thusmN(k) puts nearly the total prob-ability of F on the first B1 and on Bk+1. Observe then that mN(k) keepspushing the required difference a − b to higher indexed sets Bk+1 as kgrows. If we now define the set B of (7.8) as the union of all of the oddBj , then the probability of B will include b from B1 and it will includethe a − b from Bk+1 if and only if k + 1 is odd, that is, if k is even. Theremaining sets will contribute a negligible amount. Thus we will havethat mN(k)(B) is about b + (a − b) = a for k even and b for k odd. Wecomplete the proof by making this idea precise. If k is even, then from(7.10) and (7.16)


mN(k)(B) =∑

odd j

mN(k)(Bj)

≥mN(k)(B1)+mN(k)(Bk+1)

≥ b − ε4+ (a− b)− 3ε

4= a− ε, k even . (7.18)

For odd k the compensation term (a − b) is not present and we havefrom (7.11) and (7.16)

mN(k)(B) =∑

odd j

mN(k)(Bj)

≤mN(k)(F)−mN(k)(Bk+1)

≤ a+ ε4− ((a− b)− 3ε

4)

= b + ε. (7.19)

Eqs. (7.18)-(7.19) demonstrate thatmN(k)(B) indeed oscillates as claimedand hence cannot converge, yielding the desired contradiction and com-pleting the proof of the lemma. 2

Say that a probability measure m and transformation T are such thatthe arithmetic means of the iterates of m with respect to T converge toa set function m; that is,

limn→∞

n−1n−1∑i=0

m(T−iF) =m(F), all events F. (7.20)

From the previous lemma, m is a probability measure. Furthermore, itfollows immediately from (7.20) that m is stationary with respect to T ,that is,

m(T−1F) =m(F), all events F.

Since m is the limiting arithmetic mean of the iterates of a measurem and since it is stationary, a probability measure m for which (7.5)holds, that is, for which all of the limits of (7.20) exist, is said to beasymptotically mean stationary or AMS. The limiting probability measurem is called the stationary mean of m..

Since we have shown that a random process possessing ergodic prop-erties with respect to the class of all indicator functions of events is AMSwe have proved the following result.

Theorem 7.1. A necessary condition for a random process to possess er-godic properties with respect to the class of all indicator functions ofevents (and hence with respect to any larger class such as the class ofall bounded measurements) is that it be asymptotically mean stationary.


We shall see in the next chapter that the preceding condition is alsosufficient.

Exercise

1. A dynamical system (Ω,B,m,T) is N-stationary for a positive integerN if it is stationary with respect to TN , that is, if for all F m(T−iF) isperiodic with period N . A system is said to be block stationary if it isN-stationary for some N . Show that an N-stationary process is AMSwith stationary mean

m(F) = 1N

N−1∑i=0

m(T−iF).

Show in this case that mm.

7.3 Asymptotically Mean Stationary Processes

In this section we continue to develop the implications of ergodic prop-erties. The emphasis here is on a class of measurements and eventsthat are intimately connected to ergodic properties and will play a ba-sic role in further necessary conditions for such properties as well asin the demonstration of sufficient conditions. These measurements helpdevelop the relation between an AMS measurem and its stationary meanm. The theory of AMS processes follows the work of Dowker [22] [23],Rechard [88], and Gray and Kieffer [39].

Invariant Events and Measurements

As before, we have a dynamical system (Ω,B,m,T) with ergodic prop-erties with respect to a measurement, f and hence with probability onethe sample average

limn→∞

< f >n= limn→∞

1n

n−1∑i=0

fT i (7.21)

exists. From Corollary 7.1, the limit, say < f >, must be an invariantfunction. Define the event

7.3 Asymptotically Mean Stationary Processes 207

F = x : limn→∞

n−1n−1∑i=0

f(T ix) exists, (7.22)

and observe that by assumption m(F) = 1. Consider next the set T−1F .Then x ∈ T−1F or Tx ∈ F if and only if the limit

limn→∞

n−1n−1∑i=0

f(T i(Tx)) = f(Tx)

exists. Provided that f is a real-valued measurement and hence cannottake on infinity as a value (that is, it is not an extended real-valued mea-surement), then as in Lemma 7.1 the left-hand limit is exactly the sameas the limit

limn→∞

n−1n−1∑i=0

f(T ix).

In particular, either limit exists if and only if the other one does andhence x ∈ T−1F if and only if x ∈ F . We formalize this property in adefinition: If an event F is such that T−1F = F , then the event is said tobe invariant or T -invariant or invariant with respect to T .

Thus we have seen that the event that limiting sample averages con-verge is an invariant event and that the limiting sample averages them-selves are invariant functions. Observe that we can write for all x that

f(x) = 1F(x) limn→∞

1n

n−1∑n=0

f(T ix).

We shall see that invariant events and invariant measurements are inti-mately connected. When T is the shift, invariant measurements can beinterpreted as measurements whose output does not depend on whenthe measurement is taken and invariant events are events that are unaf-fected by shifting.

Limiting sample averages are not the only invariant measurements;other examples are

lim supn→∞

< f >n (x),

andlim infn→∞

< f >n (x).

Observe that an event is invariant if and only if its indicator functionis an invariant measurement. Observe also that for any event F , the event


Fi.o. = lim supn→∞

T−nF =∞⋂n=0

∞⋃k=nT−kF

= x : x ∈ T−nF i.o.= x : Tnx ∈ F i.o. (7.23)

is invariant, where i.o. means infinitely often. To see this, observe that

T−1Fi.o. =∞⋂n=1

∞⋃k=nT−kF

and∞⋃k=nT−kF ↓ Fi.o..

Invariant events are “closed” to transforming, e.g., shifting: If a pointbegins inside an invariant event, it must remain inside that event for allfuture transformations or shifts. Invariant measurements are measure-ments that yield the same value on Tnx for any n. Invariant measure-ments and events yield special relationships between an AMS randomprocess and its asymptotic mean, as described in the following lemma.

Lemma 7.5. Given an AMS system with measurem with stationary meanm, then

m(F) =m(F), all invariant F,

andEmf = Emf , all invariant f ,

where the preceding equation means that if either integral exists, then sodoes the other and they are equal.

Proof. The first equality follows by direct substitution into the formuladefining an AMS measure. The second formula then follows from stan-dard approximation arguments, i.e., since it holds for indicator func-tions, it holds for simple functions by linearity. For positive functionstake an increasing quantized sequence and apply the simple functionresult. The two limits must converge or diverge together. For generalf , decompose it into positive and negative parts and apply the positivefunction result. 2

In some applications we deal with measurements or events that are“almost invariant” in the sense that they equal their shift with probabil-ity one instead of being invariant in the strict sense. As there are twomeasures to consider here, an AMS measure and its stationary mean,a little more care is required to conclude results like Lemma 7.5. Forexample, suppose that an event F is invariant m-a.e. in the sense thatm(F∆T−1F) = 0. It does not then follow that F is also invariant m-a.e.


or that m(F) = m(F). The following lemma shows that if one strength-ens the definitions of almost invariant to include all shifts, then m andm will behave similarly for invariant measurements and events.

Lemma 7.6. An event F is said to be totally invariant m-a.e. if

m(F∆T−kF) = 0, k = 1,2, . . . .

Similarly, a measurement f is said to be totally invariant m-a.e. if

m(f = fTk;k = 1,2, . . .) = 1.

Suppose that m is AMS with stationary mean m. If F is totally invariantm-a.e. then

m(F) =m(F),and if f is totally invariant m-a.e.

Emf = Emf .

Furthermore, if f is totally invariantm-a.e., then the event f−1(G) = ω :f(ω) ∈ G is also totally invariant and hence

m(f−1(G)) =m(f−1(G)).

Proof. It follows from the elementary inequality

|m(F)−m(G)| ≤m(F∆G)that if F is totally invariantm-a.e., thenm(F) =m(T−kF) for all positiveintegers k. This and the definition of m imply the first relation. Thesecond relation then follows from the first as in Lemma 7.5. If m putsprobability 1 on the set D ofω for which the f(Tkω) are all equal, thenfor any k

m(f−1(G)∆T−kf−1(G))=m(ω : f(ω) ∈ G∆ω : f(Tkω) ∈ G))=m(ω :ω ∈ D ∩ (ω : f(ω) ∈ G∆ω : f(Tkω) ∈ G))= 0

since ifω is in D, f(ω) = f(Tkω) and the given difference set is empty.2

Lemma 7.5 has the following important implication,

Corollary 7.4. Given an AMS system (Ω,B,m,T) with stationary meanm, then the system has ergodic properties with respect to a measurementf if and only if the stationary system (Ω,B,m,T) has ergodic properties


with respect to f . Furthermore, the limiting sample average can be chosento be the same for both systems.

Proof. The set F = x : limn→∞ < f >n exists is invariant and hencem(F) =m(F) from the lemma. Thus either both measures assign prob-ability one to this set and hence both systems have ergodic propertieswith respect to f or neither does. Choosing < f > (x) as the given limitfor x ∈ F and 0 otherwise completes the proof. 2

Another implication of the lemma is the following relation betweenan AMS measure and its stationary mean. We first require the definitionof an asymptotic form of domination or absolute continuity. Recall thatm η means that η(F) = 0 implies m(F) = 0. We say that a measure ηasymptotically dominates m (with respect to T) if η(F) = 0 implies that

limn→∞

m(T−nF) = 0.

Roughly speaking, if η asymptotically dominates m, then zero probabil-ity events under η will have vanishing probability under m if they areshifted far into the future. This can be considered as a form of steadystate behavior in that such events may have nonzero probability as tran-sients, but eventually their probability must tend to 0 as things settledown.

Corollary 7.5. If m is asymptotically mean stationary with stationarymean m, then m asymptotically dominates m. If T is invertible (e.g., thetwo-sided shift), then asymptotic domination implies ordinary dominationand hence mm.

Proof. From the lemmam andm place the same probability on invariantsets. Suppose that F is an event such that m(F) = 0. Then the set G =lim supn→∞ T−nF of (7.23) hasm measure 0 since from the continuity ofprobability and the union bound

m(G) = limn→∞

m(∞⋃k=nT−nF) ≤m(

∞⋃k=1

T−nF)

≤∞∑j=1

m(T−nF) = 0.

In addition G is invariant, and hence from the lemma m(G) = 0. Apply-ing Fatou’s lemma (Lemma 5.10(c)) to indicator functions yields

lim supn→∞

m(T−nF) ≤m(lim supn→∞

T−nF) = 0,

proving the first statement. Suppose that η asymptotically dominatesm and that T is invertible. Say η(F) = 0. Since T is invertible the set

7.3 Asymptotically Mean Stationary Processes 211⋃∞n=−∞ TnF is measurable and as in the preceding argument hasm mea-

sure 0. Since the set is invariant it also has m measure 0. Since the setcontains the set F , we must have m(F) = 0, completing the proof. 2

The next lemma pins down the connection between invariant eventsand invariant measurements. Let I denote the class of all invariantevents and, as in the previous chapter, let M(I) denote the class of allmeasurements (B-measurable functions) that are measurable with re-spect to I . The following lemma shows that M(I) is exactly the class ofall invariant measurements.

Lemma 7.7. Let I denote the class of all invariant events. Then I is a σ -field and M(I) is the class of all invariant measurements. In addition,I = σ(M(I)) and hence I is the σ -field generated by the class of allinvariant measurements.

Proof. That I is a σ -field follows immediately since inverse images (un-der T) preserve set theoretic operations. For example, if F and G areinvariant, then T−1(F ∪ G) = T−1F ∪ T−1G = F ∪ G and hence theunion is also invariant. The class M(I) contains all indicator functionsof invariant events, hence all simple invariant functions. For example, iff(ω) =

∑bi1Fi(ω) is a simple function in this class, then its inverse im-

ages must be invariant, and hence f−1(bi) = Fi ∈ I is an invariant event.Thus f must be an invariant simple function. Taking limits in the usualway then implies that every function in M(I) is invariant. Conversely,suppose that f is a measurable invariant function. Then for any Borelset G, T−1f−1(G) = (fT)−1(G) = f−1(G), and hence the inverse imagesof all Borel sets under f are invariant and hence in I . Thus measurableinvariant function must be inM(I). The remainder of the lemma followsfrom Lemma 6.2. 2

Tail Events and Measurements

The sub-σ -field of invariant events plays an important role when study-ing limiting averages. There are other sub-σ -fields that likewise play animportant role in studying asymptotic behavior, especially when deal-ing with one-sided processes or noninvertible transformations. Given adynamical system (Ω,B,m,T), consider a random process Xn con-structed in the usual way from a random variable X0 and the transfor-mation T as Xn(ω) = X0(Tnω). Define

F0 = σ(X0, X1, X2, . . .), (7.24)

the sub-σ -field generated by the collection or random variables with non-negative time indexes. If we are dealing with the Kolmogorov represen-tation of a one-sided random process (noninvertible shift), then this is


simply the full σ -field for the process Xn, F0 = B. Define the sub-σ -fields

Fn = T−nF0 = σ(Xn,Xn+1, Xn+2, . . .); n = 0,1, . . . (7.25)

F∞ =∞⋂n=0

Fn. (7.26)

F∞ is called the tail σ -field of the process Xn if F0 has the form of(7.24). The tail σ -field F∞ goes by many other names, including the re-mote σ -field , the right tail sub-σ -field , and the future tail sub-σ -fieldIntuitively, it contains all of those events that can be determined by

looking at future outcomes arbitrarily far into the future; that is, mem-bership in a tail event cannot be determined by any of the individual Xn,only by their aggregate behavior in the distant future. Strictly speaking,we should always refer to the relevant process Xn when referring toa tail σ -field, but for the present discussion we will simply assume thatwe are dealing with a directly-given representation of the process Xnand that (7.24) holds.

If T is invertible, one can also define a left tail sub-σ -field by replacingT−1 by T — which considers events depending only on the remote past.The left tail sub-σ -field is also called the past tail sub-σ -field. We shallconcentrate on the right or future tail.

In the case of a directly-given one-sided random process Xn, F0 = Band hence

Fn = T−nB

F∞ =∞⋂n=0

T−nB.

If F ∈ F0 is invariant, then T−nF = F ∈ T−nFn for all n and hence F ∈F∞, so that invariant events are tail events and I ⊂ F∞. This argumentdoes not hold in the directly-given two-sided case, however, where F0 ≠B and T−nB = B. In the one-sided case, this fact has an interesting andsimple implication. The process Xn is ergodic if and only if I , the σ -field of invariant events, is trivial, that is, if all events in this σ -field haveprobability 0 or 1. A process Xn is said to satisfy the 0-1 law and to bea Kolmogorov process or K-process if it has the property that all eventsin F∞ are trivial, that is, if all tail events have probability 0 or 1. See,e.g., [3, 103, 77]. Thus in the two-sided case at least, a K-process must beergodic. A dynamical system (Ω,B,m,T) is said to be K or Kolmogorov ifall random processes defined on it in the usual way are K-processes; thatis, for any random variable X0 defined on (Ω,B,m) the random processXn defined by Xn(ω) = X(Tnω) is a K-process.


In the two-sided case, things are more complicated. Invariant setsmight not be tail events, but it can be shown that if the measure is sta-tionary, then invariant sets differ from tail events by null sets (see, e.g.,[62], p. 27) and hence a stationary K-process is ergodic in general.

A measurement f is called a tail function if it is measurable with re-spect to the tail σ -field. Intuitively, a measurement f is a tail functiononly if we can determine its value by looking at the outputs Xn,Xn+1, . . .for arbitrarily large values of n.

Asymptotic Domination by a Stationary Measure

We have seen that if T is invertible as in a two-sided directly given ran-dom process and ifm is AMS with stationary meanm, thenmm, andhence any event having 0 probability underm also has 0 probability un-derm. In the one-sided case this is not in general true, although we haveseen that it is true for invariant events. We shall shortly see that it is alsotrue for tail events. First, however, we need a basic property of asymp-totic dominance. If m m, the Radon-Nikodym theorem implies thatm(F) =

∫F(dm/dm)dm. The following lemma provides an asymptotic

version of this relation for asymptotically dominated measures.

Lemma 7.8. Given measuresm and η, let (mT−n)a denote the absolutelycontinuous part ofmT−n with respect to η as in the Lebesgue decomposi-tion theorem (Theorem 6.3) and define the Radon-Nikodym derivative

fn =d(mT−n)a

dη;n = 1,2, . . . .

If η asymptotically dominates m then

limn→∞

[supF∈B

|m(T−nF)−∫Ffndη|] = 0.

theorem (Theorems 6.3 and 6.2) for each n = 1,2, . . . there exists aBn ∈ B such that

mT−n(F) =mT−n(F ∩ Bn)+∫Ffndη;F ∈ B, (7.27)

with η(Bn) = 0. Define B =⋃∞n=0 Bn. Then η(B) = 0, and hence by as-

sumption

0 ≤mT−n(F)−∫Ffndη =mT−n(F ∩ Bn)

≤mT−n(F ∩ B) ≤mT−n(B) →n→∞

0.

Proof. From the Lebesgue decomposition theorem and the Radon-Nikodym


Since the bound is uniform over F , the lemma is proved. 2

Corollary 7.6. Suppose that η is stationary and asymptotically dominatesm (e.g., m is AMS and η is the stationary mean), then if F is a tail event(F ∈ F∞) and η(F) = 0, then also m(F) = 0.

Proof. First observe that if T is invertible, the result follows from thefact that asymptotic dominance implies dominance. Hence assume thatT is not invertible and that we are dealing with a one-sided process Xnand hence F∞ = ∩T−nB. Let F ∈ F∞ have η probability 0. Find Fn sothat T−nFn = F, n = 1,2, . . .. (This is possible since F ∈ F∞ means thatalso F ∈ T−nB for all n and hence there must be an Fn ∈ B such thatF = T−nFn.) From the lemma,m(F)−

∫Fn fndη→ 0. Since η is stationary,∫

Fn fndη =∫F fnTndη = 0, hence m(F) = 0. 2

Corollary 7.7. Let µ and η be probability measures on (Ω,B) with η sta-tionary. Then the following are equivalent:

(a) η asymptotically dominates µ.(b) If F ∈ B is invariant and η(F) = 0, then also µ(F) = 0.(c) If F ∈ F∞ and η(F) = 0, then also µ(F) = 0.

Proof. The previous corollary shows that (a) => (c). (c) => (b) immedi-ately from the fact that invariant events are also tail events. To prove that(b) => (a), assume that η(F) = 0. The set G = lim supn→∞ T−nF of (7.23)is invariant and has η measure 0. Thus as in the proof of Corollary 7.5

lim supn→∞

m(T−nF) ≤m(lim supn→∞

T−nF) = 0.

2

Corollary 7.8. Let µ and η be probability measures on (Ω,B). If η is sta-tionary and has ergodic properties with respect to a measurement f andif η asymptotically dominates µ, then also µ has ergodic properties withrespect to f .

Proof. The set on which x : limn→∞ < f >n exists is invariant andhence if its complement (which is also invariant) has measure 0 under η,it also has measure zero under m from the previous corollary. 2

Exercises

1. Is the event

lim infn→∞

T−nF =∞⋃n=0

∞⋂k=nT−kF

= x : x ∈ T−nF for all but a finite number of n

7.4 Recurrence 215

invariant? Is it a tail event?2. Show that given an event F and ε > 0, the event

x : | 1n

n−1∑i=0

f(T ix)| ≤ ε i.o.

is invariant.3. Suppose that m is stationary and let G be an event for which T−1G ⊂G. Show that m(G − T−1G) = 0 and that there is an invariant event Gwith m(G∆G) = 0.

4. Is an invariant event a tail event? Is a tail event an invariant event? IsFi.o. of (7.23) a tail event?

5. Suppose that m is the distribution of a stationary process and that Gis an event with nonzero probability. Show that if G is invariant, thenthe conditional distribution mG defined by mG(F) = m(F|G) is alsostationary. Show that even if G is not invariant, mG is AMS.

6. Suppose that T is an invertible transformation (e.g., the shift in a two-sided process) and thatm is AMS with respect to T . Show thatm(F) >0 implies that m(TnF) > 0 i.o. (infinitely often).

7.4 Recurrence

We have seen that AMS measures are closely related to stationary mea-sures. In particular, an AMS measure is either dominated by a stationarymeasure or asymptotically dominated by a stationary measure. Since thesimpler and stronger nonasymptotic form of domination is much easierto work with, the question arises as to conditions under which an AMSsystem will be dominated by a stationary system. We have already seenthat this is the case if the transformation T is invertible, e.g., the systemis a two-sided random process with shift T . In this section we provide amore general set of conditions that relates to what is sometimes calledthe most basic ergodic property of a dynamical system, that of recur-rence.

The original recurrence theorem goes back to Poincare [85]. We brieflydescribe it as it motivates some related definitions and results.

Given a dynamical system (Ω,B,m,T), let G ∈ B. A point x ∈ G issaid to be recurrent with respect to G if there is a finite n = nG(x) suchthat

TnG(x) ∈ G,that is, if the point in G returns to G after some finite number of shifts.An event G having positive probability is said to be recurrent if almostevery point in G is recurrent with respect to G. In order to have a useful


definition for all events (with or without positive probability), we state itin a negative fashion: Define

G∗ =∞⋃n=1

T−nG, (7.28)

the set of all sequences that land in G after one or more shifts. Then

B = G −G∗ (7.29)

is the collection of all points in G that never return to G, the bad ornonrecurrent points. We now define an event G to be recurrent ifm(G−G∗) = 0. The dynamical system will be called recurrent if every eventis a recurrent event. Observe that if the system is recurrent, then for allevents G

m(G ∩G∗) =m(G)−m(G ∩G∗c)=m(G)−m(G −G∗) =m(G). (7.30)

Thus if m(G) > 0, then the conditional probability that the point willeventually return to G is m(G∗|G) =m(G ∩G∗)/m(G) = 1.

Before stating and proving Poincaré recurrence theorem, we make twoobservations about the preceding sets. First observe that if x ∈ B, thenwe cannot have Tkx ∈ G for any k ≥ 1. Since B is a subset of G, thismeans that we cannot have Tkx ∈ B for any k ≥ 1, and hence the eventsB and T−kB must be pairwise disjoint for k = 1,2, . . .; that is,

B ∩ T−kB = ∅;k = 1,2, . . .

If these sets are empty, then so must be their inverse images with respectto Tn for any positive integer n, that is,

∅ = T−n(B ∩ T−kB) = (T−nB)∩ (T−(n+k)B)

for all integers n and k. This in turn implies that

T−kB ∩ T−jB = ∅;k, j = 1,2, . . . , k 6= j; (7.31)

that is, the sets T−nB are pairwise disjoint. We shall call any set Wwith the property that T−nW are pairwise disjoint a wandering set. Thus(7.31) states that the collection G − G∗ of nonrecurrent sequences is awandering set.

As a final property, observe that G∗ as defined in (7.28) has a simpleproperty when it is shifted:

7.4 Recurrence 217

T−1G∗ =∞⋃n=1

T−n−1G =∞⋃n=2

T−nG ⊂ G∗,

and henceT−1G∗ ⊂ G∗. (7.32)

Thus G∗ has the property that it is shrunk or compressed by the inversetransformation. Repeating this shows that

T−kG∗ =∞⋃

n=k+1

T−nG ⊂ T−(k+1)G∗. (7.33)

We are now prepared to state and prove what is often considered thefirst ergodic theorem.

Theorem 7.2. (The Poincaré Recurrence Theorem) If the dynamical sys-tem is stationary, then it is recurrent.

Proof. We need to show that for any event G, the event B = G−G∗ has 0measure. We have seen, however, that B is a wandering set and hence itsiterates are disjoint. This means from countable additivity of probabilitythat

m(∞⋃n=0

T−nB) =∞∑n=0

m(T−nB).

Since the measure is stationary, all of the terms of the sum are the same.They then must all be 0 or the sum would blow up, violating the fact thatthe left-hand side is less than 1. 2

Although such a recurrence theorem can be viewed as a primitive er-godic property, unlike the usual ergodic properties, there is no indicationof the relative frequency with which the point will return to the targetset. We shall shortly see that recurrence at least guarantees infinite re-currence in the sense that the point will return infinitely often, but againthere is no notion of the fraction of time spent in the target set.

We wish next to explore the recurrence property for AMS systems. Weshall make use of several of the ideas developed above. We begin withseveral definitions regarding the given dynamical system. Recall for com-parison that the system is stationary if m(T−1G) =m(G) for all eventsG. The system is said to be incompressible if every event G for whichT−1G ⊂ G has the property thatm(G−T−1G) = 0. Note the resemblanceto stationarity: For all such Gm(G) =m(T−1G). The system is said to beconservative if all wandering sets have 0 measure. As a final definition,we modify the definition of recurrent. Instead of the requirement that arecurrent system have the property that if an event has nonzero proba-bility, then points in that event must return to that event with probability1, we can strengthen the requirement to require that the points returnto the event infinitely often with probability 1. Define as in (7.23) the set


Gi.o. =∞⋂n=0

∞⋃k=nT−kG

of all points that are in G infinitely often. The dynamical system will besaid to be infinitely recurrent if for all events G

m(G −Gi.o.) = 0.

Note from (7.33) that

∞⋃k=nT−kG = T−(n+1)G∗,

and hence since the T−nG∗ are decreasing

Gi.o. =∞⋂n=0

T−(n+1)G∗ =∞⋂n=0

T−nG∗. (7.34)

The following theorem is due to Halmos [44], and the proof largely fol-lows Wright [114], Petersen [84], or Krengel [62]. We do not, however,assume as they do that the system is nonsingular in the sense thatmT−1 m. When this assumption is made, T is said to be null pre-serving because shifts of the measure inherit the zero probability sets ofthe original measure.

Theorem 7.3. Suppose that (Ω,B,m,T) is a dynamical system. Then thefollowing statements are equivalent:

(a) T is infinitely recurrent.(b) T is recurrent.(c) T is conservative.(d) T is incompressible.

Proof. (a)⇒ (b): In words: if we must return an infinite number of times,then we must obviously return at least once. Alternatively, applicationof (7.34) implies that

G −Gi.o. = G −∞⋂n=0

T−nG∗ (7.35)

so that G − G∗ ⊂ G − Gi.o. and hence the system is recurrent if it isinfinitely recurrent.(b)⇒ (c): Suppose that W is a wandering set. From the recurrence prop-erty W −W∗ must have 0 measure. Since W is wandering, W ∩W∗ = ∅and hence m(W) = m(W − W∗) = 0 and the system is conservative.(c)⇒ (d): Consider an event G with T−1G ⊂ G. Then G∗ = T−1G since allthe remaining terms in the union defining G∗ are further subsets. Thus

7.4 Recurrence 219

G − T−1G = G −G∗. We have seen, however, that G −G∗ is a wanderingset and hence has 0 measure since the system is conservative.(b) and (d)⇒ a: From (7.35) we need to prove that

m(G −Gi.o.) =m(G −∞⋂n=0

T−nG∗) = 0.

Toward this end we use the fact that the T−nG∗ are decreasing to de-compose G∗ as

G∗ = (∞⋂n=1

T−nG∗)∞⋃n=0

(T−nG∗ − T−(n+1)G∗).

This breakup can be thought of as a “core”⋂∞n=1 T−nG∗ containing the

sequences that shift into G∗ for all shifts combined with a union of an in-finity of “doughnuts,” where the nth doughnut contains those sequencesthat when shifted n times land in G∗, but never arrive in G∗ if shiftedmore than n times. The core and all of the doughnuts are disjoint andtogether yield all the sequences in G∗. Because the core and doughnutsare disjoint,

∞⋂n=1

T−nG∗ = G∗ −∞⋃n=0

(T−nG∗ − T−(n+1)G∗)

so that

G −∞⋂n=1

T−nG∗ = (G −G∗)∪ (G ∩ (∞⋃n=0

(T−nG∗ − T−(n+1)G∗)). (7.36)

Thus since the system is recurrent we know by (7.36) and the unionbound that

m(G −∞⋂n=1

T−nG∗) ≤m(G −G∗)+m(G ∩ (∞⋃n=0

(T−nG∗ − T−(n+1)G∗))

≤m(G −G∗)+m(∞⋃n=0

(T−nG∗ − T−(n+1)G∗))

=m(G −G∗)+∞∑n=0

m(T−nG∗ − T−(n+1)G∗)),

using the disjointness of the doughnuts. (Actually, the union bound andan inequality would suffice here.) Since T−1G∗ ⊂ G∗, it is also true thatT−1T−nG∗ = T−nT−1G∗ ⊂ T−nG∗ and hence since the system is incom-pressible, all the terms in the previous sum are zero.


(d) ⇒ (b) Given an event G, let F = G ∪G∗. Then T−1F = G∗ ⊂ F andhencem(F −T−1F) = 0 by incompressibility. But F −T−1F = (G∪G∗)−G∗ = G −G∗ and hence m(G −G∗) = 0 implying recurrence.

We have now shown that (a)⇒ (b)⇒ (c)⇒ (d), that (b) and (d) togetherimply (a), and that (d) ⇒ (d) and hence (d)⇒ (a), which completes theproof that any one of the conditions implies the other three. 2

Thus, for example, Theorem 7.2 implies that a stationary system hasall the preceding properties. Our goal, however, is to characterize theproperties for an AMS system. The following theorem characterizes re-current AMS processes: An AMS process is recurrent if and only if it isdominated by its stationary mean.

Theorem 7.4. Given an AMS dynamical system (Ω,B,m,T) with station-ary mean m, then mm if and only the system is recurrent.

Proof. Ifmm, then the system is obviously recurrent sincem is fromthe previous two theorems andm inherits all of the zero probability setsfrom m. To prove the converse implication, suppose the contrary. Saythat there is a set G for which m(G) > 0 and m(G) = 0 and that thesystem is recurrent. From Theorem 7.3 this implies that the system isalso infinitely recurrent, and hence

m(G −Gi.o.) = 0

and thereforem(G) =m(Gi.o.) > 0.

We have seen, however, that Gi.o is an invariant set in Section 7.3 andhence from Lemma 7.5, m(Gi.o.) > 0. Since m is stationary and givesprobability 0 to G, m(T−kG) = 0 for all k, and hence

m(Gi.o.) =m(∞⋂n=0

∞⋃k=nT−nG) ≤m(

∞⋃k=0

T−nG) ≤∞∑k=0

m(G) = 0,

yielding a contradiction and proving the theorem. 2

If the transformation T is invertible; e.g., the two-sided shift, then wehave seen that m m and hence the system is recurrent. When theshift is noninvertible, the theorem proves that domination by a station-ary measure is equivalent to recurrence. Thus, in particular, not all AMSprocesses need be recurrent.

7.5 Asymptotic Mean Expectations 221

Exercises

1. Show that if an AMS dynamical system is nonsingular (mT−1 m),then m and the stationary mean m will be equivalent (have the sameprobability 0 events) if and only if the system is recurrent.

2. Show that ifm is nonsingular (null preserving) with respect to T , thenincompressibility of T implies that it is conservative and hence thatall four conditions of Theorem 7.3 are equivalent.

7.5 Asymptotic Mean Expectations

We have seen that the existence of ergodic properties implies that limitsof sample means of expectations exist; e.g., if < f >n converges, then sodoes n−1

∑n−1i=0 EmfTi. In this section we relate these limits to expecta-

tions with respect to the stationary mean m.Lemma 7.3 showed that the arithmetic mean of the expected values

of transformed measurements converged. The following lemma showsthat the limit is given by the expected value of the original measurementunder the stationary mean.

Lemma 7.9. Let m be an AMS random process with stationary mean m.If a measurement f is bounded, then

limn→∞

Em(< f >n) = limn→∞

n−1n−1∑i=0

Em(fT i)

= limn→∞

Emnf = Em(f). (7.37)

Proof. The result is immediately true for indicator functions from thedefinition of AMS. It then follows for simple functions from the linearityof limits. In general let qk denote the quantizer sequence of Lemma 5.3and observe that for each k we have

|Em(< f >n)− Em(f)| ≤ |Em(< f >n)− Em(< qk(f) >n)|+ |Em(< qk >n)− Em(qk(f))| + |Em(qk(f))− Em(f) | .

Choose k larger that the maximum of the bounded function f . The mid-dle term goes to zero from the simple function result. The rightmostterm is bound above by 2−k since we are integrating a function that iseverywhere less than 2−k. Similarly the first term on the left is boundabove by

n−1n−1∑i=0

Em|fT i − qk(fT i)| ≤ 2−k,


where we have used the fact that qm(f)(T ix) = qm(f(T ix)). Since k canbe made arbitrarily large and hence 2× 2−k arbitrarily small, the lemmais proved. 2

Thus the stationary mean provides the limiting average expectationof the transformed measurements. Combining the previous lemma withLemma 7.3 immediately yields the following corollary.

Corollary 7.9. Let m be an AMS measure with stationary mean m. If mhas ergodic properties with respect to a bounded measurement f and ifthe limiting sample average is < f >, then

Em < f >= Emf . (7.38)

Thus the expectation of the measurement under the stationary meangives the expectation of the limiting sample average under the originalmeasure.

If we also have uniform integrability, we can similarly extend Corol-lary 7.2.

Lemma 7.10. Let m be an AMS random process with stationary mean mthat has the ergodic property with respect to a measurement f . If the< f >n are uniformly m-integrable, then f is also m integrable and(7.37) and (7.38) hold.

Proof. From Corollary 7.2

limn→∞

Em(< f >n) = Emf .

In particular, the right-hand integral exists and is finite. From Corol-lary 7.1 the limit < f > is invariant. If the system is AMS, the invarianceof < f > and Lemma 7.5 imply that

Emf = Emf .

Since m is stationary Lemma 5.23 implies that the sequence < f >n isuniformly integrable with respect to m. Thus from Lemma 5.20 and theconvergence of < f >n on a set of m probability one (Corollary 7.4), theright-hand integral is equal to

limn→∞

Em(n−1n−1∑i=0

fT i).

From the linearity of expectation, the stationarity ofm, and Lemma 5.22,for each value of n the preceding expectation is simply Emf . 2

7.6 Limiting Sample Averages 223

Exercises

1. Show that if m is AMS, then for any positive integer N and all i =0,1, . . . ,N − 1, then the following limits exist:

limn→∞

1n

n−1∑j=0

m(T−i−jNF).

2. Define the preceding limit to bemi. Show that themi are N-stationaryand that if m is the stationary mean of m, then

m(F) = 1N

N−1∑i=0

mi(F).

7.6 Limiting Sample Averages

In this section limiting sample averages are related to conditional expec-tations. The basis for this development is the result of Section 7.1 thatlimiting sample averages are invariant functions and hence are measur-able with respect to I , the σ -field of all invariant events.

We begin with limiting relative frequencies, that is, the limiting sam-ple averages of indicator functions. Say we have a dynamical system(Ω,B,m,T) that has ergodic properties with respect to the class of allindicator functions and which is therefore AMS. Fix an event G and let< 1G > denote the corresponding limiting relative frequency, that is< 1G >n→< 1G > as n → ∞ a.e. This limit is invariant and thereforein M(I) and is measurable with respect to I . Fix an event F ∈ I andobserve that from Lemma 7.5∫

F1Gdm =

∫F

1Gdm. (7.39)

For invariant F , 1F is an invariant function and hence

< 1Ff >n=1n

n−1∑i=0

(1Ff )T i = 1F1n

n−1∑i=0

fT i = 1F < f >n

and hence if < f >n→< f > then

< 1Ff >= 1F < f > . (7.40)

Thus from Corollary 7.9 we have for invariant F that

224 7 Ergodic Properties∫F< 1G > dm = Em < 1F1G >= Em1F1G

=∫F< 1G > dm =m(F ∩G),

which with (7.39) implies that∫F< 1G > dm =m(F ∩G), all F ∈ I. (7.41)

Eq. (7.41) together with the I-measurability of the limiting sample av-erage yield (6.23) and (6.24) and thereby show that the limiting sampleaverage is a version of the conditional probabilitym(G|I) of G given theσ -field I or, equivalently, given the class of all invariant measurements!We formalize this in the following theorem.

Theorem 7.5. Given a dynamical system (Ω,B,m,T) let I denote theclass of all T -invariant events. If the system is AMS and if it has ergodicproperties with respect to the indicator function 1G of an event G (e.g., ifthe system has ergodic properties with respect to all indicator functions ofevents), then with probability one under m and m

< 1G >n= n−1n−1∑i=0

1GT i →n→∞m(G|I),

In other words,< 1G >=m(G|I). (7.42)

It is important to note that the conditional probability is defined byusing the stationary mean measure m and not the original probabilitymeasure m. Intuitively, this is because it is the stationary mean thatdescribes asymptotic events.

Eq. (7.42) shows that the limiting sample average of the indicator func-tion of an event, that is, the limiting relative frequency of the event,satisfies the descriptive definition of the conditional probability of theevent given the σ -field of invariant events. An interpretation of this re-sult is that if we wish to guess the probability of an event G based on theknowledge of the occurrence or nonoccurrence of all invariant eventsor, equivalently, on the outcomes of all invariant measurements, thenour guess is given by one particular invariant measurement: the relativefrequency of the event in question.

A natural question is whether Theorem 7.5 holds for more generalfunctions than indicator functions; that is, if a general sample average< f >n converges, must the limit be the conditional expectation? Thefollowing lemma shows that this is true in the case of bounded measure-ments.

7.6 Limiting Sample Averages 225

Lemma 7.11. Given the assumptions of Theorem 7.5, if the system hasergodic properties with respect to a bounded measurement f , then withprobability one under m and m

< f >n= n−1n−1∑i=0

fT i →n→∞

Em(f |I).

In other words,f = Em(f |I).

Proof. Say that the limit of the left-hand side is an invariant function< f >. Apply Lemma 7.9 to the bounded measurement g = f1F with aninvariant event F and

limn→∞

n−1n−1∑i=0

∫FfT i dm =

∫Ffdm.

From Lemma 7.3 the left-hand side is also given by∫F f dm. The equality

of the two integrals for all invariant F and the invariance of the limitingsample average imply from Lemma 7.12 that < f > is a version of theconditional expectation of f given I . 2

Replacing Lemma 7.9 by Lemma 7.10 in the preceding proof gives partof the following generalization.

Corollary 7.10. Given a dynamical system (Ω,B,m,T) let I denote theclass of all T -invariant events. If the system is AMS with stationary meanm and if it has ergodic properties with respect to a measurement f forwhich either (i) the < f >n are uniformly integrable with respect to m or(ii) f is m-integrable, then with probability one under m and m

< f >n= n−1n−1∑i=0

fT i →n→∞

Em(f |I).

In other words,f = Em(f |I).

Proof. Part (i) follows from the comment before the corollary. FromCorollary 8.5, since m has ergodic properties with respect to f , thenso does m and the limiting sample average < f > is the same (up toan almost everywhere equivalence under both measures). Since f is byassumption m-integrable and since m is stationary, then < f >n is uni-formly m-integrable and hence application of (i) to m then proves thatthe given equality holds m-a.e.. The set on which the the equality holdsis invariant, however, since both < f > and Em(f |I) are invariant. Hencethe given equality also holds m-a.e. from Lemma 7.5. 2


Exercises

1. Let m be AMS with stationary mean m. Let M denote the space of allsquare-integrable invariant measurements. Show that M is a closedlinear subspace of L2(m). If m has ergodic properties with respect toa bounded measurement f , show that

f = PM(f );

that is, the limiting sample average is the projection of the originalmeasurement onto the subspace of invariant measurements. From theprojection theorem, this means that < f > is the minimum mean-squared estimate of f given the outcome of all invariant measure-ments.

2. Suppose that f ∈ L2(m) is a measurement and fn ∈ L2(m) is a se-quence of measurements for which ||fn − f || →n→∞ 0. Suppose alsothat m is stationary. Prove that

1n

n−1∑i=0

fiT i →L2(m)

< f >,

where < f > is the limiting sample average of f .

7.7 Ergodicity

A special case of ergodic properties of particular interest occurs whenthe limiting sample averages are constants instead of random variables,that is, when a sample average converges with probability one to a num-ber known a priori rather than to a random variable. This will be the case,for example, if a system is such that all invariant measurements equalconstants with probability one. If this is true for all measurements, thenit is true for indicator functions of invariant events. Since such functionscan assume only values of 0 or 1, for such a system

m(F) = 0 or 1 for all F such that T−1F = F. (7.43)

Conversely, if (7.43) holds every indicator function of an invariantevent is either 1 or 0 with probability 1, and hence every simple invariantfunction is equal to a constant with probability one. Since every invariantmeasurement is the pointwise limit of invariant simple measurements(combine Lemma 5.3 and Lemma 7.7), every invariant measurement isalso a constant with probability one (the ordinary limit of the constants

7.7 Ergodicity 227

equaling the invariant simple functions). Thus every invariant measure-ment will be constant with probability one if and only if (7.43) holds.

Suppose next that a system possesses ergodic properties with respectto all indicator functions or, equivalently, with respect to all boundedmeasurements. Since the limiting sample averages of all indicator func-tions of invariant events (in particular) are constants with probabilityone, the indicator functions themselves are constant with probabilityone. Then, as previously, (7.43) follows.

Thus we have shown that if a dynamical system possesses ergodicproperties with respect to a class of functions containing the indicatorfunctions, then the limiting sample averages are constant a.e. if and onlyif (7.43) holds. This motivates the following definition: A dynamical sys-tem (or the associated random process) is said to be ergodic with respectto a transformation T or T -ergodic or, if T is understood, simply ergodic,if every invariant event has probability 1 or 0. Another name for ergodicthat is occasionally found in the mathematical literature is metricallytransitive.

Given the definition, we have proved the following result.

Lemma 7.12. A dynamical system is ergodic if and only if all invariantmeasurements are constants with probability one. A necessary conditionfor a system to have ergodic properties with limiting sample averagesbeing a.e. constant is that the system be ergodic.

Lemma 7.13. If a dynamical system (Ω,B,m,T) is AMS with stationarymean m, then (Ω,B,m,T) is ergodic if and only if (Ω,B,m,T) is.

Proof. The proof follows from the definition and Lemma 7.5. 2

Coupling the definition and properties of ergodic systems with thepreviously developed ergodic properties of dynamical systems yields thefollowing results.

Lemma 7.14. If a dynamical system has ergodic properties with respect toa bounded measurement f and if the system is ergodic, then with proba-bility one

limn→∞

< f >n= Emf = limn→∞

n−1n−1∑i=0

EmfTi.

If more generally the < f >n are uniformly integrable but not necessarilybounded (and hence the convergence is also in L1(m)), then the precedingstill holds. If the system is AMS and either the < f >n are uniformly inte-grable with respect to m or f is m-integrable, then also with probabilityone

limn→∞

< f >n= Emf .

In particular, the limiting relative frequencies are given by


limn→∞

n−1n−1∑i=0

1FT i =m(F), all F ∈ B.

In addition,

limn→∞

n−1n−1∑i=0

∫GfT idm = (Emf)m(G), all G ∈ B,

where again the result holds for measurements for which the < f >nare uniformly integrable or f is m integrable. Letting f be the indicatorfunction of an event F , then

limn→∞

n−1n−1∑i=0

m(T−iF ∩G) =m(F)m(G) all F,G ∈ B. (7.44)

Proof. If the limiting sample average is constant, then it must equal itsown expectation. The remaining results then follow from the results ofthis chapter. 2

The lemma shows that (7.44) is necessary for a process with ergodicproperties to be ergodic. It is also sufficient since if F is invariant, thenfrom Lemma 7.5 m(F) = m(F) and hence (7.44) becomes with G = Fm(T−iF ∩G) =m(F) =m(F) =m(F) =m(F)2 and hence m(F) = 0 or1. Thus we have proved the following corollary.

Corollary 7.11. A necessary and sufficient condition for a system with er-godic properties for all bounded measurements to be ergodic is that (7.44)hold for all events F and G.

Thus (7.44) thus provides a test for ergodicity. It can be viewed as a weakform of asymptotic independence. Happily we shall see shortly that thetest need not be applied to all events, just to a generating field.

Observe that if m is actually stationary and hence its own stationarymean, then (7.44) becomes

limn→∞

n−1n−1∑i=0

m(T−iF ∩G) =m(F)m(G) (7.45)

for all events G.The final result of the previous lemma yields a useful test for deter-

mining whether a system with ergodic properties is ergodic.

Lemma 7.15. Suppose that a dynamical system (Ω,B,m,T) has ergodicproperties with respect to the indicator functions and hence is AMS withstationary mean m and suppose that B is generated by a field F , thenmis ergodic if and only if

7.7 Ergodicity 229

limn→∞

n−1n−1∑i=0

m(T−iF ∩ F) =m(F)m(F), all F ∈ F . (7.46)

Proof. Necessity follows from the previous result. To show that the pre-ceding formula is sufficient for ergodicity, first assume that the formulaholds for all events F and let F be invariant. Then the left-hand side issimply m(F) = m(F) and the right-hand side is m(F)m(F) = m(F)2.But these two quantities can be equal only ifm(F) is 0 or 1, i.e., if it (andhence alsom) is ergodic. We will be done if we can show that the preced-ing relation holds for all events given that it holds on a generating field.Toward this end fix ε > 0. From Corollary 1.5 we can choose a field eventF0 such that m(F∆F0) ≤ ε and m(F∆F0) ≤ ε. To see that we can choosea field event F0 that provides a good approximation to F simultaneouslyfor both measuresm andm, apply Corollary 1.5 to the mixture measurep = (m +m)/2 to obtain an F0 for which p(F∆F0) ≤ ε/2. This impliesthat both m(F∆F0) and m(F∆F0)) must be less than ε.

From the triangle inequality

| 1n

n−1∑i=0

m(T−iF ∩ F)−m(F)m(F) |≤

| 1n

n−1∑i=0

m(T−iF ∩ F)− 1n

n−1∑i=0

m(T−iF0 ∩ F0) | +

| 1n

n−1∑i=0

m(T−iF0∩F0)−m(F0)m(F0) | + |m(F0)m(F0)−m(F)m(F) | .

The middle term on the right goes to 0 as n → ∞ by assumption. Therightmost term is bound above by 2ε. The leftmost term is bound aboveby

1n

n−1∑i=0

|m(T−iF ∩ F)−m(T−iF0 ∩ F0) | .

Since for any events D, C

|m(D)−m(C)| ≤m(D∆C) (7.47)

each term in the preceding sum is bound above bym((T−iF∩F)∆(T−iF0∩F0)). Since for any events D,C,H m(D∆C) ≤m(D∆H)+m(H∆C), eachof the terms in the sum is bound above by

m((T−iF ∩ F)∆(T−iF0 ∩ F))+m((T−iF0 ∩ F)∆(T−iF0 ∩ F0))

≤m(T−i(F∆F0))+m(F∆F0).


Thus the remaining sum term is bound above by

1n

n−1∑i=0

m(T−i(F∆F0))+m(F∆F0) →n→∞m(F∆F0)+m(F∆F0) ≤ 2ε,

which completes the proof. 2

7.8 Block Ergodic and Totally Ergodic Processes

A random process is N-ergodic if it is ergodic with respect to TN , thatis, if T−NF = F . An N-ergodic process is also ergodic since a T -invariantevent is also TN -invariant and hence has probability zero or one. Theconverse is not true, however, an ergodic process need not be N-ergodic.This can cause difficulties when looking at properties of processes pro-duced by block encoding an ergodic process, that is, processes resultingwhen successive N-tuples of the original process are operated on in anindependent fashion to produce N-tuples of a new process.

A process is said to be totally ergodic if it is N-ergodic.

Exercises

1. Suppose that system is N-ergodic and N-stationary. Show that themixture

m(F) = 1N

N−1∑i=0

m(T−iF)

is stationary and ergodic.2. Suppose that a process is N-ergodic and N-stationary and has the

ergodic property with respect to f . Show that

Ef = 1N

N−1∑i=0

E(fT i).

3. If a process is stationary and ergodic and T is invertible, show thatm(F) > 0 implies that

m(∞⋃

i=−∞TiF) = 1.

Show also that if m(F) > 0 and m(G) > 0, then for some i m(F ∩TiG) > 0. Show that, in fact, m(F ∩ TiG) > 0 infinitely often.

7.8 Block Ergodic and Totally Ergodic Processes 231

4. Suppose that (Ω,B,m) is a directly given stationary and ergodicsource with shift T and that f : Ω → Λ is a measurable mappingof sequences in Ω into another sequence space Λ with shift S with theproperty that f(Tx) = Sf(x). Such a mapping is said to be station-ary or time invariant since shifting the input simply results in shift-ing the output. Intuitively, the mapping does not change with time.Such sequence mappings are induced by sliding-block codes and time-invariant filters as considered in Section 2.6. Show that the inducedprocess (with distribution mf−1) is also stationary and ergodic.

5. Show that if m has ergodic properties with respect to all indicatorfunctions and it is AMS with stationary mean m, then

limn→∞

1n

n−1∑i=0

m(T−iF ∩ F) = limn→∞

1n2

n−1∑i=0

n−1∑j=0

m(T−iF ∩ T−jF).

6. Show that a process m is stationary and ergodic if and only if for allf , g ∈ L2(m) we have

limn→∞

1n

n−1∑i=0

E(fgTi) = (Ef)(Eg).

7. Show that the process is AMS with stationary mean m and ergodic ifand only if for all bounded measurements f and g

limn→∞

1n

n−1∑i=0

Em(fgTi) = (Emf)(Emg).

8. Suppose that m is a process and f ∈ L2(m). Assume that the follow-ing hold for all integers k and j:

EmfTk = EmfEm

((fTk)(fTj)

)= Em

((f )(fT |k−j|)

)limk→∞

Em(ffT |k|) = (Emf)2.

Prove that < f >n → Emf in L2(m).9. Consider the measures m and N−1

∑N−1i=0 mT−i. Does ergodicity of ei-

ther imply that of the other?10. Suppose that m is N-stationary and N-ergodic. Show that the mea-

sures mT−n have the same properties.11. Suppose an AMS process is not stationary but it is ergodic. Is it recur-

rent? Infinitely recurrent?12. Is a mixture of recurrent processes also recurrent?


7.9 The Ergodic Decomposition

In this section we find a characterization of the stationary mean of aprocess with bounded ergodic properties as a mixture of stationary andergodic processes. Roughly speaking, if we view the outputs of a systemwith ergodic properties, then the relative frequencies and sample aver-ages will be the same as if we were viewing an ergodic system, but we donot know a priori which one. If we view the limiting sample frequencies,however, we can determine the ergodic source being seen.

We have seen that under the given assumptions, the limiting relativefrequencies (the sample averages of the indicator functions) are condi-tional probabilities, < 1F >= m(F|I), where I is the sigma-field of in-variant events. In addition, if the alphabet is standard, one can selecta regular version of the conditional probability and hence using iter-ated expectation one can consider the stationary mean distribution mas a mixture of measures m(·|I)(x); x ∈ Ω. In this section we showthat almost all of these measures are both stationary and ergodic, andhence µ can be written as a mixture of stationary and ergodic distribu-tions. Intuitively, observing the asymptotic sample averages of systemwith ergodic properties is equivalent to observing the same measure-ments for a stationary and ergodic system that is an ergodic componentof the stationary mean, but we may not know in advance which ergodiccomponent is being observed. This result is known as the ergodic de-composition and it is due in various forms to Von Neumann [112], Rohlin[89], Kryloff and Bogoliouboff [63], and Oxtoby [80]. The developmenthere most closely parallels that of Oxtoby. (See also Gray and Davisson[34] for the simple special case of discrete alphabet processes.) We heretake the somewhat unusual step of proving the theorem first under themore general assumption that a process has bounded ergodic propertiesrather than that the process is stationary. The usual theorem will thenfollow for AMS systems from the ergodic theorem of the next chapter,which includes stationary systems as a special case. Our point is thatthe key parts of the ergodic decomposition are a direct result of the er-godic properties of sequences with respect to indicator functions ratherthan of the ergodic theorem. Although we could begin with Theorem 7.5,which characterizes the limiting relative frequencies as conditional prob-abilities, and then use Theorem 6.4 to choose a regular version of theconditional probability, it is more instructive to proceed in a direct fash-ion by assuming bounded ergodic properties. In addition, this yields astronger result wherein the decomposition is determined by the prop-erties of the points themselves and not the underlying measure; that is,the same decomposition works for all measures with bounded ergodicproperties.

7.9 The Ergodic Decomposition 233

be a dynamical system and assume that (Ω,B) is a standard measurablespace with a countable generating set F .

For each event F define the invariant set G(F) = x : limn→∞ < 1F >n(x) exists and denote the invariant limit by < 1F >. Define the set

G(F) =⋂F∈F

G(F),

the set of points x for which the limiting relative frequencies of allgenerating events exist. We shall call G(F) the set of generic points orfrequency-typical points

Thus for any x ∈ G(F)

limn→∞

n−1∑i=0

1F(T ix) = 1F(x); all F ∈ F .

Observe that membership in G(F) is determined strictly by propertiesof the points and not by an underlying measure. If m has ergodic prop-erties with respect to the class of all indicator functions (and is thereforenecessarily AMS) then clearlym(G(F)) = 1 for any event F and hence wealso have for the countable intersection thatm(G(F)) = 1. If x ∈ G(F),then the set function Px defined by

Px(F) = 1F(x) ;F ∈ F

is nonnegative, normalized, and finitely additive on F from the proper-ties of sample averages. Since the space is standard, this set functionextends to a probability measure, which we shall also denote by Px , on(Ω,B). Thus we easily obtain that every generic point induces a proba-bility measure Px and hence also a dynamical system (Ω,B, Px, T) via itsconvergent relative frequencies. For x not in G(F) we can simply fix Pxas some fixed arbitrary stationary and ergodic measure p∗.

We now confine interest to systems with bounded ergodic proper-ties so that the set of generic sequences will have probability one. ThenPx(F) =< 1F > (x) almost everywhere and Px(F) = m(F|I)(x) for allx in a set of m measure one. Since this is true for any event F , we candefine the countable intersection

H = x : Px(F) =m(F|I) all F ∈ F

and infer that m(H) = 1. Thus with probability one under m the twomeasures Px and m(·|I)(x) agree on F and hence are the same. Thusfrom the properties of conditional probability and conditional expecta-tion

We begin by gathering together the necessary definitions. Let (Ω,B,m,T)


m(F ∩D) =∫DPx(F)dm, all D ∈ I, (7.48)

and if f is in L1(m), then∫Df dm =

∫D(∫f(y)dPx(y))dm, all D ∈ I. (7.49)

Thus choosing D = Ω we can interpret (7.48) as stating that the sta-tionary meanm is a mixture of the probability measures Px and (7.49) asstating that the expectation of measurements with respect to the station-ary mean can be computed (using iterated expectation) in two steps: firstfind the expectations with respect to the components Px for all x, andsecond take the average. Such mixture and average representations willbe particularly useful because, as we shall show, the component mea-sures are both stationary and ergodic with probability one. For focusand motivation, we next state the principal result of this section — theergodic decomposition theorem — which makes this notion rigorous ina useful form. The remainder of the section is then devoted to the proofof this theorem via a sequence of lemmas that are of some interest intheir own right.

Theorem 7.6. The Ergodic Decomposition of Systems with Bounded Er-godic Properties

Given a standard measurable space (Ω,B), let F denote a countablegenerating field. Given a transformation T : Ω → Ω define the set ofgeneric sequences

G(F) = x : limn→∞

< 1F >n (x) exists all F ∈ F.

For each x ∈ G(F) let Px denote the measure induced by the limitingrelative frequencies. Define the set E ⊂ G(F) by E = x : Px is stationaryand ergodic. Define the stationary and ergodic measures px by px = Pxfor x ∈ E and otherwise px = p∗, some fixed stationary and ergodicmeasure. (Note that so far all definitions involve properties of the pointsonly and not of any underlying measure.) We shall refer to the collec-tion of measures px ;x ∈ Ω as the ergodic decomposition of the triple(Ω,B, T ). We have then that

(a) E is an invariant measurable set,(b) px(F) = pTx(F) for all points x ∈ Ω and events F ∈ B,(c) If m has bounded ergodic properties and m is its stationary mean,

thenm(E) =m(E) = 1, (7.50)

(d) for any F ∈ B

m(F) =∫px(F)dm(x) =

∫px(F)dm(x), (7.51)


(e) and if f ∈ L1(m), then so is∫f dpx and∫

f dm =∫ (∫

fdpx)dm(x). (7.52)

(g) In addition, given any f ∈ L1(m), then

limn→∞

n−1n−1∑i=0

f(T ix) = Epxf ,m− a.e. and m− a.e. (7.53)

Comment: The final result is extremely important for both mathematicsand intuition. Say that a system (Ω,B,m,T) has bounded ergodic prop-erties and that f is integrable with respect to the stationary mean m;e.g., f is measurable and bounded and hence integrable with respect toany measure. Then with probability one a point will be produced suchthat the limiting sample average exists and equals the expectation ofthe measurement with respect to the stationary and ergodic measure in-duced by the point. Thus if a system has bounded ergodic properties,then the limiting sample averages will behave as if they were in fact pro-duced by a stationary and ergodic system. These ergodic componentsare determined by the points themselves and not the underlying mea-sure; that is, the same collection works for all measures with boundedergodic properties.

Proof. We begin the proof by observing that the definition of Px impliesthat

PTx(F) = 1F(Tx) = 1F(x)= Px(F); all F ∈ F ; all x ∈ G(F) (7.54)

since limiting relative frequencies are invariant. Since PTx and Px agreeon generating field events, however, they must be identical. Since we willselect the px to be the Px on a subset of G(F), this proves (b) for x ∈ E.

Next observe that for any x ∈ G(F) and F ∈ F

Px(F) = 1F(x) = limn→∞

n−1∑i=0

1F(T ix)

= limn→∞

n−1∑i=0

1T−1F(T ix) = 1T−1F(x) = Px(T−1F).

Thus for x ∈ G(F) the measure Px looks stationary, at least on generat-ing events. The following lemma shows that this is enough to infer thatthe Px are indeed stationary.


Lemma 7.16. A measure m is stationary on (Ω, σ(F)) if and only ifm(T−1F) =m(F) for all F ∈ F .

Proof. Both m and mT−1 are measures on the same space, and hencethey are equal if and only if they agree on a generating field. 2

Note in passing that no similar countable description of an AMS mea-sure is known, that is, if limn→∞n−1

∑n−1i=0 m(T−iF) converges for all F

in a generating field, it does not follow that the measure is AMS.As before let E denote the subset of G(F) consisting of all of the

points x for which Px is ergodic. From Lemma 7.15, the definition of Px ,and the stationarity of the Px we can write E as

E = x : x ∈ G(F); limn→∞

n−1n−1∑i=0

Px(T−iF ∩ F)

= Px(F)2, all F ∈ F.

Since the probabilities Px are measurable functions, the set E wherethe given limit exists and has the given value is also measurable. SincePx(F) = PTx(F), the set is also invariant. This proves (a) of the theorem.We have already proved (b) for x ∈ E. For x 6∈ E we have that px = p∗.Since E is invariant, we also know that Tx 6∈ E and hence also pTx = p∗so that pTx = px .

We now consider a measure m with bounded ergodic properties sothat the system is AMS. Lemma 7.15 provided a countable descriptionof ergodicity that was useful to prove that the collection of all pointsinducing ergodic measures is an event, the following lemma providesanother countable description of ergodicity that is useful for computingthe probability of the set E.

Lemma 7.17. Let (Ω, σ(F),m,T) be an AMS dynamical system with sta-tionary mean m. Define the sets

G(F,m) = x : limn→∞

1n

n−1∑i=0

1F(T ix) =m(F);F ∈ F .

Thus G(F,m) is the set of sequences for which the limiting relative fre-quency exists for all events F in the generating field and equals the prob-ability of the event under the stationary mean. Define also

G(F ,m) =⋂F∈F

G(F,m).

A necessary and sufficient condition form to be ergodic is thatm(G(F ,m))= 1.


Comment: The lemma states that an AMS measure m is ergodic if andonly if it places all of its probability on the set of generic points thatinduce m through their relative frequencies.

Proof. If m is ergodic then < 1F >= m(F) m-a.e. for all F and theresult follows immediately. Conversely, if m(G(F ,m)) = 1 then alsom(G(F ,m)) = 1 since the set is invariant. Thus integrating to the limitusing the dominated convergence theorem yields

limn→∞

n−1n−1∑i=0

m(T−iF ∩ F) =∫G(F ,m)

1F ( limn→∞

n−1n−1∑i=0

1FT i)dm

=m(F)2, all F ∈ F ,

which with Lemma 7.15 proves that m is ergodic, and hence so is mfrom Lemma 7.13. 2

Proof of the Theorem. If we can show that m(E) = 1, then the remainderof the theorem will follow from (7.48) and (7.49) and the fact that px =m(·|I)(x) m−a.e. From the previous lemma, this will be the case if withpx probability one

1F = px(F), all F ∈ F ,or, equivalently, if for all F ∈ F

1F = px(F), px − a.e.

This last equation will hold if for each F ∈ F∫dpx(y)|1F(y)− px(F)|2 = 0.

This relation will hold on a set of x of m probability one if∫dm(x)

∫dpx(y) |1F(y)− px(F)|2 = 0.

Thus proving the preceding equality will complete the proof of the er-godic decomposition theorem. Expanding the square and then using thefact that G(F) has m probability one yields∫

G(F)dm(x)

∫G(F)

dpx(y)1F(y)2−

2∫G(F)

dm(x)px(F)∫G(F)

dpx(y)1F(y)+∫G(F)

dm(x)px(F)2.


Since < 1F > (y) = py(F), the first and third terms on the right areequal from (7.49). (Set f(x) =< 1F(x) >= px(F) in (7.49).) Since px isstationary, ∫

dpx < 1F >= px(F).

Thus the middle term cancels the remaining terms and the precedingexpression is zero, as was to be shown. This completes the proof ofthe ergodic decomposition theorem for a dynamical system (or the cor-responding directly given random process) which has bounded ergodicproperties. 2

Chapter 8

Ergodic Theorems

Abstract The basic ergodic theorems are proved showing that asymp-totic mean stationarity is a sufficient as well as necessary condition fora random process or dynamical system to have convergent sample aver-ages for all bounded measurements. Conditions are found for the con-vergence of unbounded sample averages in both the almost everywhereand mean sense. The proof follows the coding-flavored proofs developedin the ergodic theory literature in the 1970s rather than the classicalroute of the maximal ergodic theorem.

8.1 The Pointwise Ergodic Theorem

This section will be devoted to proving the following result, known asthe pointwise or almost everywhere ergodic theorem or as the Birkhoffergodic theorem after George Birkhoff, who first proved the result for astationary transformations.

Theorem 8.1. A sufficient condition for a dynamical system (Ω,B,m,T)to possess ergodic properties with respect to all bounded measurementsf is that it be AMS. In addition, if the stationary mean of m is m, thenthe system possesses ergodic properties with respect to all m-integrablemeasurements, and if f ∈ L1(m), then with probability one underm andm

limn→∞

1n

n−1∑i=0

fT i = Em(f |I),

where I is the sub-σ -field of invariant events.

Observe that the final statement follows immediately from the exis-tence of the ergodic property from Corollary 7.10. In this section we willassume that we are dealing with processes to aid interpretations; that


240 8 Ergodic Theorems

is, we will consider points to be sequences and T to be the shift. Thisis simply for intuition, however, and the mathematics remain valid forthe more general dynamical systems. We can assume without loss ofgenerality that the function f is nonnegative (since if it is not it can bedecomposed into f = f+−f− and the result for the nonnegative f+ andf− and linearity then yield the result for f). Define

f = lim infn→∞

1n

n−1∑i=0

fT i

and

f = lim supn→∞

1n

n−1∑i=0

fT i.

To prove the first part of the theorem we must prove that f = f with

probability one under m. Since both f and f are invariant, the event

F = x : f(x) = f(x) = x: limn→∞ < f >n exists is also invariantand hence has the same probability underm andm. Thus we must showthat m(f = f) = 1. Thus we need only prove the theorem for stationarymeasures and the result will follow for AMS measures. As a result, wecan henceforth assume that m is itself stationary.

Since f ≤ f everywhere by definition, the desired result will follow ifwe can show that ∫

f(x)dm(x) ≤∫f(x)dm(x),

In other words, if f − f ≥ 0 and∫(f − f)dm ≤ 0, then we must have

f − f = 0, m-a.e. We accomplish this by proving the following:∫f(x)dm(x) ≤

∫f(x)dm(x) ≤

∫f(x)dm(x). (8.1)

By assumption f ∈ L1(m) since in the theorem statement either f isbounded or it is integrable with respect to m, which is m since we havenow assumed m to be stationary. Thus (8.1) implies not only that lim <f >n exists, but that it is finite.

We simplify things at first by temporarily assuming that f is bounded:Suppose for the moment that the supremum of f is M < ∞. Fix ε > 0.Since f is the limit supremum of < f >n, for each x there must exist afinite n for which

1n

n−1∑i=0

f(T ix) ≥ f(x)− ε. (8.2)

8.1 The Pointwise Ergodic Theorem 241

Define n(x) to be the smallest integer for which (8.2) is true. Since f isan invariant function, we can rewrite (8.2) using the definition of n(x)as

n(x)−1∑i=0

f(T ix) ≤n(x)−1∑i=0

f(T ix)+n(x)ε. (8.3)

Since n(x) is finite for all x (although it is not bounded), there must bean N for which

m(x : n(x) > N) =∞∑

k=N+1

m(x : n(x) = k) ≤ εM

since the sum converges to 0 as N → ∞.. Choose such an N and defineB = x : n(x) > N. B has small probability and can be thought of asthe “bad” sequences, where the sample average does not get close to thelimit supremum within a reasonable time. Observe that if x ∈ Bc , thenalso Tix ∈ Bc for i = 1,2, . . . , n(x) − 1; that is, if x is not bad, then thenext n(x) − 1 shifts of n(x) are also not bad. To see this assume thecontrary, that is, that there exists a p < n(x)− 1 such that Tix ∈ Bc fori = 1,2, . . . , p − 1, but Tpx ∈ B. The fact that p < n(x) implies that

p−1∑i=0

f(T ix) < p(f(x)− ε)

and the fact that Tpx ∈ B implies that

n(x)−1∑i=p

f(T ix) < (n(x)− p)(f(x)− ε).

These two facts together, however, imply that

n(x)−1∑i=0

f(T ix) < n(x)(f(x)− ε),

contradicting the definition of n(x).We now modify f(x) and n(x) for bad sequences so as to avoid the

problem caused by the fact that n(x) is not bounded. Define

f (x) =f(x) x 6∈ BM x ∈ B

and

n(x) =n(x) x 6∈ B1 x ∈ B.


Analogous to (8.3) the modified functions satisfy

n(x)−1∑i=0

f(T ix) ≤n(x)−1∑i=0

f (T ix)+ n(x)ε (8.4)

and n(x) is bound above by N . To see that (8.4) is valid, note thatit is trivial if x ∈ B. If x 6∈ B, as argued previously T ix 6∈ B fori = 1,2, . . . , n(x)− 1 and hence f (T ix) = f(T ix) and (8.4) follows from(8.3).

Observe for later reference that∫f (x)dm(x) =

∫Bcf (x)dm(x)+

∫Bf (x)dm(x)

≤∫Bcf (x)dm(x)+

∫BM dm(x)

≤∫f(x)dm(x)+Mm(B)

≤∫f(x)dm(x)+ ε, (8.5)

where we have used the fact that f was assumed positive in the secondinequality.

We now break up the sequence into nonoverlapping blocks such thatwithin each block the sample average of f is close to the limit supre-mum. Since the overall sample average is a weighted sum of the sampleaverages for each block, this will give a long term sample average nearthe limit supremum. To be precise, choose an L so large that NM/L < εand inductively define nk(x) by n0(x) = 0 and

nk(x) = nk−1(x)+ n(Tnk−1(x)x).

Let k(x) be the largest k for which nk(x) ≤ L− 1, that is, the number ofthe blocks in the length L sequence. Since

L−1∑i=0

f(T ix) =k(x)∑k=1

nk(x)−1∑i=nk−1(x)

f (T ix)+L−1∑

i=nk(x)(x)f (T ix),

we can apply the bound of (8.4) to each of the k(x) blocks, that is, eachof the previous k(x) inner sums, to obtain

L−1∑i=0

f(T ix) ≤L−1∑i=0

f (T ix)+ Lε+ (N − 1)M,

where the final term overbounds the values in the final block by M andwhere the nonnegativity of f allows the sum to be extended to L−1. We

8.1 The Pointwise Ergodic Theorem 243

can now integrate both sides of the preceding equation, divide by L, usethe stationarity of m, and apply (8.5) to obtain∫

f dm ≤∫f dm+ ε+ (N − 1)M

L≤∫f dm+ 3ε, (8.6)

which proves the left-hand inequality of (8.1) for the case where f isnonnegative and f is bounded since ε is arbitrary.

Next suppose that f is not bounded. Define for any M the “clipped”function fM(x) = min(f (x),M) and replace f by fM from (8.2) in thepreceding argument. Since fM is also invariant, the identical steps thenlead to ∫

fM dm ≤∫f dm.

Taking the limit as M → ∞ then yields the desired left-hand inequalityof (8.1) from the monotone convergence theorem.

To prove the right-hand inequality of (8.1) we proceed in a similarmanner. Fix ε > 0 and define n(x) to be the least positive integer forwhich

1n

n−1∑i=0

f(T ix) ≤ f(x)+ ε

and define B = x : n(x) > N, where N is chosen large enough toensure that ∫

Bf (x)dm(x) < ε.

This time define

f (x) =f(x) x 6∈ B0 x ∈ B.

and

n(x) =n(x) x 6∈ B1 x ∈ B

and proceed as before.This proves (8.1) and thereby shows that the sample averages of all

m-integrable functions converge. From Corollary 7.10, the limit must bethe conditional expectations. (The fact that the limits must be condi-tional expectations can also be deduced from the preceding argumentsby restricting the integrations to invariant sets.) 2

Combining the theorem with the results of previous chapter immedi-ately yields the following corollaries.

Corollary 8.1. If a dynamical system (Ω,B,m,T) is AMS with stationarymean m and if the sequence n−1

∑n−1i=0 fT i is uniformly integrable with

respect tom, then the following limit is truem-a.e.,m-a.e., and in L1(m):


limn→∞

1n

n−1∑i=0

fT i =< f >= Em(f |I),

where I is the sub-σ -field of invariant events, and

Em < f >= E < f >= Ef .

If in addition the system is ergodic, then

limn→∞

1n

n−1∑i=0

fT i = Em(f)

in both senses.

Corollary 8.2. A necessary and sufficient condition for a dynamical sys-tem to possess bounded ergodic properties is that it be AMS.

Corollary 8.3. A necessary and sufficient condition for a an AMS processto be ergodic is that (7.44) hold; that is,

limn→∞

n−1n−1∑i=0

m(T−iF ∩G) =m(F)m(G) all F,G ∈ B. (8.7)

Corollary 8.4. A necessary and sufficient condition for a an AMS processto be ergodic is that (7.46) hold; that is,

limn→∞

n−1n−1∑i=0

m(T−iF ∩ F) =m(F)m(F) all F ∈ F , (8.8)

where F generates B

Exercises

1. Suppose thatm is stationary and that g is a measurement that has theform g = f−fT for some measurement f ∈ L2(m). Show that < g >nhas an L2(m) limit as n → ∞. Show that any such g is orthogonal toI2, the subspace of all square-integrable invariant measurements; thatis, E(gh) = 0 for all h ∈ I2. Suppose that gn is a sequence of functionsof this form that converges to a measurement r in L2(m). Show that< r >n converges in L2(m).

2. Suppose that < f >n converges in L2(m) for a stationary measurem.Show that the sample averages look increasingly like invariant func-tions in the sense that

8.2 Mixing Random Processes 245

|| < f >n − < f >n T || →n→∞ 0.

3. Show that if p and m are two stationary and ergodic measures, thenthey are either identical or singular in the sense that there is an eventF for which p(F) = 1 and m(F) = 0.

4. Show that if p and m are distinct stationary and ergodic processesand λ ∈ (0,1), then the mixture process λp+ (1−λ)m is not ergodic.

8.2 Mixing Random Processes

We can harvest several important examples of ergodic processes fromCorollary 8.4. Suppose that we have a random process Xn with a dis-tributionm and alphabet A. Recall from Section 4.5 that a process is saidto be independent and identically distributed (IID) if there is a measure qon (A,BA) such that for any rectangle G = ×i∈JGi, Gi ∈ BA, j ∈ J, J ⊂ Ia finite collection of distinct indices,

m(×j∈JGj) =∏j∈Jq(Gj);

that is, the random variables Xn are mutually independent. Clearly anIID process has the property that if G and F are two finite-dimensionalrectangles, then there is an integer M such that for N ≥ M

m(F ∩ T−nG) =m(F)m(G);

that is, shifting rectangles sufficiently far apart ensures their indepen-dence. From Corollary 8.4, IID processes are ergodic.

A generalization of this idea is obtained by having a process satisfythis behavior asymptotically as n→∞. A measurem is said to be mixingor strongly mixing with respect to a transformation T if for all F,G ∈ B

limn→∞

|m(F ∩ T−nG)−m(F)m(T−nG)| = 0 (8.9)

and weakly mixing if for all F,G ∈ B

limn→∞

1n

n−1∑j=0

|m(F ∩ T−iG)−m(F)m(T−iG)| = 0. (8.10)

Mixing is a form of asymptotic independence, the probability of an eventand a large shift of another event approach the product of the probabil-ities.

The definitions of mixing are usually restricted to stationary pro-cesses, in which case they simplify. A stationary measure m is said to


be mixing or strongly mixing with respect to a transformation T if for allF,G ∈ B

limn→∞

|m(F ∩ T−nG)−m(F)m(G)| = 0 (8.11)

and weakly mixing if for all F,G ∈ B

limn→∞

1n

n−1∑j=0

|m(F ∩ T−iG)−m(F)m(G)| = 0. (8.12)

We have shown that IID processes are ergodic, and stated that a nat-ural generalization of IID process is strongly mixing, but we have notactually proved that an IID process is mixing, since the mixing condi-tions must be proved for all events, not just the rectangles considered inthe discussion of the IID process. The following lemma shows that prov-ing that (8.9) or (8.10) holds on a generating field is sufficient in certaincases.

Lemma 8.1. If a measurem is AMS and if (8.10) holds for all sets in a gen-erating field, then m is weakly mixing. If a measure m is asymptoticallystationary in the sense that limn→∞m(T−nF) exists for all events F and if(8.9) holds for all sets in a generating field, then m is strongly mixing.

Proof. The first result is proved by using the same approximation tech-niques of the proof of Lemma 7.15, that is, approximate arbitrary eventsF and G, by events in the generating field for both m and its stationarymeanm. In the second case, as in the theory of asymptotically mean sta-tionary processes, the Vitali-Hahn-Saks theorem implies that there mustexist a measure m such that

limn→∞

m(T−nF) =m(F),

and hence arbitrary events can again be approximated by field eventsunder both measures and the result follows. 2

An immediate corollary of the lemma is the fact that IID processes arestrongly mixing since (8.9) is satisfied for all rectangles and rectanglesgenerate the entire event space.

If the measure m is AMS, then either of the preceding mixing proper-ties together with the ergodic theorem ensuring ergodic properties forAMS processes implies that the conditions of Lemma 7.15 are satisfiedsince

8.2 Mixing Random Processes 247

|n−1n−1∑i=0

m(T−iF ∩ F)−m(F)m(F)|

≤ |n−1n−1∑i=0

(m(T−iF ∩ F)−m(T−iF)m(F)| +

|n−1n−1∑i=0

m(T−iF)m(F))−m(F)m(F)|

≤ n−1n−1∑i=0

|m(T−iF ∩ F)−m(T−iF)m(F)| +

m(F) | n−1n−1∑i=0

m(T−iF)−m(F) | →n→∞

0.

Thus AMS weakly mixing systems are ergodic, and hence AMS stronglymixing systems and IID processes are also ergodic. In fact, Lemma 7.15can be (and often is) interpreted as a very weak mixing condition. In thecase of m stationary, it is obvious that weak mixing implies ergodicitysince the magnitude of a sum is bound above by the sum of the magni-tudes.

A simpler proof of the ergodicity of weakly and strongly mixing mea-sures follows from the observation that if F is invariant, then m(F ∩T−nF) =m(F) and hence weak or strong mixing implies thatm(F∩F) =m(F)2, which in turn implies that m(F) must be 0 or 1 and hence thatm is ergodic.

Kolmogorov Mixing

A final form of mixing which is even stronger than strong mixing is moredifficult to describe, but we take a little time and space for it because thecondition plays a significant role in ergodic theory. For the current dis-cussion we consider only stationary measures. A random process Xnwith process distributionm is said to be Kolmogorov mixing or K-mixingif for every event F

limn→∞

supG∈Fn=σ(Xn,Xn+1,··· )

|m(F ∩G)−m(F)m(G)| = 0 (8.13)

is said to be K-mixing if every random process defined on the system (us-ing T ) is K-mixing, but we focus on a single process here.

In the case where the process is two-sided and the shift is invertible,a K-mixing process is called a K-automorphism in ergodic theory. It can

See, e.g., Smorodinsky [103] or Petersen [84]. A dynamical system (Ω,B,m,T)


be shown (with some effort) that for stationary processes, K-mixing pro-cesses are equivalent to K-processes, that is, a process is K-mixing if andonly if it has trivial tail. One way is easy. Suppose a process is K-mixing.Choose an event F = G in the tail σ -field F∞ = ∩∞n=0Fn. (See Section 7.3.)Then F ∈ Fn for all n and the condition reduces to m(F) = m2(F),which implies thatm(F) = 0 or 1. Thus a K-mixing process must be a K-process. The other way is more difficult and requires the convergence ofconditional probability measures given a decreasing set of sigma-fields,a result which follows from Doob’s martingale convergence theorem. De-tails can be found, e.g., in [3, 103]. It can further be shown that sta-tionary K-mixing processes are strong mixing. Suppose that a stationaryprocess is is K-mixing. Since stationarity implies asymptotic stationary,Lemma 8.1 implies that we need only show the defining condition forstrong mixing holds on a generating field, so focus on cylinder sets. LetF and G be any two finite dimensional cylinders and consider the proba-bilitiesm(F ∩T−nG). By stationarity we can assume without loss of gen-erality that both F and G are nonnegative time cylinders, that is, both arein F0 = σ(X0, X1, X2, · · · ). Thus T−nG ∈ Fn = σ(Xn,Xn+1, Xn+2, · · · ).Thus from the definition of K-mixing,

|m(F ∩ T−nG)−m(F)m(G)| ≤ supG∈Fn

|m(F ∩G)−m(F)m(G)| →n→∞

0,

and hence the process is strong mixing.The original example of a K-mixing process is an IID process, and

Kolmogorov’s proof that IID processes have trivial tail is the classic Kol-mogorov 0-1 law. I have chosen not to include further details primarilybecause of the extra machinery required to prove some of these results,in particular the necessity of using the Doob martingale convergence the-orem showing the convergence of conditional probabilities giving nestedσ -fields. B-processes are K-processes (see, e.g., Ornstein [77]), and henceK-processes sit between strongly mixing processes and B-processes interms of generality. It is quite difficult, however, to construct K-processesthat are not B-processes (see, e.g., [78]) and Kolmogorov himself conjec-tured earlier that all K-processes were also B-processes.

Exercises

1. Show that if m and p are two stationary and ergodic processes withergodic properties with respect to all indicator functions, then eitherthey are identical or they are singular in the sense that there is anevent G such that p(G) = 1 and m(G) = 0.

2. Suppose that mi are distinct stationary and ergodic sources with er-godic properties with respect to all indicator functions. Show that the

8.3 Block AMS Processes 249

mixturem(F) =

∑iλimi(F),

where ∑iλi = 1

is stationary but not in general ergodic. Show that more generally ifthe mi are AMS, then so is m.

3. Prove that if a stationary process m is weakly mixing, then there isa subsequence J = j1 < j2 < . . . of the positive integers that hasdensity 0 in the sense that

limn→∞

1n

n−1∑i=0

1J(i) = 0

and that has the property that

limn→∞,n 6∈J

|m(T−nF ∩G)−m(F)m(G)| = 0

for all appropriate events F and G.4. Are mixtures of mixing processes mixing?5. Show that IID processes and strongly mixing processes are totally er-

godic.

8.3 Block AMS Processes

Given a dynamical system (Ω,B,m,T) or a corresponding source Xn,one may be interested in the ergodic properties of successive blocks orvectors

XNnN = (XnN,XnN+1, . . . , X(n+1)N−1)

or, equivalently, in the N-shift TN . For example, in communication sys-tems one may use block codes that parse the input source into N-tuplesand then map these successive nonoverlapping blocks into codewordsfor transmission. A natural question is whether or not a source beingAMS with respect to T (T -AMS) implies that it is also TN -AMS or viceversa. This section is devoted to answering this question.

If a source is T -stationary, then it is clearly N-stationary (stationarywith respect to TN) since for any event F m(T−NF) = m(T−N−1F) =· · · = m(F). The converse is not true, however, since, for example, aprocess that produces a single sequence that is periodic with period N isN-stationary but not stationary. The following lemma, which is a corol-


lary to the ergodic theorem, provides a needed step for treating AMSsystems.

Lemma 8.2. If µ and η are probability measures on (Ω,B) and if η is T -stationary and asymptotically dominates µ (with respect to T), then µ isT -AMS.

Proof. If η is stationary, then from the ergodic theorem it possessesergodic properties with respect to all bounded measurements. Since itasymptotically dominates µ, µ must also possesses ergodic propertieswith respect to all bounded measurements from Corollary 7.8. From The-orem 7.1, µ must therefore be AMS. 2

The principal result of this section is the following:

Theorem 8.2. If a system (Ω,B,m,T) is TN -AMS for any positive integerN , then it is TN -AMS for all integers N .

Proof. Suppose first that the system is TN -AMS and let mN denote itsN-stationary mean. Then for any event F

limn→∞

1n

n−1∑i=0

m(T−iNT−jF) =mN(T−jF); j = 0,1, . . . ,N − 1. (8.14)

We then have

1N

N−1∑j=0

limn→∞

1n

n−1∑i=0

m(T−iNT−jF) = limn→∞

1nN

N−1∑j=0

n−1∑i=0

m(T−iNT−jF)

= limn→∞

1n

n−1∑i=0

m(T−iF), (8.15)

where, in particular, the last limit must exist for all F and hencem mustbe AMS

Next assume only that m is AMS with stationary mean m. This im-plies from Corollary 7.5 that m T -asymptotically dominates m, that is,if m(F) = 0, then it is also true that lim supn→∞m(T−nF) = 0. But thisalso means that for any integer K

lim supn→∞

m(T−nKF) = 0

and hence that m also TK-asymptotically dominates m. But this meansfrom Lemma 8.2 applied to TK that m is TK-AMS. 2

Corollary 8.5. Suppose that a process µ is AMS and has a stationary mean

m(F) = limn→∞

1n

n−1∑i=0

m(T−iF)

8.3 Block AMS Processes 251

and an N-stationary mean

mN(F) = limn→∞

1n

n−1∑i=0

m(T−iNF).

Then the two means are related by

1N

N−1∑i=0

mN(T−iF) =m(F).

Proof. The proof follows immediately from (8.14)-(8.15). 2

As a final observation, if an event is invariant with respect to T , thenit is also N-invariant, that is, invariant with respect to TN . Thus if asystem is N-ergodic or ergodic with respect to TN , then all N-invariantevents must have probability of 0 or 1 and hence so must all invariantevents. Thus the system must also be ergodic. The converse, however, isnot true: An ergodic system need not be N-ergodic. A system that is N-ergodic for all N is said to be totally ergodic. A strongly mixing systemcan easily be seen to be totally ergodic.

Exercises

1. Prove that a strongly mixing source is totally ergodic.2. Suppose that m is N-ergodic and N-stationary. Is the mixture

m = 1N

N−1∑i=0

mT−i

stationary and ergodic? Are the limits of the form

1n

n−1∑i=0

fT i

the same under both m and m? What about limits of the form

1n

n−1∑i=0

fT iN?

3. Suppose that you observe a sequence from the process mT−i whereyou know m but not i. Under what circumstances could you deter-mine i correctly with probability one by observing the infinite outputsequence?


8.4 The Ergodic Decomposition of AMS Systems

ergodic properties, Theorem 7.6, with the equivalence of bounded er-godic properties and AMS, Corollary 8.2, yields immediately the ergodicdecomposition of AMS systems, which is largely a repeat of the state-ment of Theorem 7.6.

Theorem 8.3. The Ergodic Decomposition of AMS Dynamical SystemsGiven a standard measurable space (Ω,B), let F denote a countable

generating field. Given a transformation T : Ω → Ω define the set offrequency-typical or generic sequences by

G(F) = x : limn→∞

< 1F >n (x) exists all F ∈ F.

For each x ∈ G(F) let Px denote the measure induced by the limitingrelative frequencies. Define the set E ⊂ G(F) by E = x : Px is stationaryand ergodic. Define the stationary and ergodic measures px by px = Pxfor x ∈ E and otherwise px = p∗, some fixed stationary and ergodicmeasure. (Note that so far all definitions involve properties of the pointsonly and not of any underlying measure.) We shall refer to the collectionof measures px ;x ∈ Ω as the ergodic decomposition. We have thenthat

(a) E is an invariant measurable set,(b) px(F) = pTx(F) for all points x ∈ Ω and events F ∈ B,(c) If m is an AMS measure with stationary mean m, then

m(E) =m(E) = 1, (8.16)

(d) for any F ∈ B

m(F) =∫px(F)dm(x) =

∫px(F)dm(x), (8.17)

(e) and if f ∈ L1(m), then so is∫f dpx and∫

f dm =∫ (∫

fdpx)dm(x). (8.18)

(g) In addition, given any f ∈ L1(m), then

limn→∞

n−1n−1∑i=0

f(T ix) = Epxf ,m− a.e. and m− a.e. (8.19)

Comment: The following comments are almost a repeat of those follow-ing Theorem 7.6, but for the benefit of the reader who does not recall

Combining the ergodic decomposition theorem for systems with bounded

8.5 The Subadditive Ergodic Theorem 253

those comments or has not read the earlier result they are included. Theergodic decomposition is extremely important for both mathematics andintuition. Say that a system s AMS and that f is integrable with respectto the stationary mean; e.g., f is measurable and bounded and hence in-tegrable with respect to any measure. Then with probability one a pointwill be produced such that the limiting sample average exists and equalsthe expectation of the measurement with respect to the stationary andergodic measure induced by the point. Thus if a system is AMS, then thelimiting sample averages will behave as if they were in fact produced bya stationary and ergodic system. These ergodic components are deter-mined by the points themselves and not the underlying measure; that is,the same collection works for all AMS measures.

The Nedoma N-Ergodic Decomposition

Although an ergodic process need not be N-ergodic, an application ofthe ergodic decomposition theorem to TN yields the following resultshowing that a stationary ergodic process can be written as a mixture ofN-ergodic processes. The proof is left as an exercise.

Corollary 8.6. Suppose that a measure m is stationary and ergodic andN is a fixed positive integer. Show that there is a measure η that is N-stationary and N-ergodic that has the property that

m = 1N

N−1∑i=0

ηT−i.

This result is due to Nedoma [73]. A direct proof of the result for discreterandom processes can be found in Gallager [27], where the result playsa key role in the proof of the Shannon theorem on block source codingwith a fidelity criterion.

8.5 The Subadditive Ergodic Theorem

Sample averages are not the only measurements of systems that dependon time and converge. In this section we consider another class of mea-surements that have an ergodic theorem: subadditive measurements. Anexample of such a measurement is the Levenshtein distance between se-quences (see Example 1.5).

The subadditive ergodic theorem was first developed by Kingman [60],but the proof again follows that of Katznelson and Weiss [59]. A se-


quence of functions fn; = 1,2, . . . satisfying the relation

fn+m(x) ≤ fn(x)+ fm(Tnx), all n,m = 1,2, . . .

for all x is said to be subadditive. Note that the unnormalized sampleaverages fn = n < f >n of a measurement f are additive and satisfythe relation with equality and hence are a special case of subadditivefunctions.

The following result is for stationary systems. Generalizations to AMSsystems are considered later.

Theorem 8.4. Let (Ω,B,m,T) be a stationary dynamical system. Supposethat fn;n = 1,2, . . . is a subadditive sequence ofm-integrable measure-ments. Then there is an invariant function φ(x) such that m-a.e.

limn→∞

1nfn(x) = φ(x).

The function φ(x) is given by

φ(x) = infn

1nEm(fn|I) = lim

n→∞1nEm(fn|I).

Most of the remainder of this section is devoted to the proof of thetheorem. On the way some lemmas giving properties of subadditive func-tions are developed. Observe that since the conditional expectations withrespect to the σ -field of invariant events are themselves invariant func-tions, φ(x) is invariant as claimed.

Since the functions are subadditive,

Em(fn+m|I) ≤ Em(fn|I)+ Em(fmTn|I).

From the ergodic decomposition theorem, however, the third term isEPx(fmTn) = EPx(fm) since Px is a stationary measure with probabilityone. Thus we have with probability one that

Em(fn+m|I)(x) ≤ Em(fn|I)(x)+ Em(fm|I)(x).

For a fixed x, the preceding is just a sequence of numbers, say ak, withthe property that

an+k ≤ an + ak.A sequence satisfying such an inequality is said to be subadditive. Thefollowing lemma (due to Hammersley and Welch [49]) provides a keyproperty of subadditive sequences.

Lemma 8.3. If an;n = 1,2, . . . is a subadditive sequence, then

limn→∞

ann= infn≥1

ann,


where if the right-hand side is −∞, the preceding means that an/n di-verges to −∞.

Proof. First observe that if an is subadditive, for any k,n

akn ≤ an + a(k−1)n ≤ an + an + a(k−2)n ≤ · · · ≤ kan. (8.20)

Call the infimum γ and first consider the case where it is finite. Fixε > 0 and choose a positive integer N such that N−1aN < γ + ε. De-fine b =max1≤k≤3N−1 ak. For each n ≥ 3N there is a unique integer k(n)for which (k(n)+ 2)N ≤ n < (k(n)+ 3)N . Since the sequence is subad-ditive, an ≤ ak(n)N + a(n−k(n)N). Since n− k(n)N < 3N by construction,an−k(n)N ≤ b and therefore

γ ≤ ann≤ ak(n)N + b

n.

Using (8.20) this yields

γ ≤ ann≤ k(n)aN

n+ bn≤ k(n)N

naNN+ bn≤ k(n)N

n(γ + ε)+ b

n.

By construction k(n)N ≤ n− 2N ≤ n and hence

γ ≤ ann≤ γ + ε+ b

n.

Thus given an ε, choosing n0 large enough yields γ ≤ n−1an ≤ γ + 2εfor all n > n0, proving the lemma. If γ is −∞, then for any positive M wecan choose n so large that n−1an ≤ −M . Proceeding as before for anyε there is an n0 such that if n ≥ n0, then n−1an ≤ −M + ε. Since M isarbitrarily big and ε arbitrarily small, n−1an goes to −∞. 2

Returning to the proof of the theorem, the preceding lemma impliesthat with probability one

φ(x) = infn≥1

1nEm(fn|I)(x) = lim

n→∞1nEm(fn|I)(x).

As in the proof of the ergodic theorem, define

f(x) = lim supn→∞

1nfn(x)

and

f(x) = lim infn→∞

1nfn(x).

Lemma 8.4. Given a subadditive function fn(x); = 1,2, . . ., f and f asdefined previously are invariant functions m-a.e.


Proof. From the definition of subadditive functions

f1+n(x) ≤ f1(x)+ fn(Tx)

and hencefn(Tx)n

≥ 1+nn

f1+n(x)1+n − f1(x)

n.

Taking the limit supremum yields f(Tx) ≥ f(x). Since m is stationary,EmfT = Emf and hence f(x) = f(Tx) m-a.e. The result for f followssimilarly. 2

We now tackle the proof of the theorem. It is performed in two steps.The first uses the usual ergodic theorem and subadditivity to prove thatf ≤ φ a.e. Next a construction similar to that of the proof of the ergodictheorem is used to prove that f = φ a.e., which will complete the proof.

As an introduction to the first step, observe that since the measure-ments are subadditive

1nfn(x) ≤

1n

n−1∑i=0

f1(T ix).

Since f1 ∈ L1(m) for the stationary measure m, we know immediatelyfrom the ergodic theorem that

f(x) ≤ f1(x) = Em(f1|I). (8.21)

We can obtain a similar bound for any of the fn by breaking it up intofixed size blocks of length, say N and applying the ergodic theorem tothe sum of the fN . Fix N and choose n ≥ N . For each i = 0,1,2, . . . ,N−1there are integers j, k such that n = i + jN + k and 1 ≤ k < N . Henceusing subadditivity

fn(x) ≤ fi(x)+j−1∑l=0

fN(T lN+ix)+ fk(T jN+ix).

Summing over i yields

Nfn(x) ≤N−1∑i=0

fi(x)+jN−1∑l=0

fN(T lx)+N−1∑i=0

fn−i−jN(T jN+ix),

and hence

1nfn(x) ≤

1nN

jN−1∑l=0

fN(T lx)+1nN

N−1∑i=0

fi(x)+1nN

N−1∑i=0

fn−i−jN(T jN+ix).


As n → ∞ the middle term on the right goes to 0. Since fN is assumedintegrable with respect to the stationary measure m, the ergodic theo-rem implies that the leftmost term on the right tends to N−1 < fN >.Intuitively the rightmost term should also go to zero, but the proof is abit more involved. The rightmost term can be bound above as

1nN

N−1∑i=0

|fn−i−jN(T jN+ix)| ≤1nN

N−1∑i=0

N∑l=1

|fl(T jN+ix)|

where j depends on n (the last term includes all possible combinationsof k in fk and shifts). Define this bound to be wn/n. We shall prove thatwn/n tends to 0 by using Corollary 5.10, that is, by proving that

∞∑n=0

m(wnn> ε) <∞. (8.22)

First note that since m is stationary,

wn =1N

N−1∑i=0

N∑l=1

|fl(T jN+ix)|

has the same distribution as

w = 1N

N−1∑i=0

N∑l=1

|fl(T ix)|.

Thus we will have proved (8.22) if we can show that

∞∑n=0

m(w > nε) <∞. (8.23)

To accomplish this we need a side result, which we present as a lemma.

Lemma 8.5. Given a nonnegative random variable X defined on a proba-bility space (Ω,B,m),

∞∑n=1

m(X ≥ n) ≤ EX.

Proof.


EX =∞∑n=1

E(X|X ∈ [n− 1, n))m(X ∈ [n− 1, n))

≥∞∑n=1

(n− 1)m(X ∈ [n− 1, n))

=∞∑n=1

nm(X ∈ [n− 1, n))− 1.

By rearranging terms the last sum can be written as

∞∑n=0

m(X ≥ n),

which proves the lemma. 2

Applying the lemma to the left side of (8.23) yields

∞∑n=0

m(wε> n) ≤

∞∑n=0

m(wε≥ n) ≤ 1

εEm(w) <∞,

since m-integrability of the fn implies m-integrability of their abso-lute magnitudes and hence of a finite sum of absolute magnitudes. Thisproves (8.23) and hence (8.22). Thus we have shown that

f(x) ≤ 1NfN(x) =

1NEm(fN|I)

for all N and hence that

f(x) ≤ φ(x),m− a.e. (8.24)

This completes the first step of the proof.If φ(x) = −∞, then (8.24) completes the proof since the desired limit

then exists and is −∞. Hence we can focus on the invariant set G = x :φ(x) > −∞. Define the event GM = x : φ(x) ≥ −M. We shall showthat for all M ∫

GMf dm ≥

∫GMφdm, (8.25)

which with (8.24) proves that f = φ on GM . We have, however, that

G =∞⋃M=1

GM.

Thus proving that f = φ on all GM also will prove that f = φ on G andcomplete the proof of the theorem.


We now proceed to prove (8.25) and hence focus on x in GM and henceonly x for which φ(x) ≥ −M . We proceed as in the proof of the ergodictheorem. Fix an ε > 0 and define fM = max(f ,−M−1). Observe that thisimplies that

φ(x) ≥ fM(x), all x. (8.26)

Define n(x) as the smallest integer n ≥ 1 for which n−1fn(x) ≤ fM(x)+ε. Define B = x : n(x) > N and choose N large enough so that∫

B(|f1(x)| +M + 1)dm(x) < ε. (8.27)

Define the modified functions as before:

f M(x) =fM(x) x 6∈ Bf1(x) x ∈ B

and

n(x) =n(x) x 6∈ B1 x ∈ B .

From (8.27)∫f M dm =

∫BcfM dm+

∫BfM dm =

∫BcfM dm+

∫Bf1 dm

≤∫BcfM dm+

∫B|f1|dm ≤

∫BcfM dm+ ε−

∫B(M + 1)dm

≤∫BcfM dm+ ε+

∫BfM dm ≤

∫fM dm+ ε. (8.28)

Since fM is invariant,

fn(x)(x) ≤n(x)−1∑i=0

fM(Tix)+n(x)ε

and hence also

fn(x)(x) ≤n(x)−1∑i=0

f M(Tix)+ n(x)ε

Thus for L > N

fL(x) ≤L−1∑i=0

f M(Tix)+ Lε+N(M − 1)+

L−1∑i=L−N

|f1(T ix)|.

Integrating and dividing by L results in

260 8 Ergodic Theorems∫φ(x)dm(x) ≤

∫1LfL(x)dm(x)

=∫

1LfL(x)dm(x)

≤∫f M(x)dm(x)+ ε+

N(M + 1)L

+ NL

∫|f1(x)|dm(x).

Letting L→∞ and using (8.28) yields∫φ(x)dm(x) ≤

∫fM(x)dm(x).

From (8.26), however, this can only hold if in fact fM(x) = φ(x) a.e. Thisimplies, however, that f(x) = φ(x) a.e. since the event f(x) 6= fM(x)implies that fM(x) = −M − 1 which can only happen with probability0 from (8.26) and the fact that fM(x) = φ(x) a.e. This completes theproof of the subadditive ergodic theorem for stationary measures. 2

Unlike the ergodic theorem, the subadditive ergodic theorem does noteasily generalize to AMS measures since ifm is not stationary, it does notfollow that f or f are invariant measurements or that the event f = fis either an invariant event or a tail event. The following generalizationis a slight extension of unpublished work of Wu Chou. [13]

Theorem 8.5. Let (Ω,B,m,T) be an AMS dynamical system with station-ary mean m. Suppose that fn;n = 1,2, . . . is a subadditive sequenceofm-integrable measurements. There is an invariant function φ(x) suchthat m-a.e.

limn→∞

1nfn(x) = φ(x)

if and only if m <<m, in which case the function φ(x) is given by

φ(x) = infn

1nEm(fn|I) = lim

n→∞1nEm(fn|I).

Comment: Recall from Theorem 7.4 that the condition that the AMSmeasure is dominated by its stationary mean is equivalent to the con-dition that the system be recurrent. Note that the theorem immediatelyimplies that the Kingman ergodic theorem holds for two-sided AMS pro-cesses. What is new here is the result for one-sided AMS processes.

Proof. If m << m, then the conclusions follow immediately from theergodic theorem for stationary sources since m inherits the probability1 and 0 sets fromm. Suppose thatm is not dominated bym. Then fromTheorem 7.4, the system is not conservative and hence there exists awandering set W with nonzero probability, that is, the sets T−nW arepairwise disjoint and m(W) > 0. We now construct a counterexampleto the subadditive ergodic theorem, that is, a subadditive function thatdoes not converge. Define


fn(x) =

−2kn x ∈ W and 2kn ≤ n ≤ 2kn+1, kn = 0,1,2, . . .fn(T jx)− 2n x ∈ T−jW0 otherwise.

We first prove that fn is subadditive, that is, that fn+m(x) ≤ fn(x) +fm(Tnx) for all x ∈ Ω. To do this we consider three cases: (1) x ∈ W , (2)x ∈ W∗ =

⋃∞i=1 T−iW , and (3) x 6∈ W ∪W∗. In case (1),

fn(x) = −2kn

andfn+m(x) = −2kn+m.

To evaluate fm(Tnx) observe that x ∈ W implies that Tnx 6∈ W ∪W∗since otherwise

x ∈ T−n∞⋃k=0

T−kW =∞⋃k=nT−kW,

which is impossible because x ∈ W and W is a wandering set. Thusfn(Tmx) = 0. Thus for x ∈ W

fn(x)+ fm(Tnx)− fn+m(x) = −2kn + 2kn+m.

Since by construction kn +m ≥ kn, this is nonnegative and f is subad-ditive in case (1). For case (3) the result is trivial: If x 6∈ W ∪W∗, then noshift of x is ever in W , and all the terms are 0. This leaves case (2).

Suppose now that x ∈ T−jW for some k. Then

fn+m(x) = fn+m(Tjx)− 2n+m = −2kn+m − 2n+m,

andfn(x) = fn(Tjx)− 2n = −2kn − 2n.

For the final term there are two cases. If n > j, then analogously to case(1) x ∈ T−jW implies that Tnx 6∈ W ∪W∗ since otherwise

x ∈ T−n∞⋃k=0

T−kW =∞⋃k=nT−kW,

which is impossible since W is a wandering set and hence x cannot be inboth T−jW and in

⋃∞k=n T−kW since n > j. If n ≤ j, then

fm(Tnx) = fm(Tjx)− 2m = −2km − 2m.

Therefore if n > j,

fn(x)+ fm(Tnx)− fn+m(x) = −2kn − 2n + 0+ 2kn+m + 2n+m ≥ 0.


If n ≤ j then

fn(x)+ fm(Tnx)− fn+m(x) = −2kn − 2n − 2km − 2m + 2kn+m + 2n+m.

Suppose that n ≥m ≥ 1, then

2n+m − 2n − 2m = 2n(2m − 1− 2−|n−m|) ≥ 0.

Similarly, since then kn ≥ km, then

2kn+m − 2kn − 2km = 2kn(2kn+m−kn − 1− 2−|kn−km| ≥ 0.

Similar inequalities follow when n ≤ m, proving that fn is indeed sub-additive on Ω.

We now complete the proof. The eventW has nonzero probability, butif x ∈ W , then

limn→∞

12nf2n(x) = lim

n→∞−2n

2n= −1

whereas

limn→∞

12n − 1

f2n−1(x) = limn→∞

−2n−1

2n − 1= −1

2

and hence n−1fn does not have a limit on this set of nonzero probability,proving that the subadditive ergodic theorem does not hold in this case.2

Chapter 9

Process Approximation and Metrics

Abstract The goodness or badness of approximation of one process foranother can be quantified using various metrics (or distances) or distor-tion measures (or cost functions). Two of the most useful such notionsof approximation are introduced and studied in this chapter — the dis-tributional distance and the dp distortion. The former leads to a usefulmetric space of probability measures. The second combines ideas fromthe Monge/Kantorovich optimal transportation cost and Ornstein’s d (d-bar) distance on random variables, vectors, and processes and is usefulfor quantifying the different time-average behavior of frequency-typicalor frequency-typical sequences produced by distinct random processes.

9.1 Distributional Distance

We begin with probability metric that is not very strong, but suffices fordeveloping the ergodic decomposition of affine functionals in the nextchapter.

Given any class G = Fi; i = 1,2, . . . consisting of a countable col-lection of events we can define for any measures p,m ∈ P(Ω,B) thefunction

dG(p,m) =∞∑i=1

2−i|p(Fi)−m(Fi)|. (9.1)

If G contains a generating field, then dG is a metric on P(Ω,B) sincefrom Lemma 1.19 two measures defined on a common σ -field generatedby a field are identical if and only if they agree on the generating field.We shall always assume that G contains such a field. We shall call sucha distance a distributional distance. Usually we will require that G be astandard field and hence that the underlying measurable space (Ω,B)is standard. Occasionally, however, we will wish a larger class G, but


264 9 Process Approximation and Metrics

will not require that it be standard. The different requirements will berequired for different results. When the class G is understood from con-text, the subscript will be dropped.

Let (P(Ω,B), dG) denote the metric space of P(Ω,B) with the metricdG . A key property of spaces of probability measures on a fixed measur-able space is given in the following lemma.

Lemma 9.1. The metric space (P((Ω,B)), dG) of all probability measureson a measurable space (Ω,B) with a countably generated sigma-field isseparable if G contains a countable generating field. If also G is standard(as is possible if the underlying measurable space (Ω,B) is standard), thenalso (P, dG) is complete and hence Polish.

Proof. For each n let An denote the set of nonempty intersection setsor atoms of F1, . . . , Fn, the first n sets in G. (These are Ω sets, eventsin the original space.) For each set G ∈ An choose an arbitrary point xGsuch that xG ∈ G. We will show that the class of all measures of the form

r(F) =∑G∈An

pG1F(xG),

where the pG are nonnegative and rational and satisfy∑G∈An

pG = 1,

forms a dense set in P(Ω,B). Since this class is countable, P(Ω,B) is sep-arable. Observe that we are approximating all measures by finite sums ofpoint masses. Fix a measure m ∈ (P(Ω,B), dG) and an ε > 0. Choose nso large that 2−n < ε/2. Thus to match up two measures in d = dG , (9.1)implies that we must match up the probabilities of the first n sets in Gsince the contribution of the remaining terms is less that 2−n. Define

rn(F) =∑G∈An

m(G)1F(xG)

and note thatm(F) =

∑G∈An

m(G)m(F|G),

wherem(F|G) =m(F∩G)/m(G) is the elementary conditional probabil-ity of F given G if m(G) > 0 and is arbitrary otherwise. For conveniencewe now consider the preceding sums to be confined to those G for whichm(G) > 0.

Since the G are the atoms of the first n sets Fi, for any of theseFi either G ⊂ Fi and hence G ∩ Fi = G or G ∩ Fi = ∅. In the first case1Fi(xG) = m(Fi|G) = 1, and in the second case 1Fi(xG) = m(Fi|G) = 0,and hence in both cases

9.1 Distributional Distance 265

rn(Fi) =m(Fi); i = 1,2, . . . , n.

This implies that

d(rn,m) ≤∞∑

i=n+1

2−i = 2−n ≤ ε2.

Enumerate the atoms of An as Gl; l = 1,2, . . . , L, where L ≤ 2n. Forall l but the last (l = L) pick a rational number pGl such that

m(Gl)− 2−lε4≤ pGl ≤m(Gl);

that is, we choose a rational approximation tom(Gl) that is slightly lessthan m(Gl). We define pGL to force the rational approximations to sumto 1:

pGL = 1−L−1∑l=1

pGl.

Clearly pGL is also rational and

|pGL −m(GL)| = |L−1∑l=1

pGl −L−1∑l=1

m(Gl)|

≤L−1∑l=1

|pGl −m(Gl)| ≤ε4

L−1∑l=1

2−l ≤ ε4.

Thus

∑G∈An

|pG −m(G)| =L−1∑l=1

|pGl −m(Gl)| + |pGL −m(GL)| ≤ε2.

Define the measure tn by

tn(F) =∑G∈An

pG1F(xG)

and observe that

d(tn, rn) =n∑i=1

2−i|tn(Fi)− rn(Fi) =n∑i=1

2−i|∑G∈An

1Fi(pG −m(G))|

≤n∑i=1

2−i∑G∈An

|pG −m(G)| ≤ε4.

Thus


d(tn,m) ≤ d(tn, rn)+ d(rn,m) ≤ε2+ ε

2= ε,

which completes the proof of separability.Next assume that mn is a Cauchy sequence of measures. From the

definition of d this can only be if also mn(Fi) is a Cauchy sequenceof real numbers for each i. Since the real line is complete, this meansthat for each i mn(Fi) converges to something, say α(Fi). The set func-tion α is defined on the class G and is clearly nonnegative, normalized,and finitely additive. Hence if G is also standard, then α extends to aprobability measure on (Ω,B); that is, α ∈ P(Ω,B). By constructiond(mn,α) → 0 as n → ∞, and hence in this case P(Ω,B) is complete.2

Lemma 9.2. Assume as in the previous lemma that (Ω,B) is standard andG is a countable generating field. Then (P((Ω,B)), dG) is sequentiallycompact; that is, if µn is a sequence of probability measures in P(Ω,B),then it has a subsequence µnk that converges.

Proof. Since G is a countable generating field, a probability measure µin P is specified via the Carathéodory extension theorem by the infinitedimensional vector µ = µ(F);F ∈ G ∈ [0,1]Z+ . Suppose that µn isa sequence of probability measures and let µn also denote the corre-sponding sequence of vectors. From Lemma 1.12, [0,1]Z+ is sequentiallycompact with respect to the product space metric, and hence there mustexist a µ ∈ [0,1]Z+ and a subsequence µnk such that µnk → µ. The vectorµ provides a set function µ assigning a value µ(F) for all sets F in thegenerating field G. This function is nonnegative, normalized, and finitelyadditive. In particular, if F =

⋃i Fi is a union of a finite number of dis-

joint events in G, then

µ(F) = limk→∞

µnk(F) = limk→∞

∑iµnk(Fi)

=∑i

limk→∞

µnk(Fi) =∑iµ(Fi).

Since the alphabet is standard, µ must extend to a probability mea-sure on (Ω,B). By construction µnk(F) → µ(F) for all F ∈ G and hencedG(µnk, µ)→ 0, completing the proof. 2

An Example

As an example of the previous construction, suppose that Ω is itself aPolish space with some metric d and that G is the field generated bythe countable collection of spheres VΩ of Lemma 1.10. Thus if mn → µ

9.1 Distributional Distance 267

under dG , mn(F) → µ(F) for all spheres in the collection and all finiteintersections and unions of such spheres. From (1.18) any open set Gcan be written as a union of elements of VΩ, say

⋃i Vi. Given ε, choose

an M such that

m(M⋃i=1

Vi) > m(G)− ε.

Then

m(G)− ε < m(M⋃i=1

Vi) = limn→∞

mn(M⋃i=1

Vi) ≤ lim infn→∞

mn(G).

Thus we have proved that convergence of measures under dG impliesthat

lim infn→∞

mn(G) ≥m(G); all open G.

Since the complement of a closed set is open, this also implies that

lim supn→∞

mn(G) ≤m(G); all closed G.

Along with the preceding limiting properties for open and closed sets,convergence under dG has strong implications for continuous functions.Recall from Section 1.4 that a function f : Ω → R is continuous if givenω ∈ Ω and ε > 0 there is a δ such that d(ω,ω′) < δ ensures that|f(ω)− f(ω′)| < ε. We consider two such results, one in the form wewill later need and one to relate our notion of convergence to that ofweak convergence of measures.

Lemma 9.3. If f is a nonnegative continuous function and (Ω,B) and dGare as previously, then dG(mn,m)→ 0 implies that

lim supn→∞

Emnf ≤ Emf .

Proof. For any n we can divide up [0,∞) into the countable collection ofintervals [k/n, (k + 1)/n), k = 0,1,2, . . . that partition the nonnegativereal line. Define the sets Gk(n) = r : k/n ≤ f(r) < (k + 1)/n anddefine the closed sets Fk(n) = r : k/n ≤ f(r). Observe that Gk(n) =Fk(n)− Fk+1(n). Since f is nonnegative for any n,

∞∑k=0

knm(Gk(n)) ≤ Emf ≤

∞∑k=0

k+ 1nm(Gk(n)) =

∞∑k=0

knm(Gk(n))+

1n.

The sum can be written as

1n

∞∑k=0

k (m(Fk(n))−m(Fk+1(n)) =1n

∞∑k=0

m(Fk(n)),


and therefore1n

∞∑k=0

m(Fk(n)) ≤ Emf .

By a similar construction

Emnf <1n

∞∑k=0

mn(Fk(n))+1n

and hence from the property for closed sets

lim supn→∞

Emnf ≤1n

∞∑k=0

lim supn→∞

mn(Fk(n))+1n

≤ 1n

∞∑k=0

m(Fk(n))+1n≤ Emf +

1n.

Since this is true for all n, the lemma is proved. 2

Corollary 9.1. Given (Ω,B) and dG as in the lemma, suppose that f is abounded, continuous function. Then dG(mn,m)→ 0 implies that

limn→∞

Emnf = Emf .

Proof. If f is bounded we can find a finite constant c such that g =f + c ≥ 0. As g is clearly continuous, we can apply the lemma to g toobtain

lim supn→∞

Emng ≤ Emg = c + Emf .

Since the left-hand side is c + lim supn→∞ Emnf ,

lim supn→∞

Emnf ≤ Emf ,

as we found with positive functions. Now, however, we can apply thesame result to −f (which is also bounded) to obtain

lim infn→∞

Emnf ≥ Emf ,

which together with the limit supremum result proves the corollary. 2

Given a space of measures on a common metric space (Ω,B), a se-quence of measures mn is said to converge weakly to m if

limn→∞

Emnf = Emf

for all bounded continuous functions f . We have thus proved the follow-ing:

9.2 Optimal Coupling Distortion/Transportation Cost 269

Corollary 9.2. Given the assumptions of Lemma 9.3, if dG(mn,m) → 0,then mn converges weakly to m.

We will not make much use of the notion of weak convergence, butthe preceding result is included to relate the convergence that we use toweak convergence for the special case of Polish alphabets. For a moredetailed study of weak convergence of measures, the reader is referredto Billingsley [4, 5].

9.2 Optimal Coupling Distortion/Transportation Cost

In the previous section we considered a distance function on probabilitymeasures. This section treats a very different distance or distortion mea-sure, which can be viewed as a metric or distortion measure between ran-dom variables or vectors. In the next section the ideas will be extendedto random processes. Unlike the distributional distances which measurehow well or badly probabilities match up, this distortion or cost func-tion will measure how small the average distortion between two randomvariables or vectors can be. The idea is extended to random processesin the next section by means of a fidelity criterion, a family of distortionmeasures. This is a particularly useful notion when dealing with commu-nications and signal processing systems where the quality of a systemis usually measured by how good (or how bad) one process is as an ap-proximation to another.

Distortion Measures

A distortion measure is not a “measure” in the sense used so far; it isan assignment of a nonnegative real number which indicates how bad anapproximation one symbol or random object is of another — the smallerthe distortion, the better the approximation. The idea is a natural gener-alization of a metric or distance — a distortion measure need not satisfythe symmetry and triangle inequality properties of a metric. If the tworandom objects correspond to the input and output of a communicationsystem, for example, then the distortion provides a measure of the qual-ity of performance of the system, small (large) distortion means good(bad) quality. As another example, we might try to simulate a particu-lar process by another process from a constrained class. The constraintmight be a finite-alphabet constraint, a constraint on the memory, or, asconsidered in the sequel [30], a constraint on the entropy rate. The ques-tion naturally arises as to what the best such constrained approximation


might be, and a process metric provides a useful means of quantifyingthis optimization.

Given two measurable spaces (A,B(A)) and (B,B(B)), a distortionmeasure on A×B is a nonnegative measurable mapping ρ : A×B → [0,∞)which assigns a real number ρ(x,y) to each x ∈ A as y ∈ B which canbe thought of as the cost of reproducing x and y . Infinite values can beallowed in the abstract, but most distortion measures of practical inter-est take on only finite (but possibly unbounded) values.

Metrics or distances are a common choice for distortion measures,and for our purposes the most important example of a metric distortionmeasure arises when the input space A is a Polish space, a complete sep-arable metric space, and B is either A itself or a Borel subset of A. Inthis case the distortion measure is fundamental to the structure of thealphabet and the alphabets are standard since the space is Polish. Manypractically important distortion measures are not metrics. The most im-portant single distortion measure in signal processing, is squared error,which is the square of the Euclidean metric and not itself a metric. Thestructure of the underlying metric, however, is preserved in this case,and more generally in the case of distortion measures formed as a posi-tive power of a metric. This is the case that will be emphasized here.

Optimal Coupling Distortion and Transportation Distance

Suppose that π is a measure on the product measurable space (A ×B,B(A)×B(B)). The average distortion is defined in the obvious way asEπρ. The average distortion can be restated in terms of random variablesor vectors. Suppose that X and Y are random objects with alphabets AXand AY . Let (X, Y) denote the identity mapping on the product spaceAX × AY and let X and Y denote the projections onto the componentspaces given by X(x,y) = x, Y(x,y) = y . Then π can be viewed asthe joint distribution of X and Y and the average distortion is given byEπρ(X,Y).

Any such measure π on the product space induces marginal measureson the individual component spaces, say µX and µY , where

µX(F) = π(F × B);F ∈ B(A); µY (G) = π(A×G);G ∈ B(B) (9.2)

A joint measure consistent with two given marginals is called a couplingor joining of the two marginals.

The question now naturally arises: given two marginal distributionsµX and µY , what joint distribution π minimizes the average distortion?This yields the optimization problem


ρ(µX, µY ) = infπ∈P(µX,µY )

Eπρ(X,Y), (9.3)

where P(µX, µY ) is the collection of all joint measures π satisfying (9.2).Note that the set P(µX, µY ) is not empty since it at least contains theproduct measure.

We first encountered this idea in Example 2.2, where we defined adistance measure between distributions PX and PY describing randomvariables X and Y by

d(PX, PY ) = infPX,Y∈P(PX,PY )

PX,Y (X ≠ Y).

This is simply 9.3 for the particular case ρ = dH , the Hamming distanceof Example 1.1 where dH(x,y) = 0 if x = y and 1 otherwise, so that

PX,Y (X ≠ Y) = EPX,Y dH(X, Y). (9.4)

This definition is more suited to random variables and vectors thanto random processes, but before extending the definition to processes inthe Section 9.4, some historical remarks are in order.

The above formulation dates back to Kantorovich’s minimization ofaverage cost over couplings with prescribed marginals [57]. Kantorovichconsidered compact metric spaces with metric cost functions. The mod-ern literature considers cost functions that are essentially as generalas the distortion measures considered in information theoretic codingtheorems. Results exist for general distortion measures/cost functions,but stronger and more useful results exist when the structure of met-ric spaces is added by concentrating on distortion measures which arepositive powers of metrics.

The problem for Kantorovich arose in the study of resource allocationin economics, work for which he later shared the 1974 Nobel prize foreconomics. The problem is an early example of linear programming, andKantorovich is generally recognized as the inventor of linear program-ming. Kantorovich later realized [58] that his problem was a generaliza-tion of a 1781 problem of Monge [71], which can be described in in thecurrent context as follows. Fix as before the marginal distributions µX ,µY . Given a measurable mapping f : A → B, define as usual the inducedmeasure µXf−1 by

µXf−1(G) = µX(f−1(G));G ∈ BB. (9.5)

Defineρ(µX, µY ) = inf

f :µXf−1=µYEµXρ(X, f (X)). (9.6)

In this formulation the coupling of two random objects by a joint mea-sure is replaced by a deterministic mapping of the first into the second.


This effectively constrains the optimization to the special case of deter-ministic mappings and the resulting optimization is no longer one withlinear constraints, which results in increased difficulty of analysis andevaluation.

The problem of (9.3) is widely known as the Monge/Kantorovich op-timal transportation or transport problem, and when the cost functionis a metric the resulting minimum average distortion has been calledthe Monge/Kantorovich distance. There is also a more general meaningof the name “Monge/Kantorovich distance,” as will be discussed. Whileit is the Kantorovich formulation that is most relevant here, the Mongeformulation will provide interesting analogies in the sequel when cod-ing schemes are introduced since the mapping f can be interpreted asa coding of X into Y . A significant portion of the optimal transporta-tion literature deals with conditions under which the two optimizationsyield the same values and the optimal coupling is in fact a deterministicmapping.

When the idea was rediscovered in 1974 [74, 42] as part of an ex-tension of Ornstein’s d- distance for random processes [76, 77, 78](more about this later), it was called the ρ or “rho-bar distance.” It wasrecognized, however, that the finite dimensional case (random vectors)was Vasershtein’s distance [108], which had been popularized by Do-brushin [18], and, in [74], as the transportation distance. In the optimaltransportation literature, the German spelling of Vasershtein, Wasser-stein, also caught on as a suitable name for the Monge/Kantorovich dis-tance and remains so.

The Monge/Kantorovich optimal cost provides a distortion or distancebetween measures, random variables, and random vectors. It has founda wide variety of applications and there are several thorough modernbooks detailing the theory and applications. The interested reader is re-ferred to [86, 87, 109, 110, 111] and the enormous number of referencestherein. The richness of the field is amply demonstrated by the morethan 700 references cited in [111] alone! Many examples of evaluation,dual formulations, and methods for constructing the measures and func-tions involved have appeared in the literature. Names associated with thedistance include Gini [29], Fréchet [25], Vasershtein/Wasserstein [108],Vallender [107], Dobrushin [18], Mallows [70], and Bickel and Freed-man [2]. It was rediscovered in the computer science literature as the“earth mover’s distance” with an intuitive description close to that ofthe original transportation distance [92, 65].

Because of the plethora of contributing authors from different fields,the distortion presents a challenge of choosing both notation and aname. Because this book shares the emphasis on random processes withOrnstein and ergodic theory, the notation and nomenclature will gener-ally follow the ergodic theory literature and the derivative informationtheory literature. Initially, however, we adopt the traditional names for


the traditional setting, and much of the notation in the various fields issimilar.

The Monge/Kantorovich Distance

Assume that both µX and µY are distributions on random objects X andY defined on a common Polish space (A,d). Define the distortion mea-sure ρ = dp for p ≥ 0, where the p = 0 case is shorthand for the Ham-ming distance, that is,

d0(x,y) = dH(x,y) = 1− δx,y . (9.7)

Define the collection of measures Pp(A,B(A)) as the collection of allprobability measures µ on (A,B(A)) for which there exists a point a∗ ∈A such that ∫

dp(x,a∗)dµ(x) <∞. (9.8)

The point a∗ is called a reference letter in information theory [27].Define

dp(µX, µY ) = ρ(µX, µY )min(1,1/p) =ρ(µX, µY ) 0 ≤ p ≤ 1

ρ1/p(µX, µY ) 1 ≤ p . (9.9)

The following lemma shows that dp is indeed a metric on Pp(A,B(A)).Lemma 9.4. Given a Polish space (A,d) with a Borel σ -field B(A), letPp(A,B(A)) denote the collection of all Borel probability measures on theBorel measurable space (A,B(A)) with a reference letter. Then for anyp ≥ 0, dp is a metric on Pp(A,B(A)).Proof. First consider the case with 0 ≤ p ≤ 1, in which case ρ is itself ametric. This immediately implies that ρ is nonnegative and symmetric inits arguments. To prove that the triangle inequality holds, suppose thatµX , µY , and µZ are three distributions on (A,B(A)). Suppose that π ap-proximately yields ρ(µX, µZ) and that r approximately yields ρ(µZ, µY ):

Eπρ(X,Z) ≤ ρ(µX, µZ)+ εErρ(Z, Y) ≤ ρ(µZ, µY )+ ε.

The existence of the reference letter ensures that the expectations are allfinite. These two distributions can be used to construct a joint distribu-tion q for the triple (X,Z, Y) such that X → Z → Y is a Markov chain andsuch that the coordinate vectors have the correct pairwise and marginaldistributions. In particular, r and µZ together imply a conditional distri-bution PY |Z and π and µX together imply a conditional distribution PZ|X .


Since the spaces are standard these can be chosen to be regular condi-tional probabilities and hence the distribution specified by its values onrectangles. If we define νx(F) = pZ|X(F|x), then

q(F ×G ×D) =∫DdµX(x)

∫Gdνx(z)pY |Z(F|z).

For this distribution, since ρ is a metric,

ρ(µX, µY ) ≤ Eqρ(X,Y) ≤ Eqρ(X,Z)+ Eqρ(Z, Y)= Eπρ(X,Z)+ Erρ(Z, Y)≤ ρ(µX, µZ)+ ρ(µZ, µY )+ 2ε,

which proves the triangle inequality for ρ. Suppose that ρ = 0. Thisimplies that there exists a π ∈ Pp(µX, µY ) such that π(x,y : x 6= y) =0 since ρ(x,y) is 0 only if x = y . This is only possible, however, ifµX(F) = µY (F) for all events F , that is, the two marginals are identical.

Next assume that p ≥ 1, in which case ρ need not be a metric. Con-struct the measures π , r and q as before except that now π and r arechosen to satisfy

(Eπρ(X,Z))1/p ≤ ρ1/p(µX, µZ)+ ε(Erρ(Z, Y))1/p ≤ ρ1/p(µZ, µY )+ ε.

Again the existence of the reference letter ensures that the expectationsare all finite. Then

dp(µX, µY ) = ρ1/p(µX, µY )

≤(Eqρ(X,Y)

)1/p=(Eqdp(X,Y)

)1/p

≤(Eq [(d(X,Z)+ d(Z, Y))p]

)1/p

≤ (Eπ [dp(X,Z)])1/p + (Er [dp(Z, Y)])1/p

using Minkowski’s inequality (Lemma 5.8). Thus

dp(µX, µY ) ≤ dp(µX, µZ)+ dp(µZ, µY )+ 2ε.

2

9.3 Coupling Discrete Spaces with the Hamming Distance

Dobrushin [18] developed several interesting properties for the spe-cial case of the transportation or Monge/Kantorovich distance, which

9.3 Coupling Discrete Spaces with the Hamming Distance 275

he called the Vasershtein distance [108], with a Hamming distortion.The properties prove useful when developing examples of the processdistance, so they are collected here. Although the properties are oftenquoted in the literature, they are usually given without proof or for moregeneral continuous spaces where the proof is more complicated.

For this section we consider probability measures µX and µY on acommon discrete space A. We will use µX to denote both the measureand the PMF, e.g., µX(G) is the probability of a subset G ⊂ S and µ(x) =µ(x) is the probability of the one point set x. Specifically, X and Yare random objects with alphabets AX = AY = A. Let (X, Y) denote theidentity mapping on the product space A×A and let X and Y denote theprojections onto the component spaces given by X(x,y) = x, Y(x,y) =y . Denote as before the collection of all couplings π satisfying (9.2) byP(µX, µY ).

The Hamming transportation distance is given by

dH(µX, µY ) = infπ∈P(µX,µY )

dH(X,Y) = infπ∈P(µX,µY )

π(X ≠ Y).

Two other distances between µX and µY ) are considered here: The vari-ation distance

var(µX, µY ) = supG| µX(G)− µY (G) |,

where the supremum is over all events G, and the distribution distanceof Example 2.3

| µX − µY |=∑x∈A

| µX(x)− µY (x) | .

The variation (or variational or total variation) distance quantifies thenotion of the maximum possible difference the two measures can assignto a single event. The distribution distance is simply an `1 norm of thedifference of the two measures.

Lemma 9.5.

dH(µX, µY ) = var(µX, µY ) =12| µX − µY | .

Proof. Define the set

B = x : µX(x) ≥ µY (x). (9.10)

For any event G,


| µX(G)− µY (G) | = |∑x∈G

µX(x)−∑x∈G

µY (x) |=|∑x∈G(µX(x)− µY (x)) |

= |∑

x∈G∩B(µX(x)− µY (x))︸︷︷︸

a

−∑

x∈G∩Bc(µY (x)− µX(x))︸︷︷︸

b

|

The terms a and b are each positive. If for a given event G, a > b, then|a−b| = a−b. In this case µX(G)−µY (G) can be increased by removingall of the x ∈ Bc from G and adding to G all of the x ∈ B, which impliesthat

| µX(G)− µY (G) |≤ µX(B)− µY (B).Similarly, if a ≤ b, then

| µX(G)− µY (G) | ≤ µY (Bc)− µX(Bc

= (1− µY (B))− (1− µX(B)) = µX(B)− µY (B).

Since one of these conditions must hold | µX(G) − µY (G) |≤ µX(B) −µY (B) for all G, which upper bound can be achieved by choosing G = B.Thus var(µX, µY ) = µX(B)− µY (B). Since

| µX(G)− µY (G) | =∑x|µX(x)− µY (x)|

=∑x∈B(µX(x)− µY (x))+

∑x∈Bc

(µY (x)− µX(x))

= µX(B)− µY (B)+ µY (Bc)− µX(Bc)= µX(B)− µY (B)+ (1− µY (B))− (1− µX(B))= 2(µX(B)− µY (B)),

whence

var(µX, µY ) =12| µX − µY | .

Suppose thatπ ∈ P(µX, µY ).Then sinceπ(x,x) ≤min(µX(x), µY (x)),

π(X ≠ Y) = 1−π(X = Y) = 1−∑xπ(x,x)

≥ 1−∑x

min(µX(x), µY (x))

= 12

∑x(µX(x)+ µY (x)− 2 min(µX(x), µY (x)))

= 12

∑x| µX(x)− µY (x) |=

12| µX − µY | .

This lower bound is in fact achievable by defining π in terms of the eventB of (9.10) for x = y by

9.4 Fidelity Criteria and Process Optimal Coupling Distortion 277

π(x,x) =min(µX(x), µY (x)) (9.11)

and for x ≠ y by

(µX(x)− µY (x))1B(x)(µY (x)− µX(x))1Bc (x).µX(B)− µY (B)

. (9.12)

Thus

dH(µX, µY ) =12|µX − µY |.

2

9.4 Fidelity Criteria and Process Optimal CouplingDistortion

In information theory, ergodic theory, and signal processing, it is im-portant to quantify the distortion between processes rather than onlybetween probability distributions, random variables, or random vectors.This means that instead of having a single space of interest, one isinterested in a limit of finite-dimensional spaces or a single infinite-dimensional space. In this case one needs a family of distortion mea-sures for the finite-dimensional spaces which will yield a notion of dis-tortion and optimal coupling for the entire process. Towards this end,suppose next that we have a sequence of product spaces An and Bn (withthe associated product σ -fields) for n = 1,2, · · · . A fidelity criterion ρn,n = 1,2, · · · is a sequence of distortion measures on An × Bn [95, 96].If one has a pair random process, say Xn,Yn, then it will be of interestto find conditions under which there is a limiting per symbol distortionin the sense that

ρ∞(x,y) = limn→∞

1nρn(xn,yn)

exists. As one might guess, the distortion measures in the sequence needto be interrelated in order to get useful behavior. The simplest and mostcommon example is that of an additive or single-letter fidelity criterionwhich has the form

ρn(xn,yn) =n−1∑i=0

ρ1(xi,yi).

Here if the pair process is AMS and ρ1 satisfies suitable integrabilityassumptions, then the limiting distortion will exist and be invariant fromthe ergodic theorem.

More generally, if ρn is subadditive in the sense that


ρn(xn,yn) ≤ ρk(xk,yk)+ ρn−k(xn−kk ,yn−kk ),

then stationarity of the pair process and integrability of ρ1 will ensurethat n−1ρn converges from the subadditive ergodic theorem. For exam-ple, if d is a distortion measure (possibly a metric) on A× B, then


d(xi,yi)p1/p

for p > 1 is subadditive from Minkowski’s inequality.By far the bulk of the information theory literature considers only

additive fidelity criteria and we will share this emphasis. The most com-mon examples of additive fidelity criteria involve defining the per-letterdistortion ρ1 in terms of an underlying metric d. The most commonly oc-curring examples are to set ρ1(x,y) = d(x,y), in which case ρn is alsoa metric for all n, or ρ1(x,y) = dp(x,y) for some p > 0. If 1 > p > 0,then again ρn is a metric for all n. If p ≥ 1, then ρn is not a metric, but

ρ1/pn is a metric. We do not wish to include the 1/p in the definition of

the fidelity criterion, however, because what we gain by having a metricdistortion we more than lose when we take the expectation. For example,a popular distortion measure is the expected squared error, not the ex-pectation of the square root of the squared error. Thanks to Lemma 9.4we can still get the benefits of a distance even with the nonmetric dis-tortion measure, however, by taking the appropriate root outside of theexpectation.

A small fraction of the information theory and communications liter-ature consider the case of ρ1 = f(d) for a nonnegative convex functionf , which is more general than ρ1 = dp if p ≥ 1. We will concentrate onthe simpler case of a positive power of a distortion.

We will henceforth make the following assumptions:

• ρn;n = 1,2, . . . is an additive fidelity criterion defined by


dp(xi,yi), (9.13)

where d is a metric and p > 0. We will refer to this as the dp-distortion. We will also allow the case p = 0 which is defined by

ρ1(x,y) = d0(x,y) = dH(x,y),

the Hamming distance dH(x,y) = 1− δx,y .• The random process distributions µ considered will possess a refer-

ence letter in the sense of Gallager [27]: given µ there exists an a ∈ Afor which

9.4 Fidelity Criteria and Process Optimal Coupling Distortion 279∫dp(x,a∗)dµ1(x) <∞. (9.14)

This is trivially satisfied if the distortion is bounded.

Define Pp(A,d) to be the space of all stationary process distributions µon (A,B(A))T satisfying (9.14), where both T = Z and Z+ will be consid-ered. Note that previously Pp(A,d) referred to a space of distributionsof random objects taking values in a metric space A. Now it refers toa space of process distributions with all individual component randomvariables taking values in a common metric space A, that is, the processdistributions are on (A,B(A))T.

In the process case we will consider Monge/Kantorovich distances onthe spaces of random vectors with alphabets An for all n = 1,2, . . ., butwe shall define a process distortion and metric as the supremum over allpossible vector dimensions.

Process Optimal Coupling Distortion

Given two process distributions µX and µY describing random processesXn and Yn, let the induced n-dimensional distributions be µXn, µYn .For each positive integer nwe can define the n-dimensional or n-th orderoptimal coupling distortion as the distortion between the induced n-dimensional distributions:

ρn(µXn, µYn) = infπ∈Pp(µXn ,µYn)

Eπρn(Xn, Yn) (9.15)

The process optimal coupling distortion (or ρ distortion) between µXand µY is defined by

ρ(µX, µY ) = supn

1nρn(µXn, µYn). (9.16)

The extension of the optimal coupling distortion or transportation costto processes was developed in the early 1970s by D.S. Ornstein for thecase of ρ1 = dH and the resulting process metric, called the d distance,played a fundamental role in the development of the Ornstein isomor-phism theorem of ergodic theory (see [76, 77, 78] and the referencestherein). The idea was extended to processes with Polish alphabets andmetric distortion measures and the square of metric distortion measuresin 1975 [42] and applied to problems of quantizer mismatch [35], Shan-non information theory [40, 75, 41, 31, 43, 32], and robust statistics [81].While there is a large literature on the finite-dimensional optimal cou-pling distance, the literature for the process optimal coupling distanceseems limited to the information theory and ergodic theory literature.


Here the focus is on the process case, but the relevant finite-dimensionalresults are also treated as needed.

The key aspect of the process distortion is that if it is small, thennecessarily the distortion between all sample vectors produced by thetwo processes is also small.

The dp-distance

The process distortion is clearly a metric for the important special caseof an additive fidelity criterion with a metric per letter distortion. Thissubsection shows how a process metric can be obtained in the mostimportant special case of an additive fidelity criterion with a per letterdistortion given by a positive power of a metric. The result generalizesthe special cases of process metrics in [77, 42, 106] and extends the wellknown finite-dimensional version of optimal transportation theory (e.g.,Theorem 7.1 in [110]).

Theorem 9.1. Given a Polish space (A,d) and p ≥ 0, define the additivefidelity criterion ρn : An ×An → R+;n = 1,2, . . . by

ρn(xn,yn) =n−1∑n=0

dp(xi,yi),

where d0 is shorthand for the Hamming distance dH . Define

dp(µX, µY ) = supnn−1ρmin(1,1/p)

n (µXn, µYn) = ρmin(1,1/p)(µX, µY ). (9.17)

Then dp is a metric on Pp(A,d), the space of all stationary random pro-cesses with alphabet A.

Comments: The theorem together with Lemma 9.4 says that if ρ1 = dp,then dp = ρ1/p is a metric for both the vector and process case if p ≥ 1,and dp = ρ is a metric if 0 ≤ p ≤ 1. The two cases agree for p = 1and the p = 0 case is simply shorthand for the p = 1 case with theHamming metric. In the case of p = 0, the process distance d0 is Orn-stein’s d-distance and the notation is usually abbreviated to simply dto match usage in the ergodic theory literature. It merits pointing outthat d0(µXn, µYn) is not the transportation distance with a Hamming dis-tance on An, it is the transportation distance with respect to the dis-tance d(xn,yn) =

∑n−1i=0 dH(xi,yi), the sum of the Hamming distances

between symbols. This is also n times the average Hamming distanceof Example 1.4. The two metrics on An are related through the simplebounds


n−1∑i=0

dH(xi,yi) ≥ dH(xn,yn) ≥1n

n−1∑i=0

dH(xi,yi). (9.18)

Proof. Since the dp(µXn, µYn) = ρmin(1,1/p)n (µXn, µYn) are nonnegative

and symmetric, so is their supremum dp(µX, µY ). Lemma 9.4 shows thatfor each n, dp(µXn, µYn) satisfies the triangle inequality. Hence so doesthe supremum since

dp(µX, µY ) = supn

1ndp(µXn, µYn)

≤ supn

1n

(dp(µXn, µZn)+ dp(µZn, µYn)

)≤ sup

n

1ndp(µXn, µZn)+ sup

n

1ndp(µZn, µYn)

= dp(µX, µZ)+ dp(µZ, µY ).

Suppose that dp(µX, µY ) = 0 and hence also dp(µXn, µZn) = 0 for all n.As in the proof of Lemma 9.4, this implies that µXn = µYn for each n, andhence that the process distributions µX and µY must also be identical. 2

The next theorem collects several more properties of the dp dis-tance, including the facts that the supremum defining the process dis-tance is a limit, that the distance between IID processes reduces to theMonge/Kantorovich distance between the first order marginals, and acharacterization of the process distance as an optimization over pro-cesses.

Theorem 9.2. Suppose that µX and µY are stationary process distributionswith a common standard alphabet A and that ρ1 = dp is a positive powerof a metric on A and that ρn is defined on An in an additive fashion asbefore. Then

(a) limn→∞n−1ρn(µXn, µYn) exists and equals supn n−1ρn(µXn, µYn).Thus dp(µX, µY ) = limn→∞n−1ρmin(1,1/p)

n (µXn, µYn).(b) If µX and µY are both IID, then ρ(µX, µY ) = ρ1(µX0 , µY0) and hence

dp(µX, µY ) = ρmin(1,1/p)1 (µX0 , µY0)

(c) Let Ps = Ps(µX, µY ) denote the collection of all stationary distribu-tions πXY having µX and µY as marginals, that is, distributions onXn,Yn with coordinate processes Xn and Yn having the givendistributions. Define the process distortion measure ρ′

ρ′(µX, µY ) = infπXY∈Ps

EπXY ρ(X0, Y0).

Thenρ(µX, µY ) = ρ′(µX, µY );


that is, the limit of the finite dimensional minimizations is given by aminimization over stationary processes.

(d) Suppose that µX and µY are both stationary and ergodic. DefinePe = Pe(µX, µY ) as the subset of Ps containing only ergodic processes,then

ρ(µX, µY ) = infπXY∈Pe

EπXY ρ(X0, Y0).

(e) Suppose that µX and µY are both stationary and ergodic. Let GX de-note a collection of frequency-typical sequences for µX in the sense ofSection 7.9. Recall that by measuring relative frequencies on frequency-typical sequences one can deduce the underlying stationary and er-godic measure that produced the sequence. Similarly let GY denote aset of frequency-typical sequences for µY . Define the process distortionmeasure

ρ′′(µX, µY ) = infx∈GX,y∈GY

lim supn→∞

1n

n−1∑i=0

ρ1(x0, y0).

Thenρ(µX, µY ) = ρ′′(µX, µY ).

that is, the ρ distortion gives the minimum long term time averagedistance obtainable between frequency-typical sequences from the twoprocesses.

(f) The infima defining ρn and ρ′ are actually minima.

Proof. (a) Suppose that πN is a joint distribution on (AN×AN,B(A)N×B(A)N) describing (XN, YN) that approximately achieves ρN , e.g., πNhas µXN and µYN as marginals and for ε > 0

EπNρ(XN, YN) ≤ ρN(µXN , µYN )+ ε.

For any n < N let πn be the induced distribution for (Xn, Yn) andπN−nn that for (XN−nn , YN−nn ). Then since the processes are stationaryπn ∈ Pn and πN−nn ∈ PN−n and hence

EπNρ(XN, YN) = Eπnρ(Xn, Yn)+ EπN−nnρ(XN−nn , YN−nn )

≥ ρn(µXn, µYn)+ ρN−n(µXN−n, µYN−n).

Since ε is arbitrary we have shown that if an = ρn(µXn, µYn), then forall N > n

aN ≥ an + aN−n;

that is, the sequence an is superadditive. It then follows from Lemma8.4 (since the negative of a superadditive sequence is a subadditivesequence) that


limn→∞

ann= supn≥1

ann,

which proves (a).(b) Suppose that π1 approximately yields ρ1, that is, has the correct

marginals andEπ1ρ1(X0, Y0) ≤ ρ1(µX0 , µY0)+ ε.

Then for any n the product measure πn defined by

πn = ×n−1i=0 π

1

has the correct vector marginals for Xn and Yn since they are IID, andhence πn ∈ Pn, and therefore

ρn(µXn, µYn) ≤ Eπnρn(Xn, Yn) = nEπ1ρ1(X0, Y0)≤ n(ρ1(µX0 , µY0)+ ε).

Since by definition ρ is the supremum of n−1ρn, this proves (c).(c) Given ε > 0 let π ∈ Ps be such that Eπρ1(X0, Y0) ≤ ρ′(µX, µY ) + ε.

The induced distribution on Xn,Yn is then contained in Pn, andhence using the stationarity of the processes

ρn(µXn, µYn) ≤ Eρn(Xn, Yn) = nEρ1(X0, Y0) ≤ nρ′(µX, µY ),

and therefore ρ′ ≥ ρ since ε is arbitrary.Let πn ∈ Pn, n = 1,2, . . . be a sequence of measures such that

Eπn[ρn(Xn, Yn)] ≤ ρn + εn

where εn → 0 as n→∞. Let qn denote the product measure on the se-quence space induced by the πn, that is, for any N and N-dimensionalrectangle F = ×i∈IFi with all but a finite number N of the Fi beingA× B,

q(F) =∏j∈Iπn(×(j+1)n−1

i=jn Fi);

that is, qn is the pair process distribution obtained by gluing togetherindependent copies of πn. This measure is obviouslyN-stationary andwe form a stationary process distribution πn by defining

πn(F) =1n

n−1∑i=0

qn(T−iF)

for all events F . If we now consider the metric space of all processdistributions on (AI × AI, B(A)I × B(A)I) of Section 9.1 with a met-ric dG based on a countable generating field of rectangles (which ex-ists since the spaces are standard), then from Lemma 9.2 the space


is sequentially compact and hence there is a subsequence pnk thatconverges in the sense that there is a measure π and πnk(F) → π(F)for all F in the generating field. Assuming the generating field to con-tain all shifts of field members, the stationarity of the πn implies thatπ(T−1F) = π(F) for all rectangles F in the generating field and henceπ is stationary. A straightforward computation shows that for all nπ induces marginal distributions on Xn and Yn of µXn and µYn andhence π ∈ Ps . In addition,

Eπρ1(X0, Y0) = limk→∞

Eπnkρ1(X0, Y0)

= limk→∞

n−1k

nk−1∑i=0

Eqnkρ1(Xi, Yi)

= limk→∞

(ρnk + εnk) = ρ,

proving that ρ′ ≤ ρ and hence that they are equal.(d) If π ∈ Ps nearly achieves ρ′, then from the ergodic decomposition

it can be expressed as a mixture of stationary and ergodic processesπλ and that

Eπρ1 =∫dW(λ)Eπλρ1.

All of the πλ must have µX and µY as marginal process distributions(since they are ergodic) and at least one value of λmust yield an Eπλρ1

no greater than the preceding average. Thus the infimum over station-ary and ergodic pair processes can be no worse than that over station-ary processes. (Obviously it can be no better.)

(e) Suppose that x ∈ GX and y ∈ GY . Fix a dimension n and define foreach N and each F ∈ B(A)n, G ∈ B(A)n

µN(F ×G) =1N

N−1∑i=0

1F×G(T i(x,y)).

The set function µN extends to a measure on (An×An,B(A)n×B(A)n).Since the sequences are assumed to be frequency-typical,

µN(F ×AI) →N→∞

µXn(F),

µN(AI ×G) →N→∞

µYn(G),

for all events F and G and generating fields for B(A)n. From Lemma9.2 (as used previously) there is a convergent subsequence of the µN .If µ is such a limit point, then for generating F , G µ(F ×AI) = µXn(F)and µ(AI×G) = µYn(G), which imply that the same relation must holdfor all F and G (since the measures agree on generating events). Thus


µ ∈ Pn. Since µN is just a sample distribution,

EµN [ρn(Xn, Yn)] = 1

N

N−1∑j=0

ρn(xnj ,ynj ) =

1N

N−1∑j=0

[n−1∑i=0

ρ1(xi+j, yi+j)].

Thus1nEµ[ρn(Xn, Yn)] ≤ lim sup

N→∞

1N

N−1∑i=0

ρ1(xi,yi)

and hence ρ ≤ ρ′.Choose for ε > 0 a stationary measure p ∈ Ps such that

Eπρ1(X0, Y0) = ρ′ + ε.

Since π is stationary,

ρ∞(x,y) = limn→∞

1n

n−1∑i=0

ρ1(xi,yi)

exists with probability one. Furthermore, since the coordinate pro-cesses are stationary and ergodic, with probability one x and y mustbe frequency-typical. Thus

ρ∞(x,y) ≥ ρ′′.

Taking the expectation with respect to π and invoking the ergodictheorem,

ρ′′ ≤ Eπρ∞ = Eπρ1(X0, Y0) ≤ ρ′ + ε.Hence ρ′ ≤ ρ′ = ρ and hence ρ′ = ρ. (g) Consider ρ′. Choose a se-quence πn ∈ Ps such that

Eπnρ1(X0, Y0) ≤ ρ′(µX, µY )+ 1/n.

From Lemma 9.2 πn must have a convergent subsequence with limit,say, π . The measure π must be stationary and have the correctmarginal processes (since all of the πn do) and from the precedingequation

Eπρ1(X0, Y0) ≤ ρ′(µX, µY ).A similar argument works for ρn.

2


9.5 The Prohorov Distance

In this section the focus is on metric distortion measures for simplicity.The section develops relations between the the Prohorov and transporta-tion distances between random vectors. The Prohorov distance measureresembles the Monge/Kantorovich distance for random variables andhas many applications in probability theory, ergodic theory, and statis-tics. (See, for example, Dudley [24], Billingsley [4] [5], Moser, et al. [72],Strassen [104], and Hample [50].) The Prohorov distance unfortunatelydoes not extend in a useful way to provide a distance measure betweenprocesses and it does not ensure the frequency-typical sequence close-ness properties that the ρ distortion possesses. We present a brief de-scription, however, for completeness and to provide the relation betweenthe two distance measures on random vectors. The development follows[81].

Analogous to the definition of the nth order ρ distance, we define forpositive integer n the nth order Prohorov distance Πn = Πn(µXn, µYn)between two distributions for random vectors Xn and Yn with a commonstandard alphabet An and with a metric ρn defined on An by

Πn = infπ∈P

infγ : π(xn,yn :1nρn(xn,yn) > γ) ≤ γ

where P is as in the definition of ρn, that is, it is the collection of all jointdistributions µXn,Yn having the given marginal distributions µXn and µYn .It is known from Strassen [104] and Dudley [24] that the infimum isactually a minimum. It can be proved in a manner similar to that for thecorresponding ρn result.

Lemma 9.6. The ρ and Prohorov distances satisfy the following inequali-ties: Πn(µXn, µYn)2 ≤ 1

nρn(µXn, µYn) ≤ ρ(µX, µY ). (9.19)

If ρ1 is bounded, say by ρmax, then

1nρn(µXn, µYn) ≤ Πn(µXn, µYn)(1+ ρmax). (9.20)

If there exists an a∗ such that

EµX0(ρ1(X0, a∗)2) ≤ ρ∗ <∞,

EµY0(ρ1(Y0, a∗)2) ≤ ρ∗ <∞, (9.21)

then

1nρn(µXn, µYn) ≤ Πn(µXn, µYn)+ 2(ρ∗Πn(µXn, µYn))1/2. (9.22)

9.5 The Prohorov Distance 287

Comment: The Lemma shows that the ρ distance is stronger than the Pro-horov in the sense that making it small also makes the Prohorov distancesmall. If the underlying metric either is bounded or has a reference letterin the sense of (9.21), then the two distances provide the same topologyon nth order distributions.

Proof. From Theorem 9.2 the infimum defining the ρn distance is actu-ally a minimum. Suppose that π achieves this minimum; that is,

Eπρn(Xn, Yn) = ρn(µXn, µYn).

For simplicity we shall drop the arguments of the distance measures andjust refer to Πn and ρn. From the Markov inequality

p(xn,yn :1nρn(xn,yn) > ε) ≤

Eπ 1nρn(X

n, Yn)ε

=1nρnε.

Setting ε = (n−1ρn)1/2 then yields

p(xn,yn :1nρn(xn,yn) >

√n−1ρn) ≤

√n−1ρn,

proving the first part of the first inequality. The fact that ρ is the supre-mum of ρn completes the first inequality.

If ρ1 is bounded, let π yield Πn and write

1nρn ≤ Eπ

1nρn(Xn, Yn)

≤ Πn +π(xn,yn :1nρn(xn,yn) > Πn)ρmax

= Πn(1+ ρmax).

Next consider the case where (9.21) holds. Again let π yield Πn, andwrite, analogously to the previous equations,

1nρn ≤ Πn + ∫

Gdp(xn,yn)

1nρn(xn,yn)

whereG = xn,yn : n−1ρn(xn,yn) > Πn.

Let 1G be the indicator of G and let E denote expectation with respect toπ . Since ρn is a metric, the triangle inequality and the Cauchy-Schwartzinequality yield


1nρn ≤

1nEρn(Xn, Yn)

≤ Πn + 1nE(ρn(Xn,a∗n)1G)+ E(

1nρn(Yn,a∗n)1G)

≤ Πn + [E( 1nρn(Xn,a∗n))2]1/2[E(12

G)]1/2

+[E( 1nρn(Yn,a∗n))2]1/2[E(12

G)]1/2.

Applying the Cauchy-Schwartz inequality for sums and invoking (9.21)yield

E(1nρn(Xn,a∗n))2 = E

( 1n

n−1∑i=0

ρ1(Xi, a∗))2

≤ E( 1n

n−1∑i=0

ρ1(Xi, a∗)2) ≤ ρ∗.

Thus1nρn ≤ 2(ρ∗)1/2(p(G))1/2 = Πn + 2(ρ∗)1/2Π1/2

n ,

completing the proof. 2

9.6 The Variation/Distribution Distance for DiscreteAlphabets

In the case of discrete alphabets and the average Hamming fidelity cri-terion, the resulting d0-distance is simply Ornstein’s d distance and thevariation distance provides some useful bounds and comparisons. Forconvenience we modify notation slightly and and let d = dH denote thesymbol distortion ρ1 and consider the additive fidelity criterion


d(xi,yi)

and following the ergodic theory literature we will denote dn = ρn,dn(µXn, µYn) = ρn(µXn, µYn), and d(µX, µY ) = ρ(µX, µY ). Then Lemma 9.5and (9.18) immediately yield the following lemma. (See, e.g., Neuhoff [74].)

Lemma 9.7. Given discrete-alphabet process distributions, forn = 1,2, · · ·

dn(µXn, µYn) ≥12| µXn −µYn | = dH(µXn, µYn) ≥

1ndn(µXn, µYn) (9.23)

9.7 Evaluating dp 289

with equality for n = 1.

The lemma shows that for discrete alphabet random processes, the ddistance on vectors is equivalent to the variation or distribution distance,if the dimension is fixed forcing either distance to be small makes theother one small as well. The conclusion is not true in the limit, however.Having a very small d distance between processes forces d/n to be smallfor all n, but it does not force the variation distance to be small.

The lemma coupled with Theorem 9.2 provides a means of computingthe d distance between two IID processes as

d(µX, µY ) = d1(µX0 , µY0) =12| µX0 − µY0 | . (9.24)

Thus, for example, the distance between two binary Bernoulli processeswith parameters p and q is simply |p − q|.

9.7 Evaluating dp

There is a substantial literature on bounding and evaluating the vari-ous optimal coupling distances and distortions for random variables,vectors, and processes. Most of it is concerned with one-dimensionalreal valued random variables, which from Theorem 9.2 also yields theprocess metric for IID random processes with the given marginals.Lemma 9.7 solves the problem for IID discrete alphabet random pro-cesses with an additive fidelity criterion with the Hamming per symboldistance, that is, for the classic Ornstein d distance.

Another evaluation example applicable to IID processes follows froma result for one-dimensional real random variables due to Hoeffding [51]and Fréchet [25]. Suppose that X and Y are real-valued random vari-ables with distributions µX and µY and cumulative distribution func-tions (CDFs) FX(x) = µX((−∞, x]) = Pr(X ≤ x) and FY (y), respec-tively, and that the distortion is squared error. Define the inverse CDFF−1X (u) = infx:FX(x)>u x. Then

ρ(µX, µY ) =∫ 1

0| F−1X (u)− F−1

Y (u) |2 du (9.25)

d2(µX, µY ) =(∫ 1

0| F−1X (u)− F−1

Y (u) |2 du) 1

2

(9.26)

It is a standard result of elementary probability theory that if X is a real-valued random variable with CDF FX and if U is a uniformly distributedrandom variable on [0,1], then the random variable F−1

X (U) has CDFFX . Hence one coupling for µX and µY is the distribution of the pair of


random variables (F−1X (U), F

−1Y (U)). Thus the right hand side of (9.25)

immediately gives an upper bound. The proof of equality can be found,e.g., in Chapter 3 of Rachev and Rüschendorf [87] or Theorem 2.18 ofVillani [110].

Gaussian Processes

Few examples of the exact evaluation of the distance between processeswith memory are known. An exception is the ρ distortion with a squared-error distortion and hence also the d2-distance between Gaussian pro-cesses, found by Gray, Neuhoff, and Shields [42]. As pointed out byTchen [106], the development holds somewhat more generally, but theGaussian case is the most important. The development here follows [42],but a few details are added and the Gaussian assumption is not madeuntil needed.

The development requires some basic spectral theory, but mostly theelementary kind sufficient for engineering developments of random pro-cesses. Suppose that Xn is a real-valued random process with pro-cess distribution µX and constant mean EµX (Xn) = 0. The assump-tion of zero mean makes things much simpler and only zero meanprocesses will be considered for the remainder of this section. Definethe autocorrelation function RX(n, k) = EµX (XnXk) for all appropriaten,k. The process Xn is said to be weakly stationary or stationaryin the wide sense or wide-sense stationary RX(n, k) depends on its ar-guments only through their difference (it is toeplitz), in which case wewrite RX(n, k) = RX(n − k). The autocorrelation is a nonnegative defi-nite, symmetric function. Wide-sense stationarity is a generalization ofstationarity, and obviously a stationary process is also weakly stationary.We consider the more general notion for the moment since it is sufficientfor the basic results we are after.

Assume that RX(k) is sufficiently well behaved to have a Fourier trans-form (e.g., at is absolutely summable or square summable with the ap-propriate mathematical machinery added). Define the power spectraldensity of the process as the Fourier transform of the autocorrelation:

SX(f) =∞∑

k=−∞RX(k)e−j2πfk, f ∈ [−

12,12]. (9.27)

The autocorrelation can be recovered from the power spectral density bytaking the inverse Fourier transform:

RX(k) =∫ 1/2

−1/2SX(f)ej2πfk. (9.28)

9.7 Evaluating dp 291

See, e.g., [37] for an engineering-oriented development or [91] for an ad-vanced treatment of the properties and applications of autocorrelationfunctions and power spectral densities along with the various ideas fromlinear systems introduced below. Here our interest is solely in its use infinding the d2-distance between Gaussian processes and a related classof processes.

Now assume there are two processes X and Y with process distribu-tions µX and µY and power spectral densities SX and SY , respectively. Letπ denote a coupling of the two processes and define the cross correlationfunction

RXY (k,n) = Eπ(XkYn).Assume that the two processes are jointly weakly sationary in the sensethat they are both weakly stationary and, in addition, RXY (k,n) dependson its arguments only through their difference, in which case we writeRXY (k,n) = RXY (k − n). Joint stationarity of the two processes impliesjoint wide sense stationarity. Again assume that the function is suffi-ciently well-behaved to have a Fourier transform, which we call the crossspectral density

SXY (f ) =∞∑

k=−∞RXY (k)e−j2πfk. (9.29)

For any such coupling π ,

Eπ[(X0 − Y0)2] = Eπ[X20 + Y 2

0 − 2X0Y0]= RX(0)+ RY (0)− 2RXY (0)

=∫ 1/2

−1/2(SX(f)+ SY (f )− 2SXY (f ))df .

An inequality of Rozanov [91] states that

|SXY (f )|2 ≤ SX(f)SY (f ). (9.30)

This implies that for any coupling π ,

Eπ[(X0 − Y0)2] ≥∫ 1/2

−1/2

(SX(f)+ SY (f )− 2

√SX(f)SY (f )

)df

=∫ 1/2

−1/2|√SX(f)−

√SY (f ) |2 df , (9.31)

the square of the `2 distance between the square roots of the powerspectral densities. Since this is true for all couplings, it follows that

d2(µX, µY ) ≥[∫ 1/2

−1/2|√SX(f)−

√SY (f ) |2 df

]1/2

, (9.32)


so that the d2 distance is always bound below by the `2 distance betweenthe square roots of the spectra. This bound can be achieved under cer-tain circumstances, which means that then (9.32) actually yields the d2

distance between processes.Suppose that Xn has the following form: there is a sequence Un with

RU(k) = 1 for k = 0 and 0 otherwise and a sequence hn for which

Xn =∑khn−kUk for all n

In words, Xn can be expressed as the output of a linear time invari-ant filter with an uncorrelated input (such as an IID input). We arebeing intentionally casual in the mathematics here in order to avoidthe details required for appropriate generality. Things are simple, forexample, in the two-sided case when hn has only a finite number ofnonzero values. The sequence hn is called the Kronecker delta response(or discrete-time impulse response) of the system. If Un is IID, then Xnis a B-process. The time-invariant filter is just a linear time-invariant fil-ter, so such processes might reasonably be dubbed linear B-processes. Afundamental result of the theory of Gaussian random processes is thatevery purely nondeterministic (or regular) stationary Gaussian randomprocess has this form, that is, can be written as a linear filtering of anIID random process, which is also Gaussian. The Fourier transform ofthe Kronecker delta response, H(f), is called the frequency response ofthe linear system and a classic result of linear systems analysis is thatSX(f) = |H(f)|2SU(f), which in our case reduces to SX(f) = |H(f)|2.

Suppose that both X and Y can be expressed in this way, where Xis formed by passing U through a linear filter with frequency responseH(f) and Y is formed by passing the same U through a linear filter withfrequency response, say, G(f). Thus, in particular, SX(f) = |H(f)|2 andSY (f ) = |G(f)|2. Standard linear systems analysis shows that the im-plied coupling by generating X and Y in this way, separate linear fil-tering of a common uncorrelated source, yields a cross spectral densitysatisfying

|SXY (f )| = |H(f)| × |G(f)| =√SX(f)SY (f ), (9.33)

which achieves the lower bound of (9.30) and hence shows that for suchprocesses

d2(µX, µY ) =[∫ 1/2

−1/2|√SX(f)−

√SY (f ) |2 df

]1/2

. (9.34)

This result was derived for stationary Gaussian random processes in[42], but the argument shows that the result holds more generally for anytwo weakly stationary processes with linear models driven by a commonuncorrelated process. For example, this gives the optimal coupling and

9.8 Measures on Measures 293

the d2-distance between any two “linear B-processes” formed by lineartime-invariant filterings of a common IID processes, which could haveGaussian, uniform, or other marginal distributions.

9.8 Measures on Measures

Measures can be placed on a suitably defined space of measures. To ac-complish this we consider the distributional distance for a particular Gand hence on a particular metric for generating the Borel field of mea-sures. We require that (Ω,B) be standard and take G to be a standardgenerating field. We will denote the resulting metric simply by d. ThusP(Ω,B) will itself be Polish from Lemma 9.1. We will usually wish toassign nonzero probability only to a nice subset of this space, e.g., thesubset of all stationary measures. Henceforth let P be a closed and con-vex subset of P(Ω,B). Since P is a closed subset of a Polish space, itis also Polish under the same metric. Let B = B(P, d) denote the Borelfield of subsets of P with respect to the metric d. Since P is Polish, thismeasurable space (P, B) is standard.

Now consider a probability measure W on the standard space (P, B).This probability measure thus is a member of the class P(P, B) of allprobability measures on the standard measurable space (P, B), where Pis a closed convex subset of the collection of all probability measures onthe original standard measurable space (Ω,B).

We now argue that any such W induces a new measure m ∈ P via themixture relation

m =∫αdW(α). (9.35)

To be precise and to make sense of the preceding formula, consider forany fixed F ∈ G and any measure α ∈ P the expectation Eα1F . We canconsider this as a function (or functional) mapping probability measuresin P into R, the real line. This function is continuous, however, sincegiven ε > 0, if F = Fi we can choose d(α,m) < ε2−i and find that

|Eα1F − Em1F | = |α(F)−m(F)| ≤ 2i∞∑j=1

2−j|α(Fj)−m(Fj)| ≤ ε.

Since the mapping is continuous, it is measurable from Lemma 2.1. Sinceit is also bounded, the following integrals all exist:

m(F) =∫Eα1FdW(α);F ∈ G. (9.36)


This defines a set function on G that is nonnegative, normalized, andfinitely additive from the usual properties of integration. Since the space(Ω,B) is standard, this set function extends to a unique probability mea-sure on (Ω,B), which we also denote by m. This provides a mappingfrom any measure W ∈ P(P, B) to a unique measure m ∈ P ⊂ P(Ω,B).This mapping is denoted symbolically by (9.35) and is defined by (9.36).We will use the notation W →m to denote it. The induced measurem iscalled the barycenter of W .

Chapter 10

The Ergodic Decomposition

Abstract In previous chapters it was shown that an asymptotically meanstationary dynamical system or random process has an ergodic decom-position, which means that the sample averages imply a stationary andergodic process distribution with the same ergodic properties. This fol-lowed because AMS systems or processes have ergodic properties — con-vergent sample averages — for all bounded measurements, and this inturn implied an ergodic decomposition. In this chapter an alternativeform for the ergodic decomposition is derived which clarifies its natureas the random selection of a stationary and ergodic system, the ergodicdecomposition is applied to Markov processes, and an ergodic decompo-sition for affine, lower semicontinuous functionals of random processesin terms of the distributional distance is derived.

10.1 The Ergodic Decomposition Revisited

Say we have an stationary dynamical system (Ω,B,m,T) with a standardmeasurable space. The ergodic decomposition then implies a mappingψ : Ω → P((Ω,B)) defined by x → px as given in the ergodic decompo-sition theorem. In fact, this mapping is into the subset of ergodic andstationary measures. In this section we show that this mapping is mea-surable, relate integrals in the two spaces, and determine the barycenterof the induced measure on measures. In the process we obtain an alter-native statement of the ergodic decomposition.

Lemma 10.1. Given a stationary system (Ω,B,m,T) with a standardmeasurable space, let P = P(Ω,B) be the space of all probability mea-sures on (Ω,B) with the topology of Section 9.8. Then the mappingψ : Ω → P defined by the mapping x → px of the ergodic decomposi-tion is a measurable mapping. (All generic points x are mapped into the


296 10 The Ergodic Decomposition

corresponding ergodic component; all remaining points are mapped intoa common arbitrary stationary ergodic process.)

Proof. To prove ψ measurable, it suffices to show that any set of theform V = β : d(α,β) < r has an inverse image ψ−1(V) ∈ B. Thisinverse image is given by

x :∞∑i=1

2−i|px(Fi)−α(Fi)| < r.

Defining the function γ : Ω→ < by

γ(x) =∞∑i=1

2−i|px(Fi)−α(Fi)|,

then the preceding set can also be expressed as γ−1((−∞, r )). The func-tion γ, however, is a sum of measurable real-valued functions and henceis itself measurable, and hence this set is in B.

Given the ergodic decomposition px and a stationary measure m,define the measureW =mψ−1 on (P, B). Given any measurable functionf : P → <, from the measurability of ψ and the chain rule of integration∫

f(α)dW(α) =∫f(α)dmψ−1(α)

=∫f(ψx)dm(x) =

∫f(px)dm(x).

Thus the rather abstract looking integrals on the space of measuresequal the desired form suggested by the ergodic decomposition, thatis, one integral exists if and only if the other does, in which case they areequal. This demonstrates that the existence of certain integrals in themore abstract space implies their existence in the more concrete space.The more abstract space, however, is more convenient for developingthe desired results because of its topological structure. In particular, theergodic decomposition does not provide all possible probability mea-sures on the original probability space, only the subset of ergodic andstationary measures. The more general space is required by some of theapproximation arguments. For example, one cannot use the separabil-ity lemma of Section 9.8 to approximate arbitrary ergodic processes bypoint masses on ergodic processes since a mixture of distinct ergodicprocesses is not ergodic!

Letting f(α) = Eα1F = α(F) for F ∈ G we see immediately that∫Eα1F dmψ−1(α) =

∫px(F)dm(x),

and hence since the right side is

10.1 The Ergodic Decomposition Revisited 297

m(F) =∫Eα1F dmψ−1(α);

that is, m is the barycenter of the measure induced on the space of allmeasures by the original stationary measurem and the mapping definedby the ergodic decomposition.

Since the ergodic decomposition puts probability one on stationaryand ergodic measures, we should be able to cut down the measure W tothese sets and hence confine interest to functionals of stationary mea-sures. In order to do this we need only show that these sets are measur-able, that is, in the Borel field of all probability measures on (Ω,B). Thisis accomplished next.

Lemma 10.2. Given the assumptions of the previous lemma, assume alsothat G is chosen to include a countable generating field and that G ischosen so that F ∈ G implies T−1F ∈ G. (If this is not true, simply expandG to

⋃i T−iG. For example, G could be a countable generating family of

cylinders.) Let Ps and Pe denote the subsets of P consisting respectivelyof all stationary measures and all stationary and ergodic measures. ThenPs is closed (and hence measurable) and convex. Pe is measurable.

Proof. Let αn ∈ Ps be a sequence of stationary measures converging to ameasure α and hence for every F ∈ G αn(F)→ α(F). Since T−1F ∈ G thismeans that αn(T−1F) → α(T−1F). Since the αn are stationary, however,this sequence is the same as αn(F) and hence converges to α(F). Thusα(T−1F) = α(F) on a generating field and hence α is stationary. Thus Psis a closed subset under d and hence must be in the Borel field.

From Lemma 7.15, an α ∈ Ps is also in Pe if and only if

α ∈⋂F∈G

G(F),

where here

G(F) = m : limn→∞

n−1n−1∑i=0

m(T−iF ∩ F) =m(F)2

= m : limn→∞

n−1n−1∑i=0

Em1T−iF∩F = (Em1F)2.

The Em1F terms are all measurable functions of m since the sets areall in the generating field. Hence the collection of m for which the limitexists and has the given value is also measurable. Since the G(F) are allin B, so is their countable intersection.

Thus we can take P to be Ps and observe that for any AMS measureW(Ps) =m(x : px ∈ Ps) = 1 from the ergodic decomposition. 2


The following theorem uses the previous results to provide an alter-native and intuitive statement of the ergodic decomposition theorem.

Theorem 10.1. Fix a standard measurable space (Ω,B) and a transfor-mation T : Ω → Ω. Then there are a standard measurable space (Λ,L),a family of stationary ergodic measures mλ;λ ∈ Λ on (Ω,B), and amapping ψ : Ω→ Λ such that

(a) ψ is measurable and invariant, and(b) if m is a stationary measure on (Ω,B) and Wψ is the induced distri-

bution; that is, Wψ(G) = m(ψ−1(G)) for G ∈ Λ (which is well definedfrom (a)), then

m(F) =∫dm(x)mψ(x)(F) =

∫dWψ(λ)mλ(F), all F ∈ B,

and if f ∈ L1(m), then so is∫f dmλ mψ-a.e. and

Emf =∫dWψ(λ)Emλf .

Finally, for any event F mψ(F) = m(F|ψ), that is, given the ergodicdecomposition and a stationary measure m , the ergodic component λis a version of the conditional probability under m given ψ = λ.

Proof. Let P denote the space of all probability measures described inLemma 9.2 along with the Borel sets generated by using the given dis-tance and a generating field. Then from Lemma 9.1 the space is Polish.Let Λ = Pe denote the subset of P consisting of all stationary and ergodicmeasures. From Lemma 9.1 this set is measurable and hence is a Borelset. Theorem 4.3 then implies that Λ together with its Borel sets is stan-dard. Define the mappingψ : Ω→ P as in Lemma 10.1. From Lemma 10.1the mapping is measurable. Sinceψmaps points inΩ into stationary andergodic processes, it can be considered as a mapping ψ : Ω → Λ that isalso measurable since it is the restriction of a measurable mapping to ameasurable set. Simply define mλ = λ (since each λ is a stationary er-godic process). The formulas for the probabilities and expectations thenfollow from the ergodic decomposition theorem and change of variableformulas. The conditional probability m(F|ψ = λ) is defined by the re-lation

m(F ∩ ψ ∈ G) =∫Gm(F|ψ = λ)dWψ(λ),

for all G ∈ L. Since ψ is invariant, σ(ψ) is contained in the σ -field ofall invariant events. Combining the ergodic decomposition theorem andthe ergodic theorem we have that Epxf = Em(f |I), where I is the σ -fieldof invariant events. (This follows since both equal limn−1

∑i fT i.) Thus

the left-hand side is

10.2 The Ergodic Decomposition of Markov Processes 299

m(F ∩ψ−1(G)) =∫ψ−1(G)

px(F)dm(x)

=∫ψ−1(G)

mψ(x)(F)dm(x) =∫Gmλ(F)dWψ(λ),

where the first equality follows using the conditional expectation of f =1F , and the second follows from a change of variables.

The theorem can be viewed as simply a reindexing of the ergodic com-ponents and a description of a stationary measure as a weighted averageof the ergodic components where the weighting is a measure on the in-dices and not on the original points. In addition, one can view the func-tionψ as a special random variable defined on the system or process thattells which ergodic component is in effect and hence view the ergodiccomponent as the conditional probability with respect to the stationarymeasure given the ergodic component in effect. For this reason we willoccasionally refer to the ψ of Theorem 10.1 as the ergodic componentfunction.

10.2 The Ergodic Decomposition of Markov Processes

The ergodic decomposition has some rather surprising properties forcertain special kinds of processes. Suppose we have a process Xn withstandard alphabet and distributionm. For the moment we allow the pro-cess to be either one-sided or two-sided. The process is said to be kthorder Markov or k-step Markov if for all n and all F ∈ σ(Xn, . . .)

m((Xn,Xn+1, . . .) ∈ F|Xi; i ≤ n− 1) =m((Xn,Xn+1, . . .) ∈ F|Xi; i = n− k, . . . , n− 1);

that is, if the probability of a turevent given the entire past depends onlyon the k most recent samples. When k is 0 the process is said to bememoryless. The following provides a test for whether or not a processis Markov.

Lemma 10.3. A process Xn is k-step Markov if and only if the condi-tional probabilities m((Xn,Xn+1, . . .) ∈ F|Xi; i ≤ n − 1) are measurablewith respect to σ(Xn−k, · · · , Xn−1), i.e., if for all F m((Xn,Xn+1, . . .) ∈F|Xi; i ≤ n− 1) depends on Xi; i ≤ n− 1 only through Xn−k, . . . , Xn−i.

Proof. By iterated conditional probabilities,

E(m((Xn,Xn+1, . . .) ∈ F|Xi; i ≤ n− 1)|Xn−k, . . . , Xn−1)=m((Xn,Xn+1, . . .) ∈ F|Xi; i = n− k, . . . , n− 1).


Since the spaces are standard, we can treat a conditional expectation asan ordinary expectation, that is, an expectation with respect to a regularconditional probability. Thus if the argument ζ = m((Xn,Xn+1, . . .) ∈F|Xi; i ≤ n − 1) is measurable with respect to σ(Xn−k, . . . , Xn−1), thenfrom Corollary 6.4 the conditional expectation of ζ is ζ itself and hencethe left-hand side equals the right-hand side, which in turn implies thatthe process is k-step Markov. This conclusion can also be drawn by usingLemma 6.12. 2

If a k-step Markov process is stationary, then it has an ergodic decom-position. What do the ergodic components look like? In particular, mustthey also be Markov? First note that in the case of k = 0 the answer istrivial: If the process is memoryless and stationary, then it must be IID,and hence it is strongly mixing and hence it is ergodic. Thus the processputs all of its probability on a single ergodic component, the processitself. In other words, a stationary memoryless process produces withprobability one an ergodic memoryless process. Markov processes ex-hibit a similar behavior: We shall show in this section that a stationaryk-step Markov process puts probability one on a collection of ergodick-step Markov processes having the same transition or conditional prob-abilities. The following lemma will provide an intuitive explanation.

Lemma 10.4. Suppose that Xn is a two-sided stationary process withdistribution m. Let mλ;λ ∈ Λ denote the ergodic decomposition for astandard space Λ of Theorem 10.1 and let ψ be the ergodic componentfunction. Then the mapping ψ of Theorem 10.1 can be taken to be mea-surable with respect to σ(X−1, X−2, . . .). Furthermore,

m((X0, X1, . . .) ∈ F|X−1, X−2, . . .) =m((X0, X1, . . .) ∈ F|ψ,X−1, X−2, . . .)=mψ((X0, X1, . . .) ∈ F|X−1, X−2, . . .),

m-a.e.

Comment: The lemma states that we can determine the ergodic compo-nent of a two-sided process by looking at the past alone. Since the pastdetermines the ergodic component, the conditional probability of thefuture given the past and the component is the same as the conditionalprobability given the past alone.

Proof. The ergodic component in effect can be determined by the limit-ing sample averages of a generating set of events. The mapping of The-orem 10.1 produced a stationary measure ψ(x) = px specified by therelative frequencies of a countable number of generating sets computedfrom x. Since the finite-dimensional rectangles generate the completeσ -field and since the probabilities under stationary measures are notaffected by shifting, it is enough to determine the probabilities of allevents of the form Xi ∈ Fi; i = 0, . . . , k− 1 for all one-dimensional gener-ating sets Fi. Since the process is stationary, the shift T is invertible, and

10.2 The Ergodic Decomposition of Markov Processes 301

hence for any event F of this type we can determine its probability fromthe limiting sample average

limn→∞

1n

n−1∑i=k

1F(T−ix)

(we require in the definition of generic sets that both these time averagesas well as those with respect to T converge). Since these relative frequen-cies determine the ergodic component and depend only on x−1, x−2, . . .,ψ has the properties claimed (and still satisfies the construction that wasused to prove Theorem 10.1, we have simply constrained the sets con-sidered and taken averages with respect to T−1). Since ψ is measurablewith respect to σ(X−1, X−2, . . .), the conditional probability of a futureevent given the infinite past is unchanged by also conditioning on ψ.Since conditioning on ψ specifies the ergodic component in effect, theseconditional probabilities are the same as those computed using the er-godic component. (The final statement can be verified by plugging intothe defining equations for conditional probabilities.) 2

We are now ready to return to k-step Markov processes. If m is two-sided and stationary and k-step Markov, then from the Markov propertyand the previous lemma

m((X0, . . .) ∈ F|X−1, . . . , X−k) =m((X0, . . .) ∈ F|X−1, X−2, . . .)=mψ((X0, . . .) ∈ F|X−1, X−2, . . .).

But this says that mψ((X0, . . .) ∈ F|X−1, X−2, . . .) is measurable with re-spect to σ(X−1, . . . , X−k) (since it is equal to a function that is), whichimplies from Lemma 10.3 that it is k-step Markov. We have now provedthe following lemma for two-sided processes.

Lemma 10.5. If Xn is a stationary Markov process and mλ; λ ∈ Λ isthe ergodic decomposition, and ψ is the ergodic component function, then

mψ((Xn, . . .) ∈ F|Xi; i ≤ n) =m((Xn, . . .) ∈ F|Xn−k, . . . , Xn−1)

almost everywhere. Thus with probability one the ergodic components arealso k-step Markov with the same transition probabilities.

Proof. For two-sided processes the statement follows from the previousargument and the stationarity of the measures. If the process is one-sided, then we can embed the process in a two-sided process with thesame behavior. Given a one-sided processm, let p be the stationary two-sided process for which p((Xk, . . . , Xk+n−1) ∈ F) = m((X0, . . . , Xn−1) ∈F) for all integers k and n dimensional sets F . This uniquely specifiesthe two-sided process. Similarly, given any stationary two-sided processthere is a unique one-sided process assigning the same probabilities to


finite-dimensional events. It is easily seen that the ergodic decomposi-tion of the one-sided process yields that for the two-sided process. Theone-sided process is Markov if and only if the two-sided process is andthe transition probabilities are the same. Application of the two-sided re-sult to the induced two-sided process then implies the one-sided result.2

10.3 Barycenters

Barycenters have some useful linearity and continuity properties. Theseare the subject of this section.

To begin recall from Section 9.2 and (9.35) that the barycentric map-ping is a mapping of a measure W ∈ P(P, B), the space of probabilitymeasures on the space P, into a measure m ∈ P, that is, m is a proba-bility measure on the original space (Ω,B). The mapping is denoted byW →m or by m =

∫αdW(α).

First observe that if W1 →m1 and W2 →m2 and if λ ∈ (0,1), then

λW1 + (1− λ)W2 → λm1 + (1− λ)m2;

that is, the mapping W → m of a measure into its barycenter is affine.The following lemma provides a countable version of this property.

Lemma 10.6. If Wi →mi, i = 1,2, . . ., and if

∞∑i=1

ai = 1

for a nonnegative sequence ai, then

∞∑i=1

aiWi →∞∑i=1

aimi.

Comment: The given countable mixtures of probability measures are eas-ily seen to be themselves probability measures. As an example of theconstruction, suppose that W is described by a probability mass func-tion, that is, W places a probability ai on a probability measure αi ∈ Pfor each i. We can then take the Wi of the lemma as the point massesplacing probability 1 on αi. Then Wi → αi and the mixture W of thesepoint masses has the corresponding mixture

∑αi as barycenter.

Proof. Let W denots the weighted average of the Wi andm the weightedaverage of the mi. The assumptions imply that for all i

10.3 Barycenters 303

mi(F) =∫Eα1FdWi(α);F ∈ G.

(That Eα1F is measurable was discussed prior to (9.36).) Thus to provethe lemma we must prove that

∞∑i=1

ai∫Eα1FdWi(α) =

∫Eα1FdW(α).

for all F ∈ G. We have by construction that for any indicator function ofan event D ∈ B that

∞∑i=1

ai∫

1D(α)dWi(α) =∫

1DdW.

Via the usual integration arguments this implies the correspondingequality for simple functions and then for bounded nonnegative mea-surable functions and hence for the Eα1F . 2

The next lemma shows that if a measure W puts probability 1 on aclosed ball in P, then the barycenter must be inside the ball.

Lemma 10.7. Given a closed ball S = α : d(α,β) ≤ r in P((Ω,B)) anda measure W such that W(S) = 1, then W →m implies that m ∈ S.

Proof. We have

d(m,β) =∞∑i=1

2−i|m(Fi)− β(Fi)|

=∞∑i=1

2−i|∫Sα(Fi)dW(α)− β(Fi)|

≤∞∑i=1

2−i∫S|α(Fi)− β(Fi)|dW(α)

=∫Sd(α,β)dW(α) ≤ r ,

where the integral is pulled out of the sum via the dominated conver-gence theorem. This completes the proof. 2

The final result of this section provides two continuity properties ofthe barycentric mapping.

Lemma 10.8. (a) Assume as earlier that the original measurable space(Ω,B) is standard. Form a metric space from the subset P of all measureson (Ω,B) using a distributional metric d = dW with W a generatingfield for B. If B is the resulting Borel field, then the Borel space (P, B)is also standard. Given these definitions, the barycentric mapping W →


m is continuous when considered as a mapping from the metric space(P(P, B), dG) to the metric space (P, d), where dG is the metric basedon the countable collection of sets G = F ∪ S, where F is a countablegenerating field of the standard sigma-field B of subsets of P and whereS is the collection of all sets of the form

α : α(F) = Eα1F ∈ [kn,k+ 1n)

for all F in the countable field that generates B and all n = 1,2, . . ., andk = 0,1, . . . , n − 1. (These sets are all in B since the functions definingthem are measurable.)

(b) Given the preceding and a nonnegative bounded measurable func-tion D : P → < mapping probability measures into the real line, definethe classH of subsets P as all sets of the form

α :D(α)Dmax

∈ [ kn,k+ 1n)

where Dmax is the maximum absolute value of D, n = 1,2, . . ., and k =0,+− 1, . . . ,+−n− 1. Let dG denote the metric on P((P, B)) induced bythe class of sets G = F∪H , with F andH as before. Then

∫D(α)dW(α)

is continuous with respect to dG ; that is, if dG(Wn,W)→n→∞ 0, then also

limn→∞

∫D(α)dWn(α) =

∫D(α)dW(α).

Comment: We are simply adding enough sets to the metric to make surethat the various integrals have the required limits.

Proof. (a) Define the uniform quantizer

qn(r) =kn

; r ∈ [ kn,k+ 1n).

Assume that dG(WN,W)→ 0 and consider the integral

|mN(F)−m(F)|

= |∫Eα1F dWN(α)−

∫Eα1FdW(α)|

≤ |∫Eα1F dWN(α)−

∫qn(Eα1F)dWN(α)| + |

∫qn(Eα1F)dWN(α)

−∫qn(Eα1F)dW(α)| + |

∫qn(Eα1F)dW(α)−

∫Eα1F dW(α)|.

uniform quantizer provides a uniformly good approximation towithin 1/n and hence given ε we can choose n so large that the leftmostand rightmost terms of the bound are each less than, say, ε/3 uniformly

The

10.4 Affine Functions of Measures 305

in N . The center term is the difference of the integrals of a fixed simplefunction. Since S was included in G, taking N → ∞ forces the probabil-ities under Wn of these events to match the probabilities under W andhence the finite weighted combinations of these probabilities goes tozero as N → ∞. In particular N can be chosen large enough to ensurethat this term is less than ε/3 for the given n. This proves (a). Part (b)follows in essentially the same way. 2

10.4 Affine Functions of Measures

P ⊂ P((Ω,B)) denote as before a convex closed subset of the space ofall probability measures on a standard measurable space (Ω,B). Let dbe the metric on this space induced by a standard generating field ofsubsets of Ω. A real-valued function (or functional) D : P → < is said tobe measurable if it is measurable with respect to the Borel field B, affineif for every α1, α2 ∈ P and λ ∈ (0,1) we have

D(λα1 + (1− λ)α2) = λD(α1)+ (1− λ)D(α2),

and upper semicontinuous if

d(αn,α) →n→∞0⇒ D(α) ≥ lim supn→∞

D(αn).

If the inequality is reversed and the limit supremum replaced by a limitinfimum, then D is said to be lower semicontinuous. An upper or lowersemicontinuous function is measurable. We will usually assume that Dis nonnegative, that is, that D(m) ≥ 0 for all m.

Theorem 10.2. IfD : P → < is a nonnegative upper semicontinuous affinefunctional on a closed and convex set of measures on a standard space(Ω,B), if D(α) is W -integrable, and if m is the barycenter of W , then

D(m) =∫D(α)dW(α).

Comment: The import of the theorem is that if a functional of probabilitymeasures is affine, then it can be expressed as an integral. This is a sortof converse to the usual implication that a functional expressible as anintegral is affine.

Proof. We begin by considering countable mixtures. Suppose that W is ameasure assigning probability to a countable number of measures e.g.,W assigns probability ai to αi for i = 1,2, , . . .. Since the barycenter of Wis

We now are prepared to develop the principal result of this chapter. Let


m =∫αdW(α) =

∞∑i=1

αiai,

and

D(∞∑i=1

αiai) = D(m).

As in Lemma 10.6, we consider a slightly more general situation. Supposethat we have a sequence of measures Wi ∈ P((P, B)), the correspondingbarycentersmi, and a nonnegative sequence of real numbers ai that sumto one. Let W =

∑i aiWi and m =

∑i aimi. From Lemma 10.6 W → m.

Define for each n

bn =∞∑

i=n+1

ai

and the probability measures

m(n) = b−1n

∞∑i=n+1

aimi,

and hence we have a finite mixture

m =n∑i=1

miai + bnm(n).

From the affine property

D(m) = D(n∑i=1

aimi + bnm(n)) =n∑i=1

aiD(mi)+ bnD(m(n)). (10.1)

Since D is nonnegative, this implies that

D(m) ≥∞∑i=1

D(mi)ai. (10.2)

Making the Wi point masses as in the comment following Lemma 10.6yields the fact that

D(m) ≥∞∑i=1

D(αi)ai. (10.3)

Observe that if D were bounded, but not necessarily nonnegative, then(10.1) would imply the countable version of the theorem, that is, that

D(m) =∞∑i=1

D(αi)ai.

10.4 Affine Functions of Measures 307

First fixM large and define DM(α) =min(D(α),M). Note that DM maynot be affine (it is if D is bounded and M is larger than the maximum).DM is clearly measurable, however, and hence we can apply the metric ofpart (b) of the previous lemma together with Lemma 9.1 to get a sequenceof measures Wn consisting of finite sums of point masses (as in Lemma9.1) with the property that if Wn → W and mn is the barycenter of Wn,then also mn →m (in d) and∫

DM(α)dWn(α) →n→∞

∫DM(α)dW(α)

as promised in the previous lemma. From upper semicontinuity

D(m) ≥ lim supn→∞

D(mn),

and from (10.3)

D(mn) ≥∫D(α)dWn(α).

Since D(α) ≥ DM(α), therefore

D(m) ≥ lim supn→∞

∫DM(α)dWn(α) =

∫DM(α)dW(α).

Since DM(α) is monotonically increasing to D(α), from the monotoneconvergence theorem

D(m) ≥∫D(α)dW(α),

which half proves the theorem.For each n = 1,2, . . . carve up the separable (under d) space P into a

sequence of closed balls

S1/n(αi) = β : d(αi, β) ≤ 1/n; i = 1,2, , . . .

Call the collection of all such sets G = Gi, i = 1,2, . . .. We use thiscollection to construct a sequence of finite partitions Vn = Vi(n),i = 0,1,2, . . . ,Nn, n=1,2, . . . . Let Vn be the collection of all nonemptyintersection sets of the form

n⋂i=1

G∗i ,

where G∗i is either Gi or Gci . For convenience we always take V0(n) asthe set

⋂ni=1G

ci of all points in the space not yet included in one of the

first n Gi (assuming that the set is not empty). Note that S1/1(αi) is thewhole set since d is bound above by one. Observe also that Vn has nomore than 2n atoms and that each Vn is a refinement of its predecessor.


Finally note that for any fixed α and any ε > 0 there will always be forsufficiently large n an atom in Vn that contains α and that has diameterless than ε. If we denote by Vn(α) the atom of Vn that contains α (andone of them must), this means that

limn→∞

diam(Vn(α)) = 0; all α ∈ P. (10.4)

For each n, define for each V ∈ Vn for whichW(V) > 0 the elementaryconditional probability WV ; that is, WV(F) = W(V ∩ F)/W(V). Break upW using these elementary conditional probabilities as

W =∑

V∈Vn,W(V)>0

W(V)WV .

For each such V letmV denote the barycenter of WV , that is, the mixtureof the measures in V with respect to the measure WV that gives prob-ability one to V . Since W has barycenter m and since it has the finitemixture form given earlier, Lemma 10.6 implies that

m =∑

V∈Vn,W(V)>0

mVW(V);

that is, for each n we can express W as a finite mixture of measures WV

for V ∈ Vn and m will be the corresponding mixture of the barycentersmV .

Thus since D is affine

D(m) =∑

V∈Vn,W(V)>0

W(V)D(mV).

Define Dn(α) as D(mV) if α ∈ V for V ∈ Vn. The preceding equationthen becomes

D(m) =∫Dn(α)dW(α). (10.5)

Recall that Vn(α) is the set in Vn that contains α. Let G denote thoseα for which W(Vn(α)) > 0 for all n. We have W(G) = 1 since at eachstage we can remove any atoms having zero probability. The union of allsuch atoms is countable and hence also has zero probability.

For α ∈ G given ε we have as in (10.4) that we can choose N largeenough to ensure that for all n > N α is in an atom V ∈ Vn that iscontained in a closed ball of diameter less than ε. Since the measure WV

on V can also be considered as a measure on the closed ball containing V ,Lemma 10.7 implies that the barycenter of V must be within the closedball and hence also

d(α,mVn(α)) ≤ ε; n ≥ N,

10.5 The Ergodic Decomposition of Affine Functionals 309

and hence the given sequence of barycenters mVn(α) converges to α.From upper semicontinuity this implies that

D(α) ≥ lim supn→∞

D(mVn(α)) = lim supn→∞

Dn(α).

Since the Dn(α) are nonnegative and D(α) is W -integrable, this meansthat the Dn(α) are uniformly integrable. To see this, observe that thepreceding equation implies that given ε > 0 there is an N such that

Dn(α) ≤ D(α)+ ε; n ≥ N,

and hence for all n

Dn(α) ≤N∑i=1

Di(α)+D(α)+ ε,

which from Lemma 5.9 implies the Dn are uniformly integrable sincethe right-hand side is integrable by (10.5) and the integrability of D(α).Since the sequence is uniformly integrable, we can apply Lemma 5.10and (10.5) to write∫D(α)dW(α)≥

∫(lim supn→∞

Dn(α))dW(α)≥ lim supn→∞

∫Dn(α)dW(α)=D(m),

which completes the proof of the theorem. 2

10.5 The Ergodic Decomposition of Affine Functionals

Now assume that we have a standard measurable space (Ω,B) and ameasurable transformation T : Ω → Ω. Define P = Ps . From Lemma 10.2this is a closed and convex subset of P((Ω,B)), and hence Theorem 10.2applies. Thus we have immediately the following result:

Theorem 10.3. Let (Ω,B,m,T) be a dynamical system with a standardmeasurable space (Ω,B) and an AMS measure m with stationary meanm. Let px ;x ∈ Ω, denote the ergodic decomposition. Let D : Ps → <be a nonnegative functional defined for all stationary measures with thefollowing properties:

(a) D(px) is m-integrable.(b) D is affine.(c) D is an upper semicontinuous function; that is, if Pn is a sequence of

stationary measures converging to a stationary measure P in the sensethat Pn(F)→ P(F) for all F in a standard generating field for B, then


D(P) ≥ lim supn→∞

D(Pn).

Then

D(m) =∫D(px)dm(x).

Comment: The functional is required to have the given properties onlyfor stationary measures since we only require its value in the precedingexpression for the stationary mean (the barycenter) and for the station-ary and ergodic components. We cannot use the previous techniques toevaluate D of the AMS processm because we have not shown that it hasa representation as a mixture of ergodic components. (It is easy to showwhen the transformation T is invertible and the measure is dominatedby its stationary mean.) We can at least partially relate such functionalsfor the AMS measure and its stationary mean since

n−1n−1∑i=0

mT−i(F) →n→∞

m(F),

and hence the affine and upper semicontinuity properties imply that

limn→∞

n−1n−1∑i=0

D(mT−i) ≤ D(m). (10.6)

The preceding development is patterned after that of Jacobs [54] forthe case of bounded (but not necessarily nonnegative) D except that herethe distributional distance is used and there weak convergence is used.

References

1. R. B. Ash. Real Analysis and Probability. Academic Press, New York, 1972.2. P. J. Bickel and D. A. Freedman. Some asymptotic theory for the bootstrap.

Annals of Statistics, Vol. 9, No. 1 196–1217, 1981.3. P. Billingsley. Ergodic Theory and Information. Wiley, New York, 1965.4. P. Billingsley. Convergence of Probability Measures. Wiley Series in Probability

and Mathematical Statistics. Wiley, New York, 1968.5. P. Billingsley. Weak Convergence of Measures: Applications in Probability. SIAM,

Philadelphia, 1971.6. O. J. Bjornsson. A note on the characterization of standard borel spaces. Math.

Scand., Vol. 47, 135–136, 1980.7. D. Blackwell. On a class of probability spaces. In Proc. 3rd Berkeley Symposium

on Math. Sci. and Prob., Vol. II, 1–6, Berkeley, 1956. Univ. California Press.8. N. Bourbaki. Elements de Mathematique,Livre VI,Integration. Hermann, Paris,

1956–1965.9. S. Boyd and L. Vandenberghe, Convex Optimization Cambridge University Press,

Cambridge, 2004.10. L. Breiman. Probability. Addison-Wesley, Menlo Park,Calif., 1968.11. J. C. Candy. A use of limit cycle oscillations to obtain robust analog-to-digital

converters. IEEE Trans. Comm., Vol. COM-22, 298–305, March 1974.12. J. P. R. Christensen. Topology and Borel Structure, volume Mathematics Studies

10. North-Holland/American Elsevier, New York, 1974.13. W. Chou, unpublished.14. K. L. Chung. A Course in Probability Theory. Academic Press, New York, 1974.15. D. C. Cohn. Measure Theory. Birkhauser, New York, 1980.16. H. Cramér. Mathematical Methods of Statistics. Princeton University Press, Prince-

ton,NJ, 1946.17. M. Denker, C. Grillenberger, and K. Sigmund. Ergodic Theory on Compact Spaces,

volume 57 of Lecture Notes in Mathematics. Springer-Verlag, New York, 1970.18. R.L. Dobrushin Prescribing a system of random variables by conditional distri-

butions. Theor. Prob. Appl. 15:458–486, 1970.19. R. Dobrushin. private correspondence.20. J. L. Doob. Measure Theory. Graduate Texts in Mathematics 143, Springer, New

York, 1994.21. J. L. Doob. Stochastic Processes. Wiley, New York, 1953.22. Y. N. Dowker. Finite and sigma-finite invariant measures. Ann. of Math., Vol. 54,

595–608, November 1951.23. Y. N. Dowker. On measurable transformations in finite measure spaces. Ann. of

Math., Vol. 62, 504–516, November 1955.

311

312 References

24. R. M. Dudley. Distances of probability measures and random variables. Ann. ofMath. Statist., Vol. 39, 1563–1572, 1968.

25. M. Fréchet, Sur la distance de deux lois de probabilité. C. R. Acad. Sci. Paris, Vol.244, 689âAS692. 366. Frisch, 1957

26. N. A. Friedman. Introduction to Ergodic Theory. Van Nostrand Reinhold Com-pany, New York, 1970.

27. R. G. Gallager. Information Theory and Reliable Communication John Wiley andSons, New York, 1968.

28. A. Garsia. A simple proof of E. Hoph’s maximal ergodic theorem. J. Math. Mech.,Vol. 14, 381–382, 1965.

29. C. Gini Sulla misura della concentrazione e della variabilit‘a del caratteri. Attidel R. Instituto Veneto, 1913–1914, 1914.

30. R. M. Gray. Entropy and Information Theory. Springer-Verlaghttp:ee.stanford.edu/∼gray/it.html. (Second edition in preparation.)

31. R. M. Gray, “Sliding-block source coding,”IEEE Trans. on Info. Theory, Vol. IT-21,No. 4, 357–368, July 1975.

32. R. M. Gray, “Time-invariant trellis encoding of ergodic discrete-time sources witha fidelity criterion,” IEEE Trans. on Info. Theory, Vol. IT-23, 71–83, Jan. 1977.

33. R. M. Gray. Oversampled sigma-delta modulation. IEEE Trans. Comm., Vol. COM-35, 481–489, April 1987.

34. R. M. Gray and L. D. Davisson. Source coding without the ergodic assumption.IEEE Trans. Inform. Theory, Vol. IT-20, 502–516, 1974.

35. R. M. Gray and L. D. Davisson, “Quantizer Mismatch,”IEEE Trans. on Comm., Vol.COM–23, No. 4, 439–442, April 1975

36. R. M. Gray and L. D. Davisson. Random Processes: A Mathematical Approach forEngineers. Prentice-Hall, Englewood Cliffs,New Jersey, 1986.

37. R.M. Gray and L. D. Davisson, Introduction to Statistical Signal Processing, Cam-bridge University Press, Cambridge, UK, 2004.

38. R. M. Gray, R. M. Gray, and L. D. Davisson, editors. Ergodic and InformationTheory, volume 19 of Benchmark Papers in Electrical Engineering and ComputerScience. Dowden,Hutchinson,& Ross, Stroudsbug PA, 1977.

39. R. M. Gray and J. C. Kieffer. Asymptotically mean stationary measures. Ann.Probab., Vol. 8, 962–973, 1980.

40. R. M. Gray, D. L. Neuhoff and D. S. Ornstein, “Nonblock source coding with afidelity criterion,”Annals of Prob., Vol. 3, No. 3, 478–491, June 1975.

41. R. M. Gray, D. L. Neuhoff and J. K. Omura, “Process definitions of distortion ratefunctions and source coding theorems,”IEEE Trans. on Info. Theory, Vol. IT-21,No. 5, 524–532, Sept. 1975.

42. R. M. Gray, D. L. Neuhoff, and P. C. Shields. A generalization of Ornstein’s d-bardistance with applications to information theory. Ann. Probab., Vol. 3, 315–328,April 1975.

43. R. M. Gray and D. S. Ornstein, “Sliding-block joint source/noisy–channel codingtheorems,”IEEE Trans. on Info. Theory, Vol. IT-22, 682–690, Nov. 1976.

44. P. R. Halmos. Invariant measures. Ann. of Math., Vol. 48, 735–754, 1947.45. P. R. Halmos. Measure Theory. Van Nostrand Reinhold, New York, 1950.46. P. R. Halmos. Lectures on Ergodic Theory. Chelsea, New York, 1956.47. P. R. Halmos. Introduction to Hilbert Space. Chelsea, New York, 1957.48. P.R. Halmos and J. von Neumann. “Operator methods in classical mechanics. II”

Ann. of Math. , 43 : 2 (1942) 332âAS-350.49. J. M. Hammersley and D. J. A. Welsh. First-passage,percolation,subadditive pro-

cesses, stochastic networks,and generalized renewal theory. Bernoulli-Bayes-Laplace Anniversary Volume, 1965.

50. F. R. Hample. A general qualitative definition of robustness. Ann. of Math.Statist., Vol. 42, 1887–1896, 1971.

References 313

51. W. Hoeffding. Maβstabinvariante korrelationstheory Schriften des Mathematis-chen sInstituts für Angewandte Mathematik der Universit[at Berlin Vol. 5, 181–233, 1940.

52. E. Hoph. Ergodentheorie. Springer-Verlag, Berlin, 1937.53. H. Inose and Y. Yasuda. A unity bit coding method by negative feedback. Proc.

IEEE, Vol. 51, 1524–1535, November 1963.54. K. Jacobs. The ergodic decomposition of the Kolmogorov-Sinai invariant. In

F. B. Wright and F. B. Wright, editors, Ergodic Theory. Academic Press, New York,1963.

55. R. Jones. New proof for the maximal ergodic theorem and the hardy-littlewoodmaximal inequality. Proc. Amer. Math. Soc., Vol. 87, 681–684, 1983.

56. L.V. Kantorovich, “On one effective method of solving certain classes of extremalproblems,” Dokl. Akad. Nauk, Vol. 28, 212âAS-215, 1940.

57. L.V. Kantorovich, “On the transfer of masses,” Uspekhi Mat. Nauk, Vol. 37, 7–8,1942.

58. L.V. Kantorovich, “On a problem of Monge,” Dokl. Akad. Nauk, Vol. 3, 225–226,1948.

59. I. Katznelson and B. Weiss. A simple proof of some ergodic theorems. IsraelJournal of Mathematics, Vol. 42, 291–296, 1982.

60. J. F. C. Kingman. The ergodic theory of subadditive stochastic processes. Ann.Probab., Vol. 1, 883–909, 1973.

61. A. N. Kolmogorov. Foundations of the Theory of Probability. Chelsea, New York,1950.

62. U. Krengel. Ergodic Theorems. De Gruyter Series in Mathematics. De Gruyter,New York, 1985.

63. N. Kryloff and N. Bogoliouboff. La théorie générale de la mesure dans son appli-cation à l’étude des systèmes de la mecanique non linéaire. Ann. of Math., Vol.38, 65–113, 1937.

64. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions,andreversals. Sov. Phys.-Dokl., Vol. 10, 707–710, 1966.

65. E. Levina and P. Bickel. The earth mover’s distance is the Mallows distance: someinsights from statistics. Proceedings Eighth IEEE International Conference on InComputer Vision (ICCV 2001), Vol.2, 251–256, 2001.

66. M. Loeve. Probability Theory. D. Van Nostrand, Princeton,New Jersey, 1963.Third Edition.

67. D. G. Luenberger. Optimization by Vector Space Methods. Wiley, New York, 1969.68. G. Mackey. Borel structures in groups and their duals. Trans. Amer. Math. Soc.,

Vol. 85, 134–165, 1957.69. L. E. Maistrov. Probability Theory: A Historical Sketch. Academic Press, New York,

1974. Translated by S. Kotz.70. C.L. Mallows. A note on asymptotic joint normality Ann. Math.l Stat. Vol. 43,

508âAS 515, 1972.71. G. Monge. Mémoire sur la théorie des déblais et des remblais, 1781.72. J. Moser, E. Phillips, and S. Varadhan. Ergodic Theory: A Seminar. Courant Insti-

tute of Math. Sciences, New York, 1975.73. J. Nedoma. On the ergodicity and r-ergodicity of stationary probability measures.

Z. Wahrsch. Verw. Gebiete, Vol. 2, 90–97, 1963.74. David Lee Neuhoff Source coding and distance measures on random processes

PhD dissertation, Stanford University August 197475. D.L. Neuhoff, R. M. Gray and L. D. Davisson. Fixed rate universal block source

coding with a fidelity criterion. IEEE Trans. on Info. Theory, Vol. IT-21, No. 5,511–523, Sept. 1975.

76. D. S. Ornstein, Bernoulli shifts with the same entropy are isomorphic, Advancesin Math., Vol. 4, 337–352, 1970.

314 References

77. D. Ornstein. An application of ergodic theory to probability theory. Ann. Probab.,Vol. 1, 43–58, 1973.

78. D. Ornstein. Ergodic Theory,Randomness,and Dynamical Systems. Yale Univer-sity Press, New Haven, 1975.

79. D. Ornstein and B. Weiss. The Shannon-McMillan-Breiman theorem for a class ofamenable groups. Israel J. of Math, Vol. 44, 53–60, 1983.

80. J. C. Oxtoby. Ergodic sets. Bull. Amer. Math. Soc., Vol. 58, 116–136, 1952.81. P. Papantoni-Kazakos and R. M. Gray. Robustness of estimators on stationary

observations. Ann. Probab., Vol. 7, 989–1002, Dec. 1979.82. K. R. Parthasarathy. Probability Measures on Metric Spaces. Academic Press, New

York, 1967.83. I. Paz. Stochastic Automata Theory. Academic Press, New York, 1971.84. K. Petersen. Ergodic Theory. Cambridge University Press, Cambridge, 1983.85. H. Poincaré. Les méthodes nouvelles de la mécanique céleste, volume I,II,III.

Gauthiers-Villars, Paris, 1892,1893,1899. Also Dover,New York,1957.86. S.T. Rachev Probability metrics and the stability of stochastic models. John Wiley

& Sons Ltd., Chichester, 1991.87. S.T. Rachev and L. Rüschendorf Mass Transportation Problems Vol. I: Theory,

Vol. II: Applications. Probability and its applications. Springer- Verlag, New York,1998.

88. O. W. Rechard. Invariant measures for many-one transformations. Duke J. Math.,Vol. 23, 477–488, 1956.

89. V. A. Rohlin. Selected topics from the metric theory of dynamical systems. Us-pechi Mat. Nauk., Vol. 4, 57–120, 1949. English translation: Amer. Math. Soc.Transl. (2) Vol. 49, 171–240, 1966.

90. V. A. Rohlin. On the fundamental ideas of measure theory, Mat. Sb., Vol. 25,107–150, 1949. English translation: Amer. Math. Soc. Transl.Amer. Math. Soc.Transl.,

91. Y. Rozanov. Stationary Random Processes. Holden Day, San Francisco, 1967.92. Y. Rubner, C. Tomasi, and L.J. Guibas. A metric for distributions with applica-

tions to image databases. In Proceedings of the IEEE International Conference onComputer Vision, 59-66. Bombay, India, Jan. 1998.

93. W. Rudin. Principles of Mathematical Analysis. McGraw-Hill, New York, 1964.94. L. Schwartz. Radon Measures on Arbitrary Topological Spaces and Cylindrical

Measures. Oxford University Press, Oxford, 1973.95. C. E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., Vol.

27, 379–423, 623–656, 1948.96. C. E. Shannon. Coding theorems for a discrete source with a fidelity criterion. In

IRE National Convention Record,Part 4, 142–163, 1959.97. P. C. Shields. The Theory of Bernoulli Shifts. The University of Chicago Press,

Chicago,Ill., 1973.98. P. C. Shields. The ergodic and entropy theorems revisited. IEEE Trans. Inform.

Theory, Vol. IT-33, 263–266, 1987.99. P. C. Shields. The Ergodic Theory of Discrete Sample Paths. Graduate Studies in

Mathematics, Vol. 13. American Mathematical Society, Providence, Rhode Island,1991.

100. G. F. Simmons. Introduction to Topology and Modern Analysis. McGraw-Hill,New York, 1963.

101. Ya. G. Sinai. Introduction to Ergodic Theory. Mathematical Notes,PrincetonUniversity Press, Princeton, 1976.

102. M. Smorodinsky. A partition on a Bernoulli shift which is not weak BernoulliMath. Systems Theory, Vol. 5, No. 3, 201–203, 1971.

103. M. Smorodinsky. Ergodic Theory, Entropy in Lecture Notes in Mathematics, Vol.214 Springer-Verlag, New York, 1971.

References 315

104. V. Strassen. The existence of probability measures with given marginals. Ann.of Math. Statist., Vol. 36, 423–429, 1965.

105. E. Tanaka and T. Kasai. Synchronization and subsititution error correctingcodes for the Levenshtein metric. IEEE Trans. Inform. Theory, Vol. IT-22, 156–162, 1976.

106. A. H. Tchen. Inequalities for distributions with given marginals. The Annals ofProbability, Vol. 8, No. 4, 814–827, August 1980.

107. S. S. Vallender. Computing the Wasserstein distance between probability dis-tributions on the line. Theory Probab. Appl., Vol. 18, 824–827, 1973.

108. L. N. Vasershtein. Markov processes on countable product space describinglarge systems of automata. Problemy Peredachi Informatsii, Vol. 5, 64–73, 1969.

109. A.M. Vershik. The Kantorovich metric: the initial history and little-known ap-plications. Zap. Nauchn. Sem. S.-Peterburg. Otdel. Mat. Inst. Steklov. (POMI) 312,Teor. Predst. Din. Sist. Komb. i Algoritm. Metody. 11 (2004), 69âAS085, 311.

110. C. Villani. Topics in Optimal Transportation Vol. 58 of Graduate Studies inMathematics. American Mathematical Society, Providence, RI, 2003.

111. C. Villani. Optimal Transport, Old and New Spring, 2009.112. J. von Neumann. Zur operatorenmethode in der klassischen mechanik. Ann. of

Math., Vol. 3, 587–642, 1932.113. P. Walters. Ergodic Theory-Introductory Lectures. Lecture Notes in Mathematics

No. 458. Springer-Verlag, New York, 1975.114. F. B. Wright. The recurrence theorem. Amer. Math. Monthly, Vol. 68, 247–248,

1961.

Index

Fσ set, 15Gδ, 15Gδ set, 15Lp norm, 144N-ergodic, 230Z, 1, 41ZM , 1R, 1, 41σ -field, 10, 158

countably generated, 13remote, 212

σ -fieldsproduct, 47

Z+, 1, 41R+, 1, 410-1 law, 212

absolutely continuous, 170, 172addition, 4, 41additive, 20

finitely, 20affine, 302, 305almost everywhere, 128, 142almost everywhere ergodic theorem,

239almost surely, 142alphabet, xx, xxiv, 36, 115AMS, 205

block, 249analytic space, xxvarithmetic mean, 140asymoptotically dominates, 210asymptotic mean expectation, 221asymptotic mean stationary, xxviiiasymptotically stationary, 154asymptotically generate, 66asymptotically generates, 64, 66

asymptotically mean stationary, 154,205

atom, 63atoms, 24autocorrelation, 140autocorrelation function, 290average, xxviiaverage Hamming distance, 3

B-process, viiiB-processes, 80bad atoms, 70ball

open, 7Banach

space, 17Banach space, 17barycenter, 294, 302base, 8basis, 66Bernoulli process, 78, 80Bernoulli shift, 80binary indexed, 64binary sequence function, 65binary sequence space, 73Birkhoff ergodic theorem, 239Birkhoff, George, 239block AMS, 249block ergodic, 230block stationary, 206Borel field, 14Borel measurable space, 14Borel measure, 82Borel measure on [0,1), 90Borel measure on [0,1)k, 110Borel measure space, 90Borel sets, 14

317

318 Index

Borel space, xxv, xxvi, 14Borel-Cantelli lemma, 145bounded ergodic properties, 200

Céaro mean, 140canonical binary sequence function, 65Cantor’s intersection theorem, 17Carathéodory, 27, 32, 62, 76, 78, 266Cartesian product, 6Cartesian product space, 46carving, 100cascade mappings, 36Cauchy sequence, 17Cauchy-Schwarz inequality, 131chain rule, 138change of variables, 137Chou, Wu, 260circular distance, 3class

separating, 68closure, 9code, 56

stationary, 55coding, 55coin flipping, 80coin flips, 91complete metric space, 17completion, 25, 26composite mappings, 36compressed, 217concave, 126conditional expectation, 166conservative, 217consistent, 77continuity at ∅, 20continuity from above, 20continuity from below, 20continuous, 267continuous function, 37continuous time random process, 41continuous-alphabet, 110converge

in mean square, 144in quadratic mean, 144

converge weakly, 268convergence, 8, 142L1(m), 143L2(m), 144with probability one, 142

convex, 126convex function, 125coordinate function, 140cost function, 3cost functions, xxx

countable extension property, 62, 76countable subadditivity, 27countably generated, 13, 62coupling, 270covariance function, 175cover, 27covergence

in probability, 143cylinders

thin, 73

de Morgan’s law, 11dense, 16diagonal sequence, 9diagonalization, 9diameter, 9digital measurement, 115directly given random process, 52distance, 2

circular, 3distribution, 40, 275distributional, 263edit, 4Hamming, 3Levenshtein, 4Prohorov, 286total variation , 40variation, 40, 275

distortion measure, 3, 269, 270distortion measures, xxxdistribution, 36, 51

marginal, 50distribution distance, 275distribution, process, 51distributional distance, 263dominated convergence theorem, 135dynamical system

recurrent, 216dynamical system, discrete-time, 42, 45

edit distance, 4elementary events, 1elementary outcomes, 1empirical probability, xxiiiensemble average, 116equivalent, 173equivalent metrics, 8equivalent random processes, 52ergodic, xxviii, 227

totally, 251ergodic component, 299ergodic decomposition, 232, 234, 252,

295, 309ergodic properties, xxiii, 195, 226

Index 319

bounded, 200ergodic property, 196ergodic theorem, xxvii

subadditive, 253ergodic theorem, subadditive, 278ergodic theorems, xxiiiergodic theory, 42ergodicity, 226Euclidean space, 4, 19event

elementary, 1recurrent, 216

event space, xx, 10events

independent, 188output, 36

expectation, 116conditional, 166

extended Fatou’s lemma, 134extended monotone convergence

theorem, 134extension of probability measures, 61extension theorem, 77

factor, 56Fatou’s lemma, 134, 148fidelity criteria, xxxfidelity criterion, 277fidelity criterion, additive, 277fidelity criterion, single-letter, 277fidelity criterion, subadditive, 277field, 11

standard, 66, 76fields

increasing, 64product, 47

filtering, 55finite additivity, 20finite intersection property, 66finitely additive, 20flow, 45frequency-typical points, 233frequency-typical sequences, 252function, 2

continuous, 37tail, 213

future tail sub-σ -field, 212

Gaussian Processes, 290Gaussian random vector, 175generated σ -field, 12generated field, 13generic points, 233generic sequences, 252

good sets priciple, 70good sets principle, 50, 62

Hölder’s inequality, 131Hamming distance, 3

average, 3Hilbert Space, 167Hilbert space, 17, 166

i.o., 208IID, 80, 110, 245incompressible, 217increasing fields, 64independent and identically distributed,

80, 245independent events, 188independent measurements, 188independent random variables, 188independent, identically distributed,

110indicator function, 38, 65inequality

Young’s, 127inequality, Jensen’s, 125infinitely recurrent, 218information source, xxivinner product, 144inner product space, 6, 144, 166integer sequence space, 73, 96, 99integrable, 127integral, 127integrating to the limit, 134intersection sets, 63invariant, 152, 207

totally, 209invariant function, 197invariant measure, 197invertible, 43isomorphic

dynamical systems, 58measurable spaces, 57probability spaces, 57

isomorphic, metrically, 57isomorphism, 56, 57isomorphism mod 0, 57, 91iterated expectation, 184

Jensen’s Inequality, 125join, 116joining, 270

K-automorphism, 247K-process, 212kernel, 102

320 Index

Kolmogorov Extension Theorem, 77Kolmogorov extension theorem, 62Kolmogorov process, 212Kolmogorov representation, 53Kolmogorov, A.N., xxivKronecker δ function, 3

l.i.m., 144lag, 140las of large numbers, xxviilaw of large numbers, xxiiiLebesgue decomposition theorem, 172Lebesgue measure, 82

on the real line, 93Lebesgue measure on [0,1)k, 110Lebesgue measure space, 90Lebesgue space, 90, 91Lee metric, 3left shift, 43left tail sub-σ -field, 212Levenshtein distance, 4limit in the mean, 144limit inferior, 120limit infimum, 120, 121limit point, 9limit superior, 120limit supremum, 120limiting time average, 141lower limit, 120Lusin space, xxv

marginal distribution, 50Markovk-step, 299kth order, 299

Markov chain, 190, 191, 273Markov inequality, 131Markov property, 190martingale, 189match, 4maximal ergodic lemma or theorem,

xxixmean, 116mean function, 175, 290mean vector, 175measurability, 42measurable, 29measurable function, 35, 157, 158measurable scheme, 102measurable space, 10, 157

standard, 66, 76measure, 20

Borel, 82Lebesgue, 82

measure space, 20finite, 20sigma-finite, 20

measurement, 115digital, 115discrete, 115

measurementsindependent, 188

metric, 2Hamming, 3Lee, 3

metric space, 2complete, separable, 97

metrically isomorphic, 57metrically transitive, 227metrics

equivalent, 8Minkowski sum inequality, 7Minkowski’s inequality, 6, 131, 278mixing, 245, 246mixture, xxxii, 24, 232monotone class, 12, 13, 31monotone convergence theorem, 134monotonicity, 27multiplication, 4

nested expectation, 184nonsingular, 218norm, 144normed linear space, 4, 5normed linear vector space, 4null preserving, 218null set, 25

one-sided, 2one-sided random process, 41open ball, 7open set, 7open sphere, 7orbit, 42, 45orthogonal, 168outer measure, 28output events, 36

parallelogram law, 167partition, 24, 63, 116partition distance, 24past tail sub-σ -field, 212PDF, 174perpindicular complement, 168point, 2points

frequency-typical, 233generic, 233

Index 321

pointwise ergodic theorem, 239Polish Borel space, 19Polish schemes, 101Polish space, 19, 147, 270power set, 11, 58, 73power spectral density, 290pre-Hilbert space, 6probabilistic average, 116probability density function, 174probability measure, xxprobability space, 19

associated, 53discrete, 21

probability space, complete, 25process

asymptotically mean stationary, 206process distance, viiiprocess distribution, xx, 51product

cartesion, 6product σ -field, 47product distribution, 190product field, 47product space, 2, 6, 46Prohorov distance, 286projection, 166, 169projection theorem, 166, 168pseudometric, 2pseudometric space associated with the

probability space, 23

quantization, 119quantizer, 43, 105, 119

Radon space, xxvRadon-Nikodym derivative, 170Radon-Nikodym theorem, 166, 169random object, 35random process, xx, 41

two-sided, 41continuous parameter, 41continuous time, 41discrete parameter, 41discrete time, 41bilateral, 41coin flipping, 80directly given, 52double-sided, 41IID, 80one-sided, 41roulette, 80single-sided, 41unilateral, 41

random processes

equivalent, 52random variable, 35, 36

discrete, 36, 115random variables

independent, 188random vector, 41rectangle, 47recurrence, 215recurrent, 215recurrent dynamical system, 216recurrent event, 215, 216reference letter, 273regular conditional probability, xxviii,

157relative frequency, xxiiiremote σ -field, 212restriction, 162right tail sub-σ -field, 212roullette, 80

sample average, 139–141sample function, 41sample mean, 140sample points, 1sample sequence, 41sample space, 1, 10sample waveform, 41sampling, 45scalar, 4scheme, 101

measurable, 102schemes, 101semi-flow, 45seminorm, 5separable, 16separating class, 68sequence space

binary, 73integer, 73

sequentially compact, 8setFσ , 15Gδ, 15open, 7

shift, 43, 139shift transformation, 43shift-invariant, 55sigma-field, 10sigma-finite, 20simple function, 115singular, 172size, 4smallest σ -field, 12space

322 Index

Euclidean, 4Hilbert, 17inner product, 6, 144metric, 2normed linear, 4normed linear vector, 4pre-Hilbert, 6standard, 63standard measurable, 76

sphereopen, 7

standard diagonalization, 9standard field, 66, 76standard measurable space, 66, 76standard probability space, 137standard space, 63standard spaces, xxvstate, 43stationary, 55, 152

block, 206weakly, 290wide-sense, 290

stationary function, 197stationary mean, xxviii, 205stochastic process, xxivstrongly mixing, 245, 246sub-σ -field, 158subadditive, 126, 254subadditive ergodic theorem, 253subadditivity, 27submartingale, 189subspace, 70sums, 128superadditive, 282supermartingale, 189Suslin space, xxv

tail σ -field, 212tail function, 213

tail sub-σ -fieldfuture, 212past, 212right, 212

Tchebychev’s inequality, 131term-by-term, 136theorem:processmetric, 280thin cylinders, 73time average, 139, 140time averge, 141time invariant, 197time series, xxivtime-averge mean, 140toeplitz, 290topology, 8total variation distance, 40totally ergodic, 230, 251totally invariant, 209transport, 272transportation, 272trivial space, 11

uniformly integrable, 133upper limit, 120

variation distance, 40, 275version of the conditional expectation,

184Vitali-Hahn-Saks Theorem, 202

wandering set, 216weak convergence, 268weakly mixing, 245, 246weakly stationary, 290

Young’s inequality, 127

zero-time mapping, 43

probability random processes and ergodic properties

Documents