springer texts in statistics - tanujit chakraborty's blog...durrett: essentials of stochastic...

624
Springer Texts in Statistics Advisors: George Casella Stephen Fienberg Ingram Olkin

Upload: others

Post on 06-Jun-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Springer Texts in Statistics

Advisors:George Casella Stephen Fienberg Ingram Olkin

Page 2: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Springer Texts in Statistics

Alfred: Elements of Statistics for the Life and Social SciencesAthreya and Lahiri: Measure Theory and Probability TheoryBerger: An Introduction to Probability and Stochastic ProcessesBilodeau and Brenner: Theory of Multivariate StatisticsBlom: Probability and Statistics: Theory and ApplicationsBrockwell and Davis: Introduction to Times Series and Forecasting, Second EditionCarmona: Statistical Analysis of Financial Data in S-PlusChow and Teicher: Probability Theory: Independence, Interchangeability,

Martingales, Third EditionChristensen: Advanced Linear Modeling: Multivariate, Time Series, and

Spatial Data—Nonparametric Regression and Response SurfaceMaximization, Second Edition

Christensen: Log-Linear Models and Logistic Regression, Second EditionChristensen: Plane Answers to Complex Questions: The Theory of Linear

Models, Third EditionCreighton: A First Course in Probability Models and Statistical InferenceDavis: Statistical Methods for the Analysis of Repeated MeasurementsDean and Voss: Design and Analysis of Experimentsdu Toit, Steyn, and Stumpf: Graphical Exploratory Data AnalysisDurrett: Essentials of Stochastic ProcessesEdwards: Introduction to Graphical Modelling, Second EditionFinkelstein and Levin: Statistics for LawyersFlury: A First Course in Multivariate StatisticsGhosh, Delampady and Samanta: An Introduction to Bayesian Analysis:

Theory and MethodsGut: Probability: A Graduate CourseHeiberger and Holland: Statistical Analysis and Data Display:

An Intermediate Course with Examples in S-PLUS, R, and SASJobson: Applied Multivariate Data Analysis, Volume I: Regression and

Experimental DesignJobson: Applied Multivariate Data Analysis, Volume II: Categorical and

Multivariate MethodsKalbfleisch: Probability and Statistical Inference, Volume I: Probability,

Second EditionKalbfleisch: Probability and Statistical Inference, Volume II: Statistical

Inference, Second EditionKarr: ProbabilityKeyfitz: Applied Mathematical Demography, Second EditionKiefer: Introduction to Statistical InferenceKokoska and Nevison: Statistical Tables and FormulaeKulkarni: Modeling, Analysis, Design, and Control of Stochastic SystemsLange: Applied ProbabilityLange: OptimizationLehmann: Elements of Large-Sample Theory

(continued after index)

Page 3: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Krishna B. AthreyaSoumendra N. Lahiri

Measure Theory and Probability Theory

Page 4: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Krishna B. Athreya Soumendra N. LahiriDepartment of Mathematics and Department of Statistics

Department of Statistics Iowa State UniversityIowa State University Ames, IA 50011Ames, IA 50011 [email protected]@iastate.edu

Editorial BoardGeorge Casella Stephen Fienberg Ingram OlkinDepartment of Statistics Department of Statistics Department of StatisticsUniversity of Florida Carnegie Mellon University Stanford UniversityGainesville, FL 32611-8545 Pittsburgh, PA 15213-3890 Stanford, CA 94305USA USA USA

Library of Congress Control Number: 2006922767

ISBN-10: 0-387-32903-X e-ISBN: 0-387-35434-4ISBN-13: 978-0387-32903-1

Printed on acid-free paper.

©2006 Springer Science+Business Media, LLCAll rights reserved. This work may not be translated or copied in whole or in part without thewritten permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street,New York, NY 10013, USA), except for brief excepts in connection with reviews or scholarlyanalysis. Use in connection with any form of information storage and retrieval, electronicadaptation, computer software, or by similar or dissimilar methodology now known or hereafterdeveloped is forbidden.The use in this publication of trade names, trademarks, service marks, and similar terms, even ifthey are not identified as such, is not to be taken as an expression of opinion as to whether ornot they are subject to proprietary rights.

Printed in the United States of America. (MVY)

9 8 7 6 5 4 3 2 1

springer.com

Page 5: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Dedicated to our wivesKrishna S. Athreya and Pubali Banerjee

andto the memory of

Uma Mani Athreya and Narayani Ammal

Page 6: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Preface

This book arose out of two graduate courses that the authors have taughtduring the past several years; the first one being on measure theory followedby the second one on advanced probability theory.

The traditional approach to a first course in measure theory, such as inRoyden (1988), is to teach the Lebesgue measure on the real line, then thedifferentation theorems of Lebesgue, Lp-spaces on R, and do general mea-sure at the end of the course with one main application to the constructionof product measures. This approach does have the pedagogic advantageof seeing one concrete case first before going to the general one. But thisalso has the disadvantage in making many students’ perspective on mea-sure theory somewhat narrow. It leads them to think only in terms of theLebesgue measure on the real line and to believe that measure theory isintimately tied to the topology of the real line. As students of statistics,probability, physics, engineering, economics, and biology know very well,there are mass distributions that are typically nonuniform, and hence it isuseful to gain a general perspective.

This book attempts to provide that general perspective right from thebeginning. The opening chapter gives an informal introduction to measureand integration theory. It shows that the notions of σ-algebra of sets andcountable additivity of a set function are dictated by certain very natu-ral approximation procedures from practical applications and that theyare not just some abstract ideas. Next, the general extension theorem ofCarathedory is presented in Chapter 1. As immediate examples, the con-struction of the large class of Lebesgue-Stieltjes measures on the real lineand Euclidean spaces is discussed, as are measures on finite and countable

Page 7: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

viii Preface

spaces. Concrete examples such as the classical Lebesgue measure and var-ious probability distributions on the real line are provided. This is furtherdeveloped in Chapter 6 leading to the construction of measures on sequencespaces (i.e., sequences of random variables) via Kolmogorov’s consistencytheorem.

After providing a fairly comprehensive treatment of measure and inte-gration theory in the first part (Introduction and Chapters 1–5), the focusmoves onto probability theory in the second part (Chapters 6–13). The fea-ture that distinguishes probability theory from measure theory, namely, thenotion of independence and dependence of random variables (i.e., measure-able functions) is carefully developed first. Then the laws of large numbersare taken up. This is followed by convergence in distribution and the centrallimit theorems. Next the notion of conditional expectation and probabilityis developed, followed by discrete parameter martingales. Although the de-velopment of these topics is based on a rigorous measure theoretic founda-tion, the heuristic and intuitive backgrounds of the results are emphasizedthroughout. Along the way, some applications of the results from probabil-ity theory to proving classical results in analysis are given. These include,for example, the density of normal numbers on (0,1) and the Wierstrassapproximation theorem. These are intended to emphasize the benefits ofstudying both areas in a rigorous and combined fashion. The approachto conditional expectation is via the mean square approximation of the“unknown” given the “known” and then a careful approximation for theL1-case. This is a natural and intuitive approach and is preferred over the“black box” approach based on the Radon-Nikodym theorem.

The final part of the book provides a basic outline of a number of specialtopics. These include Markov chains including Markov chain Monte Carlo(MCMC), Poisson processes, Brownian motion, bootstrap theory, mixingprocesses, and branching processes. The first two parts can be used for atwo-semester sequence, and the last part could serve as a starting point fora seminar course on special topics.

This book presents the basic material on measure and integration theoryand probability theory in a self-contained and step-by-step manner. It ishoped that students will find it accessible, informative, and useful and alsothat they will be motivated to master the details by carefully working outthe text material as well as the large number of exercises. The authors hopethat the presentation here is found to be clear and comprehensive withoutbeing intimidating.

Here is a quick summary of the various chapters of the book. After givingan informal introduction to the ideas of measure and integration theory,the construction of measures starting with set functions on a small class ofsets is taken up in Chapter 1 where the Caratheodory extension theorem isproved and then applied to construct Lebesgue-Stieltjes measures. Integra-tion theory is taken up in Chapter 2 where all the basic convergence the-orems including the MCT, Fatou, DCT, BCT, Egorov’s, and Scheffe’s are

Page 8: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Preface ix

proved. Included here are also the notion of uniform integrability and theclassical approximation theorem of Lusin and its use in Lp-approximationby smooth functions. The third chapter presents basic inequalities for Lp-spaces, the Riesz-Fischer theorem, and elementary theory of Banach andHilbert spaces. Chapter 4 deals with Radon-Nikodym theory via the Rieszrepresentation on L2-spaces and its application to differentiation theoremson the real line as well as to signed measures. Chapter 5 deals with prod-uct measures and the Fubini-Tonelli theorems. Two constructions of theproduct measure are presented: one using the extension theorem and an-other via iterated integrals. This is followed by a discussion on convolutions,Laplace transforms, Fourier series, and Fourier transforms. Kolmogorov’sconsistency theorem for the construction of stochastic processes is taken upin Chapter 6 followed by the notion of independence in Chapter 7. The lawsof large numbers are presented in a unified manner in Chapter 8 where theclassical Kolmogorov’s strong law as well as Etemadi’s strong law are pre-sented followed by Marcinkiewicz-Zygmund laws. There are also sectionson renewal theory and ergodic theorems. The notion of weak convergence ofprobability measures on R is taken up in Chapter 9, and Chapter 10 intro-duces characteristic functions (Fourier transform of probability measures),the inversion formula, and the Levy-Cramer continuity theorem. Chapter11 is devoted to the central limit theorem and its extensions to stable andinfinitely divisible laws. Chapter 12 discusses conditional expectation andprobability where an L2-approach followed by an approximation to L1 ispresented. Discrete time martingales are introduced in Chapter 13 wherethe basic inequalities as well as convergence results are developed. Someapplications to random walks are indicated as well. Chapter 14 discussesdiscrete time Markov chains with a discrete state space first. This is fol-lowed by discrete time Markov chains with general state spaces where theregeneration approach for Harris chains is carefully explained and is usedto derive the basic limit theorems via the iid cycles approach. There arealso discussions of Feller Markov chains on Polish spaces and Markov chainMonte Carlo methods. An elementary treatment of Brownian motion ispresented in Chapter 15 along with a treatment of continuous time jumpMarkov chains. Chapters 16–18 provide brief outlines respectively of thebootstrap theory, mixing processes, and branching processes. There is anAppendix that reviews basic material on elementary set theory, real andcomplex numbers, and metric spaces.

Here are some suggestions on how to use the book.

1. For a one-semester course on real analysis (i.e., measure end inte-gration theory), material up to Chapter 5 and the Appendix shouldprovide adequate coverage with Chapter 6 being optional.

2. A one-semester course on advanced probability theory for those withthe necessary measure theory background could be based on Chapters6–13 with a selection of topics from Chapters 14–18.

Page 9: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

x Preface

3. A one-semester course on combined treatment of measure theory andprobability theory could be built around Chapters 1, 2, Sections 3.1–3.2 of Chapter 3, all of Chapter 4 (Section 4.2 optional), Sections5.1 and 5.2 of Chapter 5, Chapters 6, 7, and Sections 8.1, 8.2, 8.3(Sections 8.5 and 8.6 optional) of Chapter 8. Such a course couldbe followed by another that includes some coverage of Chapters 9–12 before moving on to other areas such as mathematical statisticsor martingales and financial mathematics. This will be particularlyuseful for graduate programs in statistics.

4. A one-semester course on an introduction to stochastic processes ora seminar on special topics could be based on Chapters 14–18.

A word on the numbering system used in the book. Statements of results(i.e., Theorems, Corollaries, Lemmas, and Propositions) are numbered con-secutively within each section, in the format a.b.c, where a is the chapternumber, b is the section number, and c is the counter. Definitions, Exam-ples, and Remarks are numbered individually within each section, also ofthe form a.b.c, as above. Sections are referred to as a.b where a is the chap-ter number and b is the section number. Equation numbers appear on theright, in the form (b.c), where b is the section number and c is the equationnumber. Equations in a given chapter a are referred to as (b.c) within thechapter but as (a.b.c) outside chapter a. Problems are listed at the end ofeach chapter in the form a.c, where a is the chapter number and c is theproblem number.

In the writing of this book, material from existing books such as Apostol(1974), Billingsley (1995), Chow and Teicher (2001), Chung (1974), Dur-rett (2004), Royden (1988), and Rudin (1976, 1987) has been freely used.The authors owe a great debt to these books. The authors have used thismaterial for courses taught over several years and have benefited greatlyfrom suggestions for improvement from students and colleagues at IowaState University, Cornell University, the Indian Institute of Science, andthe Indian Statistical Institute. We are grateful to them.

Our special thanks go to Dean Issacson, Ken Koehler, and Justin Pe-ters at Iowa State University for their administrative support of this longproject. Krishna Athreya would also like to thank Cornell University forits support.

We are most indebted to Sharon Shepard who typed and retyped severaltimes this book, patiently putting up with our never-ending “final” versions.Without her patient and generous help, this book could not have beenwritten. We are also grateful to Denise Riker who typed portions of anearlier version of this book.

John Kimmel of Springer got the book reviewed at various stages. Thereferee reports were very helpful and encouraging. Our grateful thanks toboth John Kimmel and the referees.

Page 10: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Preface xi

We have tried hard to make this book free of mathematical and ty-pographical errors and misleading or ambiguous statements, but we areaware that there will still be many such remaining that we have notcaught. We will be most grateful to receive such corrections and sugges-tions for improvement. They can be e-mailed to us at [email protected] [email protected].

On a personal note, we would like to thank our families for their patienceand support. Krishna Athreya would like to record his profound gratitudeto his maternal granduncle, the late Shri K. Venkatarama Iyer, who openedthe door to mathematical learning for him at a crucial stage in high school,to the late Professor D. Basu of the Indian Statistical Institute who taughthim to think probabilistically, and to Professor Samuel Karlin of StanfordUniversity for initiating him into research in mathematics.

K. B. AthreyaS. N. Lahiri

May 12, 2006

Page 11: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Contents

Preface vii

Measures and Integration: An Informal Introduction 1

1 Measures 91.1 Classes of sets . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3 The extension theorems and Lebesgue-Stieltjes measures . . 19

1.3.1 Caratheodory extension of measures . . . . . . . . . 191.3.2 Lebesgue-Stieltjes measures on R . . . . . . . . . . . 251.3.3 Lebesgue-Stieltjes measures on R2 . . . . . . . . . . 271.3.4 More on extension of measures . . . . . . . . . . . . 28

1.4 Completeness of measures . . . . . . . . . . . . . . . . . . . 301.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2 Integration 392.1 Measurable transformations . . . . . . . . . . . . . . . . . . 392.2 Induced measures, distribution functions . . . . . . . . . . . 44

2.2.1 Generalizations to higher dimensions . . . . . . . . . 472.3 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.4 Riemann and Lebesgue integrals . . . . . . . . . . . . . . . 592.5 More on convergence . . . . . . . . . . . . . . . . . . . . . . 612.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Page 12: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

xiv Contents

3 Lp-Spaces 833.1 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 833.2 Lp-Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.2.1 Basic properties . . . . . . . . . . . . . . . . . . . . 893.2.2 Dual spaces . . . . . . . . . . . . . . . . . . . . . . . 93

3.3 Banach and Hilbert spaces . . . . . . . . . . . . . . . . . . . 943.3.1 Banach spaces . . . . . . . . . . . . . . . . . . . . . 943.3.2 Linear transformations . . . . . . . . . . . . . . . . . 963.3.3 Dual spaces . . . . . . . . . . . . . . . . . . . . . . . 973.3.4 Hilbert space . . . . . . . . . . . . . . . . . . . . . . 98

3.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4 Differentiation 1134.1 The Lebesgue-Radon-Nikodym theorem . . . . . . . . . . . 1134.2 Signed measures . . . . . . . . . . . . . . . . . . . . . . . . 1194.3 Functions of bounded variation . . . . . . . . . . . . . . . . 1254.4 Absolutely continuous functions on R . . . . . . . . . . . . 1284.5 Singular distributions . . . . . . . . . . . . . . . . . . . . . 133

4.5.1 Decomposition of a cdf . . . . . . . . . . . . . . . . . 1334.5.2 Cantor ternary set . . . . . . . . . . . . . . . . . . . 1344.5.3 Cantor ternary function . . . . . . . . . . . . . . . . 136

4.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5 Product Measures, Convolutions, and Transforms 1475.1 Product spaces and product measures . . . . . . . . . . . . 1475.2 Fubini-Tonelli theorems . . . . . . . . . . . . . . . . . . . . 1525.3 Extensions to products of higher orders . . . . . . . . . . . 1575.4 Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . 160

5.4.1 Convolution of measures on(R,B(R)

). . . . . . . . 160

5.4.2 Convolution of sequences . . . . . . . . . . . . . . . 1625.4.3 Convolution of functions in L1(R) . . . . . . . . . . 1625.4.4 Convolution of functions and measures . . . . . . . . 164

5.5 Generating functions and Laplace transforms . . . . . . . . 1645.6 Fourier series . . . . . . . . . . . . . . . . . . . . . . . . . . 1665.7 Fourier transforms on R . . . . . . . . . . . . . . . . . . . . 1735.8 Plancherel transform . . . . . . . . . . . . . . . . . . . . . . 1785.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

6 Probability Spaces 1896.1 Kolmogorov’s probability model . . . . . . . . . . . . . . . . 1896.2 Random variables and random vectors . . . . . . . . . . . . 1916.3 Kolmogorov’s consistency theorem . . . . . . . . . . . . . . 1996.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

7 Independence 219

Page 13: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Contents xv

7.1 Independent events and random variables . . . . . . . . . . 2197.2 Borel-Cantelli lemmas, tail σ-algebras, and Kolmogorov’s

zero-one law . . . . . . . . . . . . . . . . . . . . . . . . . . . 2227.3 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

8 Laws of Large Numbers 2378.1 Weak laws of large numbers . . . . . . . . . . . . . . . . . . 2378.2 Strong laws of large numbers . . . . . . . . . . . . . . . . . 2408.3 Series of independent random variables . . . . . . . . . . . . 2498.4 Kolmogorov and Marcinkiewz-Zygmund SLLNs . . . . . . . 2548.5 Renewal theory . . . . . . . . . . . . . . . . . . . . . . . . . 260

8.5.1 Definitions and basic properties . . . . . . . . . . . . 2608.5.2 Wald’s equation . . . . . . . . . . . . . . . . . . . . 2628.5.3 The renewal theorems . . . . . . . . . . . . . . . . . 2648.5.4 Renewal equations . . . . . . . . . . . . . . . . . . . 2668.5.5 Applications . . . . . . . . . . . . . . . . . . . . . . 268

8.6 Ergodic theorems . . . . . . . . . . . . . . . . . . . . . . . . 2718.6.1 Basic definitions and examples . . . . . . . . . . . . 2718.6.2 Birkhoff’s ergodic theorem . . . . . . . . . . . . . . . 274

8.7 Law of the iterated logarithm . . . . . . . . . . . . . . . . . 2788.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

9 Convergence in Distribution 2879.1 Definitions and basic properties . . . . . . . . . . . . . . . . 2879.2 Vague convergence, Helly-Bray theorems, and tightness . . 2919.3 Weak convergence on metric spaces . . . . . . . . . . . . . . 2999.4 Skorohod’s theorem and the continuous mapping theorem . 3039.5 The method of moments and the moment problem . . . . . 306

9.5.1 Convergence of moments . . . . . . . . . . . . . . . . 3069.5.2 The method of moments . . . . . . . . . . . . . . . . 3079.5.3 The moment problem . . . . . . . . . . . . . . . . . 307

9.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

10 Characteristic Functions 31710.1 Definition and examples . . . . . . . . . . . . . . . . . . . . 31710.2 Inversion formulas . . . . . . . . . . . . . . . . . . . . . . . 32310.3 Levy-Cramer continuity theorem . . . . . . . . . . . . . . . 32710.4 Extension to Rk . . . . . . . . . . . . . . . . . . . . . . . . 33210.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

11 Central Limit Theorems 34311.1 Lindeberg-Feller theorems . . . . . . . . . . . . . . . . . . . 34311.2 Stable distributions . . . . . . . . . . . . . . . . . . . . . . . 35211.3 Infinitely divisible distributions . . . . . . . . . . . . . . . . 35811.4 Refinements and extensions of the CLT . . . . . . . . . . . 361

Page 14: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

xvi Contents

11.4.1 The Berry-Esseen theorem . . . . . . . . . . . . . . . 36111.4.2 Edgeworth expansions . . . . . . . . . . . . . . . . . 36411.4.3 Large deviations . . . . . . . . . . . . . . . . . . . . 36811.4.4 The functional central limit theorem . . . . . . . . . 37211.4.5 Empirical process and Brownian bridge . . . . . . . 374

11.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376

12 Conditional Expectation and Conditional Probability 38312.1 Conditional expectation: Definitions and examples . . . . . 38312.2 Convergence theorems . . . . . . . . . . . . . . . . . . . . . 38912.3 Conditional probability . . . . . . . . . . . . . . . . . . . . 39212.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393

13 Discrete Parameter Martingales 39913.1 Definitions and examples . . . . . . . . . . . . . . . . . . . . 39913.2 Stopping times and optional stopping theorems . . . . . . . 40513.3 Martingale convergence theorems . . . . . . . . . . . . . . . 41713.4 Applications of martingale methods . . . . . . . . . . . . . 424

13.4.1 Supercritical branching processes . . . . . . . . . . . 42413.4.2 Investment sequences . . . . . . . . . . . . . . . . . 42513.4.3 A conditional Borel-Cantelli lemma . . . . . . . . . . 42513.4.4 Decomposition of probability measures . . . . . . . . 42713.4.5 Kakutani’s theorem . . . . . . . . . . . . . . . . . . 42913.4.6 de Finetti’s theorem . . . . . . . . . . . . . . . . . . 430

13.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430

14 Markov Chains and MCMC 43914.1 Markov chains: Countable state space . . . . . . . . . . . . 439

14.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . 43914.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . 44014.1.3 Existence of a Markov chain . . . . . . . . . . . . . . 44214.1.4 Limit theory . . . . . . . . . . . . . . . . . . . . . . 443

14.2 Markov chains on a general state space . . . . . . . . . . . . 45714.2.1 Basic definitions . . . . . . . . . . . . . . . . . . . . 45714.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . 45814.2.3 Chapman-Kolmogorov equations . . . . . . . . . . . 46114.2.4 Harris irreducibility, recurrence, and minorization . . 46214.2.5 The minorization theorem . . . . . . . . . . . . . . . 46414.2.6 The fundamental regeneration theorem . . . . . . . 46514.2.7 Limit theory for regenerative sequences . . . . . . . 46714.2.8 Limit theory of Harris recurrent Markov chains . . . 46914.2.9 Markov chains on metric spaces . . . . . . . . . . . . 473

14.3 Markov chain Monte Carlo (MCMC) . . . . . . . . . . . . . 47714.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 47714.3.2 Metropolis-Hastings algorithm . . . . . . . . . . . . 478

Page 15: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Contents xvii

14.3.3 The Gibbs sampler . . . . . . . . . . . . . . . . . . . 48014.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481

15 Stochastic Processes 48715.1 Continuous time Markov chains . . . . . . . . . . . . . . . . 487

15.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . 48715.1.2 Kolmogorov’s differential equations . . . . . . . . . . 48815.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . 48915.1.4 Limit theorems . . . . . . . . . . . . . . . . . . . . . 491

15.2 Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . 49315.2.1 Construction of SBM . . . . . . . . . . . . . . . . . . 49315.2.2 Basic properties of SBM . . . . . . . . . . . . . . . . 49515.2.3 Some related processes . . . . . . . . . . . . . . . . . 49815.2.4 Some limit theorems . . . . . . . . . . . . . . . . . . 49815.2.5 Some sample path properties of SBM . . . . . . . . 49915.2.6 Brownian motion and martingales . . . . . . . . . . 50115.2.7 Some applications . . . . . . . . . . . . . . . . . . . 50215.2.8 The Black-Scholes formula for stock price option . . 503

15.3 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504

16 Limit Theorems for Dependent Processes 50916.1 A central limit theorem for martingales . . . . . . . . . . . 50916.2 Mixing sequences . . . . . . . . . . . . . . . . . . . . . . . . 513

16.2.1 Mixing coefficients . . . . . . . . . . . . . . . . . . . 51416.2.2 Coupling and covariance inequalities . . . . . . . . . 516

16.3 Central limit theorems for mixing sequences . . . . . . . . . 51916.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529

17 The Bootstrap 53317.1 The bootstrap method for independent variables . . . . . . 533

17.1.1 A description of the bootstrap method . . . . . . . . 53317.1.2 Validity of the bootstrap: Sample mean . . . . . . . 53517.1.3 Second order correctness of the bootstrap . . . . . . 53617.1.4 Bootstrap for lattice distributions . . . . . . . . . . 53717.1.5 Bootstrap for heavy tailed random variables . . . . . 540

17.2 Inadequacy of resampling single values under dependence . 54517.3 Block bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . 54717.4 Properties of the MBB . . . . . . . . . . . . . . . . . . . . . 548

17.4.1 Consistency of MBB variance estimators . . . . . . . 54917.4.2 Consistency of MBB cdf estimators . . . . . . . . . . 55217.4.3 Second order properties of the MBB . . . . . . . . . 554

17.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556

18 Branching Processes 56118.1 Bienyeme-Galton-Watson branching process . . . . . . . . . 562

Page 16: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

xviii Contents

18.2 BGW process: Multitype case . . . . . . . . . . . . . . . . . 56418.3 Continuous time branching processes . . . . . . . . . . . . . 56618.4 Embedding of Urn schemes in continuous time branching

processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56818.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569

A Advanced Calculus: A Review 573A.1 Elementary set theory . . . . . . . . . . . . . . . . . . . . . 573

A.1.1 Set operations . . . . . . . . . . . . . . . . . . . . . 574A.1.2 The principle of induction . . . . . . . . . . . . . . . 577A.1.3 Equivalence relations . . . . . . . . . . . . . . . . . . 577

A.2 Real numbers, continuity, differentiability, and integration . 578A.2.1 Real numbers . . . . . . . . . . . . . . . . . . . . . . 578A.2.2 Sequences, series, limits, limsup, liminf . . . . . . . . 580A.2.3 Continuity and differentiability . . . . . . . . . . . . 582A.2.4 Riemann integration . . . . . . . . . . . . . . . . . . 584

A.3 Complex numbers, exponential and trigonometric functions 586A.4 Metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 590

A.4.1 Basic definitions . . . . . . . . . . . . . . . . . . . . 590A.4.2 Continuous functions . . . . . . . . . . . . . . . . . . 592A.4.3 Compactness . . . . . . . . . . . . . . . . . . . . . . 592A.4.4 Sequences of functions and uniform convergence . . 593

A.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594

B List of Abbreviations and Symbols 599B.1 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . 599B.2 Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600

References 603

Author Index 610

Subject Index 612

Page 17: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Measures and Integration: An InformalIntroduction

For many students who are learning measure and integration theory forthe first time, the notions of a σ-algebra of subsets of a set Ω, countableadditivity of a set function λ, measurability of a function, the definitionof an integral, and the interchange of limits and integration are not easyto understand and often seem not so intuitive. The goals of this informalintroduction to this subject are (1) to show that the notions of σ-algebraand countable additivity are logical consequences of certain natural ap-proximation procedures; (2) the dividends for the assumption of these twoproperties are great, and they lead to a nice and natural theory that is alsovery powerful for the handling of limits. Of course, as the saying goes, thedevil is in the details. After this informal introduction, the necessary de-tails are given in the next few sections. It is hoped that after this heuristicexplanation of the subject, the motivation for and the process of masteringthe details on the part of the students will be forthcoming.

What is Measure Theory?

A simple answer is that it is a theory about the distribution of mass overa set S. If the mass is uniformly distributed and S is an Euclidean spaceRk, it is the theory of Lebesgue measure on Rk (i.e., length in R, area inR2, volume in R3, etc.). Probability theory is concerned with the case whenS is the sample space of a random experiment and the total mass is one.Consider the following example.

Imagine an open field S and a snowy night. At daybreak one goes tothe field to measure the amount of snow in as many of the subsets of S as

Page 18: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2 Measures and Integration: An Informal Introduction

possible. Suppose now that one has the tools to measure the snow exactlyon a class of subsets, such as triangles, rectangles, circular shapes, ellipticshapes, etc., no matter how small. It is natural to try to approximate oddly-shaped regions by combinations of these “standard shapes,” and then usea limiting process to obtain a measure for the oddly-shaped regions andreach some limit for such sets. Let B denote the class of subsets of S whosemeasure is obtained this way and let λ(B) denote the amount of snow ineach B ∈ B. Call B the class of all (snow) measurable sets and λ(B) themeasure (of snow) on B for each B ∈ B. It is reasonable to expect that thefollowing properties of B and λ(·) hold:

Properties of B

(i) A ∈ B ⇒ Ac ∈ B (i.e., if one can measure the amount of snow onA and knows the total amount on S, then one knows the amount ofsnow on Ac).

(ii) A1, A2 ∈ B ⇒ A1 ∪ A2 ∈ B (i.e., if one can measure the amount ofsnow on A1 and A2, then one can do the same for A1 ∪A2).

(iii) If An : n ≥ 1 ⊂ B, and An ⊂ An+1 for all n ≥ 1, then limn→∞ An ≡⋃∞n=1 An ∈ B (i.e., if one can measure the amount of snow on An for

each n ≥ 1 on an increasing sequence of sets, then one can do so onthe limit of An).

(iv) C ⊂ B where C is the class of nice sets such as triangles, squares, etc.,that one started with.

Properties of λ(·)

(i) λ(A) ≥ 0 for A ∈ B (i.e., the amount of snow on any set is nonnega-tive!)

(ii) If A1, A2 ∈ B, A1 ∩ A2 = ∅, λ(A1 ∪ A2) = λ(A1) + λ(A2) (i.e., theamounts of snow on two disjoint sets simply add up! This propertyof λ is referred to as finite additivity).

(iii) If An : n ≥ 1 ⊂ B, are such that An ⊂ An+1 for all n, thenλ(limn→∞ An) = λ(

⋃∞n=1 An) = limn→∞ λ(An) (i.e., if we can ap-

proximate a set A by an increase sequence of sets Ann≥1 fromB, then λ(A) = limn→∞ λ(An). This property of λ is referred to asmonotone continuity from below, or m.c.f.b. in short).

This last assumption (iii) is what guarantees that different approxima-tions lead to consistent limits. Thus, if there are two increasing sequencesA′

nn≥1 and A′′nn≥1 having the same limit A but λ(A′

n)n≥1 andλ(A

′′n)n≥1 have different limits, then the approximating procedures are

not consistent.

Page 19: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Measures and Integration: An Informal Introduction 3

It turns out that the above set of reasonable and natural assumptionslead to a very rich and powerful theory that is widely applicable.

A triplet (S,B, λ) that satisfies the above two sets of assumptions iscalled a measure space. The assumptions on B and λ are equivalent to thefollowing:

On B

B(i)′: ∅, the empty set, lies in B

B(ii)′: A ∈ B ⇒ Ac ∈ B (same as (i) before)

B(iii)′: A1, A2, . . . ∈ B ⇒ ∪iAi ∈ B (combines (ii) and (iii) above) (Closureunder countable unions).

On λ

λ(i)′: λ(·) ≥ 0 (same as (i) before) and λ(∅) = 0.

λ(ii)′: λ(∪n≥1An) =∑∞

n=1 λ(An) if Ann≥1 ⊂ B are pairwise disjoint,i.e., Ai ∩Aj = ∅ for i = j (Countable additivity).

Any collection B of subsets of S that satisfies B(i)′, B(ii)′, B(iii)′ aboveis called a σ-algebra. Any set function λ on a σ-algebra B that satisfiesλ(i)′ and λ(ii)′ above is called a measure. Thus, a measure space is a triplet(S,B, λ) where S is a nonempty set, B is a σ-algebra of subsets of S and λ isa measure on B. Notice that the σ-algebra structure on B and the countableadditivity of λ are necessary consequences of the very natural assumptions(i), (ii), and (iii) on B and λ defined at the beginning.

It is not often the case that one is given B and λ explicitly. Typically,one starts with a small collection C of subsets of S that have propertiesresembling intervals or rectangles and a set function λ on C. Then, B is thesmallest σ-algebra containing C obtained from C by various operations suchas countable unions, intersections, and their limits. The key properties onC that one needs are:

(i) A, B ∈ C ⇒ A ∩B ∈ C (e.g., intersection of intervals is an interval).

(ii) A ∈ C ⇒ Ac is a finite union of sets from C (e.g., the complement ofan interval is the union of two intervals or an interval itself).

A collection C satisfying (i) and (ii) is called a semialgebra. The functionλ on B is an extension of λ on C. For this extension to be a measure on B,the conditions needed are

(i) λ(A) ≥ 0 for all A ∈ C

(ii) If A1, A2, . . . ∈ C are pairwise disjoint and A =⋃

n≥1 An ∈ C, thenλ(A) =

∑∞n=1 λ(An).

Page 20: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4 Measures and Integration: An Informal Introduction

There is a result, known as the extension theorem, that says that givensuch a pair (C, λ), it is possible to extend λ to B, the smallest σ-algebracontaining C, such that (S,B, λ) is a measure space. Actually, it does more.It constructs a σ-algebra B∗ larger than B and a measure λ∗ on B∗ suchthat (S,B∗, λ∗) is a larger measure space, λ∗ coincides with λ on C and itprovides nice approximation theorems. For example, the following approx-imation result is available:

If B ∈ B∗ with λ∗(B) < ∞, then for every ε > 0, B can be approximated bya finite union of sets from C, i.e., there exist sets A1, . . . , Ak ∈ C with k < ∞such that λ∗(AB) < ε where A ≡

⋃ki=1 Ai and AB = (A∩Bc)∪(Ac∩B),

the symmetric difference between A and B.

That is, in principle, every (measurable) set B of finite measure (i.e., B be-longing to B∗ with λ∗(B) < ∞) is nearly a finite union of (elementary) setsthat belong to C. For example if S = R and C is the class of intervals, thenevery measurable set of finite measure is nearly a finite union of disjointbounded open intervals.

The following are some concrete examples of the above extension proce-dure.

Theorem: (Lebesgue-Stieltjes measures on R). Let F : R → R satisfy

(i) x1 < x2 ⇒ F (x1) ≤ F (x2) (nondecreasing);

(ii) F (x) = F (x+) ≡ limy↓x F (y) for all x ∈ R (i.e., F (·) is right contin-uous).

Let C be the class of sets of the form (a, b], or (b,∞), −∞ ≤ a < b < ∞.Then, there exists a measure µF defined on B ≡ B(R), the smallest σ-algebra generated by C such that

µF ((a, b]) = F (b)− F (a) for all −∞ < a < b < ∞.

The σ-algebra B ≡ B(R) is called the Borel σ-algebra of R.

Corollary: There exists a measure m on B(R) such that m(I) = the lengthof I, for any interval I.

Proof: Take F (x) ≡ x in the above theorem.This measure is called the Lebesgue measure on R.

Corollary: There exists a measure λ on B(R) such that

λ((a, b]) =1√2π

∫ b

a

e−x2/2dx.

Proof: Take F (x) =∫ x

−∞1√2π

e−u2/2du, x ∈ R.

Page 21: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Measures and Integration: An Informal Introduction 5

This measure is called the standard normal probability measure on R.

Theorem: (Lebesgue-Stieltjes measures on R2). Let F : R2 → R be afunction satisfying the following:

(i) (Monotonicity) For x = (x1, x2)′, y = (y1, y2)′ with xi ≤ yi for i =1, 2,

(∆F

)(x, y) ≡ F (y1, y2)− F (x1, y2)− F (y1, x2) + F (x1, x2) ≥ 0.

(ii) (Continuity from above) F (x) = limyi↓xi,i=1,2

F (y) for all x ∈ R2.

Let C be the class of all rectangles of the form (a, b] ≡ (a1, b1] × (a2, b2]with a = (a1, a2)′, b = (b1, b2)′ ∈ R2. Then there exists a measure µF ,defined on the σ-algebra B ≡ B(R2), generated by C, such that

µF ((a, b]) =(∆F

)(a, b).

The above theorems have a converse that says that every measure on(Rk,B(Rk)) that is finite on bounded sets arises from some function F(called a distribution function) and is, therefore, a Lebesgue-Stieltjes mea-sure.

Here is another simple example of a measure space (with discrete S).

Example: Let S = s1, s2, . . . , sk, k ≤ ∞, and let B = P(S), the powerset of S, i.e., the collection of all possible subsets of S. Let p1, p2, . . . benonnegative numbers. Let

λ(A) ≡∑

1≤i≤k

piIA(si),

where IA is the indicator function of the set A, defined by IA(s) = 1 ifs ∈ A and 0 otherwise. It is easy to verify that (S,B, λ) is a measure spaceand also that every measure λ on B arises this way.

What is Integration Theory?

In short, it is a theory about weighted sums of functions on a set S whenthe weights are specified by a mass distribution λ. Here is a more detailedanswer.

Let (S,B, λ) be a measure space. Suppose f : S → R is a simple function,i.e., f is such that f(S) is a finite set a1, a2, . . . , ak. It is reasonable todefine the weighted sum of f with respect to λ as

∑ki=1 aiλ(Ai) where

Ai = f−1ai. Of course, for this to be well defined, one needs Ai to be inB and λ(Ai) < ∞ for all i such that ai = 0.

Notice that the quantity∑k

i=1 aiλ(Ai) remains the same whether the ai’sare distinct or not. Call this the integral of f with respect to λ and denote

Page 22: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

6 Measures and Integration: An Informal Introduction

this by∫

fdλ. If f and g are simple, then for α, β ∈ R,∫

(αf + βg)dλ =α∫

fdx + β∫

gdλ. Now how should one define∫

fdλ (integral of f withrespect to λ) for a nonsimple f? The answer, of course, is to “approximate”by simple functions. Let f be a nonnegative function. To define the integralof f , one would like to approximate f by simple functions. It turns outthat a necessary and sufficient condition for this is that for any a ∈ R,the set s : f(s) ≤ a is in B. Such a function f is called measurablewith respect to B or B-measurable or simply, measurable (if B is kept fixedthroughout). Let f be a nonnegative B measurable function. Then thereexists a sequence fnn≥1 of simple nonnegative functions such that foreach s ∈ S, fn(s)n≥1 is a nondecreasing sequence converging to f(s). Itis now natural to define the weighted sum of f with respect to λ, i.e., theintegral of f with respect to λ, denoted by

∫fdλ, as

∫fdλ = lim

n→∞

∫fndλ.

An immediate question is: Is the right side the same for all such approx-imating sequences fnn≥1? The answer is a yes; it is guaranteed by thevery natural assumption imposed on λ that it is finitely additive and mono-tone continuous from below, i.e. λ(ii) and λ(iii) (or equivalently, that λ iscountably additive, i.e., λ(ii)′).

One can strengthen this to a stronger result known as the monotoneconvergence theorem, a key result that in turn leads to two other majorconvergence results.

The monotone convergence theorem (MCT): Let (S,B, λ) be a mea-sure space and let fn : S → R+, n ≥ 1 be a sequence of nonnegative B-measurable functions (not necessarily simple) such that for all s ∈ S,

(i) fn(s) ≤ fn+1(s), for all n ≥ 1, and

(ii) limn→∞ fn(s) = f(s).

Then f is B-measurable and∫

fdλ = limn→∞

∫fndλ.

This says that the integral and the limit can be interchanged for mono-tone nondecreasing nonnegative B-measurable functions. Note that if fn =IAn , the indicator function of a set An and if An ⊂ An+1 for each n, thenthe MCT is the same as m.c.f.b. (cf. property λ(ii)). Thus, the very natu-ral assumption of m.c.f.b. yields a basic convergence result that makes theintegration theory so elegant and powerful.

To extend the definition of∫

fdλ to a real valued, B-measurable functionf : S → R, one uses the simple idea that f can be decomposed as f =f+ − f− where f+(s) = maxf(s), 0 and f−(s) = max−f(s), 0, s ∈ S.Since both f+ and f− are nonnegative and B-measurable,

∫f+dλ and

Page 23: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Measures and Integration: An Informal Introduction 7

∫f−dλ are both well defined. Now set∫

fdλ =∫

f+dλ−∫

f−dλ,

provided at least one of the two terms on the right is finite. The function fis said to be integrable with respect to (w.r.t.) λ if both

∫f+dλ and f−dλ

are finite or, equivalently, if∫|f |dλ < ∞. The following is a consequence

of the MCT.

Fatou’s lemma: Let fnn≥1 be a sequence of nonnegative B-measurablefunctions on a measure space (S,B, λ). Then∫

lim infn→∞ fndλ ≤ lim inf

n→∞

∫fndλ.

This in turn leads to

(Lebesgue’s) dominated convergence theorem (DCT): Let fnn≥1be a sequence of B-measurable functions from a measure space (S,B, λ) toR and let g be a B-measurable nonnegative integrable function on (S,B, λ).Suppose that for each s in S,

(i) |fn(s)| ≤ g(s) for all n ≥ 1 and

(ii) limn→∞ fn(s) = f(s).

Then, f is integrable and

limn→∞

∫fndλ =

∫fdλ =

∫lim

n→∞ fndλ.

Thus some very natural assumptions on B and λ lead to an interestingmeasure and integration theory that is quite general and that allows theinterchange of limits and integrals under fairly general conditions. A sys-tematic treatment of the measure and integration theory is given in thenext five chapters.

Page 24: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

1Measures

Section 1.1 deals with algebraic operations on subsets of a given nonemptyset Ω. Section 1.2 treats nonnegative set functions on classes of sets and de-fines the notion of a measure on an algebra. Section 1.3 treats the extensiontheorem, and Section 1.4 deals with completeness of measures.

1.1 Classes of sets

Let Ω be a nonempty set and P(Ω) ≡ A : A ⊂ Ω be the power set of Ω,i.e., the class of all subsets of Ω.

Definition 1.1.1: A collection of sets F ⊂ P(Ω) is called an algebra if(a) Ω ∈ F , (b) A ∈ F implies Ac ∈ F , and (c) A, B ∈ F implies A∪B ∈ F(i.e., closure under pairwise unions).

Thus, an algebra is a class of sets containing Ω that is closed undercomplementation and pairwise (and hence finite) unions. It is easy to seethat one can equivalently define an algebra by requiring that properties(a), (b) hold and that the property

(c)′A, B ∈ F ⇒ A ∩B ∈ F

holds (i.e. closure under finite intersections).

Page 25: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

10 1. Measures

Definition 1.1.2: A class F ⊂ P(Ω) is called a σ-algebra if it is an algebraand if it satisfies

(d) An ∈ F for n ≥ 1 ⇒⋃n≥1

An ∈ F .

Thus, a σ-algebra is a class of subsets of Ω that contains Ω and is closedunder complementation and countable unions. As pointed out in the in-troductory chapter, a σ-algebra can be alternatively defined as an algebrathat is closed under monotone unions as the following shows.

Proposition 1.1.1: Let F ⊂ P(Ω). Then F is a σ-algebra if and only ifF is an algebra and satisfies

An ∈ F , An ⊂ An+1 for all n ⇒⋃n≥1

An ∈ F .

Proof: The ‘only if’ part is obvious. For the ‘if’ part, let Bn∞n=1 ⊂ F .

Then, since F is an algebra, An ≡⋃n

j=1 Bj ∈ F for all n. Further, An ⊂An+1 for all n and

⋃n≥1 Bn =

⋃n≥1 An. Since by hypothesis ∪nAn ∈ F ,

∪nBn ∈ F .

Here are some examples of algebras and σ-algebras.

Example 1.1.1: Let Ω = a, b, c, d. Consider the classes

F1 = Ω, ∅, a

andF2 = Ω, ∅, a, b, c, d.

Then, F2 is an algebra (and also a σ-algebra), but F1 is not an algebra,since ac ∈ F1.

Example 1.1.2: Let Ω be any nonempty set and let

F3 = P(Ω) ≡ A : A ⊂ Ω, the power set of Ω

andF4 = Ω, ∅.

Then, it is easy to check that both F3 and F4 are σ-algebras. The latterσ-algebra is often called the trivial σ-algebra on Ω (Problem 1.1).

From the definition it is clear that any σ-algebra is also an algebra andthus F2,F3,F4 are examples of algebras, too. The following is an exampleof an algebra that is not a σ-algebra.

Example 1.1.3: Let Ω be a nonempty set, and let |A| denote the numberof elements of a set A ⊂ Ω. Define.

F5 = A ⊂ Ω : either |A| is finite or |Ac| is finite.

Page 26: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

1.1 Classes of sets 11

Then, note that (i) Ω ∈ F5 (since |Ωc| = |∅| = 0)), (ii) A ∈ F5 impliesAc ∈ F5 (if |A| < ∞, then |(Ac)c| = |A| < ∞ and if |Ac| < ∞, thenAc ∈ F5 trivially). Next, suppose that A, B ∈ F5. If either |A| < ∞ or|B| < ∞, then

|A ∩B| ≤ min|A|, |B| < ∞,

so that A ∩ B ∈ F5. On the other hand, if both |Ac| < ∞ and |Bc| < ∞,then

|(A ∩B)c| = |Ac ∪Bc| ≤ |Ac|+ |Bc| < ∞,

implying that A∩B ∈ F5. Thus, property (c)′ holds, and F5 is an algebra.However, if |Ω| = ∞, then F5 is not a σ-algebra. To see this, suppose that|Ω| = ∞ and ω1, ω2, . . . ⊂ Ω. Then, by definition, Ai = ωi ∈ F5 for alli ≥ 1, but A ≡

⋃∞i=1 A2i−1 = ω1, ω3, . . . ∈ F5, since |A| = |Ac| = ∞.

Example 1.1.4: Let Ω be a nonempty set and let

F6 = A ⊂ Ω : A is countable or Ac is countable.

Then, it is easy to show that F6 is a σ-algebra (Problem 1.3).

Suppose Fθ : θ ∈ Θ is a family of σ-algebras on Ω. From the definition,it follows that the intersection

⋂θ∈Θ Fθ is a σ-algebra, no matter how large

the index set Θ is (Problem 1.4). However, the union of two σ-algebras maynot even be an algebra (Problem 1.5). For the development of measuretheory and probability theory, the concept of a σ-algebra plays a crucialrole. In many instances, given an arbitrary collection of subsets of Ω, onewould like to extend it to a possibly larger class that is a σ-algebra. Thisleads to the following definition.

Definition 1.1.3: If A is a class of subsets of Ω, then the σ-algebragenerated by A, denoted by σ〈A〉, is defined as

σ〈A〉 =⋂

F∈I(A)

F ,

where I(A) ≡ F : A ⊂ F and F is a σ-algebra on Ω is the collection ofall σ-algebras containing the class A.

Note that since the power set P(Ω) contains A and is itself a σ-algebra,the collection I(A) is not empty and hence, the intersection in the abovedefinition is well defined.

Example 1.1.5: In the setup of Example 1.1.1, σ〈F1〉 = F2 (why?).

A particularly useful class of σ-algebras are those generated by opensets of a topological space. These are called Borel σ-algebras. A topologicalspace is a pair (S, T ) where S is a nonempty set and T is a collection ofsubsets of S such that (i) S ∈ T , (ii) O1,O2 ∈ T ⇒ O1 ∩O2 ∈ T , and (iii)Oα : α ∈ I ⊂ T ⇒

⋃α∈I Oα ∈ T . Elements of T are called open sets.

Page 27: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

12 1. Measures

A metric space is a pair (S, d) where S is a nonempty set and d is afunction from S × S to R+ satisfying (i) d(x, y) = d(y, x) for all x, y in S,(ii) d(x, y) = 0 iff x = y, and (iii) d(x, z) ≤ d(x, y) + d(y, z) for all x, y, z inS. Property (iii) is called the triangle inequality. The function d is called ametric on S (cf. see A.4).

Any Euclidean space Rn(1 ≤ n < ∞) is a metric space under any one ofthe following metrics:

(a) For 1 ≤ p < ∞, dp(x, y) =( n∑

i=1|xi − yi|p

)1/p

.

(b) d∞(x, y) = max1≤i≤n

|xi − yi|.

(c) For 0 < p < 1, dp(x, y) =( n∑

i=1|xi − yi|p

).

A metric space (S, d) is a topological space where a set O is open if for allx ∈ O, there is an ε > 0 such that B(x, ε) ≡ y : d(y, x) < ε ⊂ O.

Definition 1.1.4: The Borel σ-algebra on a topological space S (in partic-ular, on a metric space or an Euclidean space) is defined as the σ-algebragenerated by the collection of open sets in S.

Example 1.1.6: Let B(Rk) denote the Borel σ-algebra on Rk, 1 ≤ k < ∞.Then,

B(Rk) ≡ σ〈A : A is an open subset of Rk〉

is also generated by each of the following classes of sets

O1 = (a1, b1)× . . .× (ak, bk) : −∞ ≤ ai < bi ≤ ∞, 1 ≤ i ≤ k;O2 = (−∞, x1)× · · · × (−∞, xk) : x1, . . . , xk ∈ R;O3 = (a1, b1)× . . .× (ak, bk) : ai, bi ∈ Q, ai < bi, 1 ≤ i ≤ k;O4 = (−∞, x1)× . . .× (−∞, xk) : x1, . . . , xk ∈ Q,

where Q denotes the set of all rational numbers.

To show this, note that σ〈Oi〉 ⊂ B(Rk) for i = 1, 2, 3, 4, and hence, itis enough to show that σ〈Oi〉 ⊃ B(Rk). Let G be a σ-algebra containingO3. Observe that given any open set A ⊂ Rk, there exist a sequence ofsets Bnn≥1 in O3 such that A =

⋃n≥1 Bn (Problem 1.9). Since G is a σ-

algebra and Bn ∈ G for all n ≥ 1, A ∈ G. Thus, G is a σ-algebra containingall open subsets of Rk, and hence G ⊃ B(Rk). Hence, it follows that

B(Rk) ⊃ σ〈O1〉 ⊃ σ〈O3〉 =⋂

G:G⊃O3

G ⊃ B(Rk).

Page 28: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

1.1 Classes of sets 13

Next note that any interval (a, b) ⊂ R can be expressed in terms of halfspaces of the form (−∞, x), x ∈ R as

(a, b) =∞⋃

n=1

[(−∞, b) \ (−∞, a + n−1)],

where for any two sets A and B, A\B = x : x ∈ A, x /∈ B. It is notdifficult to show that this implies that σ〈Oi〉 = B(Rk) for i = 2, 4 (Problem1.10).

Example 1.1.7: Let Ω be a nonempty set with |Ω| = ∞ and F5 and F6be as in Examples 1.1.3 and 1.1.4. Then F6 = σ〈F5〉. To see this, notethat F6 is a σ-algebra containing F5, so that σ〈F5〉 ⊂ F6. To prove thereverse inclusion, let G be a σ-algebra containing F5. It is enough to showthat F6 ⊂ G. Let A ∈ F6. If A is countable, say A = ω1, ω2, . . ., thenAi ≡ ωi ∈ F5 ⊂ G for all i ≥ 1 and hence A =

⋃∞i=1 Ai ∈ G. On the other

hand, if Ac is countable, then by the above argument, Ac ∈ G ⇒ A ∈ G.

Definition 1.1.5: A class C of subsets of Ω is a π-system or a π-class ifA, B ∈ C ⇒ A ∩B ∈ C.

Example 1.1.8: The class C of intervals in R is a π-system whereas theclass of all open discs in R2 is not.

Definition 1.1.6: A class L of subsets of Ω is a λ-system or a λ-class if(i) Ω ∈ L, (ii) A, B ∈ L, A ⊂ B ⇒ B \A ∈ L, and (iii) An ∈ L, An ⊂ An+1for all n ≥ 1 ⇒

⋃n≥1 An ∈ L.

Example 1.1.9: Every σ-algebra F is a λ-system. But an algebra neednot be a λ-system.

It is easily checked that if L1 and L2 are λ-systems, then L1∩L2 is also aλ-system. Recall that σ〈B〉, the σ-algebra generated by B, is the intersectionof all σ-algebras containing B and is also the smallest σ-algebra containingB. Similarly, for any B ⊂ P(Ω), the λ-system generated by B, denoted byλ〈B〉, is defined as the intersection of all λ-systems containing B. It is thesmallest λ-system containing B.

Theorem 1.1.2: (The π-λ theorem). If C is a π-system, then λ〈C〉 = σ〈C〉.

Proof: For any C, σ〈C〉 is a λ-system and σ〈C〉 contains C. Thus, λ〈C〉 ⊂σ〈C〉 for any C. Hence, it suffices to show that if C is a π-system, then λ〈C〉is a σ-algebra . Since λ〈C〉 is a λ-system, it is closed under complementationand monotone increasing unions. By Proposition 1.1.1, it is enough to showthat it is closed under intersection. Let λ1(C) ≡ A : A ∈ λ〈C〉, A ∩ B ∈λ〈C〉 for all B ∈ C. Then, λ1(C) is a λ-system and C being a π-system,λ1(C) ⊃ C. Therefore, λ1(C) ⊃ λ〈C〉. But λ1(C) ⊂ λ〈C〉. So λ1(C) = λ〈C〉.

Page 29: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14 1. Measures

Next, let λ2(C) ≡ A : A ∈ λ〈C〉, A ∩ B ∈ λ〈C〉 for all B ∈ λ〈C〉. Thenλ2(C) is a λ-system and by the previous step C ⊂ λ2(C) ⊂ λ〈C〉. Hence,it follows that λ2(C) = λ〈C〉, i.e., λ〈C〉 is closed under intersection. Thiscompletes the proof of the theorem.

Corollary 1.1.3: If C is a π-system and L is a λ-system containing C,then L ⊃ σ〈C〉.

Remark 1.1.1: There are several equivalent definitions of λ-systems; see,for example, Billingsley (1995). A closely related concept is that of a mono-tone class; see, for example, Chung (1974).

1.2 Measures

A set function is an extended real valued function defined on a class ofsubsets of a set Ω. Measures are nonnegative set functions that, intuitivelyspeaking, measure the content of a subset of Ω. As explained in Section2 of the introductory chapter, a measure has to satisfy certain naturalrequirements, such as the measure of the union of a countable collectionof disjoint sets is the sum of the measures of the individual sets. Formally,one has the following definition.

Definition 1.2.1: Let Ω be a nonempty set and F be an algebra on Ω.Then, a set function µ on F is called a measure if

(a) µ(A) ∈ [0,∞] for all A ∈ F ;

(b) µ(∅) = 0;

(c) for any disjoint collection of sets A1, A2, . . . ,∈ F with⋃

n≥1 An ∈ F ,

µ( ⋃

n≥1

An

)=

∞∑n=1

µ(An).

As discussed in Section 2 of the introductory chapter, these conditions onµ are equivalent to finite additivity and monotone continuity from below.

Proposition 1.2.1: Let Ω be a nonempty set and F be an algebra ofsubsets of Ω and µ be a set function on F with values in [0,∞] and withµ(∅) = 0. Then, µ is a measure iff µ satisfies

(iii)′a : (finite additivity) for all A1, A2 ∈ F with A1∩A2 = ∅, µ(A1∪A2) =

µ(A1) + µ(A2), and

(iii)′b : (monotone continuity from below or, m.c.f.b., in short) for any

collection Ann≥1 of sets in F such that An ⊂ An+1 for all n ≥ 1

Page 30: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

1.2 Measures 15

and⋃

n≥1 An ∈ F ,

µ( ⋃

n≥1

An

)= lim

n→∞ µ(An).

Proof: Let µ be a measure on F . Since µ satisfies (iii), taking A3, A4, . . .to be ∅ yields (iii)′

a. This implies that for A and B in F , A ⊂ B ⇒µ(B) = µ(A) + µ(B \ A) ≥ µ(A), i.e., µ is monotone. To establish (iii)′

b,note that if µ(An) = ∞ for some n = n0, then µ(An) = ∞ for all n ≥ n0and µ(

⋃n≥1 An) = ∞ and (iii)′

b holds in this case. Hence, suppose thatµ(An) < ∞ for all n ≥ 1. Setting Bn = An \An−1 for n ≥ 1 (with A0 = ∅),by (iii)′

a, µ(Bn) = µ(An)−µ(An−1). Since Bnn≥1 is a disjoint collectionof sets in F with

⋃n≥1 Bn =

⋃n≥1 An, by (iii)

µ( ⋃

n≥1

An

)= µ

( ⋃n≥1

Bn

)=

∞∑n=1

µ(Bn) = limN→∞

N∑n=1

[µ(An)− µ(An−1)]

= limN→∞

µ(AN ),

and so (iii)′b holds also in this case.

Conversely, let µ satisfy µ(∅) = 0 and (iii)′a and (iii)′

b. Let Ann≥1 bea disjoint collection of sets in F with

⋃i≥1 Ai ∈ F . Let Cn =

⋃nj=1 Aj for

n ≥ 1. Since F is an algebra, Cn ∈ F for all n ≥ 1. Also, Cn ⊂ Cn+1 forall n ≥ 1. Hence,

⋃n≥1 Cn =

⋃j≥1 Aj . By (iii)′

b,

µ( ⋃

j≥1

Aj

)= µ

( ⋃n≥1

Cn

)= lim

n→∞ µ(Cn)

= limn→∞

n∑j=1

µ(Aj) (by (iii)′a)

=∞∑

j=1

µ(Aj).

Thus, (iii) holds.

Remark 1.2.1: The definition of a measure given in Definition 1.2.1 isvalid when F is a σ-algebra. However, very often one may start with ameasure on an algebra A and then extend it to a measure on the σ-algebraσ〈A〉. This is why the definition of a measure on an algebra is given here.In the same vein, one may begin with a definition of a measure on a class ofsubsets of Ω that form only a semialgebra (cf. Definition 1.3.1). As describedin the introductory chapter, such preliminary collections of sets are “nice”sets for which the measure may be defined easily, and the extension to a σ-algebra containing these sets may be necessary if one is interested in moregeneral sets. This topic is discussed in greater detail in the next section.

Page 31: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

16 1. Measures

Definition 1.2.2: A measure µ is called finite or infinite according asµ(Ω) < ∞ or µ(Ω) = ∞. A finite measure with µ(Ω) = 1 is called aprobability measure. A measure µ on a σ-algebra F is called σ-finite ifthere exist a countable collection of sets A1, A2, . . . ,∈ F , not necessarilydisjoint, such that

(a)⋃n≥1

An = Ω and (b) µ(An) < ∞ for all n ≥ 1.

Here are some examples of measures.

Example 1.2.1: (The counting measure). Let Ω be a nonempty set andF3 = P(Ω) be the set of all subsets of Ω (cf. Example 1.1.2). Define

µ(A) = |A|, A ∈ F3,

where |A| denotes the number of elements in A. It is easy to check that µsatisfies the requirements (a)–(c) of a measure. This measure µ is called thecounting measure on Ω. Note that µ is finite iff Ω is finite and it is σ-finiteif Ω is countably infinite.

Example 1.2.2: (Discrete probability measures). Let ω1, ω2, . . . ,∈ Ω andp1, p2, . . . ∈ [0, 1] be such that

∑∞i=1 pi = 1. Define for any A ⊂ Ω

P (A) =∞∑

i=1

piIA(ωi),

where IA(·) denotes the indicator function of a set A, defined by IA(ω) = 0or 1 according as ω ∈ A or ω ∈ A. For any disjoint collection of setsA1, A2, . . . ∈ P(Ω),

P

( ∞⋃i=1

Ai

)=

∞∑j=1

pjI⋃∞i=1 Ai

(ωj)

=∞∑

j=1

pj

( ∞∑i=1

IAi(ωj))

=∞∑

i=1

( ∞∑j=1

pjIAi(ωj))

=∞∑

i=1

P (Ai),

where interchanging the order of summation is permissible since the sum-mands are nonnegative. This shows that P is a probability measure onP(Ω).

Page 32: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

1.2 Measures 17

Example 1.2.3: (Lebesgue-Stieltjes measures on R). As mentioned in theprevious chapter (cf. Section 2), a large class of measures on the Borelσ-algebra B(R) of subsets of R, known as the Lebesgue-Stieltjes measures,arise from nondecreasing right continuous functions F : R → R. For eachsuch F , the corresponding measure µF satisfies µF ((a, b]) = F (b) − F (a)for all −∞ < a < b < ∞. The construction of these µF ’s via the extensiontheorem will be discussed in the next section. Also, note that if An =(−n, n), n = 1, 2, . . ., then R =

⋃n≥1 An and µF (An) < ∞ for each n ≥ 1

(such measures are called Radon measures) and thus, µF is necessarilyσ-finite.

Proposition 1.2.2: Let µ be a measure on an algebra F , and letA, B, A1, . . . , Ak ∈ F , 1 ≤ k < ∞. Then,

(i) (Monotonicity) µ(A) ≤ µ(B) if A ⊂ B;

(ii) (Finite subadditivity) µ(A1 ∪ . . . ∪Ak) ≤ µ(A1) + . . . + µ(Ak);

(iii) (Inclusion-exclusion formula) If µ(Ai) < ∞ for all i = 1, . . . , k, then

µ(A1 ∪ . . . ∪Ak) =k∑

i=1

µ(Ai)−∑

1≤i<j<k

µ(Ai ∩Aj)

+ . . . + (−1)k−1µ(A1 ∩ . . . ∩Ak).

Proof: µ(B) = µ(A ∪ (Ac ∩ B)) = µ(A) + µ(B \ A) ≥ µ(A), by (a) and(c) of Definition 1.2.1. This proves (i).

To prove (ii), note that if either µ(A) or µ(B) is finite, then µ(A∩B) < ∞,by (i). Hence, using the countable additivity property (c), we have

µ(A ∪B) = µ(A) + µ(B \A)= µ(A) + [µ(B \A) + µ(A ∩B)]− µ(A ∩B)= µ(A) + µ(B)− µ(A ∩B). (2.1)

Hence, (ii) follows from (2.1) and by induction.To prove (iii), note that the case k = 2 follows from (2.1). Next, suppose

that (iii) holds for all sets A1, . . . , Ak ∈ F with µ(Ai) < ∞ for all i =1, . . . , k for some k = n, n ∈ N. To show that it holds for k = n + 1, notethat by (2.1),

µ

( n+1⋃i=1

Ai

)

= µ

( n⋃i=1

Ai

)+ µ(An+1)− µ

( n⋃i=1

(Ai ∩An+1))

Page 33: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

18 1. Measures

= n∑

i=1

µ(Ai)−∑

1<i<j≤n

µ(Ai ∩Aj) + · · ·+ (−1)n−1µ(A1 ∩ . . . ∩An)

+ µ(An+1)−[ n∑

i=1

µ(Ai ∩An+1)−∑

1≤i<j≤n

µ(Ai ∩Aj ∩An+1)

+ · · ·+ (−1)n−1µ(A1 ∩ . . . ∩An+1)]

=n+1∑i=1

µ(Ai)−∑

1≤i<j≤n+1

µ(Ai ∩Aj) + · · ·+ (−1)nµ

( n+1⋂j=1

Aj

).

By induction, this completes the proof of Proposition 1.2.2.

In Proposition 1.2.1, it was shown that a set function µ on an algebraF is a measure iff it is finitely additive and monotone continuous frombelow. A natural question is: if µ is a measure on F and Ann≥1 is acollection of decreasing sets in F with A ≡

⋂n≥1 An also in F , does the

relation µ(A) = limn→∞ µ(An) hold, i.e., does monotone continuity fromabove hold? The answer is positive under the assumption µ(An0) < ∞ forsome n0 ∈ N. It turns out that, in general, this assumption cannot bedropped (Problem 1.18).

Proposition 1.2.3: Let µ be a measure on an algebra F .

(i) (Monotone continuity from above) Let Ann≥1 be a sequence of setsin F such that An+1 ⊂ An for all n ≥ 1 and A ≡

⋂n≥1 An ∈ F . Also,

let µ(An0) < ∞ for some n0 ∈ N. Then,

limn→∞ µ(An) = µ(A).

(ii) (Countable subadditivity) If Ann≥1 is a sequence of sets in F suchthat

⋃n≥1 An ∈ F , then

µ

( ∞⋃n=1

An

)≤

∞∑n=1

µ(An).

Proof: To prove part (i), without loss of generality (w.l.o.g.), assume thatn0 = 1, i.e., µ(A1) < ∞. Let Cn = A1 \ An for n ≥ 1, and C∞ = A1 \ A.Then Cn and C∞ belong to F and Cn ↑ C∞. By Proposition 1.2.1 (iii)

′b,

(i.e., by the m.c.f.b. property), µ(Cn) ↑ µ(C∞) and by (iii)′a, (i.e., finite

additivity), µ(Cn) = µ(A1) − µ(An) for all 1 ≤ n < ∞, due to the factµ(A1) < ∞. This proves (i).

Page 34: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

1.3 The extension theorems and Lebesgue-Stieltjes measures 19

To prove part (ii), let Dn =⋃n

i=1 Ai, n ≥ 1. Then, Dn ↑ D ≡⋃

i≥1 Ai.Hence, by m.c.f.b. and finite subadditivity,

µ(D) = limn→∞ µ(Dn) ≤ lim

n→∞

n∑i=1

µ(Ai) =∞∑

n=1

µ(An).

Theorem 1.2.4: (Uniqueness of measures). Let µ1 and µ2 be two finitemeasures on a measurable space (Ω,F). Let C ⊂ F be a π-system suchthat F = σ〈C〉. If µ1(C) = µ2(C) for all C ∈ C and µ1(Ω) = µ2(Ω), thenµ1(A) = µ2(A) for all A ∈ F .

Proof: Let L ≡ A : A ∈ F , µ1(A) = µ2(A). It is easy to verify that Lis a λ-system. Since C ⊂ L, by Theorem 1.1.2, L = σ〈C〉 = F .

1.3 The extension theorems and Lebesgue-Stieltjesmeasures

As discussed earlier, in many situations, one starts with a given set func-tion µ defined on a small class C of subsets of a set Ω and then wants toextend µ to a larger class M by some approximation procedure. In thissection, a general result in this direction, known as the extension theorem,is established. This is then applied to the construction of Lebesgue-Stieltjesmeasures on Euclidean spaces. For another application, see Chapter 6.

1.3.1 Caratheodory extension of measuresDefinition 1.3.1: Let Ω be a nonempty set and let P(Ω) be the power setof Ω. A class C ⊂ P(Ω) is called a semialgebra if (i) A, B ∈ C ⇒ A∩B ∈ Cand (ii) for any A ∈ C, there exist sets B1, B2, . . . , Bk ∈ C, for some 1 ≤k < ∞, such that Bi ∩Bj = ∅ for i = j, and Ac =

⋃ki=1 Bi.

Example 1.3.1: Ω = R, C ≡ (a, b], (b,∞) : −∞ ≤ a, b < ∞.

Example 1.3.2: Ω = R, C ≡ I : I is an interval. An interval I in R isa set in R such that a, b ∈ I, a < b ⇒ (a, b) ⊂ I.

Example 1.3.3: Ω = Rk, C ≡ I1 × I2 × ×Ik : Ij is an interval in R for1 ≤ j ≤ k.

Recall that a collection A ⊂ P(Ω) is an algebra if it is closed underfinite union and complementation. It is easily verified (Problem 1.19) thatthe smallest algebra containing a semialgebra C is A(C) ≡ A : A =⋃k

i=1 Bi, Bi ∈ C for i = 1, . . . , k, k < ∞, , i.e., the class of finite unions ofsets from C.

Page 35: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

20 1. Measures

Definition 1.3.2: A set function µ on a semialgebra C, taking values inR+ ≡ [0,∞], is called a measure if (i) µ(∅) = 0 and (ii) for any sequenceof sets Ann≥1 ⊂ C with

⋃n≥1 An ∈ C, and Ai ∩ Aj = ∅ for i = j,

µ(⋃

n≥1 An) =∑∞

n=1 µ(An).

Proposition 1.3.1: Let µ be a measure on a semialgebra C. Let A ≡ A(C)be the smallest algebra generated by C. For each A ∈ A, set

µ(A) =k∑

i=1

µ(Bi),

if the set A has the representation A =⋃k

i=1 Bi for some B1, . . . , Bk ∈C, k < ∞ with Bi ∩Bj = ∅ for i = j. Then,

(i) µ is independent of the representation of A as A =⋃k

i=1 Bi;

(ii) µ is finitely additive on A, i.e., A, B ∈ A, A ∩ B = ∅ ⇒ µ(A ∪ B) =µ(A) + µ(B); and

(iii) µ is countably additive on A, i.e., if An ∈ A for all n ≥ 1, Ai∩Aj = ∅for all i = j, and

⋃n≥1 An ∈ A, then

µ( ⋃

n≥1

An

)=

∞∑n=1

µ(An).

Proof: Parts (i) and (ii) are easy to verify. Turning to part (iii), let eachn ≥ 1, An =

⋃kn

j=1 Bnj , Bnj ∈ C, Bnjknj=1 disjoint. Since

⋃n≥1 An ∈ A

then exist disjoint sets Biki=1 ⊂ C such that

⋃n≥1 An =

⋃ki=1 Bi. Now

Bi = Bi ∩( ⋃

n≥1

An

)=⋃n≥1

(Bi ∩An)

=⋃n≥1

kn⋃j=1

(Bi ∩Bnj).

Since for all i, Bi ∈ C, Bi ∩Bnj ∈ C for all j, n and µ is a measure on C

µ(Bi) =∑n≥1

kn∑j=1

µ(Bi ∩Bnj).

Thus,

µ( ⋃

n≥1

An

)=

k∑i=1

µ(Bi) =k∑

i=1

∑n≥1

( kn∑j=1

µ(Bi ∩Bnj))

Page 36: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

1.3 The extension theorems and Lebesgue-Stieltjes measures 21

=∑n≥1

( k∑i=1

kn∑j=1

µ(Bi ∩Bnj))

=∑n≥1

µ(An),

since

An = An ∩k⋃

i=1

Bi

=k⋃

i=1

kn⋃j=1

(Bi ∩Bnj).

Thus, the set function µ defined above is a measure on A. To go beyondA = A(C) to σ〈A〉, which is also the same as σ〈C〉 (see Problem 1.19 (ii)),the first step involves using the given set function µ on C to define a setfunction µ∗ on the class of all subsets of Ω, i.e., on P(Ω), where for anyA ∈ P(Ω), µ∗(A) is an estimate ‘from above’ of the µ-content of A.

Definition 1.3.3: Given a measure µ on a semialgebra C, the outer mea-sure induced by µ is the set function µ∗, defined on P(Ω), as

µ∗(A) ≡ inf ∞∑

n=1

µ(An) : Ann≥1 ⊂ C, A ⊂⋃n≥1

An

. (3.1)

Thus, a given set A is covered by countable unions of sets from C andthe sums of the measures on such covers are computed and µ∗(A) is thegreatest lower bound one can get in this way. It is not difficult to showthat on C and A, this is not an overestimate. That is, µ∗ = µ on C andon A, µ∗ = µ. Now suppose µ(Ω) < ∞. Let A ⊂ Ω be a set such that theoverestimates of the µ-contents of A and Ac, namely, µ∗(A) and µ∗(Ac),add up to µ(Ω), i.e.,

µ(Ω) = µ∗(Ω) = µ∗(A) + µ∗(Ac). (3.2)

Then, there is no room for error and the estimates µ∗(A) and µ∗(Ac) arein fact not overestimates at all. In this case, the set A may be considered‘exactly measurable’ by this approximation procedure. This observationmotivates the following definition.

Definition 1.3.4: A set A is said to be µ∗-measurable if

µ∗(E) = µ∗(E ∩A) + µ∗(E ∩Ac) for all E ⊂ Ω. (3.3)

In other words, an analog of (3.2) should hold in every portion E of Ω fora set A to be µ∗-measurable.

Page 37: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

22 1. Measures

It can be shown (Problem 1.20) that µ∗ defined in (3.1) satisfies:

µ∗(∅) = 0, (3.4)A ⊂ B ⇒ µ∗(A) ≤ µ∗(B), (3.5)

and for any Ann≥1 ⊂ P(Ω),

µ∗( ⋃

n≥1

An

)≤

∞∑n=1

µ∗(An) . (3.6)

Definition 1.3.5: Any set function µ∗ : P(Ω) → R+ ≡ [0,∞] satisfying(3.4)–(3.6) is called an outer measure on Ω.

The following result (due to C. Caratheodory) yields a measure spaceon Ω starting from a general outer measure µ∗ that need not arise from ameasure µ as in (3.1).

Theorem 1.3.2: Let µ∗ be an outer measure on Ω, i.e., it satisfies (3.4)–(3.6). Let M ≡ Mµ∗ ≡ A : A is µ∗-measurable, i.e., A satisfies (3.3).Then

(i) M is a σ-algebra,

(ii) µ∗ restricted to M is a measure, and

(iii) µ∗(A) = 0 ⇒ P(A) ⊂M.

Proof: From (3.3), it follows that ∅ ∈ M and that A ∈ M ⇒ Ac ∈ M.Next, it will be shown thatM is closed under finite unions. Let A1, A2 ∈M.Then, for any E ⊂ Ω,

µ∗(E) = µ∗(E ∩A1) + µ∗(E ∩Ac1) (since A1 ∈M)

= µ∗(E ∩A1 ∩A2) + µ∗(E ∩A1 ∩Ac2)

+µ∗(E ∩Ac1 ∩A2) + µ∗(E ∩Ac

1 ∩Ac2) (since A2 ∈M).

But (A1 ∩A2)∪ (A1 ∩Ac2)∪ (Ac

1 ∩A2) = A1 ∪A2. Since µ∗ is subadditive,it follows that

µ∗(E ∩ (A1∪A2)) ≤ µ∗(E ∩A1∩A2)+µ∗(E ∩A1∩Ac2)+µ∗(E ∩Ac

1∩A2).

Thusµ∗(E) ≥ µ∗(E ∩ (A1 ∪A2)) + µ∗(E ∩ (A1 ∪A2)c).

The subadditivity of µ∗ yields the opposite inequality and so, A1∪A2 ∈Mand hence, M is an algebra. To show that M is a σ-algebra, it suffices toshow that M is closed under monotone unions, i.e., An ∈ M, An ⊂ An+1for all n ≥ 1 ⇒ A ≡

⋃n≥1 An ∈ M. Let B1 = A1 and Bn = An ∩ Ac

n−1

Page 38: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

1.3 The extension theorems and Lebesgue-Stieltjes measures 23

for all n ≥ 2. Then, for all n ≥ 1, Bn ∈ M (since M is an algebra),⋃nj=1 Bj = An, and

⋃∞j=1 Bj = A. Hence, for any E ⊂ Ω,

µ∗(E) = µ∗(E ∩An) + µ∗(E ∩Acn)

= µ∗(E ∩An ∩Bn) + µ∗(E ∩An ∩Bcn) + µ∗(E ∩Ac

n)(since Bn ∈M)

= µ∗(E ∩Bn) + µ∗(E ∩An−1) + µ∗(E ∩Acn)

=n∑

j=1

µ∗(E ∩Bj) + µ∗(E ∩Acn) (by iteration)

≥n∑

j=1

µ∗(E ∩Bj) + µ∗(E ∩Ac) (by monotonicity).

Now letting n → ∞, and using the subadditivity of µ∗ and the fact that⋃∞j=1 Bj = A, one gets

µ∗(E) ≥∞∑

j=1

µ∗(E ∩Bj) + µ∗(E ∩Ac)

≥ µ∗(E ∩A) + µ∗(E ∩Ac).

This completes the proof of part (i).To prove part (ii), let Bnn≥1 ⊂ M and Bi ∩ Bj = ∅ for i = j. Let

Aj =⋃∞

i=j Bi, j ∈ N. Then, by (i), Aj ∈M for all j ∈ N and so

µ∗(A1) = µ∗(A1 ∩B1) + µ∗(A1 ∩Bc1) (since B1 ∈M)

= µ∗(B1) + µ∗(A2)= µ∗(B1) + µ∗(B2) + µ∗(A3) (by iteration)

=n∑

i=1

µ∗(Bi) + µ∗(An+1) (by iteration)

≥n∑

i=1

µ∗(Bi) for all n ≥ 1.

Now letting n → ∞, one has µ∗(A1) ≥∑∞

i=1 µ∗(Bi). By subadditivity ofµ∗, the opposite inequality holds and so

µ∗(A1) = µ∗( ∞⋃

i=1

Bi

)=

∞∑i=1

µ∗(Bi),

proving (ii).As for (iii), note that by monotonicity, µ∗(A) = 0, B ⊂ A ⇒ µ∗(B) = 0

and hence, for any E, µ∗(E ∩ B) = 0. Since µ∗(E) ≥ µ∗(E ∩ Bc), this

Page 39: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

24 1. Measures

implies µ∗(E) ≥ µ∗(E ∩Bc)+µ∗(E ∩B). The opposite inequality holds bythe subadditivity of µ∗. So B ∈M, and (iii) is proved.

Definition 1.3.6: A measure space (Ω,F , ν) is called complete if for anyA ∈ F with ν(A) = 0 ⇒ P(A) ⊂ F .

Thus, by part (iii) of the above theorem, (Ω,Mµ∗ , µ∗) is a completemeasure space. Now the above theorem is applied to a µ∗ that is generatedby a given measure µ on a semialgebra C via (3.1).

Theorem 1.3.3: (Caratheodory’s extension theorem). Let µ be a measureon a semialgebra C and let µ∗ be the set function induced by µ as definedby (3.1). Then,

(i) µ∗ is an outer measure,

(ii) C ⊂ Mµ∗ , and

(iii) µ∗ = µ on C, where Mµ∗ is as in Theorem 1.3.2.

Proof: The proof of (i) involves verifying (3.4)–(3.6), which is left asan exercise (Problem 1.20). To prove (ii), let A ∈ C. Let E ⊂ Ω andAnn≥1 ⊂ C be such that E ⊂

⋃n≥1 An. Then, for all i ∈ N, Ai =

(Ai ∩A)∪ (Ai ∩B1)∪ . . .∪ (Ai ∩Bk) where B1, . . . , Bk are disjoint sets inC such that

⋃kj=1 Bj = Ac. Since µ is finitely additive on C,

µ(Ai) = µ(Ai ∩A) +k∑

j=1

µ(Ai ∩Bj)

⇒∞∑

n=1

µ(An) =∞∑

n=1

µ(An ∩A) +∞∑

n=1

k∑j=1

µ(An ∩Bj)

≥ µ∗(E ∩A) + µ∗(E ∩Ac),

since An ∩ An≥1 and An ∩ Bj : 1 ≤ j ≤ k, n ≥ 1 are both countablesubcollections of C whose unions cover E∩A and E∩Ac, respectively. Fromthe definition of µ∗(E), it now follows that

µ∗(E) ≥ µ∗(E ∩A) + µ∗(E ∩Ac).

Now the subadditivity of µ∗ completes the proof of part (ii).To prove (iii), let A ∈ C. Then, by definition, µ∗(A) ≤ µ(A). If µ∗(A) =

∞, then µ(A) = ∞ = µ∗(A). If µ∗(A) < ∞, then by the definition of‘infimum,’ for any ε > 0, there exists Ann≥1 ⊂ C such that A ⊂

⋃n≥1 An

and

µ∗(A) ≤∞∑

n=1

µ(An) ≤ µ∗(A) + ε.

Page 40: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

1.3 The extension theorems and Lebesgue-Stieltjes measures 25

But A = A ∩ (⋃

n≥1 An) =⋃

n≥1(A ∩ An). Note that the set function µdefined in Proposition 1.3.1 is a measure on A(C) and it coincides with µon C. Since A, A∩An ∈ C for all n ≥ 1, by Proposition 1.2.3 (b) applied toµ,

µ(A) = µ(A) ≤∞∑

n=1

µ(A ∩An) ≤∞∑

n=1

µ(An) =∞∑

n=1

µ(An) ≤ µ∗(A) + ε.

(Alternately, w.l.o.g., assume that Ann≥1 are disjoint. Then A = A ∩⋃n≥1 An =

⋃n≥1(A ∩ An). Since A ∩ An ∈ C for all n, by the countable

additivity of µ on C, µ(A) =∑∞

n=1 µ(A∩An) ≤∑∞

n=1 µ(An) = µ∗(A)+ε.)

Since ε > 0 is arbitrary, this yields µ(A) ≤ µ∗(A).

Thus, given a measure µ on a semialgebra C ⊂ P(Ω), there is a completemeasure space (Ω,Mµ∗ , µ∗) such that Mµ∗ ⊃ C and µ∗ restricted to Cequals µ. For this reason, µ∗ is called an extension of µ. The measure space(Ω,Mµ∗ , µ∗) is called the Caratheodory extension of µ. Since Mµ∗ is a σ-algebra and contains C, Mµ∗ must contain σ〈C〉, the σ-algebra generatedby C, and thus, (Ω, σ〈C〉, µ∗) is also a measure space. However, the lattermay not be complete (see Section 1.4).

Now the above method is applied to the construction of Lebesgue-Stieltjes measures on R and R2.

1.3.2 Lebesgue-Stieltjes measures on R

Let F : R → R be nondecreasing. For x ∈ R, let F (x+) ≡ limy↓x F (y),and F (x−) ≡ limy↑x F (y). Set F (∞) = limx↑∞ F (x) and F (−∞) =limx↓−∞ F (y). Let

C ≡

(a, b] : −∞ ≤ a ≤ b < ∞∪

(a,∞) : −∞ ≤ a < ∞

. (3.7)

Define

µF ((a, b]) = F (b+)− F (a+),µF ((a,∞)) = F (∞)− F (a+). (3.8)

Then, it may be verified that

(i) C is a semialgebra;

(ii) µF is a measure on C. (For (ii), one needs to use the Heine-Boreltheorem. See Problems 1.22 and 1.23.)

Let (R,Mµ∗F, µ∗

F ) be the Caratheodory extension of µF , i.e., the measurespace constructed as in the above two theorems.

Page 41: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

26 1. Measures

Definition 1.3.7: Let F : R → R be nondecreasing. The (measure) space(R,Mµ∗

F, µ∗

F ) is called a Lebesgue-Stieltjes measure space and µ∗F is the

Lebesgue-Stieltjes measure generated by F .

Since σ〈C〉 = B(R), the class of all Borel sets of R, every Lebesgue-Stieltjes measure µ∗

F is also a measure on (R,B(R)). Note also that µ∗F is

finite on bounded intervals.Conversely, given any Radon measure µ on (R,B(R)), i.e., a measure on

(R,B(R)) that is finite on bounded intervals, set

F (x) =

⎧⎨⎩

µ((0, x]) if x > 00 if x = 0−µ((x, 0]) if x ≤ 0.

Then µF = µ on C. By the uniqueness of the extension (discussed later inthis section, see also Theorem 1.2.4), it follows that µ∗

F coincides with µ onB(R). Thus, every Radon measure on (R,B(R)) is necessarily a Lebesgue-Stieltjes measure.

Definition 1.3.8: (Lebesgue Measure on R). When F (x) ≡ x, x ∈ R,the measure µ∗

F is called the Lebesgue measure and the σ-algebra Mµ∗F

iscalled the class of Lebesgue measurable sets.

The Lebesgue measure will be denoted by m(·) or µL(·). Given beloware some important results on m(·).

(i) It follows from equation (3.1), that µ∗F (x) = F (x+) − F (x−) and

hence = 0 if F is continuous at x. Thus m(x) ≡ 0 on R.

(ii) By countable additivity of m(·), m(A) = 0 for any countable set A.

(iii) (Cantor set). There exists an uncountable set C such that m(C) = 0.An example is the Cantor set constructed as follows: Start with I0 =[0, 1]. Delete the open middle third, i.e.,

(13 , 2

3

). Next from the closed

intervals I11 = [0, 13 ] and I12 = [23 , 1] delete the open middle thirds,

i.e.,(

19 , 2

9

)and

(79 , 2

9

), respectively. Repeat this process of deleting

the middle third from each of the remaining closed intervals. Thusat stage n there will be 2n−1 new closed intervals and 2n−1 deletedopen intervals, each of length 1

3n . Let Un denote the union of all thedeleted open intervals at the nth stage. Then Unn≥1 are disjointopen sets. Let U ≡

⋃n≥1 Un. By countable additivity

m(U) =∞∑

n=1

m(Un) =∞∑

n=1

2n−1

3n= 1.

Let C ≡ [0, 1]−U . Since U is open and [0,1] is closed, C is nonempty.It can be shown that C ≡ x : x =

∑∞1

ai

3i , ai = 0 or 2 (Problem

Page 42: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

1.3 The extension theorems and Lebesgue-Stieltjes measures 27

1.33). Thus C can be mapped in (1–1) manner on to the set of allsequences δii≥1 such that δi = 0 or 2. But this set is uncountable.Since m([0, 1]) = 1, it follows that m(C) = 0. For more properties ofthe Cantor set, see Rudin (1976) and Chapter 4.

(iv) m(·) is invariant under reflection and translation. That is, for any Ein B(R),

m(−E) = m(E) and m(E + c) = m(E)

for all c in R where −E = −x : x ∈ E and E + c = y : y =x + c, x ∈ E. This follows from Theorem 1.2.4 and the fact that theclaim holds for intervals (cf. Problem 2.48).

(v) There exists a subset A ⊂ R such that A ∈ Mm. That is, there existsa non-Lebesgue measurable set. The proof of this requires the use ofthe axiom of choice (cf. A.1). For a proof see Royden (1988).

1.3.3 Lebesgue-Stieltjes measures on R2

Let F : R2 → R satisfy

F (a2, b2)− F (a2, b1)− F (a1, b2) + F (a1, b1) ≥ 0, (3.9)

andF (a2, b2)− F (a1, b1) ≥ 0, (3.10)

for all a1 ≤ a2, b1 ≤ b2. Extend F to R2 by appropriate limiting procedure.Let

C2 ≡ I1 × I2 : I1, I2 ∈ C, (3.11)

where C is as in (3.7). Next, for I1 = (a1, a2], I2 = (b1, b2], − ∞ <a1, a2, b1, b2 < ∞, set

µF (I1× I2) ≡ F (a2+, b2+)−F (a2+, b1+)− (F (a1+, b2+) + F (a1+, b1+)),(3.12)

where for any a, b ∈ R, F (a+, b+) is defined as

F (a+, b+) ≡ lima′↓a,b′↓b

F (a′, b′).

Note that by (3.10), the limit exists and hence, F (a+, b+) is well defined.Further, by (3.9), the right side of (3.12) is nonnegative. Next extend thedefinition of µF to unbounded sets in C2 by the limiting procedure:

µF (I1 × I2) = limn→∞ µF

((I1 × I2) ∩ Jn

), (3.13)

where Jn = (−n, n]×(−n, n]. Then it may be verified (Problems 1.24, 1.25)that

Page 43: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

28 1. Measures

(i) C2 is a semialgebra

(ii) µF is a measure on C2.

Let (R2,Mµ∗F, µ∗

F ) be the measure space constructed as in the above twotheorems. The measure µ∗

F is called the Lebesgue-Stieltjes measure gener-ated by F and Mµ∗

Fa Lebesgue-Stieltjes σ-algebra. Again, in this case,

Mµ∗ includes the σ-algebra σ〈C〉 ≡ B(R2) and so (R2,B(R2), µ∗F ) is also a

measure space. If F (a, b) = ab, then µF is called the Lebesgue measure onR2.

A similar construction holds for any Rk, k < ∞.

1.3.4 More on extension of measuresNext the uniqueness of the extension µ∗ and approximation of the µ∗-measure of a set in Mµ∗ by that of a set from the algebra A ≡ A(C) areconsidered. As in the case of measures defined on an algebra, a measure µon a semialgebra C ⊂ P(Ω) is said to be σ-finite if there exists a countablecollection Ann≥1 ⊂ C such that (i) µ(An) < ∞ for each n ≥ 1 and (ii)⋃

n≥1 An = Ω. The following approximation result holds.

Theorem 1.3.4: Let A ∈ Mµ∗ and µ∗(A) < ∞. Then, for each ε > 0,there exist B1, B2, . . . , Bk ∈ C, k < ∞ with Bi ∩Bj = ∅ for 1 ≤ i = j ≤ k,such that

µ∗(

Ak⋃

j=1

Bj

)< ε,

where for any two sets E1 and E2, E1 E2 is the symmetric difference ofE1 and E2, defined by E1 E2 ≡ (E1 ∩ Ec

2) ∪ (Ec1 ∩ E2).

Proof: By definition of µ∗, µ∗(A) < ∞ implies that for every ε > 0, thereexist Bnn≥1 ⊂ C such that A ⊂

⋃n≥1 Bn and

µ∗(A) ≤∞∑

n=1

µ(Bn) ≤ µ∗(A) + ε/2 < ∞.

Since Bn ∈ C, Bcn is a finite union of disjoint sets from C. W.l.o.g., it can

be assumed that Bnn≥1 are disjoint. (Otherwise, one can consider thesequence B1, B2 ∩ Bc

1, B3 ∩ Bc2 ∩ Bc

1, . . ..) Next,∑∞

n=1 µ(Bn) < ∞ impliesthat for every ε > 0, there exists k ∈ N such that

∑∞n=k+1 µ(Bn) < ε

2 .Since both A and

⋃kj=1 Bj are subsets of

⋃n≥1 Bn,

µ∗(

A[ k⋃

j=1

Bj

])≤ µ∗

(Ac ∩

[ ∞⋃j=1

Bj

])+ µ∗

( ∞⋃j=k+1

Bj

).

Page 44: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

1.3 The extension theorems and Lebesgue-Stieltjes measures 29

But since µ∗ is a measure on (Ω,Mµ∗), µ∗(⋃

n≥1 Bn) = µ∗(A) +µ∗((

⋃n≥1 Bn) ∩Ac). Further, since µ∗(A) < ∞,

µ∗([ ⋃

n≥1

Bn

]∩Ac

)

= µ∗( ⋃

n≥1

Bn

)− µ∗(A))

≤∞∑

n=1

µ∗(Bn)− µ∗(A), (since µ∗ is countably subadditive)

=∞∑

n=1

µ(Bn)− µ∗(A), (since µ∗ = µ on C)

2(by the choice of Bnn≥1).

Also, by the definition of k,

µ∗( ∞⋃

j=k+1

Bj

)≤

∞∑j=k+1

µ∗(Bj) =∞∑

j=k+1

µ(Bj) <ε

2.

Thus, µ∗(A [⋃k

j=1 Bj ]) < ε. This completes the proof of the theorem.

Thus, every µ∗-measurable set of finite measure is nearly a finite union ofdisjoint elements from the semialgebra C. This was enunciated by J.E. Lit-tlewood as the first of his three principles of approximation (the other twobeing: every Lebesgue measurable function is nearly continuous (cf. Theo-rem 2.5.12) and every almost everywhere convergent sequence of functionson a finite measure space is nearly uniformly convergent (Egorov’s theorem)cf. Theorem 2.5.11).

One may strengthen Theorem 1.3.4 to prove the following result on regu-larity of Radon measures on

(Rk,B(Rk)

)(cf. Problem 1.32). See also Rudin

(1987), Chapter 2.

Corollary 1.3.5: (Regularity of measures). Let µ be a Radon measure on(Rk,B(Rk)) for some k ∈ N, i.e., µ(A) < ∞ for all bounded sets A ∈ B(Rk).Let A ∈ B(Rk) be such that µ(A) < ∞. Then, for each ε > 0, there exist acompact set K and an open set G such that K ⊂ A ⊂ G and µ(G \K) < ε.

The following uniqueness result can be established by an application ofthe above approximation theorem (Theorem 1.3.4) or by applying the π-λtheorem (Corollary 1.1.3), as in Theorem 1.2.4 (Problem 1.26).

Theorem 1.3.6: (Uniqueness). Let µ be a σ-finite measure on a semi-algebra C. Let ν be a measure on the measurable space (Ω, σ〈C〉) such thatν = µ on C. Then ν = µ∗ on σ〈C〉.

Page 45: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

30 1. Measures

An application of Theorem 1.3.4 yields an useful approximation resultknown as Lusin’s theorem (see Theorem 2.1.3 in Section 2.1) for approxi-mating Borel measurable functions by continuous functions.

1.4 Completeness of measures

Recall from Definition 1.3.6 that a measure space (Ω,F , µ) is called com-plete if for any A ∈ F , µ(A) = 0, B ⊂ A ⇒ B ∈ F . That is, for anyset A in F whose µ measure is zero, all subsets of a set A are also in F .For example, the very construction of the Lebesgue-Stieltjes measure for anondecreasing F on R, discussed in Section 1.3, yields a complete measurespace (R,Mµ∗

F, µ∗

F ). The Borel σ-algebra B(R) is a sub-σ-algebra of Mµ∗F

and (R,B(R), µ∗F ) need not be complete. For example, if µF is the Lebesgue

measure, then the Cantor set C (Section 1.3.2) has measure 0 and henceM ≡ the Lebesgue σ-algebra contains the power set of C and hence hascardinality larger than that of R. It can be shown that the cardinality ofB(R) is the same as that of R (see Hewitt and Stromberg (1965)). Foranother example, if F is a degenerate distribution at 0, i.e., F (x) = 0 forx < 0 and F (x) = 1 for x ≥ 0, then Mµ∗

F= P(R), the power set of

R (Problem 1.28), and hence (R,B(R), µF ) is not complete. The same istrue for any discrete distribution function. However, it is always possibleto complete an incomplete measure space (Ω,F , µ) by adding new sets toF . This procedure is discussed below.

Theorem 1.4.1: Let (Ω,F , µ) be a measure space. Let F ≡ A : B1 ⊂ A ⊂B2 for some B1, B2 ∈ F satisfying µ(B2 \ B1) = 0. For any A ∈ F , setµ(A) = µ(B1) = µ(B2) for any pair of sets B1, B2 ∈ F with B1 ⊂ A ⊂ B2and µ(B2 \B1) = 0. Then

(i) F is a σ-algebra and F ⊂ F ,

(ii) µ is well defined,

(iii) (Ω, F , µ) is a complete measure space and µ = µ on F .

Proof:

(i) Since A ∈ F , there exist B1, B2 ∈ F , B1 ⊂ A ⊂ B2 and µ(B2\B1) =0. Clearly, Bc

2 ⊂ Ac ⊂ Bc1, and Bc

1, Bc2 ∈ F and µ(Bc

1 \Bc2) = µ(B2 \

B1) = 0 and so Ac ∈ F . Next, let An∞n=1 ⊂ F and A =

⋃n≥1 An.

Then, for each n there exist B1n and B2n in F such that B1n ⊂ An ⊂B2n and µ(B2n \B1n) = 0. Let B1 =

⋃n≥1 B1n and B2 =

⋃n≥1 B2n.

Then B1 ⊂ A ⊂ B2, B1, B2 ∈ F and B2 \ B1 ⊂⋃

n≥1(B2n \ B1n)and hence µ(B2 \ B1) ≤

∑∞n=1 µ(B2n \ B1n) = 0. Thus, A ∈ F and

hence F is a σ-algebra. Clearly, F ⊂ F since for every A ∈ F , onemay take B1 = B2 = A.

Page 46: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

1.5 Problems 31

(ii) Let B1 ⊂ A ⊂ B2, B′1 ⊂ A ⊂ B′

2, B1, B′1, B2, B

′2 ∈ F and µ(B2 \

B1) = 0 = µ(B′2 \B′

1). Then B1 ∪B′1 ⊂ A ⊂ B2 ∩B′

2 and (B2 ∩B′2) \

(B1 ∪B′1) = (B2 ∩B′

2) ∩ (Bc1 ∩B

′c1 ) ⊂ B2 ∩Bc

1. Thus

µ([B2 ∩B′2] \ [B1 ∪B′

1]) = 0.

Hence, µ(B2) = µ(B1) + µ(B2 \B1) = µ(B1) ≤ µ(B1 ∪B′1) = µ(B2 ∩

B′2) ≤ µ(B′

2). By symmetry µ(B′2) ≤ µ(B2) and so µ(B2) = µ(B′

2).But µ(B2) = µ(B1) and µ(B′

2) = µ(B′1) and also all four quantities

agree.

(iii) It remains to show that µ is countably additive and complete onF . Let An∞

n=1 be a disjoint sequence of sets from F and let A =⋃n≥1 An. Let B1nn≥1, B2nn≥1, B1, B2 be as in the proof of (i).

Then, the fact that An∞n=1 are disjoint implies B1n∞

n=1 are alsodisjoint. And since B1 =

⋃n≥1 B1n and µ is a measure on (Ω,F),

µ(B1) ≡∞∑

n=1

(B1n).

Also, by definition of B1n’s, µ(B1n) = µ(An) for all n ≥ 1, and by(i), µ(A) = µ(B1). Thus,

µ(A) = µ(B1) =∞∑

n=1

(B1n) =∞∑

n=1

µ(An),

establishing the countable additivity of µ.

Next, suppose that A ∈ F and µ(A) = 0. Then there exist B1, B2 ∈ Fsuch that B1 ⊂ A ⊂ B2 and µ(B2 \B1) = 0. Further, by definition ofµ, µ(B2) = µ(A) = 0. If D ⊂ A, then ∅ ⊂ D ⊂ B2 and µ(B2 \ ∅) = 0.Therefore, D ∈ F and hence (Ω, F , µ) is complete.

Finally, if A ∈ F , then take B1 = B2 = A and so, µ(A) = µ(B1) =µ(A), and thus, µ = µ on F . Hence, the proof of the theorem iscomplete.

1.5 Problems

1.1 Let Ω be a nonempty set. Show that F ≡ Ω, ∅ and G = P(Ω) ≡A : A ⊂ Ω are both σ-algebras.

1.2 Let Ω be a finite set, i.e., the number of elements in Ω is finite. LetF ⊂ P(Ω) be an algebra. Show that F is a σ-algebra.

1.3 Show that F6 in Example 1.1.4 is a σ-algebra.

Page 47: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

32 1. Measures

1.4 Let Fθ : θ ∈ Θ be a family of σ-algebras on Ω. Show that G ≡⋂θ∈Θ Fθ is also a σ-algebra.

1.5 Let Ω = 1, 2, 3, F1 = 1, 2, 3,Ω, ∅, F2 = 1, 2, 3,Ω, ∅.Verify that F1 and F2 are both algebras (in fact, σ-algebras) butF1 ∪ F2 is not an algebra.

1.6 Let Ω be a nonempty set and let A ≡ Ai : i ∈ N be a partition of Ω,i.e., Ai ∩ Aj = ∅ for all i = j and

⋃n≥1 An = Ω. Let F =

⋃i∈J Ai :

J ⊂ N where, for J = ∅,⋃

i∈J Ai ≡ ∅. Show that F is a σ-algebra.

1.7 Let Ω be a nonempty set and let B ≡ Bi : 1 ≤ i ≤ k < ∞ ⊂ P(Ω),B not necessarily a partition. Find σ〈B〉.

(Hint: For each δ = (δ1, δ2, . . . , δk), δi ∈ 0, 1 let Bδ =⋂k

i=1 Bi(δi),where Bi(0) = Bc

i and Bi(1) = Bi, i ≥ 1. Show that σ(B) = E :E =

⋃δ∈J Bδ, J ⊂ 1, 2, . . . , k.)

1.8 Let Ω ≡ 1, 2, . . . = N and Ai ≡ j : j ∈ N, j ≥ i, i ∈ N. Show thatσ〈A〉 = P(Ω) where A = Ai : i ∈ N.

1.9 (a) Show that every open set A ⊂ R is a countable union of openintervals.

(Hint: Use the fact that the set Q of all rational numbers isdense in R.)

(b) Extend the above to Rk for any k ∈ N.

(c) Strengthen (a) to assert that the open intervals in (a) can bechosen to be disjoint.

1.10 Show that in Example 1.1.6, Oj ⊂ σ〈Oi〉 for all 1 ≤ i, j ≤ 4.

1.11 For k ∈ N, let O5 ≡ (a1, b1]× . . .× (ak, bk] : −∞ < ai < bi < ∞, 1 ≤i ≤ k and O6 ≡ [a1, b1] × . . . × [ak, bk] : −∞ < ai < bi < ∞, 1 ≤i ≤ k. Show that σ〈O5〉 = σ〈O6〉 = B(Rk).

1.12 Let OS ≡ x : x ∈ Rk be the class of all singletons in Rk, k ∈ N.Show that σ〈OS〉 is properly contained in B(Rk).

(Hint: Show that σ〈OS〉 coincides with F6 of Example 1.1.4).

1.13 Let R ≡ R ∪ +∞ ∪ −∞ be the extended real line. The Borelσ-algebra on R, denoted by B(R), is defined as the σ-algebra on Rgenerated by the collection B(R) ∪ ∞ ∪ −∞.

(a) Show that B(R) =A ∪B : A ∈ B(R), B ⊂ −∞,∞

.

(b) Show, however, that the σ-algebra on R generated by B(R) isgiven by σ〈B(R)〉 =

A ∪ B : A ∈ B(R), B = −∞,∞ or

B = ∅.

Page 48: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

1.5 Problems 33

1.14 Let A1,A2 and A3 denote, respectively, the class of triangles, discs,and pentagons in R2. Show that σ〈Ai〉 ≡ B(R2). Thus, the σ-algebraB(R2) (and similarly B(Rk)) can be generated by starting with vari-ous classes of sets of different shapes and geometry.

1.15 Let Ω be a nonempty set and B be a σ-algebra on Ω. Let A ⊂ Ωand BA ≡ B ∩ A : B ∈ B. Show that BA is a σ-algebra on A. Theσ-algebra BA is called the trace σ-algebra of B on A.

1.16 Let Ω = R and F be the collection of all finite unions of disjointintervals of the form (a, b]∩R, −∞ ≤ a < b ≤ ∞. Show that F is analgebra but not a σ-algebra.

1.17 Let Ω be a nonempty set and Aii∈N be a sequence of subsets of Ωsuch that Ai+1 ⊂ Ai for all i ∈ N. Verify that A = Ai : i ∈ N is aπ-system and also determine λ〈A〉, the λ-system generated by A.

1.18 Let Ω ≡ N, F = P(Ω), and An = j : j ∈ N, j ≥ n, n ∈ N. Letµ be the counting measure on (Ω,F). Verify that limn→∞ µ(An) =µ(⋂

n≥1 An).

1.19 Let Ω be a nonempty set and let C ⊂ P(Ω) be a semialgebra. Let

A(C) ≡

A : A =k⋃

i=1

Bi : Bi ∈ C, i = 1, 2, . . . , k, k ∈ N

.

(a) Show that A(C) is the smallest algebra containing C.(b) Show also that σ〈C〉 = σ〈A(C)〉.

1.20 Let µ∗ be as in (3.1) of Section 1.3. Verify (3.4)–(3.6).

(Hint: Fix 0 < ε < ∞. If µ∗(An) < ∞ for all n ∈ N, then find, foreach n, a cover Anjj≥1 ⊂ C such that µ∗(An) ≤

∑∞j=1 µ(Anj)+ ε

2n .)

1.21 Prove Proposition 1.3.1.

1.22 Let F : R → R be nondecreasing. Let (a, b], (an, bn], n ∈ N be in-tervals in R such that (a, b] =

⋃n≥1(an, bn] and (an, bn] : n ≥ 1

are disjoint. Let µF (·) be as in (3.8). Show that µF ((a, b]) =∑∞n=1 µF ((an, bn]) by completing the following steps:

(a) Let G(x) ≡ F (x+) for all x ∈ R and let G(±∞) = F (±∞).Verify that G(·) is nondecreasing and right continuous on R andthat for any A in C, µF (A) = µG(A).

(b) In view of (a), assume w.l.o.g. that F (·) is right continuous.Show that for any k ∈ N,

F (b)− F (a) ≥k∑

i=1

(F (bi)− F (ai)),

Page 49: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

34 1. Measures

and conclude that

F (b)− F (a) ≥∞∑

i=1

(F (bi)− F (ai)).

(c) To prove the reverse inequality, fix η > 0. Choose c > a anddn > bn, n ≥ 1 such that such that F (c) − F (a) < η, [c, b] ⊂⋃

n≥1(an, dn) and F (dn) − F (bn) < η/2n for all n ∈ N. Next,apply the Heine-Borel theorem to the interval [c, b] and the opencover (an, dn)n≥1 and extract a finite cover (ai, di)k

i=1 for[c, b]. W.l.o.g., assume that c ∈ (a1, d1) and b ∈ (ak, dk). Nowverify that

F (b)− F (a) ≤k∑

j=1

(F (bj)− F (aj)) + 2η

≤∞∑

j=1

(F (bj)− F (aj)) + 2η.

1.23 Extend the above arguments to the case when (a, b] and (ai, bi], i ≥ 1are not necessarily bounded intervals.

1.24 Verify that C2, defined in (3.11), is a semialgebra.

1.25 (a) Verify that the limit in (3.13) exists.

(b) Extend the arguments in Problems 1.22 and 1.23 to verify thatµF of (3.12) and (3.13) is a measure on C2.

1.26 Establish Theorem 1.3.6 by completing the following:

(a) Suppose first that ν(Ω) < ∞. Verify that L ≡ A : A ∈ σ〈C〉,µ∗(A) = ν(A) is a λ system and use the π-λ theorem.

(b) Extend the above to the σ-finite case.

1.27 Prove Corollary 1.3.5 for Lebesgue measure m(·).

1.28 Let F be a discrete distribution function, i.e., F is of the form

F (x) ≡∞∑

j=1

ajI(xj ≤ x), x ∈ R,

where 0 < aj < ∞,∑

j≥1 aj = 1, xj ∈ R, j ≥ 1. Show that Mµ∗F

=P(R).

(Hint: Show that µ∗F (Ac) = 0, where A ≡ xj : j ≥ 1, and use the

fact that for any B ⊂ R, B ∩A ∈ B(R).)

Page 50: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

1.5 Problems 35

1.29 Let

F (x) =

⎧⎨⎩

0 for x < 0x for 0 ≤ x ≤ 11 for x > 0.

Show that Mµ∗F≡ A : A ∈ P(R), A ∩ [0, 1] ∈ M, where M is the

σ-algebra of Lebesgue measurable sets as in Definition 1.3.7.

1.30 Let F (·) = 12Φ(·)+ 1

2FP (·) where Φ(·) is the standard normal cdf, i.e.,Φ(x) ≡

∫ x

−∞1√2π

e−u2/2Au and FP (x) ≡∑∞

k=0 e−2 2k

k! I(k)(−∞,x], x ∈ R.

Let F1 = Φ, F2 = FP and F3 = F . Let A1 = (0, 1), A2 = x : x ∈R, sin x ∈ (0, 1

2 ), A3 = x : for some integers a0, a1, . . . , ak, k < ∞,∑ki=0 aix

i = 0, the set of all algebraic numbers. Compute µFi(Aj),

1 ≤ i, j ≤ 3.

1.31 Let µ be a measure on a semialgebra C ⊂ P(Ω) where Ω is a nonemptyset. Let µ∗ be the outer measure generated by µ and let Mµ∗ be theσ-algebra of µ∗-measurable sets as defined in Theorem 1.3.3.

(a) Show that for all A ⊂ Ω, there exists a B ∈ σ〈C〉 such thatA ⊂ B and µ∗(A) = µ∗(B).

(Hint: If µ∗(A) = ∞, take B to be Ω. If µ∗(A) < ∞, usethe definition of µ∗ to show that for each n ≥ 1, there ex-ists Bnjj≥1 ⊂ C such that A ⊂ Bn ≡

⋃j≥1 Bnj , µ∗(A) ≤∑∞

j=1 µ(Bnj) < µ∗(A) + 1n . Take B =

⋂n≥1 Bn.)

(b) Show that for all A ∈ Mµ∗ with µ∗(A) < ∞, there exists B ∈σ〈C〉 such that A ⊂ B and µ∗(B \A) = 0.

(Hint: Use (a) and the relation B = A ∪ (B \ A) with A andB \A = B ∩Ac in Mµ∗ .)

(c) Show that if µ is σ-finite (i.e., there exist sets Ωn, n ≥ 1 in Cwith µ(Ωn) < ∞ for all n ≥ 1 and

⋃n≥1 Ωn = Ω), then in (b),

the hypothesis that µ∗(A) < ∞ can be dropped.

(Hint: Assume w.l.o.g. that Ωnn≥1 are disjoint. Apply (b) toAn ≡ A ∩ Ωnn≥1.)

(d) Show that if µ is σ-finite, then A ∈ Mµ∗ iff there exist setsB1, B2 ∈ σ〈C〉 such that B1 ⊂ A ⊂ B2 and µ∗(B2 \B1) = 0.

(Hint: Apply (c) to both A and Ac.)

This shows that Mµ∗ is the completion of σ〈C〉 w.r.t. µ∗.

1.32 (An outline of a proof of Corollary 1.3.5). Let (R,Mµ∗F, µ∗

F ) be aLebesgue Stieltjes measure space generated by a right continuous andnondecreasing function F : R → R.

Page 51: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

36 1. Measures

(a) Show that A ∈Mµ∗F

iff there exist Borel sets B1 and B2 ∈ B(R)such that B1 ⊂ A ⊂ B2 and µ∗

F (B2 \B1) = 0.

(Hint: Take C to be the semialgebra C = (a, b] : −∞ ≤ a ≤b < ∞∪(b,∞) : −∞ ≤ b < ∞ and apply Problem 1.31 (d).)

(b) Let A ∈Mµ∗F

with µ∗F (A) < ∞. Show that for any ε > 0, there

exist a finite number of bounded open intervals Ij , j = 1, 2, . . . , k

such that µ∗F (A

⋃kj=1 Ij) < ε.

(Outline: Claim: For any B ∈ C with µF (B) < ∞, there existsan open interval I such that µ∗

F (I B) < ε.

To see this, note that if B = (a, b], − ∞ ≤ a < b < ∞,then one may choose b′ > b such that F (b′) − F (b) < ε. Now,with I = (a, b′), µ∗

F (I B) = µ∗F ((b, b′)) = µF ((b, b′)) ≤

F (b′)− F (b) < ε. If B = (b,∞) and µF (B) < ∞, then there ex-ists b′ > b such that F (∞)−F (b′−) < ε. Hence, with I = (b, b′),µ∗

F (IB) = µ∗F ([b′,∞)) = F (∞)−F (b′−) < ε. This proves the

claim.

Next, By Theorem 1.3.4, for all ε > 0, there existB1, B2, . . . , Bk ∈ C such that µ∗

F (A⋃k

j=1 Bj) < ε/2. For eachBj , find Ij , a bounded open interval such that µ∗

F (BjIj) < ε2j .

Since (A1 ∪ A2) (C1 ∪ C2) ⊂ (A1 C1) ∪ (A2 C2) for anyA1, A2, C1, C2 ⊂ Ω, it follows that

µ∗F

([ k⋃j=1

Bj

][ k⋃

j=1

Ij

])<

k∑j=1

µ∗F (Bj Ij) <

ε

2.

Hence, µ∗F (A [

⋃kj=1 Ij ]) < ε.)

(c) Let A ∈ Mµ∗F

with µ∗F (A) < ∞. Show that for every ε > 0,

there exists an open set O such that A ⊂ O and µ∗F (O \A) < ε.

(Hint: By definition of µ∗F , for every ε > 0, there ex-

ist Bjj≥1 ⊂ C such that A ⊂⋃

j≥1 Bj and µ∗F (A) ≤∑∞

j=1 µF (Bj) ≤ µ∗F (A) + ε. Now as in (b), there exist open

intervals Ij such that Bj ⊂ Ij and µ∗F (Ij \ Bj) < ε/2j for all

j ≥ 1. Then A ⊂⋃∞

j=1 Bj ⊂⋃∞

j=1 Ij ≡ O. Also, µ∗F (O) =

µ∗F (A) + µ∗

F (O \A) ⇒ µ∗F (O \A) = µ∗

F (O)− µ∗(A) < 2ε (sinceµ∗(O) ≤

∑∞j=1 µ∗(Ij) =

∑∞j=1 µ∗

F (Bj) + ε ≤ µ∗F (A) + 2ε < ∞).)

(d) Extend (c) to all A ∈Mµ∗F.

(Hint: Let Ai = A ∩ [i, i + 1], i ∈ Z. Apply (c) to Ai withεi = ε

2|i|+1 and take unions.)(e) Show that for all A ∈ Mµ∗

Fand for all ε > 0, there exist a

closed set C and an open set O such that C ⊂ A ⊂ O and

Page 52: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

1.5 Problems 37

µ∗F (O \A) ≤ ε, µ∗(A \ C) < ε.

(Hint: Apply (d) to A and Ac.)(f) Show that for all A ∈ Mµ∗

Fwith µ∗

F (A) < ∞ and for all ε > 0,there exist a closed and bounded set F ⊂ A such that µ∗

F (A \F ) < ε and an open set O with A ⊂ O such that µ∗

F (O \A) < ε.

(Hint: Apply (d) to A ∩ [−M,M ] where M is chosen so thatµ∗

F (A ∩ [−M,M ]c) < ε. Why is this possible?)Remark: Thus for all A ∈ Mµ∗

Fwith µ∗

F (A) < ∞ and for allε > 0, there exist a compact set K ⊂ A and an open set O ⊃ Asuch that µ∗

F (A \K) < ε and µ∗F (O \A) < ε. The first property

is called inner regularity of µ∗F and the second property is called

outer regularity of µ∗F .

(g) Show that for all A ∈ Mµ∗F

with µ∗F (A) < ∞ and for all ε > 0,

there exists a continuous function gε with compact support (i.e.,gε(x) is zero for |x| large) such that

µ∗F (A g−1

ε 1) < ε.

(Hint: For any bounded open interval (a, b), let η > 0 be suchthat µF ((a, a + η]) + µF ([b− η, b)) < ε. Next define

gε(x) =

⎧⎨⎩

1 if a + η ≤ x ≤ b− η0 if x ∈ (a, b)linear over [a, a + η] ∪ [b− η, b].

Then gε is continuous with compact support. Also, g−11 =[a + η, b − η] and (a, b) g−11 = (a, a + η) ∪ (b − η, b). SoµF (a, b) g−1

ε (1)| < ε, proving the claim for A = (a, b). Thegeneral case follows from (b).)

(h) Show that for all A ∈ Mµ∗F

and for all ε > 0, there exists acontinuous function gε (not necessarily with compact support)such that µ∗

F (Ag−1ε 1) < ε (i.e., drop the condition µ∗

F (A) <∞).

(Hint: Let Ak = A ∩ [k, k + 1], k ∈ Z. Find gk : R → Rcontinuous with support in

(k − 1

8 , k + 98

)such that µ∗

F (IAk=

gk) < ε2|k|+1 . Let g =

∑k∈Z gk. Note that for any x ∈ R, at

most two gk(x) = 0 and so g is continuous. Also, µ∗F (IA = g) ≤∑

k∈Z µ∗F (IAk

= gk) < ε.)

1.33 Let C be the Cantor set in [0,1] as defined in Section 1.3.2.

(a) Show that

C =

x : x =∞∑

i=1

ai

3i, ai ∈ 0, 2

.

Page 53: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

38 1. Measures

and hence that C is uncountable.

(b) Show that

C + C ≡ x + y : x, y ∈ C = [0, 2].

Page 54: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2Integration

2.1 Measurable transformations

Oftentimes, one is not interested in the full details of a measure space(Ω,F , µ) but only in certain functions defined on Ω. For example, if Ω rep-resents the outcomes of 10 tosses of a fair coin, one may only be interestedin knowing the number of heads in the 10 tosses. It turns out that to assignmeasures (probabilities) to sets (events) involving such functions, one canallow only certain functions (called measurable functions) that satisfy some‘natural’ restrictions, specified in the following definitions.

Definition 2.1.1: Let Ω be a nonempty set and let F be a σ-algebra onΩ. Then the pair (Ω,F) is called a measurable space. If µ is a measure on(Ω,F), then the triplet (Ω,F , µ) is called a measure space. If in addition,µ is a probability measure, then (Ω,F , µ) is called a probability space.

Definition 2.1.2:

(a) Let (Ω,F) be a measurable space. Then a function f : Ω to R is called〈F ,B(R)〉-measurable (or F-measurable) if for each a in R

f−1((−∞, a])≡ ω : f(ω) ≤ a ∈ F . (1.1)

(b) Let (Ω,F , P ) be a probability space. Then a function X : Ω → R iscalled a random variable, if the event

X−1((−∞, a]) ≡ ω : X(ω) ≤ a ∈ F

Page 55: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

40 2. Integration

for each a in R, i.e., a random variable is a real valued F-measurablefunction on a probability space (Ω,F , P ).

It will be shown later that condition (1.1) on f is equivalent to thestronger condition that f−1(A) ∈ F for all Borel sets A ∈ B(R). Since forany Borel set A ∈ B(R), f−1(A) is a member of the underlying σ-algebraF , one can assign a measure to the set f−1(A) using a measure µ on (Ω,F).Note that for an arbitrary function T from Ω → R, T−1(A) need not bea member of F and hence such an assignment may not be possible. Thus,condition (1.1) on real valued mappings is a ‘natural’ requirement whiledealing with measure spaces.

The following definition generalizes (1.1) to maps between two measur-able spaces.

Definition 2.1.3: Let (Ωi,Fi), i = 1, 2 be measurable spaces. Then, amapping T : Ω1 → Ω2 is called measurable with respect to the σ-algebras〈F1,F2〉 (or 〈F1,F2〉-measurable) if

T−1(A) ∈ F1 for all A ∈ F2.

Thus, X is a random variable on a probability space (Ω,F , P ) iff X is〈F ,B(R)〉-measurable. Some examples of measurable transformations aregiven below.

Example 2.1.1: Let Ω = a, b, c, d,F2 = Ω, ∅, a, b, c, d and letF3 = the set of all subsets of Ω. Define the mappings Ti : Ω → Ω, i = 1, 2,by

T1(ω) ≡ a for ω ∈ Ω

and

T2(ω) =

a if ω = a, bc if ω = c, d.

Then, T1 is 〈F2,F3〉-measurable since for any A ∈ F3, T−11 (A) = Ω or

∅ according as a ∈ A or a ∈ A. By similar arguments, it follows thatT2 is 〈F3,F2〉-measurable. However, T2 is not 〈F2,F3〉-measurable sinceT−1

2 (a) = a, b ∈ F2.

As this simple example shows, measurability of a given mapping criticallydepends on the σ-algebras on its domain and range spaces. In general, ifT is 〈F1,F2〉-measurable, then T is 〈F1,F2〉-measurable for any σ-algebraF1 ⊃ F1 and T is 〈F1, F2〉-measurable for any F2 ⊂ F2.

Example 2.1.2: Let T : R → R be defined as

T (x) =

sin 2x if x > 01 + cos x if x ≤ 0.

Page 56: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.1 Measurable transformations 41

Is T measurable w.r.t. the Borel σ-algebras 〈B(R),B(R)〉? If one is toapply the definition directly, one must check that T−1(A) ∈ B(R) for allA ∈ B(R). However, finding T−1(A) for all Borel sets A is not an easy task.In many instances like this one, verification of the measurability propertyof a given mapping by directly using the definition can be difficult. Insuch situations, one may use some easy-to-verify sufficient conditions. Someresults of this type are given below.

Proposition 2.1.1: Let (Ωi,Fi), i = 1, 2, 3 be measurable spaces.

(i) Suppose that F2 = σ〈A〉 for some class of subsets A of Ω2. If T :Ω1 → Ω2 is such that T−1(A) ∈ F1 for all A ∈ A, then T is 〈F1,F2〉-measurable.

(ii) Suppose that T1 : Ω1 → Ω2 is 〈F1,F2〉-measurable and T2 : Ω2 →Ω3 is 〈F2,F3〉-measurable. Let T = T2 T1 : Ω1 → Ω3 denote thecomposition of T1 and T2, defined by T (ω1) = T2(T1(ω1)), ω1 ∈ Ω1.Then, T is 〈F1,F3〉-measurable.

Proof:

(i) Define the collection of sets

F = A ∈ F2 : T−1(A) ∈ F1.

Then,

(a) T−1(Ω2) = Ω1 ∈ F1 ⇒ Ω2 ∈ F .

(b) If A ∈ F , then T−1(A) ∈ F1 ⇒ (T−1(A))c ∈ F1 ⇒ T−1(Ac) =(T−1(A))c ∈ F1, implying Ac ∈ F .

(c) If A1, A2, . . . ,∈ F , then, T−1(Ai) ∈ F1 for all i ≥ 1. Since F1is a σ-algebra , T−1(

⋃n≥1 An) =

⋃n≥1 T−1(An) ∈ F1. Thus,⋃

n≥1 An ∈ F . (See also Problem 2.1 on de Morgan’s laws.)

Hence, by (a), (b), (c), F is a σ-algebra and by hypothesis A ⊂ F .Hence, F2 = σ〈A〉 ⊂ F ⊂ F2. Thus, F = F2 and T is 〈F1,F2〉-measurable.

(ii) Let A ∈ F3. Then, T−12 (A) ∈ F2, since T2 is 〈F2,F3〉-measurable.

Also, by the 〈F1,F2〉-measurability of T1, T−1(A) = T−11 (T−1

2 (A)) ∈F1, showing that T is 〈F1,F3〉-measurable.

Proposition 2.1.2: For any k, p ∈ N, if f : Rp → Rk is continuous, thenf is 〈B(Rp),B(Rk)〉-measurable.

Proof: Let A = A : A is an open set in Rk. Then, by the continuity off , f−1(A) is open and hence, is in B(Rp) (cf. Section A.4). Thus, f−1(A) ∈

Page 57: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

42 2. Integration

Rp for all A ∈ A. Since B(Rk) = σ〈A〉, by Proposition 2.1.1 (a), f is〈B(Rp),B(Rk)〉-measurable.

Although the converse to the above proposition is not true, a result dueto Lusin says that except on a set of small measure, f coincides with acontinuous function. This is stronger than the statement that except onset of small measure, f is close to a continuous function. For the statementand proof of Lusin’s theorem, see Theorem 2.5.12.

Proposition 2.1.3: Let f1, . . . , fk (k ∈ N) be 〈F ,B(R)〉-measurable trans-formations from Ω to R. Then,

(i) f = (f1, . . . , fk) is 〈F ,B(Rk)〉-measurable.

(ii) g = f1 + . . . + fk is 〈F ,B(R)〉-measurable.

(iii) h ≡∏k

i=1 fi is 〈F ,B(R)〉-measurable.

(iv) Let p ∈ N and let ψ : Rk → Rp be continuous. Then, ξ ≡ ψ f is〈F ,B(Rp)〉-measurable, where f = (f1, . . . , fk).

Proof: To prove (i), note that for any rectangle R = (a1, b1)×. . .×(ak, bk),

f−1(R) = ω ∈ Ω : a1 < f1(ω) < b1, . . . , ak < fk(ω) < bk

=k⋂

i=1

ω ∈ Ω : ai < f1(ω) < bi

=k⋂

i=1

f−1i (ai, bi) ∈ F ,

since each fi is 〈F ,B(R)〉-measurable. Hence, by Proposition 2.1.1 (i), fis 〈F ,B(Rk)〉-measurable. To prove (ii), note that the function g1(x) ≡x1 + . . . + xk, x = (x1, . . . , xk) ∈ Rk is continuous on Rk, and hence,by Proposition 2.1.2, is 〈B(Rk),B(R)〉-measurable. Since g = g1 f , g is〈F ,B(R)〉-measurable, by Proposition 2.1.1 (ii). The proofs of (iii) and (iv)are similar to that of (ii) and hence, are omitted.

Corollary 2.1.4: The collection of 〈F ,B(R)〉-measurable functions fromΩ to R is closed under pointwise addition and multiplication as well asunder scalar multiplication.

The proof of Corollary 2.1.4 is omitted.

In view of the above, writing the function T of Example 2.1.2 as

T (x) = (sin 2x)I(0,∞)(x) + (1 + cosx)I(−∞,0](x),

x ∈ R, the 〈B(R),B(R)〉-measurability of T follows. Note that here T is notcontinuous over R but only piecewise continuous (see also Problem 2.2).

Page 58: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.1 Measurable transformations 43

Next, measurability of the limit of a sequence of measurable functions isconsidered. Let R = R ∪ +∞,−∞ denote the extended real line and letB(R) ≡ σ〈B(R) ∪ +∞ ∪ −∞〉 denote the extended Borel σ-algebra onR.

Proposition 2.1.5: For each n ∈ N, let fn : Ω → R be a 〈F ,B(R)〉-measurable function.

(i) Then, each of the functions supn∈N fn, infn∈N fn, lim supn→∞ fn, andlim infn→∞ fn is 〈F ,B(R)〉-measurable.

(ii) The set A ≡ ω : limn→∞ fn(ω) exists and is finite lies in F andthe function h ≡ (limn→∞ fn) · IA is 〈F ,B(R)〉-measurable.

Proof:

(i) Let g = supn≥1 fn. To show that g is 〈F ,B(R)〉-measurable, it isenough to show that ω : g(ω) ≤ r ∈ F for all r ∈ R (cf. Problem2.4). Now, for any r ∈ R,

ω : g(ω) ≤ r =∞⋂

n=1

ω : fn(ω) ≤ r

=∞⋂

n=1

f−1n ((−∞, r]) ∈ F ,

since f−1n ((−∞, r]) ∈ F for all n ≥ 1, by the measurability of fn.

Next note that infn≥1 fn = − supn≥1(−fn) and hence, infn≥1 fn is〈F ,B(R)〉-measurable. To prove the measurability of lim supn→∞ fn,define the functions gm = supn≥m fn, m ≥ 1. Then, gm is 〈F ,B(R)〉-measurable for each m ≥ 1 and since gm is nonincreasing in m,infm≥1 gm ≡ lim supn→∞ fn is also 〈F ,B(R)〉-measurable. A similarargument works for lim infn→∞ fn.

(ii) Let h1 = lim supn→∞ fn and h2 = lim infn→∞ fn, and define hi =hiIR(hi), i = 1, 2. Note that h1− h2 is 〈F ,B(R)〉-measurable. Hence,

ω : limn→∞ fn(ω) exists and is finite

= ω : −∞ < lim supn→∞

fn(ω) = lim infn→∞ fn(ω) < ∞

= ω : −∞ < h2(ω) = h1(ω) < ∞= ω : h1(ω) = h2(ω) ∩ ω : h1(ω) < ∞, h2(ω) > −∞= (h1 − h2)−1(0) ∩ ω : h1(ω) < ∞, h2(ω) > −∞ ∈ F .

Finally, note that h = h1IA.

Page 59: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

44 2. Integration

Definition 2.1.4: Let fλ : λ ∈ Λ be a family of mappings from Ω1 intoΩ2 and let F2 be a σ-algebra on Ω2. Then,

σ〈f−1λ (A) : A ∈ F2, λ ∈ Λ〉

is called the σ-algebra generated by fλ : λ ∈ Λ (w.r.t. F2) and is denotedby σ〈fλ : λ ∈ Λ〉.

Note that σ〈fλ : λ ∈ Λ〉 is the smallest σ-algebra on Ω1 that makes allfλ’s measurable w.r.t. F2 on Ω2.

Example 2.1.3: Let f = IA for some set A ⊂ Ω1 and Ω2 = R andF2 = B(R). Then,

σ〈f〉 = σ〈A〉 = Ω1, ∅, A, Ac.

Example 2.1.4: Let Ω1 = Rk, Ω2 = R, F2 = B(R), and for 1 ≤ i ≤ k, letfi : Ω1 → Ω2 be defined as

fi(x1, . . . , xk) = xi, (x1, . . . , xk) ∈ Ω1 = Rk.

Then, σ〈fi : 1 ≤ i ≤ k〉 = B(Rk).

To show this, note that any measurable rectangle A1 × . . .× Ak can bewritten as A1× . . .×Ak =

⋂ki=1 f−1

i (Ai) and hence A1× . . .×Ak ∈ σ〈fi :1 ≤ i ≤ k〉 for all A1, . . . , Ak ∈ R. Since Rk is generated by the collection ofall measurable rectangles, B(Rk) ⊂ σ〈fi : 1 ≤ i ≤ k〉. Conversely, for anyA ∈ B(R) and for any 1 ≤ i ≤ k, f−1

i (A) = R× . . .×A× . . .×R (with A inthe ith position) is in B(Rk). Therefore, σ〈fi : 1 ≤ i ≤ k〉 = σ〈f−1

i (A) :A ∈ R, 1 ≤ i ≤ k〉 ⊂ B(Rk). Hence, σ〈fi : 1 ≤ i ≤ k〉 = B(Rk).

Proposition 2.1.6: Let fλ : λ ∈ Λ be an uncountable collection of mapsfrom Ω1 to Ω2. Then for any B ∈ σ〈fλ : λ ∈ Λ〉, there exists a countableset ΛB ⊂ Λ such that B ∈ σ〈fλ : λ ∈ ΛB〉.

Proof: The proof of this result is left as an exercise (Problem 2.5).

2.2 Induced measures, distribution functions

Suppose X is a random variable defined on a probability space (Ω,F , P ).Then P governs the probabilities assigned to events like X−1([a, b]),−∞ <a < b < ∞. Since X takes values in the real line, one should be able toexpress such probabilities only as a function of the set [a, b]. Clearly, sinceX is 〈F , R〉-measurable, X−1(A) ∈ F for all A ∈ B(R) and the function

PX(A) ≡ P (X−1(A)) (2.1)

Page 60: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.2 Induced measures, distribution functions 45

is a set function defined on B(R). Is this a (probability) measure on B(R)?The following proposition answers the question more generally.

Proposition 2.2.1: Let (Ωi,Fi), i = 1, 2 be measurable spaces and letT : Ω1 → Ω2 be a 〈F1,F2〉-measurable mapping from Ω1 to Ω2. Then, forany measure µ on (Ω1,F1), the set function µT−1, defined by

µT−1(A) ≡ µ(T−1(A)), A ∈ F2 (2.2)

is a measure on F2.

Proof: It is easy to check that µT−1 satisfies the three conditions for beinga measure. The details are left as an exercise (cf. Problem 2.9).

Definition 2.2.1: The measure µT−1 is called the measure induced by T(or the induced measure of T ) on F2.

In particular, if µ(Ω1) = 1, then µT−1(Ω2) = 1. Hence, the set functionP defined in (2.1) is indeed a probability measure on (R,B(R)).

Definition 2.2.2: For a random variable X defined on a probability space(Ω,F , P ), the probability distribution of X (or the law of X), denoted byPX (say), is the induced measure of X under P on R, as defined in (2.2).

In introductory courses on probability and statistics, one defines proba-bilities of events like ‘X ∈ [a, b]’ by using the probability mass function fordiscrete random variables and the probability density function for ‘contin-uous’ random variables. The measure-theoretic definition above allows oneto treat both these cases as well as the case of ‘mixed’ distributions undera unified framework.

Definition 2.2.3: The cumulative distribution function (or cdf in short)of a random variable X is defined as

FX(x) ≡ PX((−∞, x]), x ∈ R. (2.3)

Proposition 2.2.2: Let F be the cdf of a random variable X.

(i) For x1 < x2, F (x1) ≤ F (x2) (i.e., F is nondecreasing on R).

(ii) For x in R, F (x) = limy↓x F (y) (i.e., F is right continuous on R).

(iii) limx→−∞ F (x) = 0 and limx→+∞ F (x) = 1.

Proof: For x1 < x2, (−∞, x1] ⊂ (−∞, x2]. Since PX is a measure on B(R),

F (x1) = PX((−∞, x1]) ≤ PX((−∞, x2]) = F (x2),

proving (i).

Page 61: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

46 2. Integration

To prove (ii), let xn ↓ x. Then, the sets (−∞, xn] ↓ (−∞, x], andPX((−∞, x1]) = P (X ≤ x1) ≤ 1. Hence, using the monotone continu-ity from above of the measure PX (m.c.f.a.) (cf. Proposition 1.2.3), onegets

limn→∞ F (xn) = lim

n→∞ PX((−∞, xn]) = PX((−∞, x]) = F (x).

Next consider part (iii). Note that if xn ↓ −∞ and yn ↑ ∞, then(−∞, xn] ↓ ∅ and (−∞, yn] ↑ (−∞,∞). Hence, part (iii) follows the m.c.f.a.and the m.c f.b. properties of PX (cf. Propositions 1.2.1 and 1.2.3).

Definition 2.2.4: A function F : R → R satisfying (i), (ii), and (iii)of Proposition 2.2.2 is called a cumulative distribution function (or cdf forshort).

Thus, given a random variable X, its cdf FX satisfies properties (i), (ii),(iii) of Proposition 2.2.2. Conversely, given a cdf F , one can construct aprobability space (Ω,F , P ) and a random variable X on it such that thecdf of X is F . Indeed, given a cdf F , note that by Theorem 1.3.3 and Def-inition 1.3.7, there exists a (Lebesgue-Stieltjes) probability measure µF on(R,B(R)) such that µF ((−∞, x]) = F (x). Now define X to be the identitymap on R, i.e., let X(x) ≡ x for all x ∈ R. Then, X is a random variable onthe probability space (R,B(R), µF ) with probability distribution PX = µF

and cdf FX = F .In addition to (i), (ii) and (iii) of Proposition 2.2.2, it is easy to verify

that for any x in R,

P (X < x) = FX(x−) ≡ limy↑x

FX(y)

and henceP (X = x) = FX(x)− FX(x−). (2.4)

Thus, the function FX(·) has a jump at x iff P (X = x) > 0. Since amonotone function from R to R can have only jump discontinuities and onlya countable number of them (cf. Problem 2.11), for any random variableX, the set a ∈ R : P (X = a) > 0 is countable. This leads to the followingdefinitions.

Definition 2.2.5:

(a) A random variable X is called discrete if there exists a countable setA ⊂ R such that P (X ∈ A) = 1.

(b) A random variable X is called continuous if P (X = x) = 0 for allx ∈ R.

Note that X is continuous iff FX is continuous on all of R and X isdiscrete iff the sum of all the jumps of its cdf FX is one. It may also be

Page 62: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.2 Induced measures, distribution functions 47

noted that if FX is a step function, then X is discrete but not conversely.For example, consider the case where the set A in the above definition isthe set of all rational numbers.

It turns out that a given cdf may be written as a weighted sum of adiscrete and a continuous cdfs. Let F be a cdf. Let A ≡ x : p(x) ≡F (x) − F (x−) > 0. As remarked earlier, A is at most countable. Writeα =

∑y∈A p(y) and let Fd(x) =

∑y∈A p(y)I(−∞,x](y), and Fc(x) = F (x)−

Fd(x). It is easy to verify that Fc(·) is continuous on R. If α = 0, thenF (x) = Fc(x) and F is continuous. If α = 1, then F = Fd(x) and F isdiscrete. If 0 < α < 1, F (·) can be written as

F (x) = αFd(x) + (1− α)Fc(x), (2.5)

where Fd(x) ≡ α−1Fd(x) and Fc(x) ≡ (1 − α)−1Fc(x) are both cdfs, withFd being discrete and Fc being continuous. For a further decomposition ofFc(·) into absolutely continuous and singular continuous components, seeChapter 4.

2.2.1 Generalizations to higher dimensionsInduced distributions of random vectors and the associated cdfs are brieflyconsidered in this section. Let X = (X1, X2, . . . , Xk) be a k-dimensionalrandom vector defined on a probability space (Ω,F , P ). The probabilitydistribution PX of X is the induced probability measure on (Rk,B(Rk)),defined by (cf. (2.1))

PX(A) ≡ P (X−1(A)) A ∈ B(Rk) . (2.6)

The cdf FX of X is now defined by

FX(x) = P (X ≤ x), x ∈ Rk , (2.7)

where for any x = (x1, x2, . . . , xk) and y = (y1, y2, . . . , yk) in Rk, x ≤ ymeans that xi ≤ yi for all i = 1, . . . , k.

The extension of Proposition 2.2.2 to the k-dimensional case is notation-ally involved. Here, an analog of Proposition 2.2.2 for the bivariate case,i.e., for k = 2 is stated.

Proposition 2.2.3: Let F be the cdf of a bivariate random vector X =(X1, X2).

(i) Then, for any x = (x1, x2) ≤ y = (y1, y2),

F (y1, y2)− F (x1, y2)− F (y1, x2) + F (x1, x2) ≥ 0. (2.8)

(ii) For any x = (x1, x2) ∈ R2,

limy1↓x1,y2↓x2

F (y1, y2) = F (x1, x2),

i.e., F is right continuous on R2.

Page 63: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

48 2. Integration

(iii) limx1→−∞ F (x1, a) = limx2→−∞ F (a, x2) = 0 for all a ∈ R;limx1→∞,x2→∞ F (x1, x2) = 1.

(iv) For any a ∈ R, limy↑∞ F (a, y) = P (X1 ≤ a) and limy↑∞ F (y, a) =P (X2 ≤ a).

Proof: Clearly,

0 ≤ P (X ∈ (x1, y1]× (x2, y2])= P (x1 < X1 ≤ y1, x2 < X2 ≤ y2)= P (X1 ≤ y1, x2 < X2 ≤ y2)− P (X1 ≤ x1, x2 < X2 ≤ y2)= P (X1 ≤ y1, X2 ≤ y2)− P (X1 ≤ y1, X2 ≤ x2)

−[P (X1 ≤ x1, X2 ≤ y2)− P (X1 ≤ x1, X2 ≤ x2)]= F (y1, y2)− F (y1, x2)− F (x1, y2) + F (x1, x2).

This proves (i).To prove (ii), note that for any sequence yin ↓ xi, i = 1, 2, the sets

An = (−∞, y1n]× (−∞, y2n] ↓ A ≡ (−∞, x1]× (−∞, x2]. Hence, by m.c.f.aproperty of a probability measure,

F (y1n, y2n) = P (X ∈ An) ↓ P (A) = F (x1, x2).

For (iii), note that (−∞, x1n]× (−∞, a] ↓ ∅ for any sequence x1n ↓ −∞and for any a ∈ R. Hence, again by the m.c.f.a. property,

F (x1n, a) ↓ 0 as n →∞ .

By similar arguments, F (a, x2n) ↓ 0 whenever x2n ↓ −∞. To prove thelast relation in (iii), apply the m.c.f.b. property to the sets (−∞, x1n] ×(−∞, x2n] ↑ R2 for x1n ↑ ∞, x2n ↑ ∞.

The proof of part (iv) is similar.

Note that any function satisfying properties (i), (ii), (iii) of Proposition2.2.3 determines a probability measure uniquely. This follows from the dis-cussions in Section 1.3, as (1.3.9) and (1.3.10) follow from (i) and (iii)(Problem 2.12). For a general k ≥ 1, an analog of property (i) above iscumbersome to write down explicitly. Indeed, for any x ≤ y, now a suminvolving 2k-terms must be nonnegative. However, properties (ii), (iii), and(iv) can be extended in an obvious way to the k-dimensional case. SeeProblem 2.13 for a precise statement. Also, for a general k ≥ 1, functionssatisfying the properties listed in Problem 2.13 uniquely determine a prob-ability measure on (Rk,B(Rk)).

2.3 Integration

Let (Ω,F , µ) be a measure space and f : Ω → R be a measurable function.The goal of this section is to define the integral of f with respect to the

Page 64: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.3 Integration 49

measure µ and establish some basic convergence results. The integral of anonnegative function taking only finitely many values is defined first, whichis then extended to all nonnegative measurable functions by approximationfrom below. Finally, the integral of an arbitrary measurable function is de-fined using the decomposition of the function into its positive and negativeparts.

Definition 2.3.1: A function f : Ω → R ≡ [−∞,∞] is called simpleif there exist a finite set (of distinct elements) c1, . . . , ck ∈ R and setsA1, . . . , Ak ∈ F , k ∈ N such that f can be written as

f =k∑

i=1

ciIAi . (3.1)

Definition 2.3.2: (The integral of a simple nonnegative function). Letf : Ω → R+ ≡ [0,∞] be a simple nonnegative function on (Ω,F , µ) withthe representation (3.1). The integral of f w.r.t. µ, denoted by

∫fdµ, is

defined as ∫fdµ ≡

k∑i=1

ciµ(Ai) . (3.2)

Here and in the following, the relation

0 · ∞ = 0

is adopted as a convention.

It may be verified that the value of the integral in (3.2) does not dependon the representation of f . That is, if f can be expressed as f =

∑lj=1 djIBj

for some d1, . . . , dl ∈ R+ (not necessarily distinct) and for some setsB1, . . . , Bl ∈ F , then

∑ki=1 ciµ(Ai) =

∑lj=1 djµ(Bj), so that the value

of the integral remains unchanged (Problem 2.17). Also note that for thef in Definition 2.3.2,

0 ≤∫

fdµ ≤ ∞.

The following proposition is an easy consequence of the definition andthe above remark.

Proposition 2.3.1: Let f and g be two simple nonnegative functions on(Ω,F , µ). Then

(i) (Linearity) For α ≥ 0, β ≥ 0,∫

(αf + βg)dµ = α∫

fdµ + β∫

gdµ.

(ii) (Monotonicity) If f ≥ g a.e. (µ), i.e., µ(ω : ω ∈ Ω, f(ω) < g(ω)) =0, then

∫fdµ ≥

∫gdµ.

Page 65: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

50 2. Integration

(iii) If f = g a.e. (µ), i.e., if µ(ω : ω ∈ Ω, f(ω) = g(ω)) = 0, then∫fdµ =

∫gdµ.

Definition 2.3.3: (The integral of a nonnegative measurable function).Let f : Ω → R+ be a nonnegative measurable function on (Ω,F , µ). Theintegral of f with respect to µ, also denoted by

∫fdµ, is defined as∫

fdµ = limn→∞

∫fndµ, (3.3)

where fnn≥1 is any sequence of nonnegative simple functions such thatfn(ω) ↑ f(ω) for all ω.

Note that by Proposition 2.3.1 (ii), the sequence ∫

fndµn≥1 is nonde-creasing, and hence the right side of (3.3) is well defined. That the rightside of (3.3) is the same for all such approximating sequences of functionsneeds to be established and is the content of the following proposition. Theproof of this proposition exploits in a crucial way the m.c.f.b. and the finiteadditivity of the set function µ (or, equivalently, the countable additivityof µ).

Proposition 2.3.2: Let fnn≥1 and gnn≥1 be two sequences of simplenonnegative measurable functions on (Ω,F , µ) to R+ such that as n →∞,fn(ω) ↑ f(ω) and gn(ω) ↑ f(ω) for all ω ∈ Ω. Then

limn→∞

∫fndµ = lim

n→∞

∫gndµ. (3.4)

Proof: Fix N ∈ N and 0 < ρ < 1. It will now be shown that

limn→∞

∫fndµ ≥ ρ

∫gNdµ. (3.5)

Suppose that gN has the representation gN ≡∑k

i=1 diIBi . Let Dn = ω ∈Ω : fn(ω) ≥ ρgN (ω), n ≥ 1. Since fn(ω) ↑ f(ω) for all ω, Dn ↑ D ≡ω : f(ω) ≥ ρgN (ω) (Problem 2.18 (b)). Also since gN (ω) ≤ f(ω) and0 < ρ < 1, D = Ω. Now writing fn = fnIDn

+ fnIDcn, it follows from

Proposition 2.3.1 that∫fndµ ≥

∫fnIDndµ ≥ ρ

∫gNIDndµ

= ρk∑

i=1

diµ(Bi ∩Dn). (3.6)

By the m.c.f.b. property, for each i ∈ N, µ(Bi ∩Dn) ↑ µ(Bi ∩ Ω) = µ(Bi)as n →∞. Since the sequence

∫fndµn≥1 is nondecreasing, taking limits

Page 66: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.3 Integration 51

in (3.6), yields (3.5). Next, letting ρ ↑ 1 yields limn→∞∫

fndµ ≥∫

gNdµfor each N ∈ N and hence,

limn→∞

∫fndµ ≥ lim

n→∞

∫gndµ .

By symmetry, (3.4) follows and hence, the proof is complete.

Remark 2.3.1: It is easy to verify that Proposition 2.3.2 remains valid iffnn≥1 and gnn≥1 increase to f a.e. (µ).

Given a nonnegative measurable function f , one can always constructa nondecreasing sequence fnn≥1 of nonnegative simple functions suchthat fn(ω) ↑ f(ω) for all ω ∈ Ω in the following manner. Let δnn≥1be a sequence of positive real numbers and let Nnn≥1 be a sequence ofpositive integers such that as n → ∞, δn ↓ 0, Nn ↑ ∞ and Nnδn ↑ ∞.Further, suppose that the sequence Pn ≡ jδn : j = 0, 1, 2, . . . , Nn isnested, i.e., Pn ⊂ Pn+1 for each n ≥ 1. Now set

fn(ω) =

jδn if jδn ≤ f(ω) < (j + 1)δn, j = 0, 1, 2, . . . , (Nn − 1)Nnδn if f(ω) ≥ Nnδn .

(3.7)A specific choice of δn and Nn is given by δn = 2−n, Nn = n2n.

Thus, with the above choice of fnn≥1 in the definition of the Lebesgueintegral

∫fdµ in (3.3), the range of f is subdivided into intervals of decreas-

ing lengths. This is in contrast to the definition of the Riemann integral off over a bounded interval, which is defined via subdividing the domain off into finer subintervals.

Remark 2.3.2: In some cases it may be more appropriate to choosethe approximating sequence fnn≥1 in a manner different from (3.7). Forexample, let Ω = ωi : i ≥ 1 be a countable set, F = P(Ω), the powerset of Ω, and let µ be a measure on (Ω,F). Then any function f : Ω →R+ ≡ [0,∞) is measurable and the integral

∫fdµ coincides with the sum∑∞

i=1 f(ωi)µ(ωi) as can be seen by choosing the approximating sequencefnn≥1 as

fn(ωi) =

f(ωi) for i = 1, 2, . . . , n

0 for i > n.

Remark 2.3.3: The integral of a nonnegative measurable function can bealternatively defined as∫

fdµ = sup∫

gdµ : g nonnegative and simple, g ≤ f

.

The equivalence of this to (3.3) is seen as follows. Clearly the right sideabove, say, M is greater than or equal to

∫fdµ as in (3.3). Conversely,

Page 67: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

52 2. Integration

there exist a sequence gnn≥1 of simple nonnegative functions with gn ≤ ffor all n ≥ 1 such that limn

∫gndµ equals the supremum M defined above.

Now set hn = maxgj : 1 ≤ j ≤ n, n ≥ 1. Now it can be verified that foreach n ≥ 1, hn is nonnegative, simple, and satisfies hn ↑ f and also that∫

hndµ converges to M (Problem 2.19 (b)).

Corollary 2.3.3: Let f and g be two nonnegative measurable functionson (Ω,F , µ). Then, the conclusions of Proposition 2.3.1 remain valid forsuch f and g.

Proof: This follows from Proposition 2.3.1 for nonnegative simple func-tions and Definition 2.3.3.

The definition of the integral∫

fdµ of a nonnegative measurable func-tion f in (3.3) makes it possible to interchange limits and integration ina fairly routine manner. In particular, the following key result is a directconsequence of the definition.

Theorem 2.3.4: (The monotone convergence theorem or MCT). Letfnn≥1 and f be nonnegative measurable functions on (Ω,F , µ) such thatfn ↑ f a.e. (µ). Then ∫

fdµ = limn→∞

∫fndµ. (3.8)

Remark 2.3.4: The important difference between (3.4) and (3.8) is thatin (3.8), the fn’s need not be simple.

Proof: It is similar to the proof of Proposition 2.3.2. Let gnn≥1 be asequence of nonnegative simple functions on (Ω,F , µ) such that gn(ω) ↑f(ω) for all ω. By hypothesis, there exists a set A ∈ F such that µ(Ac) = 0and for ω in A, fn(ω) ↑ f(ω). Fix k ∈ N and 0 < ρ < 1. Let Dn = ω : ω ∈A, fn(ω) ≥ ρgk(ω), n ≥ 1. Then, Dn ↑ D ≡ ω : ω ∈ A, f(ω) ≥ ρgk(ω).Since gk(ω) ≤ f(ω) for all ω, it follows that D = A. Now, by Corollary2.3.3, ∫

fndµ ≥∫

fnIDndµ ≥ ρ

∫gkIDndµ for all n ≥ 1.

By m.c.f.b.,∫

gkIDndµ ↑∫

gkIAdµ =∫

gkdµ as n →∞, yielding

lim infn→∞

∫fndµ ≥ ρ

∫gkdµ

for all 0 < ρ < 1 and all k ∈ N. Letting ρ ↑ 1 first and then k ↑ ∞, from(3.3) one gets

lim infn→∞

∫fndµ ≥

∫fdµ.

Page 68: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.3 Integration 53

By monotonicity (Corollary 2.3.3),∫fndµ ≤

∫fdµ for all n ≥ 1

and so the proof is complete.

Corollary 2.3.5: Let hnn≥1 be a sequence of nonnegative measurablefunctions on a measure space (Ω,F , µ). Then

∫ ( ∞∑n=1

hn

)dµ =

∞∑n=1

∫hndµ .

Proof: Let fn =∑n

i=1 hi, n ≥ 1, and let f =∑∞

i=1 hi. Then, 0 ≤ fn ↑ f .By the MCT, ∫

fndµ ↑∫

fdµ.

But by Corollary 2.3.3,

∫fndµ =

n∑i=1

∫hidµ.

Hence, the result follows.

Corollary 2.3.6: Let f be a nonnegative measurable function on a mea-surable space (Ω,F , µ). For A ∈ F , let

ν(A) ≡∫

fIAdµ.

Then, ν is a measure on (Ω,F).

Proof: Let Ann≥1 be a sequence of disjoint sets in F . Let hn = fIAn

for n ≥ 1. Then by Corollary 2.3.5,

ν(⋃n≥1

An) =∫

fI[⋃

n≥1 An]dµ =∫

f ·[ ∞∑

n=1

IAn

]dµ

=∫ [ ∞∑

n=1

hn

]dµ =

∞∑n=1

∫hndµ =

∞∑n=1

ν(An).

Remark 2.3.5: Notice that µ(A) = 0 ⇒ ν(A) = 0. In this case ν issaid to be dominated by or absolutely continuous with respect to µ. TheRadon-Nikodym theorem (see Chapter 4) provides a converse to this. Thatis, if ν and µ are two measures on a measurable space (Ω,F) such that

Page 69: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

54 2. Integration

ν is dominated by µ and µ is σ-finite, then there exists a nonnegativemeasurable function f such that ν(A) =

∫fIAdµ for all A in F . This f

is called a Radon-Nikodym derivative (or a density) of ν with respect to µand is denoted by dν

dµ .

Theorem 2.3.7: (Fatou’s lemma). Let fnn≥1 be a sequence of nonneg-ative measurable functions on (Ω,F , µ). Then

lim infn→∞

∫fndµ ≥

∫lim infn→∞ fndµ. (3.9)

Proof: Let gn(ω) ≡ inffj(ω) : j ≥ n. Then gnn≥1 is a sequence ofnonnegative, nondecreasing measurable functions on (Ω,F , µ) such thatgn(ω) ↑ g(ω) ≡ lim inf

n→∞ fn(ω). By the MCT,

∫gndµ ↑

∫gdµ.

But by monotonicity∫fndµ ≥

∫gndµ for each n ≥ 1,

and hence, (3.9) follows.

Remark 2.3.6: In (3.9), the inequality can be strict. For example, takefn = I[n,∞), n ≥ 1, on the measure space (R,B(R), m) where m is theLebesgue measure. For another example, consider fn = nI[0, 1

n ], n ≥ 1, onthe finite measure space ([0, 1],B([0, 1]), m).

Definition 2.3.4: (The integral of a measurable function). Let f be a realvalued measurable function on a measure space (Ω,F , µ). Let f+ = fIf≥0and f− = −fIf<0. The integral of f with respect to µ, denoted by

∫fdµ,

is defined as ∫fdµ =

∫f+dµ−

∫f−dµ,

provided that at least one of the integrals on the right side is finite.

Remark 2.3.7: Note that both f+ and f− are nonnegative measurablefunctions and f = f+ − f− and |f | = f+ + f−. Further, the integrals∫

f+dµ and∫

f−dµ in Definition 2.3.4 are defined via Definition 2.3.3.

Definition 2.3.5: (Integrable functions). A measurable function f on ameasure space (Ω,F , µ) is said to be integrable with respect to µ if

∫|f |dµ <

∞.

Since |f | = f+ +f−, it follows that f is integrable iff both f+ and f− areintegrable, i.e.,

∫f+dµ < ∞ and

∫f−dµ < ∞. In the following, whenever

Page 70: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.3 Integration 55

the integral of f or its integrability is discussed, the measurability of f willbe assumed to hold.

Remark on notation: The∫

fdµ is also written as∫Ω

f(ω)µ(dω) and∫

Ωf(ω)dµ(ω).

Definition 2.3.6: Let f be a measurable function on a measure space(Ω,F , µ) and A ∈ F . Then integral of f over A with respect to µ, denotedby∫

Afdµ, is defined as ∫

A

fdµ ≡∫

fIAdµ, (3.10)

provided the right side is well defined.

Definition 2.3.7: (Lp-spaces). Let (Ω,F , µ) be a measure space and0 < p ≤ ∞. Then Lp(Ω,F , µ) is defined as

Lp(Ω,F , µ) ≡ f : |f |p is integrable with respect to µ

=

f :∫|f |pdµ < ∞

for 0 < p < ∞,

and

L∞(Ω,F , µ) ≡

f : µ(|f | > K) = 0 for some K ∈ (0,∞)

.

The following is an extension of Proposition 2.3.1 to integrable functions.

Proposition 2.3.8: Let f , g ∈ L1(Ω,F , µ). Then

(i)∫

(αf + βg)dµ = α∫

fdµ + β∫

gdµ for any α, β ∈ R.

(ii) f ≥ g a.e. (µ) ⇒∫

fdµ ≥∫

gdµ.

(iii) f = g a.e. (µ) ⇒∫

fdµ =∫

gdµ.

Proof: It is easy to verify (Problem 2.32) that if h = h1−h2 where h1 andh2 are nonnegative functions in L1(Ω,F , µ), then h is also in L1(Ω,F , µ)and ∫

hdµ =∫

h1dµ−∫

h2dµ. (3.11)

Note that h ≡ αf + βg can be written as

(a+ − α−)(f+ − f−) + (β+ − β−)(g+ − g−)= (α+f+ + α−f− + β+g+ + β−g−)

− (α+f− + α−f+ + β+g− + β−g+)= h1 − h2, say.

Page 71: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

56 2. Integration

Since f , g ∈ L1(Ω,F , µ), it follows that h1 and h2 ∈ L1(Ω,F , µ). Further,they are nonnegative and by (3.11), h ∈ L1(Ω,F , µ) and∫

hdµ =∫

h1dµ−∫

h2dµ.

Now apply Proposition 2.3.1 to each of the terms on the right side andregroup the terms to get∫

hdµ = α

∫fdµ + β

∫gdµ.

Proofs of (ii) and (iii) are left as an exercise.

Remark 2.3.8: By Proposition 2.3.8, if f and g ∈ L1(Ω,F , µ), then sodoes αf + βg for all α, β ∈ R. Thus, L1(Ω,F , µ) is a vector space over R.Further, if one sets

‖f‖1 ≡∫|f |dµ,

(and identifies a function f with its equivalence class under the relation

f ∼ g iff f = g a.e. (µ)), then ‖ · ‖1 defines a norm on L1(Ω,F , µ) and

makes it a normed linear space (cf. Chapter 3). A similar remark also holdsfor Lp(Ω,F , µ) for 1 < p ≤ ∞.

Next note that by Proposition 2.3.8, if f = 0 a.e. (µ), then∫

fdµ = 0.However, the converse is not true. But if f is nonnegative a.e. (µ), then theconverse is true as shown below.

Proposition 2.3.9: Let f be a measurable function on (Ω,F , µ) and letf be nonnegative a.e. (µ). Then∫

fdµ = 0 iff f = 0 a.e. (µ).

Proof: It is enough to prove the “only if” part. Let D = ω : f(ω) > 0and Dn = ω : f(ω) > 1

n, n ≥ 1. Then D =⋃

n≥1 Dn. Since f ≥ fIDn

a.e. (µ),

0 =∫

fdµ ≥∫

fIDndµ ≥ 1n

µ(Dn) ⇒ µ(Dn) = 0 for each n ≥ 1.

Also Dn ↑ D and so by m.c.f.b.,

µ(D) = limn→∞ µ(Dn) = 0 .

Hence, Proposition 2.3.9 follows.

A dual to the above proposition is the next one.

Page 72: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.3 Integration 57

Proposition 2.3.10: Let f ∈ L1(Ω,F , µ). Then, |f | < ∞ a.e. (µ).

Proof: Let Cn = ω : |f(ω)| > n, n ≥ 1 and let C = ω : |f(ω)| = ∞.Then Cn ↓ C and∫

|f |dµ ≥∫|f |ICndµ ≥ nµ(Cn) ⇒ µ(Cn) ≤

∫|f |dµ

n.

Since∫|f |dµ < ∞, limn→∞ µ(Cn) = 0. Hence, by m.c.f.a., µ(C) =

limn→∞ µ(Cn) = 0.

The next result is a useful convergence theorem for integrals.

Theorem 2.3.11: (The extended dominated convergence theorem orEDCT ). Let (Ω,F , µ) be a measure space and let fn, gn : Ω → R be〈F , R〉-measurable functions such that |fn| ≤ gn a.e. (µ) for all n ≥ 1.Suppose that

(i) gn → g a.e. (µ) and fn → f a.e. (µ);

(ii) gn, g ∈ L1(Ω,F , µ) and∫|gn|dµ →

∫|g|dµ as n → ∞. Then, f ∈

L1(Ω,F , µ),

limn→∞

∫fndµ =

∫fdµ and lim

n→∞

∫|fn − f |dµ = 0. (3.12)

Two important special cases of Theorem 2.3.11 will be stated next. Whengn = g for all n ≥ 1, one has the standard version of the dominatedconvergence theorem.

Corollary 2.3.12: (Lebesgue’s dominated convergence theorem, or DCT ).If |fn| ≤ g a.e. (µ) for all n ≥ 1,

∫gdµ < ∞ and fn → f a.e. (µ), then

f ∈ L1(Ω,F , µ),

limn→∞

∫fndµ =

∫fdµ and lim

n→∞

∫|fn − f |dµ = 0. (3.13)

Corollary 2.3.13: (The bounded convergence theorem, or BCT ). Letµ(Ω) < ∞. If there exist a 0 < k < ∞ such that |fn| ≤ k a.e. (µ) andfn → f a.e. (µ), then

limn→∞

∫fndµ =

∫fdµ and lim

n→∞

∫|fn − f |dµ = 0. (3.14)

Proof: Take g(ω) ≡ k for all ω ∈ Ω in the previous corollary.

Proof of Theorem 2.3.11: By Fatou’s lemma,∫|f |dµ ≤ lim inf

n→∞

∫|fn|dµ ≤ lim inf

n→∞

∫|gn|dµ =

∫|g|dµ < ∞.

Page 73: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

58 2. Integration

Hence, f is integrable. For proving the second part, let hn = fn + gn

and γn = gn − fn, n ≥ 1. Then, hnn≥1 and γnn≥1 are sequences ofnonnegative integrable functions. By Fatou’s lemma and (ii),∫

(f + g)dµ =∫

lim infn→∞ hndµ

≤ lim infn→∞

∫hndµ

= lim infn→∞

[∫gndµ +

∫fndµ

]

=∫

gdµ + lim infn→∞

∫fndµ.

Similarly, ∫(g − f)dµ ≤

∫gdµ− lim sup

n→∞

∫fndµ.

By Proposition 2.3.8,∫

(g ± f)dµ =∫

gdµ±∫

fdµ. Hence,∫fdµ ≤ lim inf

n→∞

∫fndµ

andlim sup

n→∞

∫fndµ ≤

∫fdµ

yielding that limn→∞∫

fndµ =∫

fdµ. For the last part, apply the aboveargument to fn and gn replaced by fn ≡ |f − fn| and gn ≡ gn + |f |,respectively.

Theorem 2.3.14: (An approximation theorem). Let µF be a Lebesgue-Stieltjes measure on (R,B(R)). Let f ∈ Lp(R,B(R), µF ), 0 < p < ∞. Then,for any δ > 0, there exist a step function h and a continuous function gwith compact support (i.e., g vanishes outside a bounded interval) such that∫

|f − h|pdµ < δ, (3.15)

∫|f − g|pdµ < δ, (3.16)

where a step function h is a function of the form h =∑k

i=1 ciIAi with k <∞, c1, c2, . . . , ck ∈ R and A1, A2, . . . , Ak being bounded disjoint intervals.

Proof: Let fn(·) = f(·)IBn(·) where Bn = x : |x| ≤ n, |f(x)| ≤ n. By

the DCT, for every ε > 0, there exists an Nε such that for all n ≥ Nε,∫|f − fn|pdµF < ε. (3.17)

Page 74: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.4 Riemann and Lebesgue integrals 59

Since |fNε(·)| ≤ Nε on [−Nε, Nε], for any η > 0, there exists a simple

function f such that

sup|fNε(x)− f(x)| : x ∈ R < η. (cf. (3.7)) (3.18)

Next, using Problem 1.32 (b), one can show that for any η > 0, there existsa step function h (depending on η) such that∫

|fNε− h|pdµF < η. (3.19)

Since f − h = f − fNε + fNε − f + f − h,

|f − h|p ≤ Cp

(|f − fNε

|p + |fNε− f |p + |f − h|p

),

where Cp is a constant depending only on p.This in turn yields, from (3.17)–(3.19),∫

|f − h|pdµF ≤ Cp(ε + (µF x : |x| ≤ Nε)ηp + η). (3.20)

Given δ > 0, choose ε > 0 first and then η > 0 such that the right side of(3.20) above is less than δ.

Next, given any step function h and η > 0 there exists a continuousfunction g with compact support such that µF x : h(x) = g(x) < η (cf.Problem 1.32 (g)). Now (3.16) follows from (3.15).

Remark 2.3.9: The approximation (3.16) remains valid if g is restrictedto the class of all infinitely differentiable functions with compact support.Further it remains valid for 0 < p < ∞ for Lebesgue-Stieltjes measures onany Euclidean space.

Remark 2.3.10: The above approximation theorem fails for p = ∞. Forexample, consider the function f(x) ≡ 1 in L∞(m).

2.4 Riemann and Lebesgue integrals

Let f be a real valued bounded function on a bounded interval [a, b]. Recallthe definition of the Riemann integral. Let P = x0, x1, . . . , xn be a finitepartition of [a, b], i.e., x0 = a < x1 < x2 < xn−1 < xn = b and ∆ =∆(P ) ≡ max(xi+1 − xi) : 0 ≤ i ≤ n − 1 be the diameter of P . LetMi = supf(x) : xi ≤ x ≤ xi+1 and mi = inff(x) : xi ≤ x ≤ xi+1,i = 0, 1, . . . , n− 1.

Definition 2.4.1: The upper- and lower-Riemann sums of f w.r.t. thepartition P are, respectively, defined as

U(f, P ) ≡n−1∑i=0

Mi · (xi+1 − xi) (4.1)

Page 75: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

60 2. Integration

and

L(f, P ) ≡n−1∑i=0

mi · (xi+1 − xi). (4.2)

It is easy to verify that if Q = y0, y1, . . . , yk is another partition satisfyingP ⊂ Q, then U(f, P ) ≥ U(f, Q) ≥ L(f, Q) ≥ L(f, P ). Let P denote thecollection of all finite partitions of [a, b].

Definition 2.4.2: The upper-Riemann integral∫

f is defined as∫f = inf

P∈PU(f, P ) (4.3)

and the lower-Riemann integral∫f , by∫

f = supP∈P

L(f, P ). (4.4)

It can be shown (cf. Problem 2.23) that if Pnn≥1 is any sequence ofpartitions such that ∆(Pn) → 0 as n →∞ and Pn ⊂ Pn+1 for each n ≥ 1,then U(Pn, f) ↓

∫f and L(Pn, f) ↑

∫f .

Definition 2.4.3: f is said to be Riemann integrable if∫f =

∫f. (4.5)

The common value is denoted by∮[a,b] f .

Fix a sequence Pnn≥1 of partitions such that Pn ⊂ Pn+1 for all n ≥ 1and ∆(Pn) → 0 as n → ∞. Let Pn = xn0 = a < xn1 < xn2 . . . < xnkn

=b. For i = 0, 1, . . . , kn − 1, let

φn(x) ≡ supf(t) : xi ≤ t ≤ xi+1, x ∈ [xi, xi+1)ψn(x) ≡ inff(t) : xi ≤ t ≤ xi+1, x ∈ [xi, xi+1)

and let φn(b) = ψn(b) = 0. Then, φn and ψn are step functions on [a, b]and hence, are Borel measurable. Further, since f is bounded, so are φn

and ψn and hence are integrable on [a, b] w.r.t. the Lebesgue measure m.The Lebesgue integrals of φn and ψn are given by

∫[a,b] φndm = U(Pn, f)

and∫[a,b] ψndm = L(Pn, f).

It can be shown (Problem 2.24) that for all x ∈⋃

n≥1 Pn, as n →∞,

φn(x) → φ(x) ≡ limδ↓0

supf(y) : |y − x| < δ (4.6)

andψn(x) → ψ(x) ≡ lim

δ↓0inff(y) : |y − x| < δ (4.7)

Page 76: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.5 More on convergence 61

Thus, φ and ψ, being limits of Borel measurable functions (except possiblyon a countable set), are Borel measurable. By the BCT (Corollary 2.3.13),∫

f = limn→∞

∫φndm =

∫φdm

and ∫f = lim

n→∞

∫ψndm =

∫ψdm.

Thus, f is Riemann integrable on [a, b], iff∫

φdm =∫

ψdm, iff∫

(φ −ψ)dm = 0. Since φ(x) ≥ f(x) ≥ ψ(x) for all x, this holds iff φ = f = ψ a.e[m]. It can be shown that f is continuous at x0 iff φ(x0) = f(x0) = ψ(x0)(Problem 2.8). Summarizing the above discussion, one gets the followingtheorem.

Theorem 2.4.1: Let f be a bounded function on a bounded interval [a, b].Then f is Riemann integrable on [a, b] iff f is continuous a.e. (m) on[a, b]. In this case, f is Lebesgue integrable on [a, b] and the Lebesgue in-tegral

∫[a,b] fdm equals the Riemann integral

∮[a,b] f , i.e., the two integrals

coincide.

It should be noted that Lebesgue integrability need not imply Riemannintegrability. For example, consider f(x) ≡ IQ1(x) where Q1 is the set ofrationals in [0, 1] (Problem 2.25).

The functions φ and ψ defined in (4.6) and (4.7) above are called, re-spectively, the upper and the lower envelopes of the function f . They aresemicontinuous in the sense that for each α ∈ R, the sets x : φ(x) < αand x : ψ(x) > α are open (cf. Problem 2.8).

Remark 2.4.1: The key difference in the definitions of Riemann andLebesgue integrals is that in the former the domain of f is partitionedwhile in the latter the range of f is partitioned.

2.5 More on convergence

Let fnn≥1 and f be measurable functions from a measure space (Ω,F , µ)to R, the set of extended real numbers. There are several notions of con-vergence of fnn≥1 to f . The following two have been discussed earlier.

Definition 2.5.1: fnn≥1 converges to f pointwise if

limn→∞ fn(ω) = f(ω) for all ω in Ω.

Definition 2.5.2: fnn≥1 converges to f almost everywhere (µ), denotedby fn → f a.e. (µ), if there exists a set B in F such that µ(B) = 0 and

limn→∞ fn(ω) = f(ω) for all ω∈Bc. (5.1)

Page 77: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

62 2. Integration

Now consider some more notions of convergence.

Definition 2.5.3: fnn≥1 converges to f in measure (w.r.t. µ), denotedby fn −→m f , if for each ε > 0,

limn→∞ µ

(|fn − f | > ε

)= 0. (5.2)

Definition 2.5.4: Let 0 < p < ∞. Then, fnn≥1 converges to f in Lp(µ),denoted by fn −→Lp

f , if∫|fn|pdµ < ∞ for all n ≥ 1,

∫|f |pdµ < ∞ and

limn→∞

∫|fn − f |pdµ = 0. (5.3)

Clearly, (5.3) is equivalent to ‖fn − f‖p → 0 as n → ∞, where for anyF-measurable function g and any 0 < p < ∞,

‖g‖p =(∫

|g|pdµ)min 1

p ,1. (5.4)

For p = 1, this is also called convergence in absolute deviation and forp = 2, convergence in mean square.

Definition 2.5.5: fnn≥1 converges to f uniformly (over Ω) if

limn→∞ sup|fn(ω)− f(ω)| : ω ∈ Ω = 0. (5.5)

Definition 2.5.6: fnn≥1 converges to f in L∞(µ) if

limn→∞ ‖fn − f‖∞ = 0, (5.6)

where for any F-measurable function g on (Ω,F , µ),

‖g‖∞ = infK : K ∈ (0,∞), µ(|g| > K) = 0

. (5.7)

Definition 2.5.7: fnn≥1 converges to f nearly uniformly (µ) if for everyε > 0, there exists a set A ∈ F such that µ(A) < ε and on Ac, fn → funiformly, i.e., sup|fn(ω)− f(ω)| : ω ∈ Ac → 0 as n →∞.

The notion of convergence in Definition 2.5.7 is also called almost uniformconvergence in some books (cf. Royden (1988)). The sequence fn ≡ nI[0,1/n]on (Ω = [0, 1],B([0, 1]), m) converges to f ≡ 0 nearly uniformly but not inL∞(m).

When µ is a probability measure, there is another useful notion of con-vergence, known as convergence in distribution, that is defined in termsof the induced measures µf−1

n n≥1 and µf−1. This notion of convergencewill be treated in detail in Chapter 9.

Page 78: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.5 More on convergence 63

Next, the connections between some of these notions of convergence areexplored.

Theorem 2.5.1: Suppose that µ(Ω) < ∞. Then, fn → f a.e. (µ) impliesfn −→m f .

The proof is left as an exercise (Problem 2.26). The hypothesis that‘µ(Ω) < ∞’ in Theorem 2.5.1 cannot be dispensed with as seen by takingfn = I[n,∞) on R with Lebesgue measure. Also, fn −→m f does not implyfn → f a.e. (µ) (Problem 2.46), but the following holds.

Theorem 2.5.2: Let fn −→m f . Then, there exists a subsequence nkk≥1such that fnk

→ f a.e. (µ).

Proof: Since fn −→m f , for each integer k ≥ 1, there exists an nk suchthat for all n ≥ nk

µ(|fn − f | > 2−k

)< 2−k. (5.8)

W.l.o.g., that assume nk+1 > nk for all k ≥ 1. Let Ak = |fnk− f | >

2−k. By Corollary 2.3.5,

∫ ( ∞∑k=1

IAk

)dµ =

∞∑k=1

∫IAk

dµ =∞∑

k=1

µ(Ak),

which, by (5.8), is finite. Hence, by Proposition 2.3.10,∑∞

k=1 IAk< ∞ a.e.

(µ). Now observe that∑∞

k=1 IAk(ω) < ∞ ⇒ |fnk

(ω)− f(ω)| ≤ 2−k for allk large ⇒ limk→∞ fnk

(ω) = f(ω). Thus, fnk→ f a.e. (µ).

Remark 2.5.1: From the above result it follows that the extended domi-nated convergence theorem (Theorem 2.3.11) remains valid if convergencea.e. of fnn≥1 and of gnn≥1 are replaced by convergence in measure forboth (Problem 2.37).

Theorem 2.5.3: Let fnn≥1, f be measurable functions on a measurespace (Ω,F , µ). Let fn −→Lp

f for some 0 < p < ∞. Then fn −→m f .

Proof: For each ε > 0, let An = |fn − f | ≥ ε, n ≥ 1. Then∫|fn − f |pdµ ≥

∫An

|fn − f |pdµ ≥ εpµ(An).

Since fn → f in Lp,∫|fn − f |pdµ → 0 and hence, µ(An) → 0.

It turns out that fn −→m f need not imply fn −→Lp

f , even if fn :n ≥ 1∪f is contained in Lp(Ω,F , µ). For example, let fn = nI[0, 1

n ] andf ≡ 0 on the Lebesgue space ([0, 1],B([0, 1]), m), where m is the Lebesguemeasure. Then fn −→m f but

∫|fn − f | ≡ 1 for all n ≥ 1. However, under

Page 79: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

64 2. Integration

some additional conditions, convergence in measure does imply convergencein Lp. Here are two results in this direction.

Theorem 2.5.4: (Scheffe’s theorem). Let fnn≥1, f be a collection ofnonnegative measurable functions on a measure space (Ω,F , µ). Let fn → fa.e. (µ),

∫fndµ →

∫fdµ and

∫fdµ < ∞. Then

limn→∞

∫|fn − f |dµ = 0.

Proof: Let gn = f − fn, n ≥ 1. Since fn → f a.e. (µ), both g+n and g−

n goto zero a.e. (µ). Further, 0 ≤ g+

n ≤ f and by hypothesis∫

fdµ < ∞. Thus,by the DCT, it follows that ∫

g+n dµ → 0.

Next, note that by hypothesis,∫

gndµ → 0. Thus,∫

g−n dµ =

∫g+

n dµ −∫gndµ → 0 and hence,

∫|gn|dµ =

∫g+

n dµ +∫

g−n dµ → 0.

Corollary 2.5.5: Let fnn≥1, f be probability density functions on ameasure space (Ω,F , µ). That is, for all n ≥ 1,

∫fndµ =

∫fdµ = 1 and

fn, f ≥ 0 a.e. (µ). If fn → f a.e. (µ), then

limn→∞

∫|fn − f |dµ = 0.

Remark 2.5.2: The above theorem and the corollary remain valid if theconvergence of fn to f a.e. (µ) is replaced by f −→m f .

Corollary 2.5.6: Let pnkk≥1, n = 1, 2, . . . and pkk≥1 be sequences ofnonnegative numbers satisfying

∑∞k=1 pnk = 1 =

∑∞k=1 pk. Let pnk → pk

as n →∞ for each k ≥ 1. Then∑∞

k=1 |pnk − pk| → 0.

Proof: Apply Corollary 2.5.5 with µ = the counting measure on (N,P(N)).

A more general result in this direction that does not require fn, f to benonnegative involves the concept of uniform integrability. Let fλ : λ ∈ Λbe a collection of functions in L1(Ω,F , µ). Then for each λ ∈ Λ, by theDCT and the integrability of fλ,

aλ(t) ≡∫

|fλ|>t|fλ|dµ → 0 as t →∞. (5.9)

The notion of uniform integrability requires that the integrals aλ(t) go tozero uniformly in λ ∈ Λ as t →∞.

Page 80: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.5 More on convergence 65

Definition 2.5.8: The collection of functions fλ : λ ∈ Λ in L1(Ω,F , µ)is uniformly integrable (or UI, in short) if

supλ∈Λ

aλ(t) → 0 as t →∞ . (5.10)

The following proposition summarizes some of the main properties of UIfamilies of functions.

Proposition 2.5.7: Let fλ : λ ∈ Λ be a collection of µ-integrable func-tions on (Ω,F , µ).

(i) If Λ is finite, then fλ : λ ∈ Λ is UI.

(ii) If K ≡ sup∫|fλ|1+εdµ : λ ∈ Λ < ∞ for some ε > 0, then fλ : λ ∈

Λ is UI.

(iii) If |fλ| ≤ g a.e. (µ) and∫

gdµ < ∞, then fλ : λ ∈ Λ is UI.

(iv) If fλ : λ ∈ Λ and gγ : γ ∈ Γ are UI, then so is fλ + gγ : λ ∈Λ, γ ∈ Γ.

(v) If fλ : λ ∈ Λ is UI and µ(Ω) < ∞, then

supλ∈Λ

∫|fλ|dµ < ∞. (5.11)

Proof: By hypothesis, aλ(t) ≡∫

|fλ|>t |fλ|dµ → 0 as t → ∞ for eachλ. If Λ is finite this implies that supaλ(t) : λ ∈ Λ → 0 as t → ∞. Thisproves (i).

To prove (ii), note that since 1 < |fλ|/t on the set |fλ| > t,

supλ∈Λ

aλ(t) ≤ supλ∈Λ

∫|fλ|>t

|fλ|[|fλ|/t

]εdµ ≤ Kt−ε → 0 as t →∞.

For (iii), note that for each t ∈ R, the function ht(x) ≡ xI(t,∞)(x), x ∈ Ris nondecreasing. Hence, by the integrability of g,

supλ∈Λ

aλ(t) = supλ∈Λ

∫ht(|fλ|)dµ ≤

∫ht(g)dµ =

∫g>t

gdµ → 0 as t →∞.

To prove (iv), for t > 0, let a(t) = supλ∈Λ∫

ht(|fλ|)dµ and b(t) =supγ∈Γ

∫ht(|gγ |)dµ. Then, for any λ ∈ Λ, and γ ∈ Γ,∫

|fλ+gγ |>t|fλ + gγ |dµ =

∫ht

(|fλ + gγ |

)dµ

≤∫

ht

(2 max|fλ|, |gγ |

)dµ

Page 81: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

66 2. Integration

=∫

ht

(2|fλ|

)I(|fλ| ≥ |gγ |

)dµ +

∫ht

(2|gγ |

)I(|fλ| < |gγ |

)dµ

≤ 2∫

|fλ|>t/2|fλ|dµ + 2

∫|gγ |>t/2

|gγ |dµ

≤ 2[a(t/2) + b(t/2)].

By hypothesis, both a(t) and b(t) → 0 as t →∞, thus proving (iv).Next consider (v). Since fλ : λ ∈ Λ is UI, there exists a T > 0 such

that

supλ∈Λ

∫hT

(|fλ|)dµ ≤ 1.

Hence,

supλ∈Λ

∫|fλ|dµ = sup

λ∈Λ

∫|fλ|≤T

|fλ|dµ +∫

hT (|fλ|)dµ

≤ Tµ(Ω) + 1 < ∞.

This completes the proof of the proposition.

Remark 2.5.3: In the above proposition, part (ii) can be improved asfollows: Let φ : R+ → R+ be nondecreasing and φ(x)

x ↑ ∞ as x ↑ ∞.If supλ∈Λ

∫φ(|fλ|)dµ < ∞, then fλ : λ ∈ Λ is UI (Problem 2.27). A

converse to this result is true. That is, if fλ : λ ∈ Λ is UI then thereexists such a function φ. Some examples of such φ’s are φ(x) = xk, k >1, φ(x) = x(log x)βI(x > 1), β > 0, and φ(x) = exp(βx), β > 0.

In part (v) of Proposition 2.5.7, (5.11) does not imply UI. For example,consider the sequence of functions fn = nI[0, 1

n ], n = 1, 2, . . . on [0, 1]. Onthe other hand, (5.11) with an additional condition becomes necessary andsufficient for UI.

Proposition 2.5.8: Let f ∈ L1(Ω,F , µ). Then for every ε > 0, thereexists a δ > 0 such that µ(A) < δ ⇒

∫A|f |dµ < ε.

Proof: Fix ε > 0. By the DCT, there exists a t > 0 such that∫|f |>t |f |dµ < ε/2. Hence, for any A ∈ F with µ(A) ≤ δ ≡ ε

2t ,∫A

|f |dµ ≤∫

A∩|f |≤t|f |dµ +

∫|f |>t

|f |dµ

≤ tµ(A) +∫

|f |>t|f |dµ

≤ ε,

proving the claim.

Page 82: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.5 More on convergence 67

The above proposition shows that for every f ∈ L1(Ω,F , µ), the measure(cf. Corollary 2.3.6)

ν|f |(A) ≡∫

A

|f |dµ (5.12)

on (Ω,F) satisfies the condition that ν|f |(A) is small if µ(A) is small, i.e.,for every ε > 0, there exists a δ > 0 such that µ(A) < δ ⇒ ν|f |(A) < ε.This property is referred to as the absolute continuity of the measure νf

w.r.t. µ.

Definition 2.5.9: Given a family fλ : λ ∈ Λ ⊂ L1(Ω,F , µ), the mea-sures ν|fλ| : λ ∈ Λ as defined in (5.12) above are uniformly absolutelycontinuous w.r.t. µ (or u.a.c. (µ), in short) if for every ε > 0, there existsa δ > 0 such that

µ(A) < δ ⇒ sup

ν|fλ|(A) : λ ∈ Λ

< ε.

Theorem 2.5.9: Let fλ : λ ∈ Λ ⊂ L1(Ω,F , µ) and µ(Ω) < ∞. Then,fλ : λ ∈ Λ is UI iff supλ∈Λ

∫|fλ|dµ < ∞ and

ν|fλ|(·) : λ ∈ Λ

is u.a.c.

(µ).

Proof: Letfλ : λ ∈ Λ

be UI. Then, since µ(Ω) < ∞, L1 boundedness

of fλ : λ ∈ Λ follows from Proposition 2.5.7 (v). To establish u.a.c. (µ),fix ε > 0. By UI, there exists an N such that

supλ∈Λ

∫|fλ|>N

|fλ|dµ < ε/2.

Let δ = ε2N and let A ∈ F be such that µ(A) < δ. Then, as in the proof of

Proposition 2.5.8 above,

supλ∈Λ

∫A

|fλ|dµ ≤ Nµ(A) + ε/2 < ε.

Conversely, supposefλ : λ ∈ Λ

is L1 bounded and u.a.c. (µ). Then,

for every ε > 0, there exists a δε > 0 such that

supλ∈Λ

∫A

|fλ|dµ < ε if µ(A) < δε. (5.13)

Also, for any nonnegative f in L1(Ω,F , µ) and t > 0,∫

fdµ ≥∫

f>t fdµ ≥tµ(f ≥ t

), which implies that

µ(f ≥ t

)≤∫

fdµ

t.

(This is known as Markov’s inequality − see Chapter 3). Hence, it followsthat

supλ∈Λ

µ(|fλ| ≥ t) ≤[supλ∈Λ

∫|fλ|dµ

]/ t. (5.14)

Page 83: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

68 2. Integration

Now given ε > 0, choose Tε such that[supλ∈Λ

∫|fλ|dµ

]/Tε < δε where δε

is as in (5.13). Then, by (5.14), it follows that

supλ∈Λ

∫|fλ|≥Tε

|fλ|dµ < ε,

i.e., fλ : λ ∈ Λ is UI.

Theorem 2.5.10: Let (Ω,F , µ) be a measure space with µ(Ω) < ∞, andlet fn : n ≥ 1 ⊂ L1(Ω,F , µ) be such that fn → f a.e. (µ) and f is〈F ,B(R)〉-measurable. If fn : n ≥ 1 is UI, then f is integrable and

limn→∞

∫|fn − f |dµ = 0.

Remark 2.5.4: In view of Proposition 2.5.7 (iii), Theorem 2.5.10 yieldsconvergence of

∫fdµ to

∫fdµ under weaker conditions than the DCT,

provided µ(Ω) < ∞. However, even under the restriction µ(Ω) < ∞, UIof fnn≥1 is a sufficient, but not a necessary condition for convergence of∫

fndµ to∫

fdµ (Problem 2.28). In the special case where fn’s are non-negative (and µ(Ω) < ∞),

∫fndµ →

∫fdµ < ∞ if and only if fnn≥1

are UI (Problem 2.29). On the other hand, when µ(Ω) = +∞, UI is nolonger sufficient to guarantee the convergence of

∫fndµ to

∫fdµ (Problem

2.30). Thus, the notion of UI is useful mainly for finite measures and, inparticular, probability measures.

Proof: By Proposition 2.5.7 and Fatou’s lemma,∫|f |dµ ≤ lim inf

n→∞

∫|fn|dµ ≤ sup

n≥1

∫|fn|dµ < ∞

and hence f is integrable.Next for n ∈ N, t ∈ (0,∞), define the functions gn = |fn − f |, g

n,t=

gnI(|gn| ≤ t) and gn,t = gnI(|gn| > t). Since fnn≥1 is UI and f isintegrable, by Proposition 2.5.7 (iv), gnn≥1 is UI. Hence, for any ε > 0,there exists tε > 0 such that

supn≥1

∫gn,tdµ < ε for all t ≥ tε. (5.15)

Next note that for any t > 0, since fn → f a.e. (µ), gn,t

→ 0 a.e. (µ), and|g

n,t| ≤ t and

∫(t)dµ = tµ(Ω) < ∞. Hence, by the DCT,

limn→∞

∫g

n,tdµ = 0 for all t > 0. (5.16)

Page 84: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.5 More on convergence 69

By (5.15) and (5.16) with t = tε, we get

0 ≤ lim supn→∞

∫|fn − f |dµ ≤ sup

n≥1

∫gn,tdµ + lim

n→∞

∫g

n,tdµ ≤ ε

for all ε > 0. Hence,∫|fn − f |dµ → 0 as n →∞.

The next result concerns connections between the notions of almost ev-erywhere convergence and almost uniform convergence.

Theorem 2.5.11: (Egorov’s theorem). Let fn → f a.e. (µ) and µ(Ω) <∞. Then fn → f nearly uniformly (µ) as in Definition 2.5.7.

Proof: For j, n, r ∈ N, define the sets

Ajr = ω : |fj(ω)− f(ω)| ≥ r−1Bnr =

⋃j≥n

Ajr, Cr =⋂n≥1

Bnr

D =⋃r≥1

Cr .

It is easy to verify that D is the set of points where fn does not convergenceto f . That is,

D = ω : fn(ω) → f(ω).By hypothesis µ(D) = 0. This implies µ(Cr) = 0 for all r ≥ 1. Since Bnr ↓Cr as n →∞ and µ(Ω) < ∞, by m.c.f.a., µ(Cr) = 0 ⇒ limn→∞ µ(Bnr) =0. So for all r ≥ 1, ε > 0, there exists kr ∈ N such that µ(Bkrr) < ε

2r . LetA =

⋃r≥1 Bkrr. Then µ(A) < ε and Ac =

⋂r≥1 Bc

krr. Also, for each r ∈ N,Ac ⊂ Bc

krr. So for any n ≥ 1,

sup|fn(ω)− f(ω)| : ω ∈ Ac≤ sup|fn(ω)− f(ω)| : ω ∈ Bc

krr for all r ∈ N

≤ 1r

if n ≥ kr.

That is, sup|fn(ω)− f(ω)| : ω ∈ Ac → 0 as n →∞.

Theorem 2.5.12: (Lusin’s theorem). Let F : R → R be a nondecreasingfunction and let µ = µ∗

F be the corresponding Lebesgue-Stieltjes measureon (R,Mµ∗

F). Let f : R → R be 〈Mµ∗

F,B(R)〉-measurable. Let µ

(x :

|f(x)| = ∞)

= 0. Then for every ε > 0, there exists a continuous functiong : R → R such that µ

(x : f(x) = g(x)

)< ε.

Proof: Fix −∞ < a < b < ∞. Since µ([a, b]) < ∞ and µ(x : |f(x)| =

∞)

= 0, for each ε > 0, there exists K ∈ (0,∞) such that

µ(x : a ≤ x ≤ b, |f(x)| > K

)<

ε

2.

Page 85: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

70 2. Integration

Let fK(x) = f(x)I[a,b](x)I|f |≤K(x). Then fK : R → R is bounded,〈Mµ∗

F,B(R)〉-measurable and zero for |x| > K. Consider now the following

claim: For every ε > 0, there exists a continuous function g : [a, b] → Rsuch that

µ(x : fK(x) = g(x), a ≤ x ≤ b

)<

ε

2. (5.17)

Clearly this implies that

µ(x : a ≤ x ≤ b, f(x) = g(x)

)< ε. (5.18)

Fix δ > 0. Now for each n ∈ Z, take [a, b] = [n, n + 1], ε = δ2|n|+2 , apply

(5.18) and call the resulting continuous function gn. Let g be a continuousfunction from R → R such that µ

(x : n ≤ x ≤ n + 1, g(x) = gn(x)

)<

δ2|n|+2 . This can be done by setting g(x) = gn(x) for n ≤ x ≤ (n + 1)− δn

and linear on[(n + 1)− δn, n + 1

]for some 0 < δn < δ

2|n|+3 . Then

µ(x : f(x) = g(x)

)≤

∞∑n=−∞

µ(x : n ≤ x ≤ n + 1, f(x) = g(x)

)

≤∞∑

n=−∞µ(x : n ≤ x ≤ n + 1, f(x) = gn(x)

)

+∞∑

n=−∞µ(x : n ≤ x ≤ n + 1, gn(x) = g(x)

)

< 2∞∑

n=−∞

δ

2|n|+2 ≤ 2δ.

So it suffices to establish (5.17). Since fK : R → R is bounded and〈Mµ∗

F,B(R)〉-measurable, for each ε > 0, there exists a simple function

s(x) ≡∑k

i=1 ciIAi(x), with Ai ⊂ [a, b], Ai ∈ Mµ∗

F, Ai : 1 ≤ i ≤ k are

disjoint, µ(Ai) < ∞, and ci ∈ R for i = 1, . . . , k, such that |f(x)−s(x)| < ε4

for all a ≤ x ≤ b. By Theorem 1.3.4, for each Ai and η > 0, there exista finite number of open disjoint intervals Iij = (aij , bij), j = 1, . . . ni suchthat

µ

(Ai

ni⋃j=1

Iij

)<

η

2k.

Now as in Problem 1.32 (g), there exists a continuous function gij suchthat

µ(g−1

ij 1 Iij

)<

η

kni, j = 1, 2, . . . , ni, i = 1, 2, . . . , k.

Let

gi ≡ni∑

j=1

gij , 1 ≤ i ≤ k.

Page 86: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.6 Problems 71

Then µ(Aig−1i 1) < η

k . Let g =∑k

i=1 cigi. Then µ(s = g) < η. Hencefor every ε > 0, η > 0, there is a continuous function gε,η : [a, b] → R suchthat

µ(

x : a ≤ x ≤ b, |fK(x)− gε,η(x)| > ε)

< η.

Now for each n ≥ 1, let

hn(·) ≡ g 12n , 1

2n(·) and An =

x : a ≤ x ≤ b, |fK(x)− hn(x)| > 1

2n

.

Then, µ(An) < 12n and hence

∞∑n=1

µ(An) < ∞.

By the MCT, this implies that∫[a,b]

(∑∞n=1 IAn

)dµ < ∞ and hence

∞∑n=1

IAn< ∞ a.e. µ.

Thus hn → fK a.e. µ on [a, b]. By Egorov’s theorem for any ε > 0, there isa set Aε ∈ B([a, b]) such that

µ(Acε) < ε/2 and hn → fK uniformly on Aε.

By the inner regularity (Corollary 1.3.5) of µ, there is a compact setD ⊂ Aε such that

µ(Aε\D) < ε/2.

Since hn → fK uniformly on Aε, fK is continuous on Aε and hence on D.It can be shown that there exists a continuous function g : [a, b] → R suchthat g = fK on D (Problem 2.8 (e)). A more general result extending acontinuous function defined on a closed set to the whole space is knownas Tietze’s extension theorem (see Munkres (1975)). Thus µ(x : a ≤ x ≤b, fK(x) = g(x)) < ε. This completes the proof of (5.17) and hence thatof the proposition.

Remark 2.5.5: (Littlewood’s principles). As pointed out in Section 1.3,Theorems 1.3.4, 2.5.11, and 2.5.12 constitute J. E. Littlewood’s three prin-ciples: every Mµ∗

Fmeasurable set is nearly a finite union of intervals; every

a.e. convergent sequence is nearly uniformly convergent; and every Mµ∗F-

measurable function is nearly continuous.

2.6 Problems

2.1 Let Ωi, i = 1, 2 be two nonempty sets and T : Ω1 → Ω2 be a map.Then for any collection Aα : α ∈ I of subsets of Ω2, show that

T−1( ⋃

α∈I

)=

⋃α∈I

T−1(Aα

)

Page 87: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

72 2. Integration

and T−1( ⋂

α∈I

)=

⋂α∈I

T−1(Aα

).

Further,(T−1(A)

)c = T−1(Ac) for all A ⊂ Ω2. (These are known asde Morgan’s laws.)

2.2 Let Aii≥1 be a collection of disjoint sets in a measurable space(Ω,F).

(a) Let gii≥1 be a collection of 〈F ,B(R)〉-measurable functionsfrom Ω to R. Show that

∑∞i=1 IAi

gi converges on R and is〈F ,B(R)〉-measurable.

(b) Let G ≡ σ〈Ai : i ≥ 1〉. Show that h : Ω → R is 〈G,B(R)〉-measurable iff g(·) is constant on each Ai.

2.3 Let f, g : Ω → R be 〈F ,B(R)〉-measurable. Set

h(ω) =f(ω)g(ω)

I(g(ω) = 0), ω ∈ Ω.

Verify that h is 〈F ,B(R)〉-measurable.

2.4 Let g : Ω → R be such that for every r ∈ R, g−1((−∞, r]) ∈ F . Showthat g is 〈F ,B(R)〉-measurable.

2.5 Prove Proposition 2.1.6.

(Hint: Show that

σ〈fλ : λ ∈ Λ〉 =⋃L∈C

σ〈fλ : λ ∈ L〉

where C is the collection of all countable subsets of Λ.)

2.6 Let Xi, i = 1, 2, 3 be random variables on a probability space(Ω,F , P ). Consider the random equation (in t ∈ R):

X1(ω)t2 + X2(ω)t + X3(ω) = 0. (6.1)

(a) Show that A ≡ ω ∈ Ω : Equation (6.1) has two distinct realroots ∈ F .

(b) Let T1(ω) and T2(ω) denote the two roots of (6.1) on A. Let

fi(w) =

Ti(w) on A0 on Ac ,

i = 1, 2. Show that (f1, f2) is 〈F ,B(R2)〉-measurable.

Page 88: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.6 Problems 73

2.7 Let M ≡((

Xij

)), 1 ≤ i, j ≤ k, be a (random) matrix of random

variables Xij defined on a probability space (Ω,F , P ).

(a) Show that Y1 ≡ det(M) (the determinant of M) and Y2 ≡tr(M)(the trace of M) are both 〈F ,B(R)〉-measurable.

(b) Show also that Y3 ≡ the largest eigenvalue of M ′M is 〈F ,B(R)〉-measurable, where M ′ is the transpose of M .

(Hint: Use the result that Y3 = supx =0

(x′M ′Mx)(x′x) .)

2.8 Let f : R → R. Let f(x) = infδ>0 sup|y−x|<δ f(y) and f(x) =supδ>0 inf |y−x|<δ f(y), x ∈ R.

(a) Show that for any t ∈ R,

x : f(x) < t

is open and hence, f is 〈B(R),B(R)〉-measurable.

(b) Show that for any t > 0,

x : f(x)− f(x) < t ≡⋃r∈Q

x : f(x) < t + r, f(x) > r

and hence is open.

(c) Show that f is continuous at some x0 in R iff f(x0) = f(x0).

(d) Show that the set Cf ≡ x : f(·) is continuous at x is a Gδ

set, i.e., an intersection of a countable number of open sets, andhence, Cf is a Borel set.

(e) Let D be a closed set in R. Let f : D → R be continuous on D.Show that there exists a g : R → R continuous such that g = fon D.

(Hint: Note that Dc is open in R and hence it can be expressedas a countable union of disjoint open intervals Ij = (aj , bj) :1 ≤ j ≤ k ≤ ∞. Note that aj , bj ∈ D for all j except forpossibly the j’s for which aj = −∞ or bj = +∞. Let

g(x) ≡

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎩

f(x) if x ∈ D

f(aj) + (x−aj)(bj−aj)

(f(bj)− f(aj)) if x ∈ (aj , bj),aj , bj ∈ D

f(bj) if aj = −∞,x ∈ (aj , bj)

f(aj) if bj = ∞,x ∈ (aj , bj).

Now verify that g has the required properties.)

Page 89: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

74 2. Integration

2.9 Prove Proposition 2.2.1 using Problem 2.1.

2.10 (a) Show that for any x ∈ R and any random variable X with cdfFX(·), P (X < x) = FX(x−) ≡ limy↑x FX(y).

(b) Show that Fc(·) in (2.5) is continuous.

2.11 Let F : R → R be nondecreasing.

(a) Show that for x ∈ R, F (x−) ≡ limy↑x F (y) and F (x+) =limy↓x F (y) exist and satisfy F (x−) ≤ F (x) ≤ F (x+).

(b) Let D ≡ x : F (x+) − F (x−) > 0. Show that D is at mostcountable.

(Hint: Show thatD =

⋃n≥1

⋃r≥1

Dn,r,

where Dn,r = x : |x| ≤ n, F (x+)− F (x−) > 1r and that each

Dn,r is finite.)

2.12 Suppose that (i) and (iii) of Proposition 2.2.3 hold. Show that forany a1 in R and −∞ < a2 ≤ b2 < ∞, F (a2, a1) ≤ F (b2, a1) andF (a1, a2) ≤ F (a1, b2), (i.e., F is monotone coordinatewise).

2.13 Let F : Rk → R be such that:

(a) for x1 = (x11, x12, . . . , x1k) and x2 = (x21, x22, . . . , x2k) withx1i ≤ x2i for i = 1, 2, . . . k,

∆F (x1, x2) ≡∑a∈A

(−1)s(a)F (a) ≥ 0,

where A ≡ a = (a1, a2, . . . , ak) : ai ∈ x1i, x2i, i = 1, 2, . . . , kand for a in A, s(a) = |i : ai = x1i, i = 1, 2, . . . , k| is the num-ber of indices i for which ai = x1i.

(b) For each i = 1, 2, . . . , k, limxi1↓−∞

F (xi) = 0.

Let Ck be the semialgebra of sets of the form A : A = A1× . . .×Ak,Ai ∈ C for all 1 ≤ i ≤ k where C is the semialgebra in R defined in(3.7). Set µF (A) ≡ ∆F (x1, x2) if A = (x11, x21] × (x12, x22] × . . . ×(x1k, x2k] is bounded and set µF (A) = limn→∞ µ(A ∩ Jn), whereJn = (−n, n]k.

(i) Show that F is coordinatewise monotone, i.e., if x =(x1, . . . , xk), y = (y1, . . . , yk) and xi ≤ yi for every i = 1, . . . , k,then

F (y) ≥ F (x).

Page 90: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.6 Problems 75

(ii) Show that C is a semialgebra and µF is a measure on C by usingthe Heine-Borel theorem as in Problem 1.22 and 1.23.

2.14 Let m(·) denote the Lebesgue measure on (R,B(R)). Let T : R → Rbe the map T (x) = x2. Evaluate the induced measure mT−1(A) ofthe set A, where

(a) A = [0, t], t > 0.

(b) A = (−∞, 0).

(c) A = 1, 2, 3, . . ..(d) A =

⋃∞i=1(i

2, (i + 1i2 )2).

(e) A =⋃∞

i=1(i2, (i + 1

i )2).

2.15 Consider the probability space((0, 1),B((0, 1)), m

), where m(·) is the

Lebesgue measure.

(a) Let Y1 be the random variable Y1(x) ≡ sin 2πx for x ∈ (0, 1).Find the cdf of Y1.

(b) Let Y2 be the random variable Y2(x) ≡ log x for x ∈ (0, 1). Findthe cdf of Y2.

(c) Let F : R → R be a cdf . For 0 < x < 1, let

F−11 (x) = infy : y ∈ R, F (y) ≥ x

F−12 (x) = supy : y ∈ R, F (y) ≤ x.

Let Zi be the random variable defined by

Zi = F−1i (x) 0 < x < 1, i = 1, 2.

(i) Find the cdf of Zi, i = 1, 2.

(Hint: Verify using the right continuity of F that for any0 < x < 1, t ∈ R, F (t) ≥ x ⇔ F−1

1 (x) ≤ t.)(ii) Show also that F−1

1 (·) is left continuous and F−12 (·) is right

continuous.

2.16 (a) Let (Ω,F1, µ) be a σ-finite measure space. Let T : Ω → R be〈F ,B(R)〉-measurable. Show, by an example, that the inducedmeasure µT−1 need not be σ-finite.

(b) Let (Ωi,Fi) be measurable spaces for i = 1, 2 and let T :Ω1 → Ω2 be 〈F1,F2〉-measurable. Show that any measure µon (Ω1,F1) is σ-finite if µT−1 is σ-finite on (Ω2,F2).

Page 91: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

76 2. Integration

2.17 Let (Ω,F , µ) be a measure space and let f : Ω → [0,∞] be such thatit admits two representations

f =k∑

i=1

ciIAiand f =

∑j=1

djIBj,

where ci, dj ∈ [0,∞],and Ai and Bj ∈ F for all i, j. Show that

k∑i=1

ciµ(Ai) =∑

j=1

djµ(Bj).

(Hint: Express Ai and Bj as finite unions of a common collection ofdisjoint sets in F .)

2.18 (a) Prove Proposition 2.3.1.(b) In the proof of Proposition 2.3.2, verify that Dn ↑ D.(c) Verify Remark 2.3.1.

(Hint: Let

An = ω : fn−1(ω) ≥ fn(ω), gn+1(ω) ≥ gn(ω)

A =( ⋂

n≥1

An

)⋂ω : lim

n→∞ gn(ω) = g(ω),

limn→∞ fn(ω) = f(ω)

and fn = fnIA, gn = gnIA. Verify that µ(Ac) = 0 and applyProposition 2.3.2 to fnn≥1 and gnn≥1.)

2.19 (a) Verify that fn(·) defined in (3.7) satisfies fn(ω) ↑ f(ω) for all ωin Ω.

(b) Verify that the sequence hnn≥1 of Remark 2.3.3 satisfieslimn→∞ hn = f a.e. (µ), and limn→∞

∫hndµ = M .

2.20 Apply Corollary 2.3.5 to show that for any collection aij : i, j ∈ Nof nonnegative numbers,

∞∑i=1

⎛⎝ ∞∑

j=1

aij

⎞⎠ =

∞∑j=1

( ∞∑i=1

aij

).

2.21 Let g : R → R.

(a) Recall that limt→∞ g(t) = L for some L in R if for every ε > 0,there exists a Tε < ∞ such that t ≥ Tε ⇒ |g(t)− L| < ε. Showthat limt→∞ g(t) = L for some L in R iff limn→∞ g(tn) = L forevery sequence tnn≥1 with limn→∞ tn = ∞.

Page 92: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.6 Problems 77

(b) Formulate and prove a similar result when limt→a g(t) = L forsome a, L ∈ R.

2.22 Let ft : t ∈ R ⊂ L1(Ω,F , µ).

(a) (The continuous version of the MCT). Suppose that ft ↑ f ast ↑ ∞ a.e. (µ) and for each t, ft ≥ 0 a.e. (µ). Show that∫

ftdµ ↑∫

fdµ.

(b) (The continuous version of the DCT). Suppose there exists anonnegative g ∈ L1(Ω,F , µ) such that for each t, |ft| ≤ g a.e.(µ) and as t → ∞, ft → f a.e. (µ). Then f ∈ L1(Ω,F , µ) and∫|ft − f |dµ → 0 and hence,

∫ftdµ →

∫fdµ, as t →∞.

2.23 Let f : [a, b] → R be bounded where −∞ < a < b < ∞. Let Pnn≥1be a sequence of partitions such that ∆(Pn) → 0. Show that as n →∞,

U(Pn, f) →∫

f and L(Pn, f) →∫

f,

where∫

f and∫f are as defined in (4.3) and (4.4), respectively.

(Hint: Given ε > 0, fix a partition P = x0 = a < x1 < . . . < xk = bsuch that

∫f < U(P, f) < f + ε. Let δ = min0≤i≤k−1(xi+1 − xi).

Choose n large such that the diameter ∆(Pn) < δ. Verify that

U(Pn, f) < U(P, f) + kB∆(Pn)

where B = sup|f(x)| : a ≤ x ≤ b and conclude that limnU(Pn, f) ≤∫f + ε.)

2.24 Establish (4.6) and (4.7).

(Hint: Show that for every x and any ε > 0, φn(x) ≤ φ(x) + ε for alln large and that for x ∈

⋃n≥1 Pn, φn(x) ≥ φ(x) for all n.)

2.25 If f(x) = IQ1(x) where Q1 = Q∩[0, 1], Q being the set of all rationals,then show that for any partition P , U(P, f) = 1 and L(P, f) = 0.

2.26 Establish Theorem 2.5.1.

(Hint: Verify that

D ≡ ω : fj(ω) → f(ω)=

⋃r≥1

⋂n≥1

⋃j≥n

Ajr,

where Ajr = |fj−f | > 1r. Show that since µ(D) = 0 and µ(Ω) < ∞,

µ(Drn) → 0 as n →∞ for each r ∈ N, where Drn =⋃

j≥n Ajr.)

Page 93: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

78 2. Integration

2.27 Let φ : R+ → R+ be nondecreasing and φ(x)x ↑ ∞ as x ↑ ∞.

Also, let fλ : λ ∈ Λ be a subset of L1(Ω,F , µ). Show that ifsupλ∈Λ

∫φ(|fλ|)dµ < ∞, then fλ : λ ∈ Λ is UI.

2.28 Let µ be the Lebesgue measure on ([−1, 1],B([−1, 1])). For n ≥ 1,define fn(x) = nI(0,n−1)(x) − nI(−n−1,0)(x) and f(x) ≡ 0 for x ∈[−1, 1]. Show that fn → f a.e. (µ) and

∫fndµ →

∫fdµ but fnn≥1

is not UI.

2.29 Let fn : n ≥ 1 ∪ f ⊂ L1(Ω,F , µ).

(a) Show that∫|fn − f |(dµ) → 0 iff fn −→m f and

∫|fn|dµ →∫

|f |dµ.(b) Show further that if µ(Ω) < ∞ then the above two are equivalent

to fn −→m f and fn UI.

2.30 For n ≥ 1, let fn(x) = n−1/2I(0,n)(x), x ∈ R, and let f(x) = 0, x ∈ R.Let m denote the Lebesgue measure on (R,B(R)). Show that fn → fa.e. (m) and fnn≥1 is UI, but

∫fndm →

∫fdm.

2.31 (Computing integrals w.r.t. the Lebesgue measure). Let f ∈L1(R,Mm, m) where (R,Mm, m) is the real line with Lebesgue σ-algebra, and Lebesgue measure, i.e., m = µ∗

F where F (x) ≡ x. Thedefinition of

∫fdm as

∫f+dm−

∫f−dm involves computing

∫f+dm

and∫

f−dm which in turn is given in terms of approximating by in-tegrals of simple nonnegative functions. This is not a very practicalprocedure. For f that happens to be continuous a.e. and bounded onfinite intervals, one can compute the Riemann integral of f over finiteintervals and pass to the limit. Justify the following steps:

(a) Let f be continuous a.e. and bounded on finite intervals andf ∈ L1(R,Mm, m). Show that for −∞ < a < b < ∞, f ∈L1([a, b],Mm, m) and∫

[a,b]fdm =

∮[a,b]

f(x)dx,

where the right side denotes the Riemann integral of f on [a, b].(b) If, in addition, f ∈ L1(R,Mm, m), then∫

R

fdm = lima→−∞b→+∞

∫[a,b]

fdm.

(c) If f is continuous a.e. and ∈ L1(R,Mm, m), then∫R

fdm = lima→−∞b→+∞c→∞

∫[a,b]

φc(f)dm

where φc(f) = f(x)I(|f(x)| ≤ c)+cI(f(x) > c)−cI(f(x) < −c).

Page 94: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.6 Problems 79

(d) Apply the above procedure to compute∫

Rfdm for

(i) f(x) = 11+x2 ,

(ii) f(x) = e−x2/2,

(iii) f(x) = e+xe−x2/2.

2.32 (a) Let a ∈ R. Show that if a1, a2 are nonnegative such that a =a1 − a2 then a1 ≥ a+, a2 ≥ a− and a1 − a+ = a2 − a−.

(b) Let f = f1 − f2 where f1, f2 are nonnegative and are inL1(Ω,F , µ). Show that f ∈ L1(Ω,F , µ) and

∫fdµ =

∫f1dµ −∫

f2dµ.

2.33 Let (Ω,F , µ) be a measure space. Let f : Ω× (a, b) → R be such thatfor each a < t < b, f(·, t) ∈ L1(Ω,F , µ).

(a) Suppose for each a < t < b,

(i) limh→0

f(ω, t + h) = f(ω, t) a.e. (µ).

(ii) sup|h|≤1

|f(ω, t + h)| ≤ g1(ω, t), where g1(·, t) ∈ L1(Ω,F , µ).

Show that φ(t) ≡∫Ω f(ω, t)dµ is continuous on (a, b).

(b) Suppose for each a < t < b.

(i) limh→0

f(ω, t + h)− f(ω, t)h

= g2(ω, t) exists a.e. (µ),

(ii) sup0≤|h|≤1

∣∣∣f(ω, t + h)− f(ω, t)h

∣∣∣ ≤ G(ω, t) a.e. (µ),

(iii) G(ω, t) ∈ L1(Ω,F , µ).

Show thatφ(t) ≡

∫Ω

f(ω, t)dµ

is differentiable on (a, b).

(Hint: Use the continuous version of DCT (cf. Problem 2.22).)

2.34 Let A ≡((aij)

)be an infinite matrix of real numbers. Suppose that

for each j, limi→∞ aij = aj exists in R and supi |aij | ≤ bj , where∑∞j=1 bj < ∞.

(a) Show by an application of the DCT that

limi→∞

∞∑j=1

|aij − aj | = 0.

(b) Show the same directly, i.e., without using the DCT.

Page 95: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

80 2. Integration

2.35 Using the above problem or otherwise, show that for any sequencexnn≥1 with limn→∞ xn = x in R,

limn→∞

(1 +

xn

n

)n

=∞∑

j=0

xj

j!≡ exp(x).

2.36 (a) Let fnn≥1 ⊂ L1(Ω,F , µ) such that fn → 0 in L1(µ). Showthat fnn≥1 is UI.

(b) Let fnn≥1 ⊂ Lp(Ω,F , µ), 0 < p < ∞, such that µ(Ω) < ∞,|fn|pn≥1 is UI and fn −→m f . Show that f ∈ Lp(µ) andfn → f in Lp(µ).

2.37 (a) Show that for a sequence of real numbers xnn≥1, limn→∞ xn =x ∈ R holds iff every subsequence xnj

j≥1 of xnn≥1 has afurther subsequence xnjk

k≥1 such that limk→∞ xnjk= x.

(b) Use (a) and Theorem 2.5.2 to show that the extended DCT(Theorem 2.3.11) is valid if the a.e. convergence of fnn≥1 andgnn≥1 is replaced by convergence in measure.

2.38 Let (R,Mµ∗F, µ∗

F ) be a Lebesgue-Stieltjes measure space generatedby a F : R → R nondecreasing. Let f : R → R be Mµ∗

F-measurable

such that |f | < ∞ a.e. (µ∗F ). Show that for every k ∈ N and for every

ε, η ∈ (0,∞), there exists a continuous function g : R → R such thatg(x) = 0 for |x| > k and µ∗

F (x : |x| ≤ k, |f(x)− g(x)| > η < ε.

(Hint: Complete the following:Step I: For all ε > 0, there exists Mk,ε ∈ (0,∞) such that µ∗

F (x :|x| ≤ k, |f(x)| > Mk,ε) < ε.Step II: For η > 0, there exists a simple function s(·) such thatµ∗

F (x : |x| ≤ k, |f(x)| ≤ Mk,ε, |f(x)− s(x)| > η) = 0.Step III: For δ > 0, there exists a continuous function g(·) suchthat g ≡ 0 for |x| > k and µ∗

F (x : |x| ≤ k, s(x) = g(x)) < δ.)

2.39 Recall from Corollary 2.3.6 that for g ∈ L1(Ω,F , µ) and nonnegativeνg(A) ≡

∫A

gdµ is a measure. Next for any F-measurable function h,show that h ∈ L1(νg) iff h · g ∈ L1(µ) and

∫hdνg =

∫hgdµ.

(Hint: Verify first for h simple and nonnegative, next for h nonneg-ative, and finally for any h.)

2.40 Prove the BCT, Corollary 2.3.13, using Egorov’s theorem (Theorem2.5.11).

2.41 Deduce the DCT from the BCT with the notation as in Corollary2.3.12.

(Hint: Apply the BCT to the measure space (Ω,F , νg) and functionshn = fn

g I(g > 0), h = fg I(g > 0) where νg is as in Problem 2.39.)

Page 96: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

2.6 Problems 81

2.42 (Change of variables formula). Let (Ωi,Fi), i = 1, 2 be two measur-able spaces. Let f : Ω1 → Ω2 be 〈F1,F2〉-measurable, h : Ω2 → Rbe 〈F2,B(R)〉-measurable, and µ1 be a measure on (Ω1,F1). Showthat g ≡ h f , i.e., g(ω) ≡ h(f(ω)) for ω ∈ Ω1 is in L1(µ) iff h(·) ∈L1(Ω2,F2, µ2) where µ2 = µ1f

−1 iff I(·) ∈ L1(R,B(R), µ3 ≡ µ2h

−1)

where I(·) is the identity function in R, i.e., I(x) ≡ x for all x ∈ Rand also that ∫

Ω1

gdµ1 =∫

Ω2

hdµ2 =∫

R

xdµ3.

2.43 Let φ(x) ≡ 1√2π

e−x2/2 be the standard N(0, 1) pdf on R. Let(µn, σn)n≥1 be a sequence in R × R+. Suppose µn → µ andσn → σ as n → ∞. Let fn(x) = 1

σnφ(

x−µn

σn

), f(x) = 1

σ φ(

x−µσ

)and νn(A) =

∫A

fndm, ν(A) =∫

Afdm for any Borel set A in R,

where m(·) is the Lebesgue measure on R. Using Scheffe’s theorem,verify that, as n → ∞, νn(·) → ν(·) uniformly on B(R) and that forany h : R → R, bounded and Borel measurable,∫

hdνn →∫

hdν.

2.44 Let fn(x) = cn

(1− x

n

)nI[0,n](x), x ∈ R, n ≥ 1.

(a) Find cn such that∫

fndm = 1.

(b) Show that limn→∞ fn(x) ≡ f(x) exists for all x in R and that fis a pdf

(c) For A ∈ B(R), let

νn(A) ≡∫

A

fndm and ν(A) =∫

A

fdm.

Show that νn → ν uniformly on B(R).

2.45 Let (Ω,F , µ) be a measure space and f : Ω → R be F-measurable.Suppose that

∫Ω |f |dµ < ∞ and D is a closed set in R such that for

all B ∈ F with µ(B) > 0,

1µ(B)

∫B

fdµ ∈ D.

Show that f(ω) ∈ D a.e. µ.

(Hint: Show that for x ∈ D, there exists r > 0 such that µω :|f(ω)− x| < r = 0.)

Page 97: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

82 2. Integration

2.46 Find a sequence of nonnegative continuous functions fnn≥1 on [0,1]such that

∫[0,1] fndm → 0 but fn(x)n≥1 does not converge for any

x.

(Hint: Let for m ≥ 1, 1 ≤ k ≤ m, gm,k = I[ k−1m , k

m

] and fnn≥1 be

a reordering of gn,k : 1 ≤ k ≤ n, n ≥ 1.)

2.47 Let fnn≥1 be a sequence of continuous functions from [0,1] to [0,1]such that fn(x) → 0 as n → ∞ for all 0 ≤ x ≤ 1. Show that∫ 10 fn(x)dx → 0 as n → ∞ (where the integral is the Riemann in-

tegral) by two methods: one by using BCT and one without usingBCT. Show also that if µ is a finite measure on ([0, 1],B([0, 1])), then∫[0,1] fndµ → 0 as n →∞.

2.48 (Invariance of Lebesgue measure under translation and reflection.)Let m(·) be Lebesgue measure on (R,B(R)).

(a) For any E ∈ B(R) and c ∈ R, define −E ≡ x : −x ∈ E andE + c ≡ y : y = x + c, x ∈ E. Show that

m(−E) = m(E) andm(E + c) = m(E)

for all E ∈ B(R) and c ∈ R.

(b) For any f ∈ L1(R,B(R), m) and c ∈ R, let f(x) ≡ f(−x) andfc(x) ≡ f(x + c). Show that∫

fdm =∫

fcdm =∫

fdm.

Page 98: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

3Lp-Spaces

3.1 Inequalities

This section contains a number of useful inequalities.

Theorem 3.1.1: (Markov’s inequality). Let f be a nonnegative measurablefunction on a measure space (Ω,F , µ). Then for any 0 < t < ∞,

µ(f ≥ t) ≤∫

fdµ

t. (1.1)

Proof: Since f is nonnegative,∫

fdµ ≥∫(f≥t) fdµ ≥ tµ(f ≥ t).

Corollary 3.1.2: Let X be a random variable on a probability space(Ω,F , P ). Then, for r > 0, t > 0,

P (|X| ≥ t) ≤ E|X|rtr

.

Proof: Since |X| ≥ t = |X|r ≥ tr for all t > 0, r > 0, this followsfrom (1.1).

Corollary 3.1.3: (Chebychev’s inequality). Let X be a random variablewith EX2 < ∞, E(X) = µ and Var(X) = σ2. Then for any 0 < k < ∞,

P (|X − µ| ≥ kσ) ≤ 1k2 .

Page 99: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

84 3. Lp-Spaces

Proof: Follows from Corollary 3.1.2 with X replaced by X − µ and withr = 2, t = kσ.

Corollary 3.1.4: Let φ : R+ → R+ be nondecreasing. Then for any randomvariable X and 0 < t < ∞,

P (|X| ≥ t) ≤ Eφ(|X|)φ(t)

.

Proof: Use (1.1) and the fact that |X| ≥ t ⇒ φ(|X|) ≥ φ(t).

Corollary 3.1.5: (Cramer’s inequality). For any random variable X andt > 0,

P (X ≥ t) ≤ infθ>0

E(eθX)eθt

.

Proof: For t > 0, θ > 0, P (X ≥ t) = P(eθX ≥ eθt

)≤ E(eθX)

eθt , by (1.1).

Definition 3.1.1: A function φ : (a, b) → R is called convex if for all0 ≤ λ ≤ 1, a < x ≤ y < b,

φ(λx + (1− λ)y) ≤ λφ(x) + (1− λ)φ(y). (1.2)

Geometrically, this means that for the graph of y = φ(x) on (a, b), foreach fixed t ∈ (0,∞), the chord over the interval (x, x + t) turns in thecounterclockwise direction as x increases.

More precisely, the following result holds.

Proposition 3.1.6: Let φ : (a, b) → R. Then φ is convex iff for alla < x1 < x2 < x3 < b,

φ(x2)− φ(x1)x2 − x1

≤ φ(x3)− φ(x2)x3 − x2

, (1.3)

which is equivalent to

φ(x2)− φ(x1)x2 − x1

≤ φ(x3)− φ(x1)(x3 − x1)

≤ φ(x3)− φ(x2)x3 − x2

. (1.4)

Proof: Let φ be convex and a < x1 < x2 < x3 < b. Then one can writex2 = λx1 + (1− λ)x3 with λ = (x3−x2)

(x3−x1). So by (1.2),

φ(x2) = φ(λx1 + (1− λ)x3)≤ λφ(x1) + (1− λ)φ(x3)

=(x3 − x2)(x3 − x1)

φ(x1) +(x2 − x1)(x3 − x1)

φ(x3)

Page 100: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

3.1 Inequalities 85

which is equivalent to (1.3). Also, since

φ(x3)− φ(x1)(x3 − x1)

= λφ(x3)− φ(x2)

(x3 − x2)+ (1− λ)

φ(x2)− φ(x1)(x2 − x1)

,

(1.4) follows from (1.3).Conversely, suppose (1.4) holds for all a < x1 < x2 < x3 < b. Then given

a < x < y < b and 0 < λ < 1, set x1 = x, x2 = λx + (1− λ)y, x3 = y andapply (1.4) to verify (1.2).

The following properties of a convex function are direct consequences of(1.3). The proof is left as an exercise (Problem 3.1).

Proposition 3.1.7: Let φ : (a, b) → R be convex. Then,

(i) For each x ∈ (a, b),

φ′+(x) ≡ lim

y↓x

φ(y)− φ(x)(y − x)

, φ′−(x) ≡ lim

y↑x

φ(y)− φ(x)(y − x)

exist and are finite.

(ii) Further, φ′−(·) ≤ φ′

+(·) and both are nondecreasing on (a, b).

(iii) φ′(·) exists except on the countable set of discontinuity points of φ′+

and φ′−.

(iv) For any a < c < d < b, φ is Lipschitz on [c, d], i.e., there exists aconstant K < ∞ such that

|φ(x)− φ(y)| ≤ K|x− y| for all c ≤ x, y ≤ d.

(v) For any a < c, x < b,

φ(x)−φ(c) ≥ φ′+(c)(x− c) and φ(x)−φ(c) ≥ φ′

−(c)(x− c). (1.5)

By the mean value theorem, a sufficient condition for (1.3) and hence, forthe convexity of φ is that φ be differentiable on (a, b) and φ′ be nondecreas-ing. A further sufficient condition for this is that φ be twice differentiableon (a, b) and φ′′ be nonnegative. This is stated as

Proposition 3.1.8: Let φ be twice differentiable on (a, b) and φ′′ be non-negative on (a, b). Then φ is convex on (a, b).

Example 3.1.1: The following functions are convex in the given intervals:

(a) φ(x) = |x|p, p ≥ 1, (−∞,∞).

(b) φ(x) = ex, (−∞,∞).

Page 101: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

86 3. Lp-Spaces

(c) φ(x) = − log x, (0,∞).

(d) φ(x) = x log x, (0,∞).

Remark 3.1.1: By Proposition 3.1.7 (iii), the convexity of φ implies thatφ′ exists except on a countable set. For example, the function φ(x) = |x|is convex on R; it is differentiable at all x = 0. Similarly, it is easy toconstruct a piecewise linear convex function φ with a countable number ofpoints where φ is not differentiable.

The following is an important inequality for convex functions.

Theorem 3.1.9: (Jensen’s inequality). Let f be a measurable functionon a probability space (Ω,F , P ) with P (f ∈ (a, b)) = 1 for some interval(a, b), −∞ ≤ a < b ≤ ∞ and let φ : (a, b) → R be convex. Then

φ(∫

fdP)≤∫

φ(f)dP, (1.6)

provided∫|f |dP < ∞ and

∫|φ(f)|dP < ∞.

Remark 3.1.2: In terms of random variables, this says that for any ran-dom variable X on a probability space (Ω,F , P ) with P (X ∈ (a, b)) = 1and for any function φ that is convex on (a, b),

φ(EX) ≤ Eφ(X), (1.7)

provided E|X| < ∞ and E|φ(X)| < ∞, where for any Borel measurablefunction h, Eh(X) ≡

∫h(X)dP .

Proof of Theorem 3.1.9: Let c =∫

fdP . Applying (1.5), one gets

Y (ω) ≡ φ(f(ω))− φ(c)− φ′+(c)(f(ω)− c) ≥ 0 a.e. (P ), (1.8)

which, when integrated, yields∫

Y (ω)P (dω) ≥ 0. Since∫

(f(ω) − c)P (dω) = 0, (1.6) follows.

Remark 3.1.3: Suppose that equality holds in (1.6). Then, it follows that∫Y (ω)P (dω) = 0. By (1.8), this implies

φ(f(ω))− φ(c) = φ′+(c)(f(ω)− c) a.e. (P ).

Thus, if φ is a strictly convex function (i.e., strict inequality holds in (1.2)for all x, y ∈ (a, b) and 0 < λ < 1), then equality holds in (1.6) iff f(ω) = ca.e. (P ).

The following are easy consequences of Jensen’s inequality (Problem 3.3).

Proposition 3.1.10: Let k ≥ 1 be an integer.

Page 102: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

3.1 Inequalities 87

(i) Let a1, a2, , . . . , ak be real and p1, p2, , . . . , pk be positive numbers suchthat

∑ki=1 pi = 1. Then

k∑i=1

pi exp(ai) ≥ exp( k∑

i=1

piai

). (1.9)

(ii) Let b1, b2, . . . , bk be nonnegative numbers and p1, p2, . . . , pk be as in(i). Then

k∑i=1

pibi ≥k∏

i=1

bpi

i , (1.10)

and in particular,1k

k∑i=1

bi ≥( k∏

i=1

bi

) 1k

, (1.11)

i.e., the arithmetic mean of b1, . . . , bk is greater than or equal to thegeometric mean of the bi’s. Further, equality holds in (1.10) iff b1 =b2 = . . . = bk.

(iii) For any a, b real and 1 ≤ p < ∞,

|a + b|p ≤ 2p−1(|a|p + |b|p). (1.12)

Inequality (1.10) is useful in establishing the following:

Theorem 3.1.11: (Holder’s inequality). Let (Ω,F , µ) be a measure space.Let 1 < p < ∞, f ∈ Lp(Ω,F , µ) and g ∈ Lq(Ω,F , µ) where q = p

(p−1) .Then ∫

|fg|dµ ≤(∫

|f |pdµ

) 1p(∫

|g|qdµ

) 1q

, (1.13)

i.e., ‖fg‖1 ≤ ‖f‖p‖g‖q.

If ‖fg‖1 = 0, then equality holds in (1.13) iff |f |p = c|g|q a.e. (µ) for someconstant c ∈ (0,∞).

Proof: W.l.o.g. assume that∫|f |pdµ > 0 and

∫|g|qdµ > 0. Fix ω ∈ Ω. Let

p1 = 1p , p2 = 1

q , b1 = c1|f(ω)|p, and b2 = c2|g(ω)|q, where c1 = (∫|f |pdµ)−1

and c2 = (∫|g|qdµ)−1. Then applying (1.10) with k = 2 yields

c1

p|f(ω)|p +

c2

q|g(ω)|q ≥ c

1p

1 c1q

2 |f(ω)g(ω)|. (1.14)

Integrating w.r.t. µ yields

1 ≥ c1p

1 c1q

2

∫|f(ω)g(ω)|dµ(ω)

Page 103: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

88 3. Lp-Spaces

which is equivalent to (1.13).Next, equality in (1.13) implies equality in (1.14) a.e. (µ). Since 1 < p <

∞, by the last part of Proposition 3.1.10 (ii), this implies that b1 = b2 a.e.(µ), i.e., |f(ω)|p = c2c

−11 |g(ω)|q a.e. (µ).

Remark 3.1.4: (Holder’s inequality for p = 1, q = ∞). Let f ∈L1(Ω,F , µ) and g ∈ L∞(Ω,F , µ). Then |fg| ≤ |f | ‖g‖∞ a.e. (µ) and hence,

‖fg‖1 ≡∫|fg|dµ ≤ ‖f‖1‖g‖∞.

If equality holds in the above inequality, then |f |(‖g‖∞ − |g|) = 0 a.e. (µ)and hence, |g| = ‖g‖∞ on the set |f | > 0 a.e. (µ).

The next two corollaries follow directly from Theorem 3.1.11. The proofis left as an exercise (Problem 3.9).

Corollary 3.1.12: (Cauchy-Schwarz inequality). Let f , g ∈ L2(Ω,F , µ).Then ∫

|fg|dµ ≤(∫

|f |2dµ

) 12(∫

|g|2dµ

) 12

, (1.15)

i.e., ‖fg‖1 ≤ ‖f‖2‖g‖2.

Corollary 3.1.13: Let k ∈ N. Let a1, a2, . . . , ak, b1, b2, . . . , bk be realnumbers and c1, c2, . . . , ck be positive real numbers.

(i) Then, for any 1 < p < ∞,

k∑i=1

|aibi|ci ≤(

k∑i=1

|ai|pci

) 1p(

k∑i=1

|bi|qci

) 1q

, (1.16)

where q = p−1p .

(ii)k∑

i=1

|aibi|ci ≤(

k∑i=1

|ai|2ci

) 12(

k∑i=1

|bi|2ci

) 12

. (1.17)

Next, as an application of Holder’s inequality, one gets

Theorem 3.1.14: (Minkowski’s inequality). Let 1 < p < ∞ and f, g ∈Lp(Ω,F , µ). Then(∫

|f + g|pdµ

) 1p

≤(∫

|f |pdµ

) 1p

+(∫

|g|pdµ

) 1p

,

i.e., ‖f + g‖p ≤ ‖f‖p + ‖g‖q. (1.18)

Page 104: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

3.2 Lp-Spaces 89

Proof: Let h1 = |f + g|, h2 = |f + g|p−1. Then by (1.12),

|f + g|p ≤ 2p−1(|f |p + |g|p),

implying that h1 ∈ Lp(Ω,F , µ) and h2 ∈ Lq(Ω,F , µ), where q = p(p−1) .

Since |f + g|p = h1h2 ≤ |f |h2 + |g|h2, by Holder’s inequality,

∫|f+g|pdµ ≤

(∫|f |pdµ

) 1p(∫

hq2

) 1q

+(∫

|g|pdµ

) 1p(∫

hq2

) 1q

. (1.19)

But∫

hq2 =

∫|f + g|p and so (1.19) yields (1.18).

Remark 3.1.5: Inequality (1.18) holds for both p = 1 and p = ∞.

3.2 Lp-Spaces

3.2.1 Basic propertiesLet (Ω,F , µ) be a measure space. Recall the definition of Lp(Ω,F , µ), 0 <p ≤ ∞, from Section 2.5 as the set of all measurable functions f on (Ω,F , µ)such that ‖f‖p < ∞, where for 0 < p < ∞,

‖f‖p =(∫

|f |pdµ)min 1

p ,1

and for p = ∞,

‖f‖∞ ≡ supk : µ(|f | > k) > 0

(called the essential supremum of f). In this section and elsewhere, Lp(µ)denotes Lp(Ω,F , µ). The following proposition shows that Lp(µ) is a vectorspace over R.

Proposition 3.2.1: For 0 < p ≤ ∞,

f, g ∈ Lp(µ), a, b ∈ R ⇒ af + bg ∈ Lp(µ). (2.1)

Proof:

Case 1: 0 < p ≤ 1. For any two positive numbers x, y,(x

x + y

)p

+(

y

x + y

)p

≥ x

x + y+

y

x + y= 1.

Hence, for all x, y ∈ (0,∞)

(x + y)p ≤ xp + yp. (2.2)

Page 105: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

90 3. Lp-Spaces

It is easy to check that (2.2) continues to hold if x, y ∈ [0,∞). This yields|af + bg|p ≤ |a|p|f |p + |b|p|g|p, which, in turn, yields (2.1) by integration.

Case 2: 1 < p < ∞. By (1.12),

|af + bg|p ≤ 2p−1(|af |p + |bg|p).

Integrating both sides of the above inequality yields (2.1).

Case 3: p = ∞. By definition, there exist constants K1 < ∞ and K2 < ∞such that

µ(|f | > K1) = 0 = µ(|g| > K2).This implies that µ(|af + bg| > K) = 0 for any K > |a|K1 + |b|K2.Hence, af + bg ∈ L∞(µ) and

‖af + bg‖∞ ≤ |a| ‖f‖∞ + |b| ‖g‖∞ .

Recall that a set S with a function d : S × S → [0,∞] is called a metricspace if for all x, y, z ∈ S,

(i) d(x, y) = d(y, x) (symmetry)

(ii) d(x, y) ≤ d(x, z) + d(y, z) (triangle inequality)

(iii) d(x, y) = 0 iff x = y

and the function d(·, ·) is called a metric on S. Some examples are

(a) S = Rk with d(x, y) =(∑k

i=1 |xi − yi|2) 1

2;

(b) S = C[0, 1], the space of all continuous functions on [0, 1] withd(f, g) = sup|f(x)− g(x)| : 0 ≤ x ≤ 1;

(c) S = a nonempty set, and d(x, y) = 1 if x = y and 0 if x = y. (Thisd(·, ·) is called the discrete metric on S.)

The Lp-norm ‖ · ‖p can be used to introduce a distance notion in Lp(µ)for 0 < p ≤ ∞.

Definition 3.2.1: For f , g ∈ Lp(µ), 0 < p ≤ ∞, let

dp(f, g) ≡ ‖f − g‖p. (2.3)

Note that, for any f , g, h ∈ Lp(µ) and 1 ≤ p ≤ ∞,

(i) dp(f, g) = dp(g, f) ≥ 0 (nonnegativity and symmetry), and

(ii) dp(f, h) ≤ dp(f, g) + dp(g, h) (triangle inequality),

Page 106: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

3.2 Lp-Spaces 91

which follows by Minkowski’s inequality (Theorem 3.1.14) for 1 ≤ p < ∞,and by Proposition 3.2.1 for p = ∞. However, dp(f, g) = 0 implies onlythat f = g a.e. (µ). Thus, dp(·, ·) of (2.3) satisfies conditions (i) and (ii)of being a metric and it satisfies condition (iii) as well, provided any twofunctions f and g that agree a.e. (µ) are regarded as the same element ofLp(µ). This leads to the following:

Definition 3.2.2: For f , g in Lp(µ), f is called equivalent to g and iswritten as f ∼ g, if f = g a.e. (µ).

It is easy to verify that the relation ∼ of Definition 3.2.2 is an equivalencerelation, i.e., it satisfies

(i) f ∼ f for all f in Lp(µ) (reflexive)

(ii) f ∼ g ⇒ g ∼ f (symmetry)

(iii) f ∼ g, g ∼ h ⇒ f ∼ h (transitive).

This equivalence relation ∼ divides Lp(µ) into disjoint equivalence classessuch that in each class all elements are equivalent. The notion of distancebetween these classes may be defined as follows:

dp([f ], [g]) ≡ dp(f, g)

where [f ] and [g] denote, respectively, the equivalence classes of functionscontaining f and g. It can be verified that this is a metric on the set ofequivalence classes. In what follows, the equivalence class [f ] is identifiedwith the element f . With this identification, (Lp(µ), dp(·, ·)) becomes ametric space for 1 ≤ p ≤ ∞.

Remark 3.2.1: For 0 < p < 1, if one defines

dp(f, g) ≡∫|f − g|pdµ, (2.4)

then (Lp(µ), dp) becomes a metric space (with the same identification asabove of functions with their equivalence classes). The triangle inequalityfollows from (2.2).

Recall that a metric space (S, d) is called complete if every Cauchy se-quence in (S, d) converges to an element in S, i.e., if xnn≥1 is a se-quence in S such that for every ε > 0, there exists a Nε such thatn, m ≥ Nε ⇒ d(xn, xm) ≤ ε, then there exists an element x in (S, d) suchthat limn→∞ d(xn, x) = 0. The next step is to establish the completenessof Lp(µ).

Theorem 3.2.2: For 0 < p ≤ ∞, (Lp(µ), dp(·, ·)) is complete, where dp

is as in (2.3).

Page 107: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

92 3. Lp-Spaces

Proof: Let fnn≥1 be a Cauchy sequence in Lp(µ) for 0 < p < ∞. Themain steps in the proof are as follows:

(I) there exists a subsequence nkk≥1 such that fnkk≥1 converges a.e.

(µ) to a limit function f ;

(II) limk→∞

dp(fnk, f) = 0;

(III) limn→∞ dp(fn, f) = 0.

Step (I): Let εkk≥1 and δkk≥1 be sequences of positive numbersdecreasing to zero. Since fnn≥1 is Cauchy, for each k ≥ 1, there existsan integer nk such that∫

|fn − fm|pdµ ≤ εk for all n, m ≥ nk. (2.5)

W.l.o.g., let nk+1 > nk for each k ≥ 1. Then, by Markov’s inequality(Theorem 3.1.1),

µ(|fnk+1 − fnk| ≥ δk) ≤ δ−p

k

∫|fnk+1 − fnk

|pdµ ≤ δ−pk εk. (2.6)

Let Ak = |fnk+1 − fnk| ≥ δk, k = 1, 2, . . . and A = lim supk→∞ Ak ≡⋂∞

j=1⋃

k≥j Ak. If εkk≥1 and δkk≥1 satisfy

∞∑k=1

δ−pk εk < ∞, (2.7)

then by (2.6),∑∞

k=1 µ(Ak) < ∞ and hence, as in the proof of Theorem2.5.2, µ(A) = 0.

Note that for ω in Ac, |fnk+1(ω) − fnk(ω)| < δk for all k large. Thus, if∑∞

k=1 δk < ∞, then for ω in Ac, fnk(ω)k≥1 is a Cauchy sequence in R

and hence, it converges to some f(ω) in R. Setting f(ω) = 0 for ω ∈ A,one gets

limk→∞

fnk= f a.e. (µ).

A choice of εkk≥1 and δkk≥1 such that∑∞

k=1 δk < ∞ and (2.7) holdsis given by εk = 2−(p+1)k and δk = 2−k. This completes Step (I).

Step (II): By Fatou’s lemma, part (I), and (2.5), for any k ≥ 1 fixed,

εk ≥ lim infj→∞

∫|fnk

− fnk+j|pdµ ≥

∫|fnk

− f |pdµ .

Since fnk∈ Lp(µ), this shows that f ∈ Lp(µ). Now, on letting k →∞, (II)

follows.

Page 108: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

3.2 Lp-Spaces 93

Step (III): By triangle inequality, for any k ≥ 1 fixed,

dp(fn, f) ≤ dp(fn, fnk) + dp(fnk

, f).

By (2.5) and (II), for n ≥ nk, the right side above is ≤ 2εk, where εk = εk

if 0 < p < 1 and εk = ε1/pk if 1 ≤ p < ∞. Now letting k →∞, (III) follows.

The proof of Theorem 3.2.2 is complete for 0 < p < ∞. The case p = ∞is left as an exercise (Problem 3.14).

3.2.2 Dual spacesLet 1 ≤ p < ∞. Let g ∈ Lq(µ), where q = p

(p−1) if 1 < p < ∞ and q = ∞if p = 1. Let

Tg(f) =∫

fgdµ, f ∈ Lp(µ). (2.8)

By Holders inequality,∫|fg|dµ < ∞ and so Tg(·) is well defined. Clearly

Tg is linear, i.e.,

Tg(α1f1 + α2f2) = α1Tg(f1) + α2Tg(f2) (2.9)

for all α1, α2 ∈ R and f1, f2 ∈ Lp(µ).

Definition 3.2.3:

(a) A function T : Lp(µ) → R that satisfies (2.9) is called a linear func-tional.

(b) A linear functional T on Lp(µ) is called bounded if there is a constantc ∈ (0,∞) such that

|T (f)| ≤ c‖f‖p for all f ∈ Lp(µ).

(c) The norm of a bounded linear functional T on Lp(µ) is defined as

‖T‖ = sup|Tf | : f ∈ Lp(µ), ‖f‖p = 1

.

By Holder’s inequality (cf. Theorem 3.1.11 and Remark 3.1.4),

|Tg(f)| ≤ ‖g‖q‖f‖p for all f ∈ Lp(µ),

and hence, Tg is a bounded linear functional on Lp(µ). This implies that ifdp(fn, f) → 0, then

|Tg(fn)− Tg(f)| ≤ ‖g‖qdp(fn, f) → 0,

i.e., Tg is continuous on the metric space (Lp(µ), dp).

Page 109: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

94 3. Lp-Spaces

Definition 3.2.4: The set of all continuous linear functionals on Lp(µ) iscalled the dual space of Lp(µ) and is denoted by (Lp(µ))∗.

In the next section, it will be shown that continuity of a linear functionalon Lp(µ) implies boundedness. A natural question is whether every contin-uous linear functional T on Lp(µ) coincides with Tg for some g in Lq(µ).The answer is “yes” for 1 ≤ p < ∞, as shown by the following result.

Theorem 3.2.3: (Riesz representation theorem). Let 1 ≤ p < ∞. LetT : Lp(µ) → R be linear and continuous. Then, there exists a g in Lq(µ)such that T = Tg, i.e.,

T (f) = Tg(f) ≡∫

fgdµ for all f ∈ Lp(µ), (2.10)

where q = pp−1 for 1 < p < ∞ and q = ∞ if p = 1.

Remark 3.2.2: Such a representation is not valid for p = ∞. That is,there exists continuous linear functionals T on L∞(µ) for which there is nog ∈ L1(µ) such that T (f) =

∫fgdµ for all f ∈ L∞(µ), provided µ is not

concentrated on a finite set ω1, ω2, . . . , ωk ⊂ Ω.

For a proof of Theorem 3.2.3 and the above remark, see Royden (1988)or Rudin (1987).

Next consider the mapping from (Lp(µ))∗ and Lq(µ) defined by

φ(Tg) = g,

where Tg is as defined in (2.10). Then, φ is linear, i.e.,

φ(α1T1 + α2T2) = α1φ(T1) + α2φ(T2)

for all α1, α2 ∈ R and T1, T2 ∈ (Lp(µ))∗. Further,

‖φ(T )‖q = ‖T‖ for all T ∈ (Lp(µ))∗.

Thus, φ preserves the vector space structure of (Lp(µ))∗ and the norm. Forthis reason, it is called an isometry between (Lp(µ))∗ and Lq(µ).

3.3 Banach and Hilbert spaces

3.3.1 Banach spacesIf (Ω,F , µ) is a measure space it was seen in the previous section that thespace Lp(Ω,F , µ) of equivalence classes of functions f with

∫Ω |f |

pdµ < ∞is a vector space over R for all 1 ≤ p < ∞ and for p ≥ 1, ‖ · ‖p ≡(∫|f |pdµ)1/p satisfies

Page 110: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

3.3 Banach and Hilbert spaces 95

(i) ‖f + g‖ ≤ ‖f‖+ ‖g‖,

(ii) ‖αf‖ = |a| ‖f‖ for every α ∈ R,

(iii) ‖f‖ = 0 iff f = 0 a.e. (µ).

The Euclidean spaces Rk for any k ∈ N is also a vector space. Notethat for p ≥ 1, setting ‖x‖p ≡ (

∑ki=1 |xi|p)1/p if x = (x1, x2, . . . , xk),

(Rk, ‖x‖p) may be identified with a special case of Lp(Ω,F , µ), where Ω ≡1, 2, . . . , k, F = P(Ω) and µ is the counting measure.

Generalizing the above examples leads to the notion of a normed vectorspace (also called normed linear space). Recall that a vector space V overR is a nonempty set with a binary operation +, a function from V × V toV (called addition), and scalar multiplication by the real numbers, i.e., afunction from R× V → V ,

((α, v) → αv

)satisfying

(i) v1, v2 ∈ V ⇒ v1 + v2 = v2 + v1 ∈ V .

(ii) v1, v2, v3 ∈ V ⇒ (v1 + v2) + v3 = v1 + (v2 + v3).

(iii) There exists an element θ, called the zero vector, in V such thatv + θ = v for all v in V .

(iv) α ∈ R, v ∈ V ⇒ αv ∈ V .

(v) α ∈ R, v1, v2 ∈ V ⇒ α(v1 + v2) = αv1 + αv2.

(vi) α1, α2 ∈ R, v ∈ V ⇒ (α1+α2)v = α1v+α2v and α1(α2v) = (α1α2)v.

(vii) v ∈ V ⇒ 0v = θ and 1v = v.

Note that from conditions (vi) and (vii) above, it follows that for any v inV , v +(−1)v = 0 · v = θ. Thus for any v in V , (−1)v is the additive inverseand is denoted by −v. Conditions (i), (ii), and (iii) are called respectivelycommutativity, associativity, and the existence of an additive identity. ThusV under the operation + is an Abelian (i.e., commutative) group.

Definition 3.3.1: A function f from V to R+ denoted by f(v) ≡ ‖v‖ iscalled a norm if

(a) v1, v2 ∈ V ⇒ ‖v1 + v2‖ ≤ ‖v1‖+ ‖v2‖ (triangle inequality)

(b) α ∈ R, v ∈ V ⇒ ‖αv‖ = |α| ‖v‖ (scalar homogeneity)

(c) ‖v‖ = 0 iff v = θ.

A vector space V with a norm ‖ ·‖ defined on it is called a normed vectorspace or normed linear space and is denoted as (V, ‖ · ‖). Let d(v1, v2) ≡‖v1− v2‖, v1, v2 ∈ V . Then from the definition of ‖ · ‖, it follows that d is ametric on V , i.e., (V, d) is a metric space. Recall that a metric space (S, d)

Page 111: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

96 3. Lp-Spaces

is called complete if every Cauchy sequence xnn≥1 in S converges to anelement x in S.

Definition 3.3.2: A Banach space is a complete normed linear space(V, ‖ · ‖).

It was shown by S. Banach of Poland that all Lp(Ω,B, µ) spaces areBanach spaces, provided p ≥ 1 and in particular, all Euclidean spacesare Banach spaces. An example of a different kind is the space C[0, 1] ofall real valued continuous functions on [0, 1] with the usual operation ofpointwise addition and scalar multiplication, i.e., (f + g)(x) = f(x) + g(x)and (αf)(x) = α · f(x) for all α ∈ R, 0 ≤ x ≤ 1, f , g ∈ C[0, 1] where thenorm (called the supnorm) is defined by ‖f‖ = sup|f(x)| = 0 ≤ x ≤ 1.The verification of the fact that C[0, 1] with the supnorm is a Banachspace is left as an exercise (Problem 3.22). The space P of all polynomialson [0, 1] is also a normed linear space under the above norm but (P, ‖ · ‖) isnot complete (Problem 3.23). However for each n ∈ N, the space Pn of allpolynomials on [0, 1] of degree ≤ n is a Banach space under the supnorm(Problem 3.26).

Definition 3.3.3: Let V be a vector space. A subset W ⊂ V is called asubspace of V if v1, v2 ∈ W, α1, α2 ∈ R ⇒ α1v1 + α2v2 ∈ W . If (V, ‖ · ‖) isa normed vector space and W is a subspace of V , then (W, ‖ · ‖) is also anormed vector space. If W is closed in (V, ‖ · ‖), then W is called a closedsubspace of V .

Remark 3.3.1: If (V, ‖ · ‖) is a Banach space and W is a closed subspaceof V , then (W, ‖ · ‖) is also a Banach space.

3.3.2 Linear transformationsLet (Vi, ‖ · ‖i), i = 1, 2 be two normed linear spaces over R.

Definition 3.3.4: A function T from V1 to V2 is called a linear trans-formation or linear operator if α1, α2 ∈ R, x, y ∈ V1 ⇒ T (α1x + α2y) =α1T (x) + α2T (y).

Definition 3.3.5: A linear operator T from (V1, ‖ · ‖1) to (V2, ‖ · ‖2) iscalled bounded if ‖T‖ ≡ sup‖Tx‖2 : ‖x‖1 < 1 < ∞, i.e., the image ofthe unit ball in (V1, ‖ · ‖1) is contained in a ball of finite radius centered atthe zero in V2.

Here is a summary of some important results on this topic. By linearityof T , T ( x

‖x‖ ) = 1‖x‖T (x) for any x = 0. It follows that T is bounded iff

there exists k < ∞ such that for any x ∈ V1, ‖Tx‖2 ≤ k‖x‖1. Clearly,then k can be taken to be ‖T‖. Also by linearity, if T is bounded, then‖Tx1−Tx2‖ = ‖T (x1−x2)‖ ≤ ‖T‖ ‖x1−x2‖ and so the map T is continuous

Page 112: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

3.3 Banach and Hilbert spaces 97

(indeed, uniformly so). It turns out that if a linear operator T is continuousat some x0 in V1, then T is continuous on all of V1 and is bounded (Problem3.28 (a)).

Now let B(V1, V2) be the space of all bounded linear operators from(V1, ‖·‖1) to (V2, ‖·‖2). For T1, T2 ∈ B(V1, V2), α1, α2 in R, let (α1T1+α2T2)be defined by (α1T1 + α2T2)(x) ≡ α1T1(x) + α2T1(x) for all x in V1. Thenit can be verified that (α1T1 + α2T2) also belongs to B(V1, V2) and

‖T‖ ≡ sup‖Tx‖2 : ‖x‖1 ≤ 1 (3.1)

is a norm on B(V1, V2). Thus (B(V1, V2), ‖·‖) is also a normed linear space.If (V2, ‖ · ‖2) is complete, then it can be shown that (B(V1, V2), ‖ · ‖) is

also a Banach space (Problem 3.28 (b)). In particular, if (V2, ‖ · ‖2) is thereal line, the space (B(V1, R), ‖ · ‖) is a Banach space.

3.3.3 Dual spaces

Definition 3.3.6: The space of all bounded linear functions from (V1, ‖·‖)to R (also called bounded linear functionals), denoted by V ∗

1 , is called thedual space of V1.

Thus, for any normed linear space (V1, ‖·‖1) (that need not be complete),the dual space (V ∗

1 , ‖ ·‖) is always a Banach space, where ‖T‖ ≡ sup|Tx| :‖x‖1 < 1 for T ∈ V ∗

1 . If (V1, ‖ · ‖1) = Lp(Ω,F , µ) for some measurespace (Ω,F , µ) and 1 ≤ p < ∞, by the Riesz representation theorem (seeTheorem 3.2.3), the dual space may be identified with Lq(Ω,F , µ) whereq is the conjugate of p, i.e., 1

p + 1q = 1. However, as pointed out earlier in

Section 3.2, this is not true for p = ∞. That is, the dual of L∞(Ω,F , µ) isnot L1(Ω,F , µ) unless (Ω,F , µ) is a measure space where Ω is a finite setw1, w2, . . . , wk and F = P(Ω). An example for the p = ∞ case can beconstructed for the space ∞ of all bounded sequences of real numbers (cf.Royden (1988)).

The representation of the dual space of the Banach space C[0, 1] withsupnorm is in terms of finite signed measures (cf. Section 4.2).

Theorem 3.3.1: (Riesz ). Let T : C[0, 1] → R be linear and bounded.Then there exists two finite measures µ1 and µ2 on [0, 1] such that for anyf ∈ C[0, 1]

T (f) =∫

fdµ1 −∫

fdµ2.

For a proof see Royden (1988) or Rudin (1987) (see also Problem 3.27).

Page 113: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

98 3. Lp-Spaces

3.3.4 Hilbert spaceA vector space V over R is called a real innerproduct space if there existsa function f : V × V → R, denoted by f(x, y) ≡ 〈x, y〉 (and called theinnerproduct) that satisfies

(i) 〈x, y〉 = 〈y, x〉 for all x, y ∈ V ,

(ii) (linearity) 〈α1x1 +α2x2, y〉 = α1〈x, y〉+α2〈x2, y〉 for all α1, α2 ∈ R,x1, x2, y ∈ V ,

(iii) 〈x, x〉 ≥ 0 for all x ∈ V and 〈x, x〉 = 0 iff x = θ, the zero vector of V .

Using the fact that the quadratic function ϕ(t) = 〈x + ty, x + ty〉 =〈x, x〉+2t〈x, y〉+ t2〈y, y〉 is nonnegative for all t ∈ R, one gets the Cauchy-Schwarz inequality

|〈x, y〉| ≤√〈x, x〉〈y, y〉 for all x, y ∈ V.

Now setting ‖x‖ =√〈x, x〉 and using the Cauchy-Schwarz inequality, one

verifies that ‖x‖ is a norm on V and thus (V, ‖ ·‖) is a normed linear space.Further, the function 〈x, y〉 from V × V to R is continuous (Problem 3.29)under the norm ‖(x1, x2)‖ = ‖x1‖+ ‖x2‖, (x1, x2) ∈ V × V .

Definition 3.3.7: Let (V, 〈·, ·〉) be a real innerproduct space. It is calleda Hilbert space if (V, ‖ · ‖) is a Banach space, i.e., if it is complete.

It was seen in Section 3.2 that for any measure space (Ω,F , µ), the spaceL2(Ω,F , µ) of all equivalence classes of functions f : Ω → R satisfying∫|f |2dµ < ∞ is a complete innerproduct space with the innerproduct

〈f, g〉 =∫

fgdµ and hence a Hilbert space. It turns out that every Hilbertspace H is an L2(Ω,F , µ) for some (Ω,F , µ). (The axiom of choice or itsequivalent, the Hausdorff’s maximality principle, is required for a proof ofthis. See Rudin (1987).) This is in contrast to the Banach space case whereevery Lp(Ω,F , µ) with p ≥ 1 is a Banach space but not conversely, i.e.,every Banach space need not be an Lp(Ω,F , µ).

Next for each x in a Hilbert space H, let Tx : H → R be defined byTx(y) = 〈x, y〉. By the defining properties of 〈x, y〉 and the Cauchy-Schwarzinequality, it is easy to verify that Tx is a bounded linear function on H,i.e.,

Tx(α1y1 + α2y2) = α1Tx(y1) + α2Tx(y2) for all α1, α2 ∈ R, y1, y2 ∈ H(3.2)

and|Tx(y)| ≤ ‖x‖ ‖y‖ for all y ∈ H. (3.3)

Thus Tx ∈ H∗, the dual space. It is an important result (see Theorem 3.3.3below) that every T ∈ H∗ is equal to Tx for some x in H and ‖T‖ = ‖x‖.Thus H∗ can be identified with H.

Page 114: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

3.3 Banach and Hilbert spaces 99

Definition 3.3.8: Let (V, 〈·, ·〉) be an inner product space. Two vectorsx, y in V are said to be orthogonal and written as x ⊥ y if 〈x, y〉 = 0.

A collection B ⊂ V is called orthogonal if x, y ∈ B, x = y ⇒ 〈x, y〉 = 0.The collection B is called orthonormal if it is orthogonal and in additionfor all x in B, ‖x‖ = 1. Note that if x ⊥ y, then ‖x − y‖2 = 〈x − y, x −y〉 = 〈x, x〉+ 〈y, y〉 = ‖x‖2 + ‖y‖2 and so if B is an orthonormal set, thenx, y ∈ B ⇒ either x = y or ‖x−y‖ =

√2. Thus, if V is separable under the

metric d(x, y) = ‖x− y‖ (i.e., there exists a countable set D ⊂ V such thatfor every x in V and ε > 0, there exists a d ∈ D such that ‖x−d‖ < ε) andif B ⊂ V is an orthonormal system, then the open ball Sb of radius 1

2√

2around each b ∈ B satisfies Sb ∩ D : b ∈ B are disjoint and nonempty.Thus B is countable.

Now let (V, 〈·, ·〉) be a separable innerproduct space and B ⊂ V be anorthonormal system.

Definition 3.3.9: The Fourier coefficients of a vector x in V with respecton orthonormal set B is the set 〈x, b〉 : b ∈ B.

Since V is separable, B is countable. Let B = bi : i ∈ N. For a givenx ∈ V , let ci = 〈x, bi〉, i ≥ 1. Let xn ≡

∑ni=1 cibi, n ∈ N. The sequence

xnn≥1 is called the partial sum sequence of the Fourier expansion of thevector x w.r.t. the orthonormal set B.

A natural question is: when does xnn≥1 converge to x? By the linearityproperty in the definition of the innerproduct 〈·, ·〉, it follows that

0 ≤ ‖x− xn‖2 = 〈x− xn, x− xn〉 = 〈x, x〉 − 2〈x, xn〉+ 〈xn, xn〉

and

〈x, xn〉 =n∑

i=1

ci〈x, bi〉 =n∑

i=1

c2i .

Since bii≥1 are orthonormal,

‖xn‖2 = 〈xn, xn〉 =n∑

i=1

c2i = 〈x, xn〉.

Thus,

0 ≤ ‖x− xn‖2 = ‖x‖2 − ‖xn‖2 = ‖x‖2 −n∑

i=1

c2i ,

leading to

Proposition 3.3.2: (Bessel’s inequality). Let bii≥1 be orthonormal inan innerproduct space (V, 〈·, ·〉). Then, for any x in V ,

∞∑i=1

〈x, bi〉2 ≤ ‖x‖2. (3.4)

Page 115: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

100 3. Lp-Spaces

Now let (V, 〈·, ·〉) be a Hilbert space. Since for m > n,

‖xn − xm‖2 =m∑

i=n+1

〈x, bi〉2,

it follows from Bessel’s inequality that xnn≥1 is a Cauchy sequence, andsince V is complete, there is a y in V such that xn → y. This implies (by thecontinuity of 〈x, y〉) that 〈x, bi〉 = limn→∞〈xn, bi〉 = 〈y, bi〉 ⇒ 〈x−y, bi〉 = 0for all i ≥ 1. Thus, it follows that 〈x, bi〉 = 〈y, bi〉 for all i ≥ 1. The lastrelation implies y = x iff the set bii≥1 satisfies the property that

〈z, bi〉 = 0 for all i ≥ 1 ⇒ ‖z‖ = 0. (3.5)

This property is called the completeness of B. Thus B ≡ bii≥1 is acomplete orthonormal set for a Hilbert space H ≡ (V, 〈·, ·〉), iff for everyvector x,

∞∑i=1

c2i = ‖x‖2, (3.6)

where ci = 〈x, bi〉, i ≥ 1, which in turn holds iff∥∥∥∥x−n∑

i=1

cibi

∥∥∥∥→ 0 as n →∞. (3.7)

Conversely, if cii≥1 is a sequence of real numbers such that∑∞

i=1 c2i < ∞,

then the sequence xn ≡∑n

i=1 cibin≥1 is Cauchy and hence converges toan x in V . Thus the Hilbert space H can be identified with the space 2of all square summable sequences

cii≥1 :

∑∞i=1 c2

i < ∞, in the sense

that the map ϕ : x → cii≥1, where ci = 〈x, bi〉, i ≥ 1, preserves thealgebraic structure as well as the innerproduct, i.e., ϕ is a linear operatorfrom H to 2 and 〈ϕ(x), ϕ(y)〉 = 〈x, y〉 for all x, y ∈ H. Such a ϕ is calledan isometric isomorphism between H to 2. Note also that 2 is simplyL2(Ω,F , µ) where Ω ≡ N, F = P(N), and µ, the counting measure. It canbe shown (using the axiom of choice) that every separable Hilbert spacedoes possess a complete orthonormal system, i.e., an orthonormal basis.

Next some examples are given. Here, unless otherwise indicated, H de-notes the Hilbert space and B denotes an orthonormal basis of H.

Example 3.3.1:

(a) H ≡ 2 = (x1, x2, . . .) : xi ∈ R,∑∞

i=1 x2i < ∞.

B ≡ ei : i ≥ 1 where ei ≡ (0, 0, . . . , 1, 0, . . .) with 1 in the ithposition and 0 elsewhere.

(b) H ≡ L2 ([0, 2π],B([0, 2π]

), µ) where µ(A) = 1

2π m(A), m(·) beingLebesgue measure.B ≡ cos nx : n = 0, 1, 2, . . . , ∪ sin nx : n = 1, 2, . . .. (For a proof,see Chapter 5.)

Page 116: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

3.3 Banach and Hilbert spaces 101

(c) Let H ≡ L2(R,B(R), µ) where µ is a finite measure such that∫|x|kdµ < ∞ for all k = 1, 2, . . ..

Let B1 ≡ 1, x, x2, . . . and B be the orthonormal set generated byapplying the Gram-Schmidt procedure to B1 (see Problem 3.31). Itcan be shown that B is a basis for H (Problem 3.39). When µ is thestandard normal distribution, the elements of B are called Hermitepolynomials.

For one more example, i.e., Haar functions, see Problem 3.40.

Theorem 3.3.3: (Riesz representation). Let H be a separable Hilbertspace. Then every bounded linear functional T on H → R can be representedas T ≡ Tx0 for some x0 ∈ V , where Tx0(y) ≡ 〈y, x0〉.

Proof: Let B = bii≥1 be an orthonormal basis for H. Let ci ≡ T (bi),i ≥ 1. Then, for n ≥ 1,

n∑i=1

c2i =

n∑i=1

ciT (bi)

= T

( n∑i=1

cibi

)(by the linearity of T )

⇒∣∣∣∣

n∑i=1

c2i

∣∣∣∣ ≤ ‖T‖∥∥∥∥

n∑i=1

cibi

∥∥∥∥ = ‖T‖( n∑

i=1

c2i

)1/2

⇒n∑

i=1

c2i ≤ ‖T‖2

⇒∞∑

i=1

c2i < ∞.

Thus xn ≡∑n

i=1 cibin≥1 is Cauchy in H and hence converges to anx0 in H. By the continuity of T , for any y, Ty = limn→∞ Tyn, whereyn ≡

∑ni=1〈y, bi〉bi, n ≥ 1. But

Tyn =n∑

i=1

〈y, bi〉ci =⟨

y,

n∑i=1

bici

= 〈y, xn〉, by the linearity of T

Again by continuity of 〈y, x〉, it follows that Ty = 〈y, x0〉.

A sufficient condition for an L2(Ω,F , µ) to be separable is that thereexists an at most countable family A ≡ Ajj≥1 of sets in F such thatF = σ〈A〉 and µ(Aj) > 0 for each j. This holds for any σ-finite measure µon(Rk,B(Rk)

)(Problem 3.38).

Page 117: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

102 3. Lp-Spaces

Remark 3.3.2: Assuming the axiom of choice, the Riesz representationtheorem remains valid for any Hilbert space, separable or not (Problem3.43).

3.4 Problems

3.1 Prove Proposition 3.1.7.

(Hint: Use (1.4) repeatedly.)

3.2 Let (Ω,F , µ) be a measure space with µ(Ω) ≤ 1 and f : Ω → (a, b) ⊂R be in L1(Ω,F , µ). Let φ : (a, b) → R be convex. Show that ifc ≡

∫fdµ ∈ (a, b) and φ(f) ∈ L1(Ω,F , µ) and cφ′

+(c) ≥ 0, then

µ(Ω) φ

(∫fdµ

)≤∫

φ(f)dµ.

3.3 Prove Proposition 3.1.10.

(Hint: Apply Jensen’s inequality with Ω ≡ 1, 2, . . . , k, F = P(Ω),P (i) = pi, f(i) = ai, i = 1, 2, . . . , k, and φ(x) = ex to get (i).Deduce (ii) from (i) and Remark 3.1.3. For (iii), consider φ(x) = |x|p.)

3.4 Give an example of a convex function φ on (0, 1) with a finite numberof points where it is not differentiable. Can this be extended to thecountable case? Uncountable case?

(Hint: Note that φ′+(·) and φ′

−(·) are both monotone and hence haveat most a countable number of discontinuity points.)

3.5 Let φ : (a, b) → R be convex.

(a) Using the definition and induction, show that

φ

( n∑i=1

pixi

)≤

n∑i=1

piφ(xi)

for any n ≥ 2, x1, x2 . . . , xn in (a, b) and p1, p2, . . . , pn, a prob-ability distribution.

(b) Use (a) to prove Jensen’s inequality for any bounded φ.

3.6 Show that a function φ : R → R is convex iff

φ

(∫[0,1]

fdm

)≤∫

[0,1]φ(f)dm

for every bounded Borel measurable function f : [0, 1] → R, wherem(·) is the Lebesgue measure.

Page 118: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

3.4 Problems 103

3.7 Let φ be convex on (a, b) and ψ : R → R be convex and nondecreasing.Show that ψ φ is convex on (a, b).

3.8 Let X be a nonnegative random variable on some probability space.

(a) Show that (EX)(E 1X ) ≥ 1. What does this say about the cor-

relation between X and 1X ?

(b) Let f, g : R+ → R+ be Borel measurable and such thatf(x)g(x) ≥ 1 for all x in R+. Show that Ef(X)Eg(X) ≥ 1.

3.9 Prove Corollary 3.1.13 using Holder’s inequality applied to an appro-priate measure space.

3.10 Extend Holder’s inequality as follows. Let 1 < pi < ∞, andfi ∈ Lpi(Ω,F , µ), i = 1, 2, . . . , k. Suppose

∑ki=1

1pi

= 1. Show that∫ (∏ki=1 fi

)dµ ≤

∏ki=1 ‖fi‖pi

.

(Hint: Use Proposition 3.1.10 (ii).)

3.11 Verify Minkowski’s inequality for p = 1 and p = ∞.

3.12 (a) Find (Ω,F , µ), 0 < p < 1, f , g ∈ Lp(Ω,F , µ) such that

(∫|f + g|pdµ

)1/p

>(∫

|f |pdµ)1/p

+(∫

|g|pdµ)1/p

.

(b) Prove (1.18) for 0 < p < 1 with ‖f‖p =∫|f |pdµ.

3.13 Let (Ω,F , µ) be a measure space. Let Akk≥1 ⊂ F and∑∞k=1 µ(Ak) < ∞. Show that µ

(lim

k→∞Ak

)= 0, where lim

k→∞Ak =⋂∞

n=1⋃

j≥n Aj = ω : ω ∈ Aj for infinitely many j ≥ 1.

3.14 Establish Theorem 3.2.2 for p = ∞.

(Hint: For each k ≥ 1, choose nk ↑ such that ‖fnk+1 − fnk‖∞ < 2−k.

Show that there exists a set A with µ(Ac) = 0 and for ω in A,|fnk+1(ω) − fnk

(ω)| < 2−k for all k ≥ 1 and now proceed as in theproof for the case 0 < p < ∞.)

3.15 Let f , g ∈ Lp(Ω,F , µ), 0 < p < 1. Show that d(f, g) =∫|f − g|pdµ

is a metric and (Lp(Ω,F , µ), d) is complete.

3.16 Let (Ω,F , µ) be a measure space and f : Ω → R be F-measurable.Let Af = p : 0 < p < ∞,

∫|f |pdµ < ∞.

(a) Show that p1, p2 ∈ Af , p1 < p2 implies [p1, p2] ⊂ Af .

(Hint: Use∫

|f |≥1 |f |pdµ ≤

∫|f |≥1 |f |

p2dµ and∫

|f |≤1 |f |pdµ ≤∫

|f |≤1 |f |p1dµ for any p1 < p < p2.)

Page 119: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

104 3. Lp-Spaces

(b) Let ψ(p) = log∫|f |pdµ for p ∈ Af . By (a), it is known that Af

is connected, i.e., it is an interval. Show that ψ is convex in theopen interior of Af .

(Hint: Use Holder’s inequality.)

(c) Give examples to show that Af could be a closed interval, anopen interval, and semi-open intervals.

(d) If 0 < p1 < p < p2, show that

‖f‖p ≤ max‖f‖p1 , ‖f‖p2

(Hint: Use (b).)

(e) Show that if∫|f |rdµ < ∞ for some 0 < r < ∞, then ‖f‖p →

‖f‖∞ as p →∞.

(Hint: Show first that for any K > 0, µ(|f | > K) > 0 ⇒lim

p→∞‖f‖p ≥ K. If ‖f‖∞ < ∞ and µ(Ω) < ∞, use the fact that

‖f‖p ≤ ‖f‖∞(µ(Ω))1/p and reduce the general case under thehypothesis that

∫|f |pdµ < ∞ for some p to this case.)

3.17 Let X be a random variable on a probability space (Ω,F , µ). Recallthat Eh(X) =

∫h(X)dµ if h(X) ∈ L1(Ω,F , µ).

(a) Show that (E|X|p1) ≤ (E|X|p2)p1p2 for any 0 < p1 < p2 < ∞.

(b) Show that ‘=’ holds in (a) iff |X| is a constant a.e. (µ).

(c) Show that if E|X| < ∞, then E| log |X|| < ∞ and E|X|r < ∞for all 0 < r < 1, and 1

r log(E|X|r) → E log |X| as r → 0.

3.18 Let X be a nonnegative random variable.

(a) Show that EX log X ≥ (EX)(E log X).

(b) Show that√

1 + (EX)2 ≤ E(√

1 + X2) ≤ 1 + EX.

3.19 Let Ω = N, F = P(N), and let µ be the counting measure. DenoteLp(Ω,F , µ) for this case by p.

(a) Show that p is the set of all sequences xnn≥1 such that∑∞n=1 |xn|p < ∞.

(b) For the following sequences, find all p > 0 such that they belongto p:

(i) xn ≡ 1n , n ≥ 1.

(ii) xn = 1n(log(n+1))2 , n ≥ 1.

Page 120: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

3.4 Problems 105

3.20 For 1 ≤ p < ∞, prove the Riesz representation theorem for p. Thatis, show that if T is a bounded linear functional from p → R, thenthere exists a y = yii≥1 ∈ q such that for any x = xii≥1 in p,T (x) =

∑∞i=1 xiyi.

(Hint: Let yi = T (ei) where ei = ei(j)j≥1, ei(j) = 1 if i = j, 0 ifi = j. Use the fact |T (x)| ≤ ‖T‖ ‖x‖p to show that for each n ∈ N,(∑n

i=1 |yi|q) ≤ ‖T‖q < ∞.)

3.21 Let Ω = R, F = B(R), µ = µF where F is a cdf on R. If f(x) ≡ x2,find Af = p : 0 < p < ∞, f ∈ Lp(R,B(R), µF ) for the followingcases:

(a) F (x) = Φ(x), the N(0, 1) cdf, i.e., Φ(x) ≡ 1√2π

∫ x

−∞ e−u2/2du,x ∈ R.

(b) F (x) = 1π

∫ x

−∞1

1+u2 du, x ∈ R.

3.22 Show that C[0, 1] with the supnorm (i.e., with ‖f‖ = sup|f(x)| : 0 ≤x ≤ 1) is a Banach space.

(Hint: To verify completeness, let fnn≥1 be a Cauchy sequencein C[0, 1]. Show that for each 0 ≤ x ≤ 1, fn(x)n≥1 is a Cauchysequence in R. Let f(x) = lim

n→∞ fn(x). Now show that sup|fn(x) −f(x)| : 0 ≤ s ≤ 1 ≤ lim

m→∞ ‖fn − fm‖. Conclude that fn converges to

f uniformly on [0, 1] and that f ∈ C[0, 1].)

3.23 Show that the space (P, ‖ · ‖) of all polynomials on [0, 1] with thesupnorm is a normed linear space that is not complete.

(Hint: Let g(t) = 11−t/2 for 0 ≤ t ≤ 1. Find a sequence of polynomials

fnn≥1 in P that converge to g in supnorm.)

3.24 Show that the function f(v) ≡ ‖v‖ from a normed linear space (V, ‖·‖)to R+ is continuous.

3.25 Let (V, ‖ · ‖) be a normed linear space. Let S = v : ‖v‖ < 1. Showthat S is an open set in V .

3.26 Show that the space Pk of all polynomials on [0, 1] of degree ≤ kis a Banach space under the supnorm, i.e., under ‖f‖ = sup|f(x)|,0 ≤ s ≤ 1.

(Hint: Let pn(x) =∑k

j=0 anjxj be a sequence of elements in Pk that

converge in supnorm to some f(·). Show that an1n≥1 converges andrecursively, anin≥1 converges for each i.)

3.27 Let µ be a finite measure on [0,1]. Verify that Tµ(f) ≡∫

fdµ is abounded linear functional on C[0, 1] and that ‖Tµ‖ = µ([0, 1]).

Page 121: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

106 3. Lp-Spaces

3.28 Let (Vi, ‖ · ‖i), i = 1, 2, be two normed linear spaces over R.

(a) Let T : V1 → V2 be a linear operator. Show that if for some x0,‖Tx − Tx0‖ → 0 as x → x0, then T is continuous on V1 andhence bounded.

(b) Show that if (V2, ‖ · ‖2) is complete, then B(V1, V2) ≡ T | T :V1 → V2, T linear and bounded is complete under the operatornorm defined in (3.1).

In the following, H will denote a real Hilbert space.

3.29 (a) Use the Cauchy-Schwarz inequality to show that the functionf(x, y) = 〈x, y〉 is continuous from H ×H → R.

(b) (Parallelogram law). Show that in an innerproduct space(V, 〈·, ·〉), for any x, y ∈ V

‖x + y‖2 + ‖x− y‖2 = 2(‖x‖2 + ‖y‖2)

where ‖x‖2 = 〈x, x〉.

3.30 (a) Let Qn(x)n≥0 be defined on [0, 2π] by

Qn(x) = cn

(1 + cos x

2

)n

where cn is such that

12π

∫ 2π

0Qn(x)dx = 1.

Clearly, Qn(·) ≥ 0.

(i) Verify that for each δ > 0,

supQn(x) : δ ≤ x ≤ 2π − δ → 0 as n →∞.

(ii) Use this to show that if f ∈ C[0, 2π] and if

Pn(t) ≡ 12π

∫ 2π

0Qn(t− s)f(s)ds, n ≥ 0, (4.1)

then Pn → f uniformly on [0, 2π].

(b) Use this to give a proof of the completeness of the class C oftrigonometric functions.

(c) Show that if f ∈ L1([0, 2π]), then Pn(·) converges to f inL1([0, 2π]).

Page 122: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

3.4 Problems 107

(d) Let µn(·)n≥1 be a sequence of probability measures on(R,B(R)) such that for each δ > 0, µn(x : |x| ≥ δ) → 0as n → ∞. Let f : R → R be Borel measurable. Let fn(x) ≡∫

f(x−y)µn(dy), n ≥ 1. Assuming that fn(·) is well defined andBorel measurable, show that

(i) f(·) continuous at x0 and bounded ⇒ fn(x0) → f(x0).(ii) f(·) uniformly continuous and bounded on R ⇒ fn → f

uniformly on R.(iii) f ∈ Lp(R,B(R), m), 0 < p < ∞, m(·) = Lebesgue measure

⇒∫|fn − f |pdm → 0.

(iv) Show that (iii) ⇒ (c).

3.31 (Gram-Schmidt procedure). Let B ≡ bn : n ∈ N be a collection ofnonzero vector in H. Set

e1 =b1

‖b1‖e2 = b2 − 〈b2, e1〉e1,

e2 =e2

‖e2‖(provided ‖e2‖ = 0), and so on.

If ‖en‖ = 0 for some n ∈ N, then delete bn. Let E ≡ ej : 1 ≤ j < k,k ≤ ∞, be the collection of vectors reached this way.

(a) Show that E is an orthonormal system.

(b) Let HB denote the closed linear subspace generated by B, i.e.,

HB ≡

x : x ∈ H, there exists xn of the formn∑

j=1

ajbj ,

aj ∈ R, such that ‖xn − x‖ → 0

.

Show that HB is a Hilbert space and E is a basis for HB .

3.32 Let H = L2(R, B(R), µ), where µ is a probability measure. LetB ≡ 1, x, x2, . . .. Assume that

∫|x|kdµ < ∞ for all k ∈ N. Ap-

ply the Gram-Schmidt procedure in Problem 3.31 to the set B forthe following cases and evaluate e1, e2, e3.

(a) µ = Uniform [0, 1] distribution.

(b) µ = standard normal distribution.

(c) µ = Exponential (1) distribution.

The orthonormal basis E obtained this way is called Orthogonal Poly-nomials w.r.t. the given measure. (See Szego (1939).)

Page 123: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

108 3. Lp-Spaces

3.33 Let B ⊂ H be an orthonormal system. Show that for any x in H,b : 〈x, b〉 = 0 is at most countable.

(Hint: Show first that if yα : α ∈ I is a collection of nonnegativereal numbers such that for some C < ∞,

∑α∈F yα ≤ C for all F ⊂ I,

F finite, then α : yα > 0 is at most countable and apply this to theBessel inequality.)

3.34 Let B ⊂ H. Define B⊥ ≡ x : x ∈ H, 〈x, b〉 = 0, for all b ∈ B.Show that B⊥ is a closed subspace of H.

3.35 Let B ⊂ H be a closed subspace of H.

(a) Using the fact that every Hilbert space admits an orthonormalbasis, show that every x in H can be uniquely decomposed as

x = y + z (4.2)

where y ∈ B and z ∈ B⊥ and ‖x‖2 = ‖y‖2 + ‖z‖2.(b) Let PB : H → B be defined by PBx = y where x admits the

decomposition in (4.2) above. Verify that PB is a bounded linearoperator from H to B and is of norm 1 if B has at least onenonzero vector. (The operator PB is called the projection ontoB.)

(c) Verify that PB(PBx) = PBx for all x in H.

3.36 Let H be separable and xnn≥1 ⊂ H be such that ‖xn‖n≥1 isbounded by some C < ∞. Show that there exist a subsequencexnjj≥1 ⊂ xnn≥1 and an x0 in H, such that for every y in H,

〈xnj , y〉 → 〈x0, y〉.

(Hint: Fix an orthonormal basis B ≡ bnn≥1 ⊂ H. Let ani =〈xn, bi〉, n ≥ 1, i ≥ 1. Using

∑∞i=1 a2

ni ≤ C for all n and the Bolzano-Wierstrass property, show that

(a) there exists njj≥1 such that limj→∞

anji = ai exists for all i ≥ 1,∑∞i=1 a2

i < ∞,(b) lim

n→∞∑n

i=1 aibi ≡ x0 exists in H,

(c) 〈xnj, y〉 → 〈x0, y〉 for all y in H.)

3.37 Let (V, 〈·, ·〉) be an innerproduct space. Verify that 〈·, ·〉 is bilinear,i.e., for α1, α2, β1, β2 ∈ R, x1, x2, y1, y2 in V ,

〈α1x1 + α2x2, β1y1 + β2y2〉 = α1β1〈x1, y1〉+ α1β2〈x1, y2〉+ α2β1〈x2, y1〉+ α2β2〈x2, y2〉.

State and prove an extension to more than two vectors.

Page 124: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

3.4 Problems 109

3.38 Let (Ω,F , µ) be a measure space. Suppose that there exists an atmost countable family A ≡ Ajj≥1 ⊂ F such that F = σ〈A〉 andµ(Aj) > 0 for each j ≥ 1. Then show that for 0 < p < ∞, Lp(Ω,F , µ)is separable.

(Hint: Show first that for any A ∈ F with µ(A) < ∞, and ε > 0,there exists a countable subcollection A1 of A such that µ(AB) < εwhere B = ∪Aj : Aj ∈ A1. Now consider the class of functions∑n

i=1 ciIAi , n ≥ 1, Ai ∈ A, ci ∈ Q.)

3.39 Show that B in Example 3.3.1 (c) is a basis for H.

(Hint: Using Theorem 2.3.14 prove that the set of all polynomialsare dense in H.)

3.40 (Haar functions). For x in R let h(x) = I[0,1/2)(x) − I[1/2,1)(x).Let h00(x) ≡ I[0,1)(x) and for k ≥ 1, 0 ≤ j < 2k−1, let hkj(x) ≡2

k−12 h(2k−1x− j), 0 ≤ x < 1.

(a) Verify that the family hkj(·), k ≥ 0, 0 ≤ j < 2k−1 is anorthonormal set in L2([0, 1],B([0, 1]), m), where m(·) is Lebesguemeasure.

(b) Verify that this family is complete by completing the followingtwo proofs:

(i) Show that for indicator function f of dyadic interval of theform

[k2n ,

2n

), k < , the following identity holds:∫

f2dm =− k

2n=∑k,j

(∫fhkjdm

)2.

Now use the fact the linear combinations of such f ’s is densein L2[0, 1].

(ii) For each f ∈ L2([0, 1],B([0, 1]), m) such that f is orthog-onal to the Haar functions, F (t) ≡

∫[0,t] fdm, 0 ≤ t ≤ 1

is continuous and satisfies F ( j2n ) = 0 for all 0 ≤ j ≤ 2n,

n = 1, 2, . . . and hence F ≡ 0 implying f = 0 a.e.

3.41 Let H be a Hilbert space over R and M be a closed subspace of H.Let v0 ∈ H. Show that

min‖v − v0‖ : v ∈ M = max〈v0, u〉, u ∈ M⊥, ‖u‖ = 1,

where M⊥ is the orthogonal complement of M , i.e., M⊥ ≡ u :〈v, u〉 = 0 for all v ∈ M.

(Hint: Use Problem 3.35 (a).)

Page 125: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

110 3. Lp-Spaces

3.42 Let B be an orthonormal set in a Hilbert space H.

(a) (i) Show that for any x in H and any finite set bi : 1 ≤ i ≤k ⊂ B, k < ∞,

k∑i=1

〈x, bi〉2 ≤ ‖x‖2.

(ii) Conclude that for all x in H, Bx ≡ b : 〈x, b〉 = 0, b ∈ Bis at most countable.

(b) Show that the following are equivalent:

(i) B is complete, i.e., x ∈ H, 〈x, b〉 = 0 for all b ∈ B ⇒ x = 0.(ii) For all x ∈ B, there exists an at most countable set Bx ≡

bi : i ≥ 1 such that ‖x‖2 =∑∞

i=1〈x, bi〉2.(iii) For all x ∈ B, ε > 0, there exists a finite set

b1, b2, . . . , bk ⊂ B such that

∥∥∥x− k∑i=1

〈x, bi〉bi

∥∥∥ < ε.

(iv) If B ⊂ B1, B1 an orthonormal set in H ⇒ B = B1.

3.43 Extend Theorem 3.3.3 to any Hilbert space assuming that the axiomof choice holds.

(Hint: Using the axiom of choice or its equivalent, the Hausdorffmaximality principle, it can be shown that every Hilbert space Hadmits an orthonormal basis B (see Rudin (1987)). Now let T be abounded linear functional from H to R. Let f(b) ≡ T (b) for b in B.Verify that

∑ki=1 |f(bi)|2 ≤ ‖T‖2 for all finite collection bi : 1 ≤

i ≤ k ⊂ B. Conclude that D ≡ b : f(b) = 0 is countable. Letx0 ≡

∑b∈D f(b)b. Now use the proof of Theorem 3.3.3 to show that

T (x) ≡ 〈x, x0〉 for all x in H.)

3.44 Let (V, ‖·‖) be a normed linear space. Let Tnn≥1 and T be boundedlinear operators from V to V . The sequence Tnn≥1 is said to con-verge

(a) weakly to T if for each w in V ∗, the dual of V , and each v in V ,

w(Tn(v)) → w(T (v)),

(b) strongly if for each v in V , ‖Tnv − Tv‖ → 0,

(c) uniformly if sup‖Tnv − Tv‖ : ‖v‖ ≤ 1 → 0.

Page 126: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

3.4 Problems 111

Let Vp = Lp(R,B(R), µL), 1 ≤ p ≤ ∞. Let hnn≥1 ⊂ R be such thathn = 0, hn → 0, as n → ∞. Let (Tnf)(·) ≡ f(· + hn), Tf(·) ≡ f(·).Verify that

(i) Tnn≥1 and T are bounded linear operators on Vp, 1 ≤ p ≤ ∞,

(ii) for 1 ≤ p < ∞, Tnn≥1 converges to T weakly,

(iii) for 1 ≤ p < ∞, Tn converges to T strongly,

(iv) for 1 ≤ p < ∞, Tn does not converge to T uniformly byshowing that for all n, ‖Tn − T‖ = 1,

(v) for p = ∞, show that Tn does not converge weakly to T .

Page 127: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4Differentiation

4.1 The Lebesgue-Radon-Nikodym theorem

Definition 4.1.1: Let (Ω,F) be a measurable space and let µ and ν betwo measures on (Ω,F). The measure µ is said to be dominated by ν orabsolutely continuous w.r.t. ν and written as µ ν if

ν(A) = 0 ⇒ µ(A) = 0 for all A ∈ F . (1.1)

Example 4.1.1: Let m be the Legesgue measure on (R,B(R)) and let µbe the standard normal distribution, i.e.,

µ(A) ≡∫

A

1√2π

e−x2/2m(dx), A ∈ B(R).

Then m(A) = 0 ⇒ µ(A) = 0 and hence µ m.

Example 4.1.2: Let Z+ ≡ 0, 1, 2, . . . denote the set of all nonnegativeintegers. Let ν be the counting measure on Ω = Z+ and µ be the Poisson (λ)distribution for 0 < λ < ∞, i.e.,

ν(A) = number of elements in A

and

µ(A) =∑j∈A

e−λλj

j!

for all A ∈ P(Ω), the power set of Ω. Since ν(A) = 0 ⇔ A = ∅ ⇔ µ(A) = 0,it follows that µ ν and ν µ.

Page 128: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

114 4. Differentiation

Example 4.1.3: Let f be a nonnegative measurable function on a measurespace (Ω,F , ν). Let

µ(A) ≡∫

A

fdν for all A ∈ F . (1.2)

Then, µ is a measure on (Ω,F) and ν(A) = 0 ⇒ µ(A) = 0 for all A ∈ Fand hence µ ν.

The Radon-Nikodym theorem is a sort of converse to Example 4.1.3. Itsays that if µ and ν are σ-finite measures (see Section 1.2) on a measurablespace (Ω,F) and if µ ν, then there is a nonnegative measurable functionf on (Ω,F) such that (1.2) holds.

Definition 4.1.2: Let (Ω,F) be a measurable space and let µ and ν betwo measures on (Ω,F). Then, µ is called singular w.r.t. ν and written asµ ⊥ ν if there exists a set B ∈ F such that

µ(B) = 0 and ν(Bc) = 0. (1.3)

Note that µ is singular w.r.t. ν implies that ν is singular w.r.t. µ. Thus,the notion of singularity between two measures µ and ν is symmetric butthat of absolutely continuity is not. Note also that if µ and ν are mutuallysingular and B satisfies (1.3), then for all A ∈ F ,

µ(A) = µ(A ∩Bc) and ν(A) = ν(A ∩B). (1.4)

Example 4.1.4: Let µ be the Lebesgue measure on (R,B(R)) and ν bedefined as ν(A) = # elements in A∩Z where Z is the set of integers. Then

ν(Zc) = 0 and µ(Z) = 0

and hence (1.3) holds with B = Z. Thus µ and ν are mutually singular.

Another example is the pair m and µc on [0,1] where µc is the Lebesgue-Stieltjes measure generated by the Cantor function (cf. Section 4.5) and mis the Lebesgue measure.

Example 4.1.5: Let µ be the Lebesgue measure restricted to (−∞, 0] andν be the Exponential(1) distribution. That is, for any A ∈ B(R),

µ(A) = the Lebesgue measure of A ∩ (−∞, 0];

ν(A) =∫

A∩(0,∞)e−xdx.

Then, µ((0,∞)) = 0 and ν((−∞, 0]) = 0, and (1.3) holds with B =(−∞, 0].

Page 129: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4.1 The Lebesgue-Radon-Nikodym theorem 115

Suppose that µ and ν are two finite measures on a measurable space(Ω,F). H. Lebesgue showed that µ1 can be decomposed as a sum of twomeasures, i.e.,

µ = µa + µs

where µa ν and µs ⊥ ν. The next theorem is the main result of thissection and it combines the above decomposition result of Lebesgue withthe Radon-Nikodym theorem mentioned earlier.

Theorem 4.1.1: Let (Ω,F) be a measurable space and let µ1 and µ2 betwo σ-finite measures on (Ω,F).

(i) (The Lebesgue decomposition theorem). The measure µ1 can beuniquely decomposed as

µ1 = µ1a + µ1s (1.5)

where µ1a and µ1s are σ-finite measures on (Ω,F) such that µ1a µ2and µ1s ⊥ µ2.

(ii) (The Radon-Nikodym theorem). There exists a nonnegative measur-able function h on (Ω,F) such that

µ1a(A) =∫

A

hdµ2 for all A ∈ F . (1.6)

Proof: Case 1: Suppose that µ1 and µ2 are finite measures. Let µ bethe measure µ = µ1 + µ2 and let H = L2(µ). Define a linear function T onH by

T (f) =∫

fdµ1. (1.7)

Then, by the Cauchy-Schwarz inequality applied to the functions f andg ≡ 1,

|T (f)| ≤(∫

f2dµ1

) 12 (

µ1(Ω)) 1

2

≤(∫

f2dµ

) 12 (

µ1(Ω)) 1

2.

This shows that T is a bounded linear functional on H with ‖T‖ ≤ M ≡(µ1(Ω))

12 . By the Riesz representation theorem (cf. Theorem 3.3.3 and

Remark 3.3.2), there exists a g ∈ L2(µ) such that

T (f) =∫

fgdµ (1.8)

Page 130: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

116 4. Differentiation

for all f ∈ L2(µ). Let f = IA for A in F . Then, (1.7) and (1.8) yield

µ1(A) = T (IA) =∫

A

gdµ.

But 0 ≤ µ1(A) ≤ µ(A) for all A ∈ F . Hence the function g in L2(µ) satisfies

0 ≤∫

A

gdµ ≤ µ(A) for all A ∈ F . (1.9)

Let A1 = 0 ≤ g < 1, A2 = g = 1, A3 = g ∈ [0, 1]. Then (1.9)implies that µ(A3) = 0 (see Problem 4.1). Now define the measures µ1a(·)and µ1s(·) by

µ1a(A) ≡ µ1(A ∩A1), µ1s(A) ≡ µ1(A ∩A2), A ∈ F . (1.10)

Next it will be shown that µ1a µ2 and µ1s ⊥ µ2, thus establishing (1.5).By (1.7) and (1.8), for all f ∈ H,∫

fdµ1 =∫

fgdµ =∫

fgdµ1 +∫

fgdµ2

⇒∫

f(1− g)dµ1 =∫

fgdµ2. (1.11)

Setting f = IA2 yields0 = µ2(A2).

From (1.10), since µ1s(Ac2) = 0, it follows that µ1s ⊥ µ2. Now fix n ≥ 1

and A ∈ F . Let f = IA∩A1(1 + g + . . . + gn−1). Then (1.11) implies that∫A∩A1

(1− gn)dµ1 =∫

A∩A1

g(1 + g + . . . + gn−1)dµ2.

Now letting n →∞, and using the MCT on both sides, yields

µ1a(A) =∫

A

IA1

g

(1− g)dµ2. (1.12)

Setting h ≡ g1−g IA1 completes the proof of (1.5) and (1.6).

Case 2: Now suppose that µ1 and µ2 are σ-finite. Then there exists acountable partition Dn≥1 ⊂ F of Ω such that µ1(Dn) and µ2(Dn) areboth finite for all n ≥ 1. Let µ

(n)1 (·) ≡ µ1(· ∩Dn) and µ

(n)2 ≡ µ2(· ∩Dn).

Then applying ‘Case 1’ to µ(n)1 and µ

(n)2 for each n ≥ 1, one gets measures

µ(n)1a , µ

(n)1s and a function hn such that

µ(n)1 (·) ≡ µ

(n)1a (·) + µ

(n)1s (·) (1.13)

Page 131: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4.1 The Lebesgue-Radon-Nikodym theorem 117

where, for A in F , µ(n)1a (A) =

∫A

hndµ(n)2 =

∫A

hnIDndµ2 and µ

(n)1s (·) ⊥ µ

(n)2 .

Since µ1(·) =∑∞

n=1 µ(n)1 (·), it follows from (1.13) that

µ1(·) = µ1a(·) + µ1s(·), (1.14)

where µ1a(A) ≡∑∞

n=1 µ(n)1a (A) and µ1s(·) =

∑∞n=1 µ

(n)1s (·). By the MCT,

µ1a(A) =∫

A

hdµ2, A ∈ F ,

where h ≡∑∞

n=1 hnIDn .Clearly, µ1a µ2. The verification of the singularity of µ1s and µ2 is

left as an exercise (Problem 4.2).It remains to prove the uniqueness of the decomposition. Let µ1 = µa +

µs and µ1 = µ′a + µ′

s be two decompositions of µ1 where µa and µ′a are

absolutely continuous w.r.t. µ2 and µs and µ′s are singular w.r.t. µ2. By

definition, there exist sets B and B′ in F such that

µ2(B) = 0, µ2(B′) = 0, and µs(Bc) = 0, µ′s(B

′c) = 0.

Let D = B ∪ B′. Then µ2(D) = 0 and µs(Dc) ≤ µs(Bc) = 0. Similarly,µ′

s(Dc) ≤ µ′

s(B′c) = 0. Also µ2(D) = 0 implies µa(D) = 0 = µ′

a(D). Thusfor any A ∈ F ,

µa(A) = µa(A ∩Dc) and µ′a(A) = µ′

a(A ∩Dc).

Also

µs(A ∩Dc) ≤ µs(A ∩Bc) = 0µ′

s(A ∩Dc) ≤ µ′s(A ∩B′c) = 0.

Thus, µ(A ∩Dc) = µa(A ∩Dc) + µs(A ∩Dc) = µa(A ∩Dc) = µa(A) andµ(A ∩ Dc) = µ′

a(A ∩ Dc) + µ′s(A ∩ Dc) = µ′

a(A ∩ Dc) = µ′a(A). Hence,

µa(A) = µ(A∩Dc) = µ′a(A) for every A ∈ F . That is, µa = µ′

a and hence,µs = µ′

s.

Remark 4.1.1: In Theorem 4.1.1, the hypothesis of σ finiteness cannot bedropped. For example, let µ be the Lebesgue measure and ν be the countingmeasure on [0, 1]. Then µ ν but there does not exist a nonnegative F-measurable function h such that µ(A) =

∫A

hdν. To see this, if possible,suppose that for some h ∈ L1(ν), µ(A) =

∫A

hdν for all A ∈ F . Note thatµ([0, 1]) = 1 implies that

∫[0,1] hdν < ∞ and hence, that B ≡ x : h(x) > 0

is countable (Problem 4.3). But µ being the Lebesgue measure, µ(B) = 0and µ(Bc) = 1. Since by definition, h ≡ 0 on Bc, this implies 1 = µ(Bc) =∫

Bc hdν = 0, leading to a contradiction. However, if ν is σ-finite and µ ν(µ not necessarily σ-finite), then the Radon-Nikodym theorem holds, i.e.,there exists a nonnegative F-measurable function h such that

µ(A) =∫

A

hdν for all A ∈ F .

Page 132: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

118 4. Differentiation

For a proof, see Royden (1988), Chapter 11.

Definition 4.1.3: Let µ and ν be measures on a measurable space (Ω,F)and let h be a nonnegative measurable function such that

µ(A) =∫

A

hdν for all A ∈ F .

Then h is called the Radon-Nikodym derivative of µ w.r.t. ν and is writtenas

dν= h.

If µ(Ω) < ∞ and there exist two nonnegative F-measurable functions h1and h2 such that

µ(A) =∫

A

h1dν =∫

A

h2dν

for all A ∈ F , then h1 = h2 a.e. (ν) and thus the Radon-Nikodym derivativedµdν is unique up to equivalence a.e. (ν). This also extends to the case whenµ is σ-finite.

The following proposition is easy to verify (cf. Problem 4.4).

Proposition 4.1.2: Let ν, µ, µ1, µ2, . . . be σ-finite measures on a measur-able space (Ω,F).

(i) If µ1 µ2 and µ2 µ3, then µ1 µ3 and

dµ1

dµ3=

dµ1

dµ2

dµ2

dµ3a.e. (µ3).

(ii) Suppose that µ1 and µ2 are dominated by µ3. Then for any α, β ≥ 0,αµ1 + βµ2 is dominated by µ3 and

d(αµ1 + βµ2)dµ3

= αdµ1

dµ3+ β

dµ2

dµ3a.e. (µ3).

(iii) If µ ν and dµdν > 0 a.e. (ν), then ν µ and

dµ=(

)−1

a.e. (µ).

(iv) Let µnn≥1 be a sequence of measures and αnn≥1 be a sequenceof positive real numbers, i.e., αn > 0 for all n ≥ 1. Define µ =∑∞

n=1 αnµn.

Page 133: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4.2 Signed measures 119

(a) Then, µ ν iff µn ν for each n ≥ 1 and in this case,

dν=

∞∑n=1

αndµn

dνa.e. (ν).

(b) µ ⊥ ν iff µn ⊥ ν for all n ≥ 1.

4.2 Signed measures

Let µ1 and µ2 be two finite measures on a measurable space (Ω,F). Let

ν(A) ≡ µ1(A)− µ2(A), for all A ∈ F . (2.1)

Then ν : F → R satisfies the following:

(i) ν(∅) = 0.

(ii) For any Ann≥1 ⊂ F with Ai ∩ Aj = ∅ for i = j, and with∑∞i=1 |ν(Ai)| < ∞,

ν(A) =∞∑

i=1

ν(Ai). (2.2)

(iii) Let

‖ν‖ ≡ sup ∞∑

i=1

|ν(Ai)| : Ann≥1 ⊂ F , Ai ∩Aj = ∅ for

i = j,⋃n≥1

An = Ω

. (2.3)

Then, ‖ν‖ is finite.

Note that (iii) holds because ‖ν‖ ≤ µ1(Ω) + µ2(Ω) < ∞.

Definition 4.2.1: A set function ν : F → R satisfying (i), (ii), and (iii)above is called a finite signed measure.

The above example shows that the difference of two finite measures isa finite signed measure. It will be shown below that every finite signedmeasure can be expressed as the difference of two finite measures.

Proposition 4.2.1: Let ν be a finite signed measure on (Ω,F). Let

|ν|(A) ≡ sup ∞∑

n=1

|ν(An)| : Ann≥1 ⊂ F , Ai ∩Aj = ∅ for i = j,

⋃n≥1

An = A

. (2.4)

Page 134: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

120 4. Differentiation

Then |ν|(·) is a finite measure on (Ω,F).

Proof: That |ν|(Ω) < ∞ follows from part (iii) of the definition. Thus it isenough to verify that |ν| is countably additive. Let Ann≥1 be a countablefamily of disjoint sets in F . Let A =

⋃n≥1 An. By the definition of |ν|, for

all ε > 0 and n ∈ N, there exists a countable family Anjj≥1 of disjointsets in F with An =

⋃j≥1 Anj such that

∑∞j=1 |ν(Anj)| > |ν|(An) − ε

2n .Hence,

∞∑n=1

∞∑j=1

|ν(Anj)| >∞∑

n=1

|ν|(An)− ε.

Note that Anjn≥1,j≥1 is a countable family of disjoint sets in F such thatA =

⋃n≥1 An =

⋃n≥1

⋃j≥1 Anj . It follows from the definition of |ν| that

|ν|(A) ≥∞∑

n=1

∞∑j=1

|ν(Anj)| >∞∑

n=1

|ν|(An)− ε.

Since this is true for for all ε > 0, it follows that

|ν|(A) ≥∞∑

n=1

|ν|(An). (2.5)

To get the opposite inequality, let Bjj≥1 be a countable family of disjointsets in F such that

⋃j≥1 Bj = A =

⋃n≥1 An. Since Bj = Bj ∩ A =⋃

n≥1(Bj ∩An) and ν satisfies (2.2),

ν(Bj) =∞∑

n=1

ν(Bj ∩An) for all j ≥ 1.

Thus∞∑

j=1

|ν(Bj)| ≤∞∑

j=1

∞∑n=1

|ν(Bj ∩An)|

=∞∑

n=1

∞∑j=1

|ν(Bj ∩An)|. (2.6)

Note that for each An, Bj ∩ Anj≥1 is a countable family of disjointsets in F such that An =

⋃j≥1(Bj ∩ An). Hence from (2.4), it fol-

lows that |ν|(An) ≥∑∞

j=1 |ν(Bj ∩ An)| and hence,∑∞

n=1 |ν|(An) ≥∑∞n=1

∑∞j=1 |ν(Bj ∩ An)|. From (2.6), it follows that

∑∞n=1 |ν|(An) ≥∑∞

j=1 |ν(Bj)|. This being true for every such family Bjj≥1, it followsfrom (2.4) that

|ν|(A) ≤∞∑

i=1

|ν|(Ai) (2.7)

Page 135: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4.2 Signed measures 121

and with (2.5), this completes the proof.

Definition 4.2.2: The measure |ν| defined by (2.4) is called the totalvariation measure of the signed measure ν.

Next, define the set functions

ν+ ≡ |ν|+ ν

2, ν− ≡ |ν| − ν

2. (2.8)

It can be verified that ν+ and ν− are both finite measures on (Ω,F).

Definition 4.2.3: The measures ν+ and ν− are called the positive andnegative variation measures of the signed measure ν, respectively.

It follows from (2.8) that

ν = ν+ − ν−. (2.9)

Thus every finite signed measure ν on (Ω,F) is the difference of two finitemeasures, as claimed earlier.

Note that both ν+ and ν− are dominated by |ν| and all three measuresare finite. By the Radon-Nikodym theorem (Theorem 4.1.1), there existfunctions h1 and h2 in L1(Ω,F , |ν|) such that

dν+

d|ν| = h1 anddν−

d|ν| = h2. (2.10)

This and (2.9) imply that for any A in F ,

ν(A) =∫

A

h1d|ν| −∫

A

h2d|ν| =∫

A

hd|ν|, (2.11)

where h = h1 − h2. Thus every finite signed measure ν on (Ω,F) can beexpressed as

ν(A) =∫

A

fdµ, A ∈ F (2.12)

for some finite measure µ on (Ω,F) and some f ∈ L1(Ω,F , µ).Conversely, it is easy to verify that a set function ν defined by (2.12) for

some finite measure µ on (Ω,F) and some f ∈ L1(Ω,F , µ) is a finite signedmeasure (cf. Problem 4.6). This leads to the following:

Theorem 4.2.2:

(i) A set function ν on a measurable space (Ω,F) is a finite signed mea-sure iff there exist two finite measures µ1 and µ2 on (Ω,F) such thatν = µ1 − µ2.

Page 136: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

122 4. Differentiation

(ii) A set function ν on a measurable space (Ω,F) is a finite signedmeasure iff there exist a finite measure µ on (Ω,F) and an f ∈L1(Ω,F , µ) such that for all A in F ,

ν(A) =∫

A

f dµ.

Definition 4.2.4: Let ν be a finite signed measure on a measurable spaceon (Ω,F). A set A ∈ F is called a positive set for ν if for any B ⊂ A, B ∈ F ,ν(B) ≥ 0. A set A ∈ F is called a negative set for ν if for any B ⊂ A withB ∈ F , ν(B) ≤ 0.

Let h be as in (2.11). Let

Ω+ = ω : h(ω) ≥ 0 and Ω− = ω : h(ω) < 0. (2.13)

From (2.11), it follows that for all A in F , ν(A∩Ω+) ≥ 0 and ν(A∩Ω−) ≤ 0.Thus Ω+ is a positive set and Ω− is a negative set for ν. Furthermore,Ω+ ∪Ω− = Ω and Ω+ ∩Ω− = ∅. Summarizing this discussion, one gets thefollowing theorem.

Theorem 4.2.3: (Hahn decomposition theorem). Let ν be a finite signedmeasure on a measurable space (Ω,F). Then there exist a positive set Ω+

and a negative set Ω− for ν such that Ω = Ω+ ∪ Ω− and Ω+ ∩ Ω− = ∅.

Let Ω+ and Ω− be as in (2.13). It can be verified (Problem 4.8) that forany B ∈ F , if B ⊂ Ω+, then ν(B) = |ν|(B). By (2.11), this implies thatfor all A in F , ∫

A∩Ω+hd|ν| = |ν|(A ∩ Ω+).

It follows that h = 1 a.e. (|ν|) on Ω+. Similarly, h = −1 a.e. (|ν|) on Ω−.Thus, the measures ν+ and ν−, defined in (2.8), reduce to

ν+(A) =∫

A

(1 + h)2

d|ν|

=∫

A∩Ω+

(1 + h)2

d|ν|+∫

A∩Ω−

(1 + h)d

|ν|

= |ν|(A ∩ Ω+),

and similarlyν−(A) = |ν|(A ∩ Ω−).

Note that ν+ and ν− are both finite measures that are mutually singular.This particular decomposition of ν as

ν = ν+ − ν−

Page 137: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4.2 Signed measures 123

is known as the Jordan decomposition of ν. It will now be shown thatthis decomposition is minimal and that it is unique in the class of signedmeasures with mutually singular components. Suppose there exist two finitemeasures µ1 and µ2 on (Ω,F) such that ν = µ1 − µ2. For any A ∈ F ,ν+(A) = ν(A∩Ω+) = µ1(A∩Ω+)−µ2(A∩Ω+) ≤ µ1(A∩Ω+) ≤ µ1(A) andν−(A) = −ν(A∩Ω−) = µ2(A∩Ω−)− µ1(A∩Ω−) ≤ µ2(A∩Ω−) ≤ µ2(A).Thus, ν+ ≤ µ1 and ν− ≤ µ2. Clearly, since both µ1 and ν+ are finitemeasures on (Ω,F), µ1 − ν+ is also a finite measure. Similarly, µ2 − ν− isalso a finite measure. Also, since µ1 − µ2 = ν = ν+ − ν−, it follows thatµ1−ν+ = µ2−ν− = λ, say. Thus, for any decomposition of ν as µ1−µ2 withµ1, µ2 finite measures, it holds that µ1 = ν+ + λ and µ2 = ν− + λ, whereλ is a measure on (Ω,F). Thus, ν = ν+ − ν− is a minimal decompositionin the sense that in this case λ = 0. Now suppose µ1 and µ2 are mutuallysingular, i.e., there exist Ω1,Ω2 ∈ F such that Ω1 ∩ Ω2 = ∅, Ω1 ∪ Ω2 = Ω,and µ1(Ω2) = 0 = µ2(Ω1). Since µ1 ≥ λ and µ2 ≥ λ, it follows thatλ(Ω2) = 0 = λ(Ω1). Thus λ = 0 and µ1 = ν+ and µ2 = ν−.

Summarizing the above discussion yields:

Theorem 4.2.4: Let ν be a finite signed measure on a measurable space(Ω,F) and let µ1 and µ2 be two finite measures on (Ω,F) such that ν =µ1 − µ2. Then there exists a finite measure λ such that µ1 = ν+ + λ andµ2 = ν− + λ with λ = 0 iff µ1 and µ2 are mutually singular.

Let

S ≡ ν : ν is a finite signed measure on (Ω,F).

Also, for any α ∈ R, let α+ = max(α, 0) and α− = max(−α, 0). Note thatfor ν1, ν2 in S and α1, α2 ∈ R,

α1ν1 + α2ν2 = (α+1 − α−

1 )(ν+1 − ν−

1 ) + (α+2 − α−

2 )(ν+2 − ν−

2 )= (α+

1 ν+1 + α−

1 ν−1 + α+

2 ν+2 + α−

2 ν−2 )

− (α+1 ν−

1 + α−1 ν+

1 + α+2 ν−

2 + α−2 ν+

2 )= λ1 − λ2, say,

where λ1 and λ2 are both finite measures. It now follows from Theorem4.2.2 that α1ν1 + α2ν2 ∈ S. Thus, S is a vector space over R.

Now it will be shown that ‖ν‖ ≡ |ν|(Ω) is a norm on S and that (S, ‖ · ‖)is a Banach space.

Definition 4.2.5: For a finite signed measure ν on a measurable space(Ω,F), the total variation norm ν is defined by ‖ν‖ ≡ |ν|(Ω).

Proposition 4.2.5: Let S ≡ ν : ν a finite signed measure on (Ω,F).Then, ‖ν‖ ≡ |ν|(Ω), ν ∈ S is a norm on S.

Page 138: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

124 4. Differentiation

Proof: Let ν1, ν2 ∈ S, α1, α2 ∈ R and λ = α1ν1 + α2ν2. For any A ∈ Fand any Aii≥1 ⊂ F with A =

⋃i≥1 Ai,

|λ(Ai)| ≤ |α1||ν1(Ai)|+ |α2||ν2(Ai)| for all i ≥ 1

⇒∑i≥1

|λ(Ai)| ≤ |α1|∑i≥1

|ν1(Ai)|+ |α2|∑i≥1

|ν2(Ai)|

≤ |α1||ν1|(A) + |α2||ν2|(A).

Taking supremum over all Aii≥1 yields,

|λ|(A) ≤ |α1||ν1|(A) + |α2||ν2|(·)i.e., |λ|(·) ≤ |α1| |ν1|(·) + |α2||ν2|(·),

⇒ ‖λ‖ ≡ |λ|(Ω) ≤ |α1| |ν1|(Ω) + |α2| |ν2|(Ω)

= |α1|‖ν1‖+ |α2|‖ν2‖.

Taking α1 = α2 = 1 yields

‖ν1 + ν2‖ ≤ ‖ν1‖+ ‖ν2‖,

i.e., the triangle inequality holds.Next taking α2 = 0 yields ‖α1ν1‖ ≤ |α1|‖ν1‖. To get the opposite in-

equality, note that for α1 = 0, ν1 = 1α1

α1ν1 and so ‖ν1‖ ≤ | 1α1|‖α1ν1‖.

Hence, |α1| ‖ν1‖ ≤ ‖α1ν1‖. Thus, for any α1 = 0, ‖α1ν‖ = |α1|‖ν‖. Fi-nally, ‖ν‖ = 0 ⇒ |ν|(Ω) = 0 ⇒ |ν|(A) = 0 for all A ∈ F ⇒ ν(A) =0 for all A ∈ F , i.e., ν is the zero measure. Thus ‖ · ‖ is a norm on S.

Proposition 4.2.6: (S, ‖ · ‖) is complete.

Proof: Let νnn≥1 be a Cauchy sequence in (S, ‖ · ‖). Note that for eachA ∈ F , |νn(A) − νm(A)| ≤ |νn − νm|(A) ≤ ‖νn − νm‖. Hence, for eachA ∈ F , νn(A)n≥1 is a Cauchy sequence in R and hence

ν(A) ≡ limn→∞ νn(A) exists.

It will be shown that ν(·) is a finite signed measure and ‖νn − ν‖ → 0 asn → ∞. Let Aii≥1 ⊂ F , Ai ∩ Aj = ∅ for i = j, and A =

⋃i≥1 Ai. Let

xn ≡ νn(Ai)i≥1, n ≥ 1, and let x0 ≡ ν(Ai)i≥1. Note that each xn ∈ 1where 1 ≡ x : x = xii≥1 ∈ R,

∑i≥1 |xi| < ∞. For x ∈ 1, let ‖x‖1 =∑∞

i=1 |xi|. Then ‖xn − xm‖ =∑

i≥1 |νn(Ai) − νm(Ai)| ≤ |νn − νm|(A) ≤|νn − νm|(Ω) → 0 as n, m → ∞. But 1 is complete under ‖ · ‖1. So thereexists x∗ ∈ 1 such that ‖xn−x∗‖1 → 0. Since xni ≡ νn(Ai) → ν(Ai) for alli ≥ 1, it follows that x∗

i = ν(Ai) for all i ≥ 1 and that∑∞

i=1 |ν(Ai)| < ∞.Also, for all n ≥ 1, νn(A) =

∑∞i=1 νn(Ai). Since

∑i≥1 |νn(Ai)−ν(Ai)| → 0,

νn(A) ≡∑

i≥1 νn(Ai) →∑

i≥1 ν(Ai) as n →∞. But νn(A) → ν(A). Thus,

Page 139: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4.3 Functions of bounded variation 125

ν(A) =∑∞

i=1 ν(Ai). Also for any countable partition Aii≥1 ⊂ F of Ω,

∞∑i=1

|ν(Ai)| = limn→∞

∞∑i=1

|νn(Ai)| ≤ limn→∞ ‖νn‖ < ∞.

Thus |ν|(Ω) < ∞ and hence, ν ∈ S. Finally,

‖νn − ν‖ = sup ∞∑

i=1

|νn(Ai)− ν(Ai)| : Aii≥1 ⊂ F

is a disjoint partition of Ω

.

But for every countable partition Aii≥1 ⊂ F ,

∞∑i=1

|νn(Ai)− ν(Ai)| = limm→∞

∞∑i=1

|νn(Ai)− νm(Ai)| ≤ limm→∞ ‖νn − νm‖.

Thus, ‖νn − ν‖ ≤ limm→∞ ‖νn − νm‖ and hence, limn→∞ ‖νn − ν‖ ≤limn→∞ limm→∞ ‖νn − νm‖ = 0. Hence, νn → ν in S.

Definition 4.2.6: (Integration w.r.t. signed measures). Let µ be a finitesigned measure on a measurable space (Ω,F) and |µ| be its total variationmeasure as in Definition 4.2.2. Then, for any f ∈ L1(Ω,F , |µ|),

∫fdµ is

defined as ∫fdµ =

∫fdµ+ −

∫fdµ− ,

where µ+ and µ− are the positive and negative variations of µ as definedin (2.8).

Proposition 4.2.7: Let µ be a signed measure on a measurable space(Ω,F , P ). Let µ = λ1 − λ2 where λ1 and λ2 are finite measures. Letf ∈ L1(Ω,F , λ1 + λ2). Then f ∈ L1(Ω,F , |µ|) and∫

fdµ =∫

fdλ1 −∫

fdλ2. (2.14)

Proof: Left as an exercise (Problem 4.13).

4.3 Functions of bounded variation

From the construction of the Lebesgue-Stieltjes measures on (R,B(R)) dis-cussed in Chapter 1, it is seen that to every nondecreasing right continuousfunction F : R → R, there is a (Radon) measure µF on (R,B(R)) such that

Page 140: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

126 4. Differentiation

µF ((a, b]) = F (b)− F (a) for all a < b and conversely. If µ1 and µ2 are twoRadon measures and µ = µ1 − µ2, let

G(x) ≡

⎧⎨⎩

µ((0, x]) for x > 0,−µ((x, 0]) for x < 0,0 for x = 0.

=

⎧⎨⎩

F1(x)− F2(x)− (F1(0)− F2(0)) for x > 0,(F1(0)− F2(0))− (F1(x)− F2(x)) for x < 0,0 for x = 0.

Thus to every finite signed measure µ on (R,B(R)), there corresponds afunction G(·) that is the difference of two right continuous nondecreasingand bounded functions. The converse is also easy to establish. A character-ization of such a function G(·) without any reference to measures is givenbelow.

Definition 4.3.1: Let f : [a, b] → R, where −∞ < a < b < ∞. Then forany partition Q = a = x0 < x1 < x2 < . . . < xn = b, n ∈ N, the positive,negative and total variations of f with respect to Q are respectively definedas

P (f, Q) ≡n∑

i=1

(f(xi)− f(xi−1))+

N(f, Q) ≡n∑

i=1

(f(xi)− f(xi−1))−

T (f, Q) ≡n∑

i=1

|f(xi)− f(xi−1)|.

It is easy to verify that (i) if f is nondecreasing, then

P (f, Q) = T (f, Q) = f(b)− f(a) and N(f, Q) = 0

and that (ii) for any f ,

P (f, Q) + N(f, Q) = T (f, Q).

Definition 4.3.2: Let f = [a, b] → R, where −∞ < a < b < ∞. The pos-itive, negative and total variations of f over [a, b] are respectively definedas

P (f, [a, b]) ≡ supQ

P (f, Q)

N(f, [a, b]) ≡ supQ

N(f, Q)

T (f, [a, b]) ≡ supQ

T (f, Q),

Page 141: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4.3 Functions of bounded variation 127

where the supremum in each case is taken over all finite partitions Q of[a, b].

Definition 4.3.3: Let f : [a, b] → R, where −∞ < a < b < ∞. Then, fis said to be of bounded variation on [a, b] if T (f, [a, b]) < ∞. The set of allsuch functions is denoted by BV [a, b].

As remarked earlier, if f is nondecreasing, then T (f, Q) = f(b)−f(a) foreach Q and hence T (f, [a, b]) = f(b)− f(a). It follows that if f = f1 − f2,where both f1 and f2 are nondecreasing, then f ∈ BV [a, b]. A naturalquestion is whether the converse is true. The answer is yes, as shown bythe following result.

Theorem 4.3.1: Let f ∈ BV [a, b]. Let f1(x) ≡ P (f, [a, x]) and f2(x) ≡N(f, [a, x]). Then f1 and f2 are nondecreasing in [a, b] and for all a ≤ x ≤b,

f(x) = f1(x)− f2(x)

Proof: That f1 and f2 are nondecreasing follows from the definition. It isenough to verify that if f ∈ BV [a, b], then

f(b)− f(a) = P (f, [a, b])−N(f, [a, b]),

as this can be applied to [a, x] for a ≤ x < b. For each finite partition Q of[a, b],

f(b)− f(a) =n∑

i=1

(f(xi)− f(xi−1))

= P (f, Q)−N(f, Q).

Thus P (f, Q) = f(b)− f(a) + N(f, Q). By taking supremum over all finitepartitions Q, it follows that

P (f, [a, b]) = f(b)− f(a) + N(f, [a, b]).

If f ∈ BV [a, b], this yields f(b)− f(a) = P (f, [a, b])−N(f, [a, b]).

Remark 4.3.1: Since T (f, Q) = P (f, Q) + N(f, Q) = 2P (f, Q)− (f(b)−f(a)), it follows that if f ∈ BV [a, b], then

T (f, [a, b]) = 2P (f, [a, b])− (f(b)− f(a))= P (f, [a, b]) + N(f, [a, b]).

Corollary 4.3.2: A function f ∈ BV [a, b] iff there exists a finite signedmeasure µ on (R,B(R)) such that f(x) = µ([a, x]), a ≤ x ≤ b.

The proof of this corollary is left as an exercise.

Page 142: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

128 4. Differentiation

Remark 4.3.2: Some observations on functions of bounded variations arelisted below.

(a) Let f = IQ where Q is the set of rationals. Then for any [a, b], a <b, P (f, [a, b]) = N(f, [a, b]) = ∞ and so f /∈ BV [a, b]. This holds forf = ID for any set D such that both D and Dc are dense in R.

(b) Let f be Lipschitz on [a, b]. That is, |f(x)− f(y)| ≤ K|x− y| for allx, y in [a, b] where K ∈ (0,∞) is a constant. Then, f ∈ BV [a, b].

(c) Let f be differentiable in (a, b) and continuous on [a, b] and f ′(·) bebounded in (a, b). Then by the mean value theorem, f is Lipschitzand hence, f is in BV [a, b].

(d) Let f(x) = x2 sin 1x , 0 < x ≤ 1, and let f(0) = 0. Then f is contin-

uous on [0, 1], differentiable on (0, 1) with f ′ bounded on (0, 1), andhence f ∈ BV [0, 1].

(e) Let g(x) = x2 sin 1x2 , 0 < x ≤ 1, g(0) = 0. Then g is continuous

on [0, 1], differentiable on (0, 1) but g′ is not bounded on (0, 1). Thisby itself does not imply that g /∈ BV [0, 1], since being Lipschitz isonly a sufficient condition. But it turns out that g /∈ BV [0, 1]. To see

this, let xn =√

1(2n+1) π

2, n = 0, 1, 2 . . .. Then

n∑i=1

|g(xi)− g(xi−1)| ≥n∑

i=1

1(2i+1) π

2and hence T (g, [0, 1]) = ∞.

(f) It is known (see Royden (1988), Chapter 4) that if f : [a, b] → Ris nondecreasing, then f is differentiable a.e. (m) on (a, b) and∫[a,b] f

′dm ≤ f(b) − f(c), where (m) denotes the Lebesgue measure.This implies that if f ∈ BV [a, b], then f is differentiable a.e. (m) on(a, b) and so,

∫[a,b] |f

′|dm ≤ T (f, [a, b]).

4.4 Absolutely continuous functions on R

Definition 4.4.1: A function F : R → R is absolutely continuous (a.c.)if for all ε > 0, there exists δ > 0 such that if Ij = [aj , bj ], j = 1, 2, . . . , k

(k ∈ N) are disjoint and∑k

j=1(bj−aj) < δ, then∑k

j=1 |F (bj)−F (aj)| < ε.

By the mean value theorem, it follows that if F is differentiable and F ′(·)is bounded, then F is a.c. Also note that F is a.c. implies F is uniformlycontinuous.

Page 143: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4.4 Absolutely continuous functions on R 129

Definition 4.4.2: A function F : [a, b] → R is absolutely continuous if thefunction F , defined by

F (x) =

⎧⎨⎩

F (x) if a ≤ x ≤ b,F (a) if x < a,F (b) if x > b,

is absolutely continuous.

Thus F (x) = x is a.c. on R. Any polynomial is a.c. on any boundedinterval but not necessarily on all of R. For example, F (x) = x2 is a.c. onany bounded interval but not a.c. on R, since it is not uniformly continuouson R.

The main result of this section is the following result due to H. Lebesgue,known as the fundamental theorem of Lebesgue integral calculus.

Theorem 4.4.1: A function F : [a, b] → R is absolutely continuous iffthere is a function f : [a, b] → R such that f is Lebesgue measurable andintegrable w.r.t. m and such that

F (x) = F (a) +∫

[a,x]f dm, for all a ≤ x ≤ b (4.1)

where m is the Lebesgue measure.

Proof: First consider the “if part.” Suppose that (4.1) holds. Since∫[a,b] |f |dm < ∞, for any ε > 0, there exists a δ > 0 such that (cf. Proposi-

tion 2.5.8).

m(A) < δ ⇒∫

A

|f |dm < ε. (4.2)

Thus, if Ij = (aj , bj),⊂ [a, b], j = 1, 2, . . . , k are such that∑k

j=1(bj−aj) <δ, then

k∑j=1

|F (bj)− F (aj)| ≤∫

⋃kj=1 Ij

|f |dm < ε,

since m(⋃k

j=1 Ij) ≤∑k

j=1(bj − aj) < δ and (4.2) holds. Thus, F is a.c.Next consider the “only if part.” It is not difficult to verify (Problem

4.18) that F a.c. implies F is of bounded variation on any finite interval[a, b] and both the positive and the negative variations of F on [a, b] area.c. as well. Hence, it suffices to establish (4.1) assuming that F is a.c. andnondecreasing. Let µF be the Lebesgue-Stieltjes measure generated by Fas in Definition 4.4.2. It will now be shown that µF is absolutely continuousw.r.t. the Lebesgue measure. Fix ε > 0. Let δ > 0 be chosen so that

(aj , bj) ⊂ [a, b], j = 1, 2, . . . , k,k∑

j=1

(bj − aj) < δ ⇒k∑

j=1

|F (bj)− F (aj)| < ε.

Page 144: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

130 4. Differentiation

Let A ∈ Mm, A ⊂ (a, b), and m(A) = 0. Then, there exist a countablecollection of disjoint open intervals Ij = (aj , bj) : Ij ⊂ [a, b]j≥1 such that

A ⊂⋃j≥1

Ij and∑j≥1

(bj − aj) < δ.

Thus

µF

(A ∩

k⋃j=1

Ij

)≤ µF

( k⋃i=1

Ij

)

≤k∑

j=1

µF (Ij) =k∑

j=1

(F (bj)− F (aj)) < ε

for all k ∈ N.Since A ⊂

⋃j≥1 Ij , by the m.c.f.b. property of µF (·), µF (A) =

limk→∞ µF (A ∩⋃k

j=1 Ij) ≤ ε. This being true for any ε > 0, it fol-lows that µF (A) = 0. Since F is continuous, µF (a, b) = 0 and henceµF

((a, b)c

)= 0. Thus, µF m, i.e., µF is dominated by m. Now,

by the Radon-Nikodym theorem (cf. Theorem 4.1.1 (ii)), there existsa nonnegative measurable function f such that A ∈ Mm implies thatµF (A) =

∫A∩[a,b] f dm and, in particular, for a ≤ x ≤ b,

µF ([a, x]) = F (x)− F (a) =∫

[a,x]f dm,

i.e., (4.1) holds.

The representation (4.1) of an absolutely continuous F can be strength-ened as follows:

Theorem 4.4.2: Let F : R → R satisfy (4.1). Then F is differentiablea.e. (m) and F ′(·) = f(·) a.e. (m).

For a proof of this result, see Royden (1988), Chapter 4.

The relation between the notion of absolute continuity of a distributionfunction F : R → R and that of the associated Lebesgue-Stieltjes measureµF w.r.t. Lebesgue measure m will be discussed now.

Let F : R → R be a distribution function, i.e., F is nondecreasing andright continuous. Let µF be the associated Lebesgue-Stieltjes measure suchthat µF ((a, b]) = F (b) − F (a) for all −∞ < a < b < ∞. Recall that Fis said to be absolutely continuous on an interval [a, b] if for each ε > 0,there exists a δ > 0 such that for any finite collection of intervals Ij =(aj , bj), j = 1, 2, . . . , n, contained in [a, b],

n∑j=1

(bj − aj) < δ ⇒n∑

j=1

(F (bj)− F (aj)) < ε.

Page 145: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4.4 Absolutely continuous functions on R 131

Recall also that µF is absolutely continuous w.r.t. the Lebesgue measure mif for A ∈ B(R), m(A) = 0 ⇒ µF (A) = 0. A natural question is that ifF is absolutely continuous on every interval [a, b] ⊂ R, is µF absolutelycontinuous w.r.t. m and conversely? The answer is yes.

Theorem 4.4.3: Let F : R → R be a nondecreasing function and let µF bethe associated Lebesgue-Stieltjes measure. Then F is absolutely continuouson [a, b] for all −∞ < a < b < ∞ iff µF m where m is the Lebesguemeasure on (R,B(R)).

Proof: Suppose that µF m. Then by Theorem 4.1.1, there exists anonnegative measurable function h such that

µF (A) =∫

A

hdm for all A in B(R).

Hence, for any a < b in R and a < x < b,

F (x)− F (a) ≡ µF ((a, x]) =∫

(a,x]hdm.

This implies the absolute continuity of F on [a, b] as shown in Theorem4.4.1.

Conversely, if F is absolutely continuous on [a, b] for all −∞ < a < b <∞, then as shown in the proof of the “only if” part of Theorem 4.4.1, forall −∞ < a < b < ∞, then µF (A ∩ [a, b]) = 0 if m(A ∩ [a, b]) = 0. Thus,if m(A) = 0, then for all −∞ < a < b < ∞, m(A ∩ [a, b]) = 0 and henceµF (A ∩ [a, b]) = 0 and hence µF (A) = 0, i.e., µF m.

Recall that a measure µ on (Rk,B(Rk)) is a Radon measure if µ(A) <∞ for every bounded Borel set A. In the following, let m(·) denote theLebesgue measure on Rk.

Definition 4.4.3: A Radon measure µ on (Rk,B(Rk)) is differentiable atx ∈ Rk with derivative (Dµ)(x) if for any ε > 0, there is a δ > 0 such that

∣∣∣ µ(A)m(A)

− (Dµ)(x)∣∣∣ < ε

for every open ball A such that x ∈ A and diam. (A) ≡ sup‖x−y‖ : x, y ∈A, the diameter of A, is less than δ.

Theorem 4.4.4: Let µ be a Radon measure on (Rk,B(Rk)). Then

(i) µ is differentiable a.e. (m), Dµ(·) is Lebesgue measurable, and ≥ 0a.e. (m) and for all bounded Borel sets A ∈ B(Rk),∫

A

Dµ(·)dm ≤ µ(A).

Page 146: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

132 4. Differentiation

(ii) Let µa(A) ≡∫

ADµ(·)dm, A ∈ B(Rk). Let µs(·) be the unique measure

on B(Rk) such that for all bounded Borel sets A

µs(A) = µ(A)− µa(A).

Thenµs ⊥ m and Dµs(·) = 0 a.e. (m).

For a proof, see Rudin (1987).

Remark 4.4.1: By the uniqueness of the Lebesgue decomposition, itfollows that a Radon measure µ on B(Rk) is ⊥ m iff Dµ(·) = 0 a.e. (m)and is m iff µ(A) =

∫A

Dµ(·)dm for all A ∈ B(Rk).

Let f : Rk → R+ be integrable w.r.t. m on bounded sets. Let µ(A) ≡∫A

fdm for A ∈ B(Rk). Then µ(·) is a Radon measure and that is mand hence by Theorem 4.4.4

Dµ(x) = f(x) for almost all x(m).

That is, for almost all x(m), for each ε > 0, there is a δ > 0 such that∣∣∣ 1m(A)

∫A

fdm− f(x)∣∣∣ < ε

for all open balls A such that x ∈ A and diam. (A) < δ.It turns out that a stronger result holds.

Theorem 4.4.5: For almost all x(m), for each ε > 0, there is a δ > 0such that

1m(A)

∫A

|f − f(x)|dm < ε

for all open balls A such that x ∈ A and diam. (A) < δ (see Problems 4.23,4.24).

Theorem 4.4.6: (Change of variables formula in Rk, k > 1). Let V bean open set in Rk. Let T ≡ (T1, T2, . . . , Tk) be a map from Rk → Rk suchthat for each i, Ti : Rk → R and ∂Ti(·)

∂xjexists on V for all 1 ≤ i, j ≤ k.

Suppose that the Jacobian JT (·) ≡ det((∂Ti(·)

∂xj

))is continuous and positive

on V . Suppose further that T (V ) is a bounded open set W in Rk and thatT is (1− 1) and T−1 is continuous. Then

(i) For all Borel set E ⊂ V , T (E) is a Borel set ⊂ W .

(ii) ν(·) ≡ m(T (·)) is a measure on B(W ) and ν m with

dν(·)dm

= JT (·).

Page 147: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4.5 Singular distributions 133

(iii) For any h ∈ L1(W, m)∫W

hdm =∫

V

h(T (·)

)JT (·)dm.

(iv) λ(·) ≡ mT−1(·) is a measure on B(W ) and λ m with

dm= |J

(T−1(·)

)|−1.

(v) For any µ m on B(V ) the measure ψ(·) ≡ µT−1(·) is dominatedby m with

dm(·) =

( dµ

dm

)(T−1(·)

)(JT

(T−1(·)

))−1on W.

For a proof see Rudin (1987), Chapter 7.

4.5 Singular distributions

4.5.1 Decomposition of a cdfRecall that a cumulative distribution function (cdf) on R is a functionF : R → [0, 1] such that it is nondecreasing, right continuous, F (−∞) = 0,F (∞) = 1. In Section 2.2, it was shown that any cdf F on R can be writtenas

F = αFd + (1− α)Fc, (5.1)

where Fd and Fc are discrete and continuous cdfs respectively. In this sec-tion, the cdf Fc will be further decomposed into a singular continuous andabsolutely continuous cdfs.

Definition 4.5.1: A cdf F is singular if F ′ ≡ 0 almost everywhere w.r.t.the Lebesgue measure on R.

Example 4.5.1: The cdfs of Binomial, Poisson, or any integer valued ran-dom variables are singular.

It is known (cf. Royden (1988), Chapter 5) that a monotone functionF : R → R is differentiable almost everywhere w.r.t. the Lebesgue measureand its derivative F ′ satisfies∫ b

a

F ′(x)dx ≤ F (b)− F (a), (5.2)

for any −∞ < a < b < ∞.

Page 148: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

134 4. Differentiation

For x ∈ R, let Fac(x) ≡∫ x

−∞ F ′c(t)dt and Fsc(x) ≡ Fc(x) − Fac(x). If

β ≡∫∞

−∞ F ′c(t)dt = Fac(∞) = 0, then F ′

c(t) = 0 a.e. and so Fc is singularcontinuous. If β = 1, then Fc = Fac and so, Fc is absolutely continuous. If0 < α < 1 and 0 < β < 1, then F can be written as

F = αFd + βFac + γFsc, (5.3)

where β = (1−α)β, γ = (1−α)(1− β), Fac = β−1Fac, Fsc = (1− β)−1Fsc,and Fd is as in (5.1). Note that Fd, Fac, Fsc are all cdfs and α, β, γ arenonnegative numbers adding up to 1. Summarizing the above discussions,one has the following:

Proposition 4.5.1: Given any cdf F , there exist nonnegative constants α,β, γ and cdfs Fd, Fac, Fsc satisfying (a) α + β + γ = 1, and (b) Fd isdiscrete, Fac is absolutely continuous, Fsc is singular continuous, such thatthe decomposition (5.3) holds.

It can be shown that the constants α, β, and γ are uniquely determined,and that when 0 < α < 1, the decomposition (5.1) is unique, and that when0 < α, β, γ < 1, the decomposition (5.3) is unique. The decomposition(5.3) also has a probabilistic interpretation. Any random variable X canbe realized as a randomized choice over three random variables Xd, Xac,and Xsc having cdfs Fd, Fac, and Fsc, respectively, and with correspondingrandomization probabilities α, β, and γ. For more details see Problem 6.15in Chapter 6.

4.5.2 Cantor ternary setRecall the construction of the Cantor set from Section 1.3.

Let I0 = [0, 1] denote the unit interval. If one deletes the open middlethird of I0, then one gets two disjoint closed intervals I11 =

[0, 1

3

]and

I12 =[ 23 , 1]. Proceeding similarly with the closed intervals I11 and I12,

one gets four disjoint intervals I21 =[0, 1

9

], I22 =

[ 29 , 1

3

], I23 =

[ 23 , 7

9

],

I24 =[89 , 1], and so on. Thus, at each step, deleting the open middle third

of the closed intervals constructed in the previous step, one is left with 2n

disjoint closed intervals each of length 3−n after n steps. Let Cn =⋃2n

j=1 Inj

and C =⋂∞

n=1 Cn. By construction Cn+1 ⊂ Cn for each n and Cn’s areclosed sets. With m(·) denoting Lebesgue measure, one has m(C0) = 1 andm(Cn) = 2n3−n =

( 23

)n.

Definition 4.5.2: The set C ≡⋂∞

n=1 Cn is called the Cantor ternary setor simply the Cantor set.

Since m(C0) = 1, by m.c.f.a. m(C) = limn→∞ m(Cn) = limn→∞( 2

3

)n =0. Thus, the Cantor set C has zero Lebesgue measure. Next, let U1 =U11 =

( 13 , 2

3

)be the deleted interval at the first stage, U2 = U21 ∪ U22 =

Page 149: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4.5 Singular distributions 135

( 19 , 2

9

)∪( 7

9 , 89

)be the union of the deleted intervals at the second stage, and

similarly, Un =⋃2n−1

j=1 Unj at stage n. Thus Cc = U =⋃∞

n=1

(⋃2n−1

j=1 Unj

)is open and m(Cc) = 1. Since C ∪ Cc = [0, 1] and Cc is open, it followsthat C is nonempty. In fact, C is uncountably infinite as will be shownnow. To do this, one needs the concept of p-nary expansion of numbers in[0,1]. Fix a positive integer p > 1. For each x in [0,1), let a1(x) = pxwhere t = n if n ≤ t < n + 1. Thus a1(x) ≤ px < a1(x) + 1 anda1(x) ∈ 0, 1, . . . , p− 1, i.e., a1(x)

p ≤ x < a1(x)p + 1

p . Thus, if kp ≤ x < k+1

p

for some k = 0, 1, 2, . . . , p− 1, then a1(x) = k. Next, let x1 ≡ x− a1(x)p and

a2(x) = p2x1. Then, x1 ∈[0, 1

p

)and

a2(x)p2 ≤ x1 = x− a1(x)

p<

a2(x)p2 +

1p2

and a2(x) ∈ 0, 1, 2, p− 1. Next, let 0 ≤ x2 ≡ x− a1(x)p − a2(x)

p2 < 1p2 and

a3(x) = p3x2 and so on. After k such iterations one gets

0 ≤ x−k∑

i=1

ai(x)pi

<1pk

where ai(x) ∈ 0, 1, 2, . . . , p−1 for all i. Since 1pk → 0 as k →∞, one gets

the p-nary expansion of x in [0,1) as

x =∞∑

i=1

ai(x)pi

, ai(x) ∈ 0, 1, 2, . . . , p− 1. (5.4)

Notice that if x = kp for some k ∈ 0, 1, 2, . . . , p − 1, then in the above

expansion a1(x) = k and ai(x) = 0 for i ≥ 2, and the expansion terminates.But since

∑∞i=1

(p−1)pi = 1, one may also write x = k

p = k−1p +

∑∞i=1

(p−1)pi ,

this being an expansion which is nonterminating and recurring. It can beshown that for all x in [0,1) of the form x =

pm for some positive integers and m, there are exactly two expansions such that one terminates and theother is nonterminating and recurring. For all other x in [0,1), the p-naryexpansion is nonterminating and nonrecurring and is unique.

The decimal expansion corresponds with the case p = 10, the binaryexpansion with the case p = 2, and the ternary expansion with the casep = 3. Here the convention of choosing only a nonterminating expansionfor each x in [0,1) is used. Thus, for example, for p = 3, 1

3 will be replacedby∑∞

i=223i so that a1

( 13

)= 0, ai

( 13

)= 2 for i ≥ 2. Similarly, for p = 3,

x = 79 = 2

3 + 19 will be replaced by 2

3 + 032 +

∑∞i=3

23i so that a1

( 79

)= 2,

a2( 7

9

)= 0, ai

( 79

)= 2 for i ≥ 3. By taking p = 2, i.e., the binary expansion,

it is seen that every x in [0,1) can be uniquely represented as x =∑∞

i=1δi(x)2i

Page 150: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

136 4. Differentiation

where δi(x) ∈ 0, 1 for all i. Thus, the interval [0,1) is in one-to-onecorrespondence with the set of all sequences of 0’s and 1’s.

It is not difficult to prove the following (Problem 4.31).

Theorem 4.5.2: A number x belongs to the Cantor set C iff in (5.4), forall i ≥ 1, ai(x) is either 0 or 2.

Corollary 4.5.3: The Cantor set C is in one-to-one correspondence withthe set of all sequences of 0’s and 1’s and hence is in one-to-one correspon-dence with the unit interval [0,1].

Remark 4.5.1: Thus the Cantor ternary set C is a closed subset of [0,1],its Lebesgue measure m(C) = 0, and its cardinality is the same as that of[0,1]. Further, it is nowhere dense, i.e., its complement U is dense in thesense that for every open interval (a, b) ⊂ [0, 1], U ∩ (a, b) is nonempty.It is also possible to get a Cantor like set Cα with (Lebesgue) measure α,0 < α < 1, by following the above iterative procedure of deleting at eachstage intervals of length that is a fraction (1−α)

3 of the full interval (Problem4.32).

4.5.3 Cantor ternary functionThe Cantor ternary function F : [0, 1] → [0, 1] is defined as follows: Forn ≥ 1, let Unj : j = 1, . . . , 2n−1 denote the set of “deleted” intervals atstep n in the definition of the Cantor set C. Define F on Cc = U by

F (x) =12

on U11 =(1

3,23

)and

=14

on U21 =(1

9,29

)=

34

on U22 =(7

9,89

)and so on. It can be checked that F is uniformly continuous on U and hasa continuous extension to I0 = [0, 1]. The extension of the function F (alsodenoted by F ) maps [0,1] onto [0,1] and is continuous and nondecreasing.Further, on U , it is differentiable with derivative F ′ ≡ 0. Since m(U c) =0, F is a singular cdf (cf. see Definition 4.5.1). It can be shown that if∑∞

i=1ai(x)

3i is the ternary expansion of x ∈ (0, 1), then

F (x) =N(x)−1∑

i=1

ai(x)2i+1 +

12N(x) (5.5)

Where N(x) = infi : i ≥ 1, ai(x) = 1. For example, x = 13 =

∑∞i=2

23i ⇒

N(x) = ∞, a1(x) = 0, ai(x) = 2 for i ≥ 2 ⇒ F (x) =∑∞

i=212i = 1

2 whilex = 4

9 = 13 + 0

32 +∑∞

i=323i ⇒ N(x) = 1 ⇒ F (x) = 1

2 .

Page 151: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4.6 Problems 137

It can be shown that if δnn≥1 is a sequence of independent 0, 1 valuedrandom variables with P (δ1 = 0) = 1

2 = P (δ1 = 1), then the above F isthe cdf of the random variable X =

∑∞i=1

2δi

3i which lies in the Cantor setw.p. 1.

4.6 Problems

4.1 Show that in the proof of Theorem 4.1.1, µ(A3) = 0 where A3 = g ∈[0, 1] and g satisfies (1.9).

(Hint: Apply (1.9) separately to A31 = g > 1 and A32 = g < 0.)

4.2 Verify that µ1s, defined in (1.14) and µ2 are singular.

(Hint: For each n, by Case 1, there exists a gn in L2(Dn,Fn, µ(n))where µ(n) = µ

(n)1 + µ

(n)2 and Fn ≡ A∩Dn : A ∈ F, such that 0 ≤∫

Agndµ(n) ≤ µ(n)(A) for all A in Fn. Let A1n = w : w ∈ Dn, 0 ≤

gn(w) < 1, A2n = w : w ∈ Dn, gn(w) = 1, and A2 =⋃

n≥1 A2n.Show that µ2(A2) = 0 and µ1s(A) =

∑∞n=1 µ1n(A ∩ A2n) and hence

µ1s(Ac2) = 0.)

4.3 Let ν be the counting measure on [0, 1] and∫[0,1] hdν < ∞ for some

nonnegative function h. Show that B = x : h(x) > 0 is countable.

(Hint: Let Bn = x : h(x) > 1n. Show that Bn is a finite set for

each n ∈ N.)

4.4 Prove Proposition 4.1.2.

4.5 Find the Lebesgue decomposition of µ w.r.t. ν and the Radon-Nikodym derivative dµa

dν in the following cases where µa is the ab-solutely continuous component of µ w.r.t. ν.

(a) µ = N(0, 1), ν = Exponential(1)

(b) µ = Exponential(1), ν = N(0, 1)

(c) µ = µ1 + µ2, where µ1 = N(0, 1), µ2 = Poisson(1) and ν =Cauchy(0, 1).

(d) µ = µ1 + µ2, ν = Geometric(p), 0 < p < 1, where µ1 = N(0, 1)and µ2 = Poisson(1).

(e) µ = µ1 + µ2, ν = ν1 + ν2 where µ1 = N(0, 1), µ2 = Poisson(1),ν1 = Cauchy(0, 1) and ν2 = Geometric(p), 0 < p < 1.

(f) µ = Binomial (10, 1/2), ν = Poisson (1).

The measures referred to above are defined in Tables 4.6.1 and 4.6.2,given at the end of this section.

Page 152: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

138 4. Differentiation

4.6 Let (Ω,F , µ) be a measure space and f ∈ L1(Ω,F , µ). Let

νf (A) ≡∫

A

fdµ for all A ∈ F .

(a) Show that νf is a finite signed measure.

(b) Show that ‖ν‖ =∫Ω |f |dµ and for A ∈ F , ν+

f (A) =∫

Af+dµ,

ν−f (A) =

∫A

fdµ, and |νf |(A) =∫

A|f |(dµ).

4.7 (a) Let µ1 and µ2 be two finite measures such that both are dom-inated by a σ-finite measure ν. Show that the total varia-tion measure of the signed measure µ ≡ µ1 − µ2 is given by|µ|(A) =

∫A|h1 − h2|dν where for i = 1, 2, hi = dµi

dν .

(b) Conclude that if µ1 and µ2 are two measures on a countable setΩ ≡ ωii≥1 with F ≡ P(Ω), then |µ|(A) =

∑i∈A |µ1(ωi) −

µ2(ωi)|.(c) Show that if µn is the Binomial (n, pn) measure and µ is the

Poisson (λ) measure, 0 < λ < ∞, then as n →∞, |µn−µ|(·) → 0uniformly on P(Z+) iff npn → λ.

(Hint: Show that for each i ∈ Z+ ≡ 0, 1, 2, . . ., µn(i) →µ(i) and use Scheffe’s theorem.)

4.8 Let ν be a finite signed measure on a measurable space (Ω,F) andlet |ν| be the total variation measure corresponding to ν. Show thatfor any B ∈ Ω+, B ⊂ F ,

|ν|(B) = ν(B),

where Ω+ is as defined in (2.13).

(Hint: For any set A ⊂ Ω+,

ν(A) =∫

A

hd|ν| =∫

A∩Ω+hd|ν| ≥ 0.)

4.9 Show that the Banach space S of finite signed measures on (N,P(N))is isomorphic to 1, the Banach space of absolutely convergent se-quences xnn≥1 in R.

4.10 Let µ1 and µ2 be two probability measures on (Ω,F).

(a) Show that

‖µ1 − µ2‖ = 2 sup|µ1(A)− µ2(A)| : A ∈ F.

(Hint: For any A ∈ F , A, Ac is a partition of Ω and so ‖µ1−µ2‖ ≥ |µ1(A)− µ2(A)|+ |µ1(Ac)− µ2(Ac)| = 2|µ1(A)− µ2(A)|,

Page 153: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4.6 Problems 139

since µ1 and µ2 are probability measures. For the opposite in-equality, use the Hahn decomposition of Ω w.r.t. µ1 − µ2 andthe fact ‖µ1 − µ2‖ = |(µ1 − µ2)(Ω+)|+ |(µ1 − µ2)(Ω−)|.)

(b) Show that ‖µ1 − µ2‖ is also equal to

sup∣∣∣ ∫ fdµ1 −

∫fdµ2

∣∣∣ : f ∈ B(Ω, R)

where B(Ω, R) is the collection of all F-measurable functionsfrom Ω to R such that sup|f(ω)| : ω ∈ Ω ≤ 1.

4.11 Let (Ω,F) be a measurable space.

(a) Let µnn≥1 be a sequence of finite measures on (Ω,F). Showthat there exists a probability measure λ such that µn λ.

(Hint: Consider λ(·) =∑∞

n=112n

µn(·)µn(Ω) .)

(b) Extend (a) to the case where µnn≥1 are σ-finite.

(c) Conclude that for any sequence νnn≥1 of finite signed mea-sures on (Ω,F), there exists a probability measure λ such that|νn| λ for all n ≥ 1.

4.12 Let µnn≥1 be a sequence of finite measures on a measurable space(Ω,F). Show that there exists a finite measure µ on (Ω,F) such that‖µn − µ‖ → 0 iff there is a finite measure λ dominating µ and µn,n ≥ 1 such that the Radon-Nikodym derivatives fn ≡ dµn

dλ → f ≡ dµdλ

in measure on (Ω,F , λ) and µn(Ω) → µ(Ω).

4.13 (a) Let µ1 and µ2 be two finite measures on (Ω,F). Let µ1 = µ1a +µ1s be the Lebesgue-Radon-Nikodym decomposition of µ1 w.r.t.µ2 as in Theorem 4.1.1. Show that if µ = µ1 − µ2, then for allA ∈ F ,

|µ|(A) =∫

A

|h− 1|dµ2 + µ1s(A) where h =dµ1a

dµ2

is the Radon-Nikodym derivative of µ1a w.r.t. µ2. Conclude thatif µ1 ⊥ µ2, then |µ|(·) = µ1(·) + µ2(·) and if µ1 µ2, then|µ|(A) =

∫A

∣∣dµ1dµ2

− 1∣∣dµ2.

(b) Compute |µ|(·), ‖µ‖ if µ = µ1 − µ2 for the following cases

(i) µ1 = N(0, 1), µ2 = N(1, 1)(ii) µ1 = Cauchy (0,1), µ2 = N(0, 1)(iii) µ1 = N(0, 1), µ2 = Poisson (λ).

(c) Establish Proposition 4.2.7.

Page 154: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

140 4. Differentiation

4.14 Give another proof of the completeness of (S, ‖ · ‖)) by verifying thefollowing steps.

(a) For any sequence νnn≥1 in S, there is a finite measure λ andfnn≥1 ⊂ L1(Ω,F , λ) such that

νn(A) =∫

A

fndλ for all A ∈ F , for all n ≥ 1.

(b) νnn≥1 Cauchy in S is the same as fnn≥1 Cauchy inL1(Ω,F , λ) and hence, the completeness of (S, ‖·‖)) follows fromthe completeness of L1(Ω,F , λ).

4.15 Let f, g ∈ BV [a, b].

(a) Show that P (f + g; [a, b]) ≤ P (f ; [a, b]) + P (g; [a, b]) and thatthe same is true for N(·; ·) and T (·; ·).

(b) Show that for any c ∈ R,

P (cf ; [a, b]) = |c|P (f ; [a, b])

and do the same for N(·; ·) and T (·; ·).(c) For any a < c < b, P (f ; [a, b]) = P (f ; [a, c]) + P (f ; [c, b]).

4.16 Let fnn≥1 ⊂ BV [a, b] and let limn fn(x) = f(x) for all x in[a, b]. Show that P (f ; [a, b]) ≤ limn→∞ P (fn; [a, b]) and do the samefor N(·; ·) and T (·; ·).

4.17 Let f ∈ BV [a, b]. Show that f is continuous except on an at mostcountable set.

4.18 Let F : [a, b] → R be a.c. Show that it is of bounded variation.

(Hint: By the definition of a.c., for ε = 1, there is a δ1 > 0 suchthat

∑kj=1 |aj − bj | < δ1 ⇒

∑kj=1 |F (aj)− F (bj)| < 1. Let M be an

integer > b−aδ + 1. Show that T (F, [a, b]) ≤ M .)

4.19 Let F be an absolutely continuous nondecreasing function on R. LetµF be the Lebesgue-Stieltjes measure corresponding to F . Show thatfor any h ∈ L1(R,MµF

, µF ),∫R

hdµF =∫

hfdm

where f is a nonnegative measurable function such that F (b)−F (a) =∫[a,b] fdm for any a < b.

Page 155: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4.6 Problems 141

4.20 Let F : [a, b] → R be absolutely continuous with F ′(·) > 0 a.e. on[a, b], where −∞ < a < b < ∞. Let F (a) = c and F (b) = d. Let m(·)denote the Lebesgue measure on R. Show the following:

(a) (Change of variables formula). For any g : [c, d] → R andLebesgue measurable and integrable w.r.t. m∫

[c,d]gdm =

∫[a,b]

g(F )F ′dm.

(b) For any Borel set E ⊂ [a, b], F (E) is also a Borel set.(c) ν(·) ≡ m

(F (·)

)is a measure on B([a, b]) and ν m with

dm(·) = F ′(·).

(d) λ(·) ≡ mF−1(·) is a measure on B([c, d]) and λ m with

dm(·) =

(F ′(F−1(·)

))−1.

(e) For any measure µ m on B([a, b]) the measure ψ(·) ≡ µF−1(·)is dominated by m with

dm=

dm

(F−1(·)

)(F ′(F−1(·)

))−1.

(f) Establish (a) assuming that g and F ′ are both continuous notingthat both integrals reduce to Riemann integrals.

(Hint:

(i) Verify (a) for g = I[a,b], c < α < β < d and approximate by stepfunctions.

(ii) Show that F is (1−1) and F−1(·) is continuous and hence Borelmeasurable.

(iii) Show that ν(·) = µF , the Lebesgue-Stieltjes measure corre-sponding to F .

(iv) Use the fact that for any c ≤ α ≤ β ≤ d,

ψ([α, β]) = µ([γ, δ]), where γ = F−1(α), δ = F−1(β),

=∫

[γ,δ]gdm, where g =

dm

=∫

[γ,δ]

g(F−1

(F (·)

))F ′(F−1

(F (·)

))F ′(·)dm

=∫

[α,β]

g(F−1(·)

)F ′(F−1(·)

)dm by (a). )

Page 156: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

142 4. Differentiation

4.21 Let F : R → R be absolutely continuous on every finite interval.

(a) Show that the f in (4.1) can be chosen independently of theinterval [a, b].

(b) Further, if f is integrable over R, then limx→−∞ F (x) ≡ F (−∞)and limx→∞ F (x) ≡ F (∞) exist and F (x) = F (−∞) +∫(−∞,x) fdµL for all x in R.

(c) Give an example where F : R → R is a.c., but f is not integrableover R.

4.22 Let F : R → R be absolutely continuous on bounded intervals. LetIj = 1 ≤ j ≤ k ≤ ∞ be a collection of disjoint intervals such that⋃k

j=1 Ij ≡ R and on each Ij , F ′(·) > 0 a.e. or F ′(·) < 0 a.e. w.r.t. m.

(a) Show that for any h ∈ L1(R, m),∫R

hdm =∫

R

h(F (·)

)|F ′(·)|dm.

(b) Show that if µ is a measure on(R,B(R)

)dominated by m then

the measure µF−1(·) is also dominated by m and

dµF−1

dm(y) =

∑xj∈D(y)

f(xj)|F ′(xj)|

where f(·) = dµdm and D(y) = xj : xj ∈ Ij , F (xj) = y.

(c) Let µ be the N(0, 1) measure on(R,B(R)

), i.e.,

dm(x) =

1√2n

e− x22 , −∞ < x < ∞.

Let F (x) = x2. Find dµF −1

dm .

4.23 Let f : R → R be integrable w.r.t. m on bounded intervals. Showthat for almost all x0 in R (w.r.t. m),

lima↑x0b↓x0

1(b− a)

∫ b

a

|f(x)− f(x0)|dx = 0.

(Hint: For each rational r, by Theorem 4.4.4

lima↑x0b↓x0

1(b− a)

∫ b

a

|f(x)− r|dx = |f(x0)− r|.

Page 157: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4.6 Problems 143

a.e. (m). Let Ar denote the set of x0 for which this fails to hold. LetA =

⋃r∈Q Ar. Then m(A) = 0. For any x0 ∈ A and any ε > 0, choose

a rational r such that |f(x0)− r| < ε and now show that

lima↑x0b↓x0

1(b− a)

∫ b

a

|f(x)− f(x0)|dx < ε. )

4.24 Use the hint to the above problem to establish Theorem 4.4.5.

4.25 Let (Ω,B) be a measurable space. Let µnn≥1 and µ be σ-finitemeasures on (Ω,B). Let for each n ≥ 1, µn = µna + µns be theLebesgue decomposition of µn w.r.t. µ with µna µ and µns ⊥ µ. Letλ =

∑n≥1 µn, λa =

∑n≥1 µna, λs =

∑n≥1 µns. Show that λa µ

and λs ⊥ µ and that λ = λa + λs is the Lebesgue decomposition of λw.r.t. µ.

4.26 Let µnn≥1 be Radon measures on(Rk,B(Rk)

)and m be the

Lebesgue measure on Rk. Show that if λ =∑∞

n=1 µn is also a Radonmeasure, then

Dλ =∞∑

n=1

Dµn a.e. (m).

(Hint: Use Theorem 4.4.4 and the uniqueness of the Lebesgue de-composition.)

4.27 Let Fn, n ≥ 1 be a sequence of nondecreasing functions from R → R.Let F (x) =

∑n≥1 Fn(x) < ∞ for all x ∈ R. Show that F (·) is

nondecreasing and

F ′(·) =∑n≥1

F ′n(·) a.e. (m).

4.28 Let E be a Lebesgue measurable set in R. The metric density of Eat x is defined as

DE(x) ≡ limδ↓0

m(E ∩ (x− δ, x + δ)

)2δ

if it exists. Show that DE(·) = IE(·) a.e. m.

(Hint: Consider the measure λE(·) ≡ m(E ∩ ·) on the Lebesgueσ-algebra. Show that λE m and find λ′

E(·) (cf. Definition 4.4.2).)

4.29 Let F , G : [a, b] → R be both absolutely continuous. Show thatH = FG is also absolutely continuous on [a, b] and that∫

[a,b]FdG +

∫[a,b]

GdF = F (b)G(b)− F (a)G(a).

Page 158: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

144 4. Differentiation

4.30 Let (Ω,F , µ) be a finite measure space. Fix 1 ≤ p < ∞. Let T :Lp(µ) → R be a bounded linear functional as defined in (3.2.10) (cf.Section 3.2). Complete the following outline of a proof of Theorem3.2.3 (Riesz representation theorem).

(a) Let ν(A) ≡ T (IA), A ∈ F . Verify that ν(·) is a signed measureon (Ω,F).

(b) Verify that |ν| µ.(c) Let g ≡ dν

dµ . Show that g ∈ Lq(µ) where q = pp−1 , 1 < p < ∞

and q = ∞ if p = 1.(d) Show that T = Tg.

4.31 Prove Theorem 4.5.2 and Corollary 4.5.3.

4.32 For 0 < α < 1, construct a Cantor like set Cα as described in Remark4.5.1.

4.33 Show that the Cantor ternary function can be expressed as in (5.5).

4.34 Let (Ω,F , µ) be a σ-finite measure space.

(a) Let G be a σ-algebra ⊂ F . Let f : Ω → R+ be 〈F ,B(R)〉-measurable. Show that there exists a g : Ω → R+ that is〈G,B(R)〉-measurable and ν(A) ≡

∫A

fdµ =∫

Agdµ for all A

in G.

(Hint: Apply Theorem 4.1.1 (b) to the measures ν and µ re-stricted to G. When µ is a probability measure, g is called theconditional expectation of f given G (cf. Chapter 12).)

(b) Now suppose G = σ〈Aii≥1〉 where Aii≥1 is a partition ofΩ ⊂ F . Determine g(·) explicitly on each Ai such that 0 <µ(Ai) < ∞.

TABLE 4.6.1. Some discrete univariate distributions.

Mean µ µ(A), A ∈ B(R)

Bernoulli (p),∑1

i=0 pi(1− p)1−iIA(i)0 < p < 1Binomial (n, p),

∑ni=0

(ni

)pi(1− p)n−iIA(i)

0 < p < 1, n ∈ N

Geometric (p),∑∞

i=0 p(1− p)i−1IA(i)0 < p < 1

Poisson (λ),∑∞

i=0 e−λ λi

i! · IA(i)0 < λ < ∞

Page 159: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

4.6 Problems 145

TABLE 4.6.2. Some standard absolutely continuous distributions. Here m denotesthe Lebesgue measure on (R, B(R)).

Measure µ µ(A), A ∈ B(R)

Uniform (a, b), m(A ∩ [a, b])/(b− a)−∞ < a < b < ∞Exponential (β),

∫A

1β exp(−x/β)I(0,∞)(x)m(dx)

β ∈ (0,∞)Gamma (α, β),

∫A

1Γ(α)βα xα−1 exp(−x/β)I(0,∞)(x)m(dx)

α, β ∈ (0,∞) where Γ(a) =∫∞0 xa−1e−xdx, a ∈ (0,∞)

Beta (α, β), 1B(α,β)

∫A

xα−1(1− x)β−1I(0,1)(x)m(dx),α, β ∈ (0,∞) where B(a, b) = Γ(a)Γ(b)/Γ(a + b), a, b ∈ (0,∞)

Cauchy (γ, σ)∫

A1

πσσ2

σ2+(x−γ)2 m(dx)γ ∈ R, σ ∈ (0,∞)

Normal (γ, σ2),∫

A1√2πσ

exp(−(x− γ)2/σ2)m(dx)γ ∈ R, σ ∈ (0,∞)

Lognormal (γ, σ2),∫

A1√2πσ

e−(log x−γ)2/2σ2

x I(0,∞)(x)m(dx)γ ∈ R, σ ∈ (0,∞)

Page 160: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5Product Measures, Convolutions, andTransforms

5.1 Product spaces and product measures

Given two measure spaces (Ωi,Fi, µi), i = 1, 2, is it possible to construct ameasure µ ≡ µ1×µ2 on a σ-algebra on the product space Ω1×Ω2 such thatµ(A × B) = µ1(A)µ2(B) for A ∈ F1 and B ∈ F2? This section is devotedto studying this question.

Definition 5.1.1: Let (Ωi,Fi), i = 1, 2 be measurable spaces.

(a) Ω1 × Ω2 ≡ (ω1, ω2) : ω1 ∈ Ω1, ω2 ∈ Ω2, the set of all ordered pairs,is called the (Cartesian) product of Ω1 and Ω2.

(b) The set A1 × A2 with A1 ∈ F1, A2 ∈ F2 is called a measurable rect-angle. The collection of measurable rectangles will be denoted by C.

(c) The product σ-algebra of F1 and F2 on Ω1×Ω2, denoted by F1,×F2,is the smallest σ-algebra generated by C, i.e.,

F1 ×F2 ≡ σ〈A1 ×A2 : A1 ∈ F1, A2 ∈ F2〉.

(d) (Ω1 × Ω2,F1 ×F2) is called the product measurable space.

Starting with the definition of µ on the class C of measurable rectanglesby

µ(A1 ×A2) = µ1(A1)µ2(A2) (1.1)

Page 161: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

148 5. Product Measures, Convolutions, and Transforms

for all A1 ∈ F1, A2 ∈ F2, one can extend it to a measure on the algebraA of all finite unions of disjoint measurable rectangles in a natural way.Indeed, the extension to A is obtained simply by assigning the µ-measureof a finite union of disjoint measurable rectangles as the sum of the µ-measures of the corresponding individual measurable rectangles. Then, bythe extension theorem (cf. Theorem 1.3.3), it can be further extended toa complete measure µ on a σ-algebra containing F1 × F2, defined in (c)above. However, to evaluate µ(A) for an arbitrary A in F1 × F2 that isnot a measurable rectangle and to evaluate

∫hdµ for arbitrary measurable

functions h, some further work is needed. For the case of the product oftwo measure spaces, here an alternate approach that yields a direct way ofcomputing these quantities is presented. The extension theorem approachwill be used for the more general case of products of finitely many measurespaces in Section 5.3. The case of products of infinitely many probabilityspaces will be discussed in Chapter 6.

Definition 5.1.2:

(a) Let A ∈ F1 ×F2. Then, for any ω1 ∈ Ω1, the set

A1ω1 ≡ ω2 ∈ Ω2 : (ω1, ω2) ∈ A (1.2)

is called the ω1-section of A and for any ω2 ∈ Ω2, the set A2ω2 ≡ω1 ∈ Ω1 : (ω1, ω2) ∈ A is called the ω2-section of A.

(b) If f : (Ω1 × Ω2) → Ω3 is a 〈F1 × F2,F3〉 measurable mapping fromΩ1 ×Ω2 into some measurable space (Ω3,F3), then the ω1-section off is the function f1ω1 : Ω2 → Ω3, given by

f1ω1(ω2) = f(ω1, ω2), ω2 ∈ Ω2. (1.3)

Similarly, one may define the ω2-sections of f by f2ω2(ω1) = f(ω1, ω2),ω1 ∈ Ω1. The following result shows that the ω1-sections of a F1 × F2-measurable function is F2-measurable when considered as a function ofω2 ∈ Ω2.

Proposition 5.1.1: Let (Ω1×Ω2,F1×F2) be a product space, A ∈ F1×F2and let f : Ω1 × Ω2 → Ω3 be a 〈F1 ×F2,F3〉-measurable function.

(i) For every ω1 ∈ Ω1, A1ω1 ∈ F2 and for every ω2 ∈ Ω2, A2ω2 ∈ F1.

(ii) For every ω1 ∈ Ω1, f1ω1 is 〈F2,F3〉-measurable and for every ω2 ∈Ω2, f2ω2 is 〈F1,F3〉-measurable.

Proof: Fix ω1 ∈ Ω1. Define the function g : Ω2 → Ω1 × Ω2 by

g(ω2) = (ω1, ω2), ω2 ∈ Ω2.

Page 162: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.1 Product spaces and product measures 149

Note that for any measurable rectangle A ≡ A1 ×A2 ∈ F1 ×F2,

A1ω1 =

A2 if ω1 ∈ A1

∅ if ω1 ∈ A1

and hence g−1(A1 ×A2) ∈ F2. Since the class of all measurable rectanglesgenerates F1×F2, for fixed ω1 in Ω1, g is 〈F2,F1×F2〉-measurable. There-fore, for A in F1×F2, A1ω1 = g−1(A) ∈ F2 and for f as given, f1ω1 = f gis 〈F2,F3〉-measurable. This proves (i) and (ii) for the ω1-sections. Theproof for the ω2-sections are similar.

Now suppose µ1 and µ2 are measures on (Ω1,F1) and (Ω2,F2), respec-tively. Then, for any set A ∈ F1×F2, for all ω1 ∈ Ω1, A1ω1 ∈ F2 and henceµ2(A1ω1) is well defined. If this were an F1-measurable function, then onemight define a set function on F1 ×F2 by

µ12(A) =∫

Ω1

µ2(A1ω1)µ1(dω1). (1.4)

And similarly, reversing the order of µ1 and µ2, one might define a secondset function

µ21(A) =∫

Ω2

µ1(A2ω2)µ2(dω2), (1.5)

provided that µ1(A2ω2) is F2-measurable. Note that for the measurablerectangles A = A1 × A2, µ12(A) = µ1(A1)µ2(A2) = µ21(A) and thus bothµ12 and µ21 coincide (with the product measure µ) on the class C of allmeasurable rectangles. This implies that if the product measure µ is uniqueon F1 × F2, and µ12 and µ21 are measures on F1 × F2, then µ12, µ21 andµ must coincide on the σ-algebra F1 × F2. Then, one can evaluate µ(A)for any set A ∈ F1 × F2 using either of the relations (1.4) or (1.5). Thefollowing result makes this heuristic discussion rigorous.

Theorem 5.1.2: Let (Ωi,Fi, µi), i = 1, 2 be σ-finite measure spaces. Then,

(i) for all A ∈ F1 × F2, the functions µ2(A1ω1) and µ1(A2ω2) are F1-and F2-measurable, respectively.

(ii) The set functions µ12 and µ21, given by (1.4) and (1.5) respectively,are measures on F1 × F2, satisfying µ12(A) = µ21(A) for all A ∈F1 ×F2.

(iii) Further, µ12 = µ21 ≡ µ is σ-finite and it is the only measure satisfying

µ(A1 ×A2) = µ1(A1)µ2(A2) for all A1 ×A2 ∈ C.

Proof: First suppose that µ1 and µ2 are finite measures. Define

L =A ∈ F1 ×F2 : µ2(A1ω1) is a 〈F1,B(R)〉-measurable function

.

Page 163: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

150 5. Product Measures, Convolutions, and Transforms

For A = Ω1×Ω2, µ2(A1ω1) ≡ µ2(Ω2) for all ω1 ∈ Ω1 and hence Ω1×Ω2 ∈ L.Next, let A, B ∈ L with A ⊂ B. Then, it is easy to check that (A\B)1ω1 =A1ω1 \ B1ω1 . Since µ2 is a finite measure and A, B ∈ L, it follows thatµ2((A \ B)1ω1) = µ2(A1ω1 \ B1ω1) = µ2(A1ω1) − µ2(B1ω1) is 〈F1,B(R)〉-measurable. Thus, A \ B ∈ L. Finally, let Bnn≥1 ⊂ L be such thatBn ⊂ Bn+1 for all n ≥ 1. Then for any ω1 ∈ Ω1, (Bn)1ω1 ⊂ (Bn+1)1ω1 forall n ≥ 1. Hence by the MCT,

∞ > µ2

(( ⋃n≥1

Bn

)1ω1

)= µ2

( ⋃n≥1

(Bn)1ω1

)= lim

n→∞ µ2((Bn)1ω1

)

for all ω1 ∈ Ω1. By Proposition 2.1.5, this implies that µ2((⋃

n≥1 Bn)1ω1

)is 〈F1,B(R)〉-measurable, and hence

⋃n≥1 Bn ∈ L. Thus, L is a λ-system.

For A = A1 × A2 ∈ C, µ2(A1ω1) = µ2(A2)IA1(ω1) and hence C ⊂ L. SinceC is a π-system, by Corollary 1.1.3, it follows that L = F1 × F2. Thus,µ2(A1ω1), considered as a function of ω1, is 〈F1,B(R)〉-measurable for allA ∈ F1 ×F2, proving (i).

Next, consider part (ii). By part (i), µ12 is a well-defined set function onF1 × F2. It is easy to check that µ12 is a measure on F1 × F2 (Problem5.1). Similarly, µ21 is a well-defined measure on F1 × F2. Since µ12(A) =µ21(A) = µ1(A1)µ2(A2) for all A = A1 × A2 ∈ C and C is a π-systemgenerating F1 × F2, it follows from Theorem 1.2.4 that µ12(A) = µ21(A)for all A ∈ F1×F2. This proves (ii) for the case where µi(Ωi) < ∞, i = 1, 2.

Next, suppose that µi’s are σ-finite. Then, there exist disjoint setsBinn≥1 ⊂ Fi, such that

⋃n≥1 Bin = Ωi and µi(Bin) < ∞ for all n ≥ 1,

i = 1, 2. Define the finite measures

µin(D) = µi(D ∩Bin), D ∈ Fi,

for n ≥ 1, i = 1, 2. The arguments above with µi replaced by µin imply thatfor any A ∈ F1×F2, µ2n(A1ω1) is 〈F1,B(R)〉-measurable for all n ≥ 1. Sinceµ2 is a measure on F2, µ2(A1ω1) =

∑∞n=1 µ2n(Aω1) and hence, considered

as a function of ω1, it is 〈F1,B(R)〉-measurable for all A ∈ F1 ×F2. Thus,the set function µ12 of (1.4) is well defined in the σ-finite case as well.Similarly, µ21 of (1.5) is a well-defined set function on F1 ×F2. Let µ

(m,n)12

and µ(m,n)21 , respectively, denote the set functions defined by (1.4) and (1.5)

with µ1 replaced by µ1m and µ2 replaced by µ2n, m ≥ 1, n ≥ 1. Then, byrepeated use of the MCT,

µ12(A) =∫

Ω1

µ2(A1ω1)µ1(dω1)

=∞∑

m=1

(∫B1m

∞∑n=1

µ2(A1ω1 ∩B2n))

µ1(dω1)

=∞∑

m=1

∞∑n=1

∫B1m

µ2(A1ω1 ∩B2n)µ1(dω1)

Page 164: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.1 Product spaces and product measures 151

=∞∑

m=1

∞∑n=1

µ(m,n)12 (A), A ∈ F1 ×F2 (1.6)

and similarly,

µ21(A) =∞∑

n=1

∞∑m=1

µ(m,n)21 (A), A ∈ F1 ×F2. (1.7)

Since µ(m,n)12 and µ

(m,n)21 are (finite) measures, it is easy to check that µ12

and µ21 are measures on F1×F2. Also, by the finite case, µ(m,n)12 (A1×A2) =

µ(m,n)21 (A1 ×A2) for all n ≥ 1, m ≥ 1 and hence

µ12(A1 ×A2) = µ21(A1 ×A2) for all A1 ×A2 ∈ C.

Next note that B1m × B2n : m ≥ 1, n ≥ 1 is a partition of Ω1 × Ω2 byF1 ×F2 sets and by (1.6) and (1.7), for all m ≥ 1, n ≥ 1,

µ12(B1m ×B2n) = µ1(B1m)µ2(B2n) = µ21(B1m ×B2n) < ∞.

Hence, µ12 and µ21 are σ-finite on F1 × F2. Since µ12 and µ21 agree onC and C is a π-system generating the product σ-algebra, it follows thatµ12(A) = µ21(A) for all A ∈ F1×F2 and it is the unique measure satisfyingµ(A1 ×A2) = µ1(A1)µ2(A2) for all A1 ×A2 ∈ C. This completes the proofof the theorem.

Remark 5.1.1: In the above theorem, the σ-finiteness condition on themeasures µ1 and µ2 cannot be dropped. For example, let Ω1 = Ω2 = [0, 1],F1 = F2 = B([0, 1]), µ1 = the Lebesgue measure and µ2 = the countingmeasure. Clearly µ2 is not σ-finite since [0, 1] is uncountable. Let A be thediagonal set in the product space [0, 1]×[0, 1], i.e., A = (ω1, ω2) ∈ Ω1×Ω2 :ω1 = ω2. Then A ∈ F1 × F2, A1ω1 = ω1 and A2ω2 = ω2. Further,µ2(A1ω1) = 1 for all ω1 and µ1(A2ω2) = 0 for all ω2. Thus µ12(A) =∫Ω1

µ2(A1ω1)µ1(dω1) = 1, but µ21(A) =∫Ω2

µ1(A2ω2)µ2(dω2) = 0, andhence, µ12(A) = µ21(A).

Remark 5.1.2: Although this approach allows one to compute the productmeasure µ1 × µ2 of a set A ∈ F1 × F2, the measure space (Ω1 × Ω2,F1 ×F2, µ1 × µ2) may not be complete even if both (Ωi,Fi, µi), i = 1, 2 arecomplete (Problem 5.2). However, the approach based on the extensiontheorem yields a product measure space that is complete. See Remark 5.2.2for further discussion on the topic.

Definition 5.1.3:

(a) The unique measure µ on F1 × F2 defined in Theorem 5.1.2 (iii) iscalled the product measure and is denoted by µ1 × µ2.

Page 165: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

152 5. Product Measures, Convolutions, and Transforms

(b) The measure space (Ω1 × Ω2,F1 × F2, µ1 × µ2) is called the productmeasure space.

Formulas (1.4) and (1.5) give two different ways of evaluating µ1×µ2(A)for an A ∈ F1×F2. In the next section, integration of a measurable functionf : Ω1 ×Ω2 → R w.r.t. the product measure µ1 × µ2 is considered and theabove approach is extended to justify evaluation of the integral iteratively.

5.2 Fubini-Tonelli theorems

Let f : Ω1 × Ω2 → R be a 〈F1 ×F2,B(R)〉-measurable function. Relations(1.4) and (1.5) suggest that the integral of f w.r.t. µ1×µ2 may be evaluatedas iterated integrals, using the formulas∫

Ω1×Ω2

f(ω1, ω2)µ1 × µ2(d(ω1, ω2)) =∫

Ω2

[∫Ω1

f(ω1, ω2)µ1(dω1)]

µ2(dω2)

(2.1)and∫

Ω1×Ω2

f(ω1, ω2)µ1 × µ2(d(ω1, ω2)) =∫

Ω1

[∫Ω2

f(ω1, ω2)µ2(dω2)]

µ1(dω1).

(2.2)

Here, the left sides of both (2.1) and (2.2) are simply the integral of fon the space Ω = Ω1×Ω2 w.r.t. the measure µ = µ1×µ2. The expressionson the right sides of (2.1) and (2.2) are, however, iterated integrals, whereintegrals of sections of f are evaluated first and then the resulting sectionalintegrals are integrated again to get the final expression. Conditions forthe validity of (2.1) and (2.2) are provided by the Fubini-Tonelli theoremsstated below.

Theorem 5.2.1: (Tonelli’s theorem). Let (Ωi,Fi, µi), i = 1, 2 be σ-finitemeasure spaces and let f : Ω1 × Ω2 → R+ be a nonnegative F1 × F2-measurable function. Then

g1(ω1) ≡∫

Ω2

f(ω1, ω2)µ2(dω2) : Ω1 → R is 〈F1,B(R)〉-measurable

(2.3)and

g2(ω2) ≡∫

Ω1

f(ω1, ω2)µ1(dω1) : Ω2 → R is 〈F2,B(R)〉-measurable.

(2.4)Further ∫

Ω1×Ω2

fdµ =∫

Ω1

g1dµ1 =∫

Ω2

g2dµ2, (2.5)

Page 166: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.2 Fubini-Tonelli theorems 153

where µ = µ1 × µ2.

Proof: If f = IA for some A in F1 × F2, the result follows from Theo-rem 5.1.2. By the linearity of integrals, the result now holds for all simplenonnegative functions f . For a general nonnegative function f , there exista sequence fnn≥1 of nonnegative simple functions such that fn(ω1, ω2) ↑f(ω1, ω2) for all (ω1, ω2) ∈ Ω1×Ω2. Write g1n(ω1) =

∫Ω1

fn(ω1, ω2)µ2(dω2).Then, g1n is F1-measurable for all n ≥ 1, g1n’s are nondecreasing, and bythe MCT,

g1(ω1) ≡∫

Ω1

f(ω1, ω2)µ2(dω2)

= limn→∞

∫fn(ω1, ω2)µ2(dω2)

= limn→∞ g1n(ω1) (2.6)

for all ω1 ∈ Ω1. Thus, by Proposition 2.1.5, g1 is 〈F1,B(R)〉-measurable.Since (2.5) holds for simple functions,

∫fndµ =

∫g1ndµ1 for all n ≥ 1.

Hence, by repeated applications of the MCT, it follows that∫fdµ = lim

n→∞

∫fndµ = lim

n→∞

∫g1ndµ1

=∫

( limn→∞ g1n)dµ1 =

∫g1dµ1.

The proofs of (2.4) and the second equality in (2.5) are similar.

Theorem 5.2.2: (Fubini’s theorem). Let (Ωi,Fi, µi), i = 1, 2 be σ-finitemeasure spaces and let f ∈ L1(Ω1×Ω2,F1×F2, µ1×µ2). Then there existsets Bi ∈ Fi, i = 1, 2 such that

(i) µi(Ωi \Bi) = 0 for i = 1, 2,

(ii) for ω1 ∈ B1, f(ω1, ·) ∈ L1(Ω2,F2, µ2),

(iii) the function

g1(ω1) ≡ ∫

Ω2f(ω1, ω2)µ2(dω2) for ω1 in B1

0 for ω1 in Bc1

is F1-measurable and∫Ω1

g1dµ1 =∫

Ω1×Ω2

fd(µ1 × µ2), (2.7)

(iv) for ω2 ∈ B2, f(·, ω2) ∈ L1(Ω1,F1, µ1),

Page 167: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

154 5. Product Measures, Convolutions, and Transforms

(v) the function

g2(ω2) ≡ ∫

Ω1f(ω1, ω2)µ1(dω1) for ω2 in B2

0 for ω2 in Bc2

is F2-measurable and∫Ω2

g2dµ2 =∫

Ω1×Ω2

fd(µ1 × µ2). (2.8)

Remark 5.2.1: An informal statement of the above theorem is as fol-lows. If f is integrable on the product space, then the sectional integrals∫Ω2

f(ω1, ·)dµ2 and∫Ω1

f(·, ω2)dµ1 are well defined a.e., and their integralsw.r.t. µ1 and µ2, respectively, are equal to the integral of f w.r.t. the prod-uct measure µ1 × µ2.

Proof: By Tonelli’s theorem∫Ω1×Ω2

|f |d(µ1 × µ2) =∫

Ω1

(∫Ω2

|f(ω1, ω2)|µ2(dω2))

µ1(dω1).

So∫Ω1×Ω2

|f |d(µ1 × µ2) < ∞ implies that µ1(Bc1) = 0 where B1 = ω1 :∫

|f(ω1, ·)|dµ2 < ∞. Also, by Tonelli’s theorem

g11(ω1) ≡∫

Ω2

f+(ω1, ·)dµ2 and g12(ω1) ≡∫

Ω2

f−(ω1, ·)dµ2

are both F1-measurable and∫Ω1

g11dµ1 =∫

Ω1×Ω2

f+d(µ1 × µ2),∫

Ω1

g12dµ1 =∫

Ω1×Ω2

f−d(µ1 × µ2).

(2.9)Since g1 defined in (iii) can be written as g1 = (g11 − g12)IB1 , g1 is F1-measurable. Also,∫

Ω1

|g1|dµ1 ≤∫

Ω1

g11dµ1 +∫

Ω1

g12dµ1

=∫

Ω1×Ω2

f+d(µ1 × µ2) +∫

Ω1×Ω2

f−d(µ1 × µ2)

< ∞.

Further, as∫Ω1×Ω2

|f |dµ1×µ2 < ∞, by (2.9), g11 and g12 ∈ L1(Ω1,F1, µ1).Noting that µ1(Bc

1) = 0, one gets∫Ω1

g1dµ1 =∫

Ω1

(g11 − g12)IB1dµ1

=∫

Ω1

g11IB1dµ1 −∫

Ω1

g12IB1dµ1

=∫

Ω1

g11dµ1 −∫

Ω1

g12dµ1

Page 168: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.2 Fubini-Tonelli theorems 155

which, by (2.9), equals∫Ω1×Ω2

f+d(µ1 × µ2) −∫Ω1×Ω2

f−d(µ1 × µ2) =∫Ω1×Ω2

fd(µ1 × µ2). Thus, (ii) and (iii) of the theorem have been estab-lished as well as (i) for i = 1. The proofs of (iv) and (v) and that of (i) fori = 2 are similar.

An application of the Fubini-Tonelli theorems gives an integration byparts formula. Let F1 and F2 be two nondecreasing right continuous func-tions on an interval [a, b]. Let µi be the Lebesgue-Stieltjes measure onB([a, b]) corresponding to Fi, i = 1, 2. The ‘integration by parts’ for-mula allows one to write

∫(a,b] F1(x)dF2(x) ≡

∫(a,b] F1dµ2 in terms of∫

(a,b] F2(x)dF1(x) ≡∫(a,b] F2dµ1.

Theorem 5.2.3: Let F1, F2 be two nondecreasing right continuous func-tions on [a, b] with no common points of discontinuity in (a, b]. Then∫

(a,b]F1(x)dF2(x) = F1(b)F2(b)− F1(a)F2(a)−

∫(a,b]

F2(x)dF1(x). (2.10)

Proof: Note that((a, b],B(a, b], µi

), i = 1, 2 are finite measure spaces.

Consider the product space((a, b]× (a, b],B((a, b]× (a, b]), µ1×µ2

). Define

the sets

A = (x, y) : a < x ≤ y ≤ bB = (x, y) : a < y ≤ x ≤ b

andC = (x, y) : a < x = y ≤ b.

For notational simplicity, write Ax and Ay for the x-section and the y-section of A, respectively, and similarly, for the sets B and C. Since F1 andF2 have no common points of discontinuity, by Theorem 5.1.2

µ1 × µ2(C) =∫

(a,b]µ2(Cx)µ1(dx)

=∫

(a,b]µ2(x)µ1(dx)

= 0. (2.11)

(see Problem 5.3). And by Theorem 5.1.2,

µ1 × µ2(A) =∫

(a,b]µ1(Ay)µ2(dy)

=∫

(a,b]µ1((a, y])µ2(dy)

=∫

(a,b][F1(y)− F1(a)]dF2(y) (2.12)

Page 169: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

156 5. Product Measures, Convolutions, and Transforms

and similarly,

µ1 × µ2(B) =∫

(a,b]µ2(Bx)µ1(dx)

=∫

(a,b][F2(x)− F2(a)]dF1(x). (2.13)

Next note that (µ1 × µ2)((a, b] × (a, b]) = µ1((a, b]) · µ2((a, b]) = [F1(b) −F1(a)][F2(b)− F2(a)]. Hence, by (2.11)–(2.13),

[F1(b)− F1(a)][F2(b)− F2(a)]= (µ1 × µ2)((a, b]× (a, b])= µ1 × µ2(A) + µ1 × µ2(B)− µ1 × µ2(C)

=∫

(a,b][F1(y)− F1(a)]dF2(y) +

∫(a,b]

[F2(x)− F2(a)]dF1(x),

which yields (2.10), thereby completing the proof of Theorem 5.2.3.

If F1 and F2 are absolutely continuous with nonnegative densities f1 andf2 w.r.t. the Lebesgue measure on (a, b], then (2.10) yields∫ b

a

F1(x)f2(x)dx = F1(b)F2(b)− F1(a)F2(a)−∫ b

a

F2(x)f1(x)dx. (2.14)

If f1 and f2 are any two Lebesgue integrable function on (a, b] that arenot necessarily nonnegative, one can decompose fi as fi = f+

i − f−i and

apply (2.14) to f+i ’s and f−

i ’s separately. Then, by linearity, it follows thatthe relation (2.14) also holds for the given f1 and f2. Thus, the standard‘integration by parts’ formula is a special case of Theorem 5.2.3. Relations(2.10) and (2.14) can be extended to unbounded intervals under suitableconditions on F1 and F2 (Problem 5.5).

Remark 5.2.2: The measure space (Ω1×Ω2,F1×F2, µ1×µ2) constructedusing the integrals in (1.4) and (1.5) is not necessarily complete. As men-tioned at the beginning of this section, the approach based on the extensiontheorem (applied to the set function defined in (1.1) on the algebra A offinite disjoint unions of measurable rectangles) does yield a complete mea-sure space (Ω1×Ω2,M, λ) such that F1×F2 ⊂M and λ(A) = (µ1×µ2)(A)for all A in F1 × F2. To compute the integral

∫fdλ of an M-measurable

function f w.r.t. λ, formula (2.5) in Tonelli’s theorem and formulas (2.7)and (2.8) in Fubini’s theorem continue to be valid with some modifications.In the following, a modified statement of Fubini’s theorem is presented.Similar modification also holds for Tonelli’s theorem (cf. Royden (1988),Chapter 12, Section 4).

Theorem 5.2.4: Let (Ωi,Fi, µi), i = 1, 2 be σ-finite measure spacesand let f ∈ L1(Ω1 × Ω2,M, µ1 × µ2). Let (Ωi, Fi, µi) be the completionof (Ωi,Fi, µi), i = 1, 2. Then there exist sets Bi in Fi, i = 1, 2 such that

Page 170: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.3 Extensions to products of higher orders 157

(i) µi(Bci ) = 0, i = 1, 2,

(ii) for ω1 ∈ B1, f(ω1, ·) is F2-measurable,

(iii)∫Ω2|f(ω1, ·)|dµ2 < ∞ for all ω1 ∈ B1,

(iv) g1(ω1) ≡ ∫

Ω2f(ω1, ·)dµ2 for ω1 in B,

0 for ω1 in Bc1

is F1-measurable,

(v)∫Ω1

g1dµ1 =∫Ω1×Ω2

fdλ.

Further, a similar statement holds for i = 2.

5.3 Extensions to products of higher orders

Definition 5.3.1: Let (Ωi,Fi), i = 1, . . . , k be measurable spaces, 2 ≤k < ∞. Then

(a) ×ki=1Ωi = Ω1 × . . .× Ωk = (ω1, . . . , ωi) : ω :∈ Ωi for i = 1, . . . , k is

called the product set of Ω1, . . . ,Ωk.

(b) A set of the form A1 × . . . × Ak with Ai ∈ Fi, 1 ≤ i ≤ k is called ameasurable rectangle. The product σ-algebra on ×k

i=1Ωi, denoted by×k

i=1Fi, is the σ-algebra generated by the collection of all measurablerectangles, i.e.,

×ki=1Fi = σ〈A1 × . . .×Ak : Ai ∈ Fi, 1 ≤ i ≤ k〉.

(c) (×ki=1Ωi,×k

i=1Fi) is called the product space or the product measurablespace. When (Ωi,Fi) = (Ω,F) for all 1 ≤ i ≤ k, the product spacewill be denoted by (Ωk,Fk).

To define the product measure, one starts with a natural set functionon C, the class of all measurable rectangles, extends it to the algebra A offinite disjoint unions of measurable rectangles by linearity, and then usesthe extension theorem (Theorem 1.3.3) to obtain a measure on the productspace. The details of this construction are now described below.

Define a set function µ on C by

µ(A) =k∏

i=1

µ(Ai) (3.1)

where A = A1 ×A2 × ×Ak with Ai ∈ Fi, i = 1, 2, . . . , k. Next extend itto the algebra A by linearity. If B ∈ A is of the form B =

⋃mj=1 Bj where

Page 171: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

158 5. Product Measures, Convolutions, and Transforms

Bj : 1, . . . , m ⊂ C are disjoint, then define µ(B) by

µ(B) =m∑

j=1

µ(Bj). (3.2)

If the set B ∈ A admits two different representations as finite unions ofmeasurable rectangles then it is not difficult to verify that the value of µ(B)remains unchanged. Next it is shown that µ is a measure on A.

Proposition 5.3.1:. Let µ be as defined by (3.1) and (3.2) above on thealgebra A. Then, it is countably additive on A.

Proof: Let Bn∞n=1 ⊂ A be disjoint such that B =

⋃n≥1 Bn also belongs

to A. It is enough to show that

µ(B) =∞∑

n=1

µ(Bn). (3.3)

Let B and Bn∞n=1 admit the representations B =

⋃i=1 Ai and Bj =⋃j

r=1 Ajr where and j are (finite) integers, Ai, Ajr ∈ C and each of thecollections Ai

i=1 and Ajrj

r=1, j ≥ 1 is disjoint. Then each Ai can bewritten as

Ai = Ai ∩[ ⋃

n≥1

Bn

]=⋃n≥1

n⋃r=1

(Ai ∩Anr).

Suppose it is shown that for each i ≥ 1,

µ(Ai) =∞∑

j=1

j∑r=1

µ(Ai ∩Ajr). (3.4)

Then, by (3.2),

µ(B) =∑

i=1

µ(Ai)

=∑

i=1

∞∑j=1

j∑r=1

µ(Ai ∩Ajr)

=∞∑

j=1

j∑r=1

∑i=1

µ(Ai ∩Ajr)

=∞∑

j=1

j∑r=1

µ(B ∩Ajr),

Page 172: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.3 Extensions to products of higher orders 159

where the last equality follows from the representation of B ∩ Ajr as⋃i=1(Ai ∩ Ajr) and the fact that Ai ∩ Ajr ∈ C for all i, j, r. Since

Ajr ⊂ Bj ⊂ B, the above yields

∞∑j=1

j∑r=1

µ(Ajr) =∞∑

j=1

µ(Bj) (by (3.2))

which establishes (3.3).Thus, it remains to show (3.4). This is implied by the following:

Let C =⋃

n≥1 Cn where Cn∞n=1 is a collection of disjoint measurable

rectangles and C is also a measurable rectangle. Then

µ(C) =∞∑

i=1

µ(Ci). (3.5)

Let C = A1 × A2 × ×Ak and Ci = Ai1 × Ai2 × ×Aik, i = 1, 2, . . ..Since C =

⋃n≥1 Cn and Cn∞

n=1 are disjoint, this implies IC(ω1, . . . , ωk) =∑∞i=1 ICi(ω1, . . . , ωk), for all (ω1, . . . , ωk) ∈ Ω1 × . . .× Ωk. That is,

k∏j=1

IAj(ωj) =

∞∑i=1

k∏j=1

IAij(ωj) (3.6)

for all (ω1, . . . , ωk) ∈ Ω1× . . .×Ωk. Integration of both sides of (3.6) w.r.t.µ1 over Ω1 yields

µ1(A1)( k∏

j=2

IAj(ωj)

)=

∞∑i=1

µ1(Ai1)k∏

j=2

IAij(ωj),

and by iterationk∏

j=1

µj(Aj) =∞∑

i=1

k∏j=1

µj(Aij).

Hence, (3.5) follows.

By the extension theorem (Theorem 1.3.3), there exists a σ-algebra Mand a measure µ such that

(i) (×ki=1Ωi,M, µ) is complete,

(ii) ×ki=1Fi ⊂M, and

(iii) µ(A) = µ(A) for all A ∈ A.

Thus, the above procedure yields an extension µ of µ on A, and this ex-tension is unique when all the µi’s are σ-finite. Further, under the hypoth-esis that µ1, . . . , µk are σ-finite, the analogs of formulas (1.4) and (1.5) for

Page 173: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

160 5. Product Measures, Convolutions, and Transforms

computing the product measure of a set via the iterated integrals are valid.More generally, the Tonelli-Fubini theorems extend to the k-fold productspaces in an obvious manner. For example, if (Ωi,Fi, µi) are σ-finite mea-sure spaces for i = 1, . . . , k and f is a nonnegative measurable function on(×k

i=1Ωi,×ki=1Fi, µ), then∫

fdµ =∫

Ωi1

∫Ωi2

· · ·∫

Ωik

f(ω1, ω2, ωk)µi1(dωi1) . . . µik(dωik

)

for any permutation (i1, i2, . . . , ik) of (1, 2, . . . , k).

Definition 5.3.2: The measure space (×ki=1Ωi,M, µ) is called the com-

plete product measure space and µ the complete product measure.

Remark 5.3.1: If (Ωi,Fi, µi) ≡ (R,L, m) where L is the Lebesgue σ-algebra and m is the Lebesgue measure, then the above extension coincideswith the (Rk,Lk, mk), where Lk is the Lebesgue σ-algebra in Rk and mk

is the k dimensional Lebesgue measure.

5.4 Convolutions

5.4.1 Convolution of measures on (R,B(R))In this section, convolution of measures on

(R,B(R)

)is discussed. From

this, one can easily obtain convolution of sequences, convolution of func-tions in L1(R) and convolution of functions with measures.

Proposition 5.4.1: Let µ and λ be two σ-finite measures on(R,B(R)

).

For any Borel set A in B(R), let

(µ ∗ λ)(A) ≡∫ ∫

IA(x + y)µ(dx)λ(dy). (4.1)

Then (µ ∗ λ)(·) is a measure on(R,B(R)

).

Proof: Let h(x, y) ≡ x + y for (x, y) ∈ R × R. Then h : R × R → Ris continuous and hence is 〈B(R) × B(R),B(R)〉-measurable. Consider themeasure (µ×λ)h−1(·) on 〈R,B(R)〉 induced by the map h and the measureµ× λ on 〈R× R,B(R)× B(R)〉. Clearly

(µ ∗ λ)(·) = (µ× λ)h−1(·)

and hence the proposition is proved.

Definition 5.4.1: For any two σ-finite measures µ and λ on(R,B(R)

),

the measure (µ ∗ λ)(A) defined in (4.1) is called the convolution of µ andλ.

Page 174: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.4 Convolutions 161

The following proposition is easy to verify.

Proposition 5.4.2: Let µ1, µ2, µ3 be σ-finite measures on(R,B(R)

).

Then

(i) (Commutativity) µ1 ∗ µ2 = µ2 ∗ µ1,

(ii) (Associativity) µ1 ∗ (µ2 ∗ µ3) = (µ1 ∗ µ2) ∗ µ3,

(iii) (Distributive) µ1 ∗ (µ2 + µ3) = µ1 ∗ µ2 + µ1 ∗ µ3,

(iv) (Identity element) µ1 ∗ δ0 = µ1

where δ0(·) is the delta measure at 0, i.e.,

δ0(A) =

1 if 0 ∈ A,0 if 0 ∈ A

for all A ∈ B(R).

Remark 5.4.1: (Extension to Rk). Definition 5.4.1 extends to measureson(Rk,B(Rk)

)for any integer k ≥ 1 as well as to any space that is a com-

mutative group under addition such as the sequence space R∞, the functionspaces C[0, 1] and C[0,∞). These are relevant in the study of stochastic pro-cesses.

Remark 5.4.2: (Sums of independent random variables). It will be seenin Chapter 7 that if X and Y are two independent random variables on aprobability space (Ω,F , P ), then PX ∗ PY = PX+Y where for any randomvariable Z on (Ω,F , P ), PZ is the distribution of Z, i.e., the probabilitymeasure on

(R,B(R)

)induced by P and Z, i.e., PZ(·) = PZ−1(·) on B(R).

Remark 5.4.3: (Extension to signed measures). Let µ and λ be two finitesigned measures on

(R,B(R)

)as defined in Section 4.2. Let µ = µ+ − µ−,

λ = λ+ − λ− be the Jordan decomposition of µ and λ, respectively. Thenthe product measure µ× λ can be defined as the signed measure

µ× λ ≡[(µ+ × λ+) + (µ− × λ−)

]−[(µ+ × λ−) + (µ− + λ+)

]≡ γ1 − γ2, say. (4.2)

This is well defined since the measures (µ+ × λ+) + (µ− × λ−) and (µ+ ×λ−)+(µ−×λ+) are both finite measures. Now the definition of convolutionof measures in (4.1) carries over to signed measures using the definition ofintegration w.r.t. signed measures discussed in Section 4.2.

Definition 5.4.2: Let µ and ν be (finite) signed measures on a measurablespace

(R,B(R)

). The convolution of µ and ν, denoted by µ∗ν is the signed

measure defined by

(µ ∗ ν)(A) =∫ ∫

IA(x + y)d(µ× ν)

Page 175: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

162 5. Product Measures, Convolutions, and Transforms

=∫ ∫

IA(x + y)dγ1 −∫ ∫

IA(x + y)dγ2 (4.3)

where γ1 and γ2 are as in (4.2).

5.4.2 Convolution of sequencesDefinition 5.4.3: Let a ≡ ann≥0 and b ≡ bnn≥0 be two sequencesof real numbers. Then the convolution of a and b denoted by a ∗ b is thesequence

(a ∗ b)(n) =n∑

j=0

ajbn−j , n ≥ 0. (4.4)

If∑∞

j=0 |aj | < ∞ and∑∞

j=0 |bj | < ∞, then a ∗ b corresponds to the convo-lution of the signed measures a and b on Z+, defined by a(i) = ai, b(i) = bi,i ∈ Z+.

Example 5.4.1:

(a) Let bn ≡ 1 for n ≥ 0. Then for any ann≥0, (a ∗ b)(n) =∑n

j=0 aj .

(b) Fix 0 < p < 1, k1, k2 positive integers. For n ∈ Z+, let

an ≡(

k1

n

)pn(1− p)k1−nI[0,k1](n)

bn ≡(

k2

n

)pn(1− p)k2−nI[0,k2](n).

Then (a ∗ b)(n) =(k1+k2

n

)pn(1− p)k2−nI[0,k1+k2](n).

(c) Fix 0 < λ1, λ2 < ∞. Let

an ≡ e−λ1λn

1

n!, n = 0, 1, 2, . . .

bn ≡ e−λ2λn

2

n!, n = 0, 1, 2, . . . .

Then (a ∗ b)(n) = e−(λ1+λ2) (λ1+λ2)n

n! , n = 0, 1, 2, . . . .

The verification of these claims is left as an exercise (Problem 5.20).

A useful technique to determine a ∗ b is the use of generating functions.This will be discussed in Section 5.5.

5.4.3 Convolution of functions in L1(R)Let L1(R) ≡ L1(R,B(R), m) where m(·) is the Lebesgue measure. In thefollowing, for f ∈ L1(R),

∫fdm will also be written as

∫f(x)dx.

Page 176: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.4 Convolutions 163

Proposition 5.4.3: Let f , g ∈ L1(R). Then for almost all x (w.r.t. m),∫|f(x− u)| |g(u)|du < ∞. (4.5)

Proof: Let k(x, u) = f(x − u)g(u). Since π(x, u) ≡ x − u is a continuousmap from R × R → R, it is 〈B(R) × B(R),B(R)〉-measurable. Also, sincef and g are Borel measurable, it follows that k : R × R → R is 〈B(R) ×B(R),B(R)〉-measurable. Also note that by the translation invariance of m,∫|f(x − u)|dx =

∫|f(x)|dx for any Borel measurable f : R → R. Hence,

by Tonelli’s theorem,∫ ∫|k(x, u)|dxdu =

∫ (∫|k(x, u)|dx

)du

=∫|g(u)|

(∫|f(x)|dx

)du

= ‖g‖1 ‖f‖1 < ∞. (4.6)

Also by Tonelli’s theorem∫ ∫|k(x, u)|dxdu =

∫ (∫|k(x, u)|du

)dx.

By (4.6), this yields ∫|k(x, u)|du < ∞ a.e. (m)

which is the same as (4.5).

Definition 5.4.4: Let f , g ∈ L1(R). Then the convolution of f and g,denoted by f ∗ g is the function defined a.e. (m) by

(f ∗ g)(x) ≡∫

f(x− u)g(u)du. (4.7)

Note that by (4.5) this is well defined.

Proposition 5.4.4: Let f , g ∈ L1(R). Then

(i) f ∗ g = g ∗ f

(ii) f ∗ g ∈ L1(R) and ‖f ∗ g‖1 ≤ ‖f‖1 ‖g‖1

(iii)∫

f ∗ g dm =( ∫

fdm)( ∫

gdm).

Proof: For (i), use the change of variable u → x − u and the translationinvariance of m. For (ii) use (4.6). For (iii), use Fubini’s theorem.

Page 177: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

164 5. Product Measures, Convolutions, and Transforms

It is easy to see that if µ and λ denote the signed measures with Radon-Nikodym derivatives f and g, respectively, w.r.t. m, then µ∗λ is the signedmeasure with Radon-Nikodym derivative f ∗ g w.r.t. m.

Example 5.4.2: Here are two examples from probability theory.

(a) (Convolutions of Uniform [0, 1] densities). Let f = g = I[0,1]. Then

(f ∗ g)(x) =

x 0 ≤ x ≤ 12− x 1 ≤ x ≤ 2.

(b) (Convolutions of N(0, 1) densities). Let f(x) ≡ g(x) ≡ 1√2π

e− x22 ,

−∞ < x < ∞. Then (f ∗ g)(x) = 1√2π

1√2

e− x24 , −∞ < x < ∞.

5.4.4 Convolution of functions and measuresLet f ∈ L1(R) and λ be a signed measure. Let µ(A) ≡

∫A

fdm, A ∈ B(R).Then it can be shown that µ∗λ is dominated by m and its Radon-Nikodymderivative is given by

f ∗ λ(x) ≡∫

f(x− u)λ(du), x ∈ R. (4.8)

Note that f ∗λ in (4.8) is well defined for any nonnegative Borel measurablef and any measure λ on

(R,B(R)

).

5.5 Generating functions and Laplace transforms

Definition 5.5.1: Let ann≥0 be a sequence in R and let ρ =(lim

n→∞ |an|1/n)−1. The power series

A(s) ≡∞∑

n=0

ansn, (5.1)

defined for all s in (−ρ, ρ), is called the generating function of ann≥0.

Note that the power series in (5.1) converges absolutely for |s| < ρ andρ is called the radius of convergence of A(s) (cf. Appendix A.2).

The usefulness of generating functions is given by the following:

Proposition 5.5.1: Let ann≥0 and bnn≥0 be two sequences in R. Letcnn≥0 be the convolution of an and bn. That is,

cn =n∑

j=0

ajbn−j , n ≥ 0.

Page 178: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.5 Generating functions and Laplace transforms 165

ThenC(s) = A(s)B(s) (5.2)

for all s in (−ρ, ρ), where A(·), B(·), and C(·) are, respectively, thegenerating function of the sequences ann≥0, bnn≥0 and cnn≥0 andρ = min(ρa, ρb) with ρa =

(lim

n→∞ |an|1/n)−1 and ρb =

(lim

n→∞ |bn|1/n)−1.

Proof: By Tonelli’s theorem applied to the counting measure on the prod-uct space (Z+ × Z+),

∞∑n=0

|s|n( n∑

j=0

|aj | |bn−j |)

=∞∑

j=0

|aj | |s|j( ∞∑

n=j

|s|n−j |bn−j |)

= A(|s|)B(|s|) < ∞ if |s| < ρ.

Now by Fubini’s theorem, (5.2) follows.

It is easy to verify the claims in Example 5.4.1 by using the above propo-sition and the uniqueness of the power series coefficients (Problem 5.20).

Proposition 5.5.2: (Renewal sequences). Let ann≥0, bnn≥0, pnn≥0be sequences in R such that

an = bn +n∑

j=0

an−jpj , n ≥ 0 (5.3)

and p0 = 0. Then

A(s) =B(s)

1− P (s)

for all s such that |s| < ρ and P (s) = 1, where ρ = min(ρb, ρp), ρb =(lim

n→∞ |bn|1/n)−1, ρp =

(lim

n→∞ |pn|1/n)−1.

For applications of this to renewal theory, see Chapter 8.

Definition 5.5.2: (Laplace transform). Let f : [0,∞) → R be Borelmeasurable. The function

(Lf)(s) ≡∫

[0,∞)esxf(x)dx, (5.4)

defined for all s in R such that∫[0,∞)

esx|f(x)|dx < ∞, (5.5)

is called the Laplace transform of f .

Page 179: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

166 5. Product Measures, Convolutions, and Transforms

It is easily seen that if (5.5) holds for some s = s0, then it does for alls < s0. The analog of Proposition 5.5.1 is the following.

Proposition 5.5.3: Let f , g ∈ L1(R). Then

L(f ∗ g)(s) = Lf(s)Lg(s) (5.6)

for all s such that (5.5) holds for both f and g.

Definition 5.5.3: (Laplace-Stieltjes transform). Let µ be a measure on(R,B(R)

)such that µ

((−∞, 0)

)= 0. The function

µ∗(s) ≡ Lµ(s) ≡∫

[0,∞)esxµ(dx)

is called the Laplace-Stieltjes transform of µ. Clearly, µ∗(s) is well definedfor all s in R. However, µ∗(s0) = ∞ implies µ∗(s) = ∞ for s ≥ s0.

Proposition 5.5.4: Let µ and λ be measures on(R,B(R)

)such that

µ((−∞, 0)

)= 0 = λ

((−∞, 0)

). Then

L(µ ∗ λ)(s) = Lµ(s)Lλ(s) for all s in R.

For an inversion formula to obtain a probability measure µ from Lµ(·),see Feller (1966), Section 13.4.

5.6 Fourier series

In this section, Lp[0, 2π] stands for Lp([0, 2π],B([0, 2π]), m

)where m(·) is

Lebesgue measure and 0 < p < ∞.

Definition 5.6.1: For f ∈ L1[0, 2π], n ≥ 0, let

an ≡ 12π

∫ 2π

0f(x) cos nx dx

bn ≡ 12π

∫ 2π

0f(x) sin nx dx, (6.1)

sn(f, x) ≡ a0 +n∑

j=1

(aj cos jx + bj sin jx). (6.2)

Then an, bn, n = 0, 1, 2, . . . are called the Fourier coefficients of f andthe sequence sn(f, x) : n = 0, 1, 2, . . . is called the partial sum sequenceof the Fourier series expansion of f .

Since C[0, 2π] ⊂ L2[0, 2π] ⊂ L1[0, 2π], the Fourier coefficients and thepartial sum sequence of the Fourier series are well defined for f in C[0, 2π]

Page 180: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.6 Fourier series 167

and f in L2[0, 2π]. J. Fourier introduced these series in the early 19thcentury to approximate certain functions exhibiting a periodic behaviorthat arose in the study of physics and mechanics and in particular in thetheory of heat conduction (see Korner (1989) and Bhatia (2003)).

A natural question is: In what sense does sn(f, x) approximate f(x)? Itturns out that one can prove the strongest results for f in C[0, 2π], lessstronger results for f in L2[0, 2π], and finally, for f in L1[0, 2π].

In the early 19th century it was believed that for any f in C[0, 2π],sn(f, x) converged to f(x) for all x in [0, 2π]. But in 1876, D. Raymondconstructed a continuous function f for which limn→∞|sn(f ;x0)| = ∞ forsome x0. However, Fejer showed in 1903 that for f in C[0, 2π], sn(f, ·) doesconverge to f uniformly in the Cesaro sense.

Theorem 5.6.1: (Fejer). Let f ∈ C[0, 2π] and sn(f, ·), n ≥ 0 be as inDefinition 3.4.1. Let

Dn(f, ·) =1n

n−1∑j=0

sj(f, ·), n ≥ 1. (6.3)

ThenDn(f, ·) → f(·) uniformly on [0, 2π] as n →∞. (6.4)

For the proof of this result, the following result of some independentinterest is needed.

Lemma 5.6.2: (Fejer). Let for m ≥ 1

Km(x) ≡ 1 +2m

m−1∑j=1

(m− j) cos jx, x ∈ R. (6.5)

Then

(i)

Km(x) =

⎧⎨⎩

1m

(sin mx

2sin x

2

)2

if x = 0

m if x = 0 .

(6.6)

(ii) For δ > 0,

supKm(x) : δ ≤ |x| ≤ 2π − δ → 0 as m →∞. (6.7)

(iii)12π

∫ 2π

0Km(x)dx = 1. (6.8)

Page 181: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

168 5. Product Measures, Convolutions, and Transforms

Proof: Clearly, (iii) follows from (6.5) on integration. Also, (ii) followsfrom (i) since for δ ≤ x ≤ 2π − δ,

Km(x) ≤ 1m

1(sin δ

2 )2→ 0 as m →∞.

To establish (i) note that using Euler’s formula (cf. Section A.3, Appendix)eιx = cos x + ι sin x, Km(x) can be written as

Km(x) =1m

(m−1)∑j=−(m−1)

(m− |j|)eιjx

=1m

2(m−1)∑j=0

(m− |j − (m− 1)|)eι(j−(m−1))x

=1m

(m−1∑k=0

eι(k− m−12 )x

)2

.

For x = 0,

Km(x) =1m

(e− ι(m−1)x

21− eιmx

1− eιx

)2,

=1m

(e− ιmx

2 − eιmx

2

e− ιx2 − e

ιx2

)2

,

=1m

( sin mx2

sin x2

)2.

For x = 0, (6.5) yields

Km(0) = 1 +2m

m−1∑j=1

(m− j)

= 1 +2m

(m− 1)m2

= m.

Proof of Theorem 5.6.1: Let f ∈ C[0, 2π]. Let an, bnn≥0, sn(f, ·),Dm(f, ·) be as in (6.1), (6.2), and (6.3). Then from the definition ofan, bn, n ≥ 0, it follows that

Dm(f, ·) =12π

∫ 2π

0Km(x− u)f(u)du

=12π

∫ 2π

0f(x− u)Km(u)du (6.9)

Page 182: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.6 Fourier series 169

where Km(·) is defined in (6.5) and f(·) is extended to all of R periodicallywith period 2π. Since f ∈ C[0, 2π], given ε > 0, there exists a δ > 0 suchthat

x, y ∈ [0, 2π], |x− y| < δ ⇒ |f(x)− f(y)| < ε.

Also from (6.7), for this δ > 0, there exist an m0 such that m ≥ m0 ⇒supKm(x) : δ ≤ x ≤ 2π−δ < ε. Now (6.9) yields, for x ∈ [0, 2π], m ≥ m0

|Dm(f, x)− f(x)| ≤ 12π

∫ 2π

0|f(x− u)− f(x)|Km(u)du

=12π

∫(δ≤u≤2π−δ)c

|f(x− u)− f(x)|Km(u)du

+12π

∫(δ≤u≤2π−δ)

|f(x− u)− f(x)|Km(u)du

≤ 12π

(ε + 2‖f‖ε2π)

where ‖f‖ = sup|f(x)| : x ∈ [0, 2π]. Thus, m ≥ m0 ⇒ sup|Dm(f, x) −f(x)| : x ∈ [0, 2π] ≤ ε (1+4π‖f‖)

2π . Since ε > 0 is arbitrary, the proof ofTheorem 5.6.1 is complete.

An immediate consequence of Theorem 5.6.1 is the completeness of thetrigonometric functions.

Theorem 5.6.3: The collection T0 ≡ cos nx : n = 0, 1, 2, . . . ∪ sin nx :n = 1, 2, . . . is a complete orthogonal system for L2[0, 2π].

Proof: By Theorem 5.6.1 for each f ∈ C[0, 2π] and ε > 0, there exists afinite linear combination Dm(f, ·) of cos nx : n = 0, 1, 2, . . . and sin nx :n = 1, 2, . . . such that∫ 2π

0|f −Dm(f, ·)|2dx

≤ 2π(sup|f(x)−Dm(f, x)| : x ∈ [0, 2π])2 < ε2.

Also, from Theorem 2.3.14 it is known that given any g ∈ L2[0, 2π], andany ε > 0, there is a f ∈ C[0, 2π] such that

∫ 2π

0 |f − g|2dx < ε2. Thus, forany g ∈ L2[0, 2π], ε > 0, there is a f ∈ C[0, 2π] and an m ≥ 1 such that

‖g −Dm(f, ·)‖2 < 2ε.

That is, the set T of all finite linear combinations of the functions in theclass T0 is dense in L2[0, 2π]. Further, it is easy to verify that h1, h2 ∈ T0,h1 = h2 implies ∫ 2π

0h1(x)h2(x)dx = 0,

Page 183: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

170 5. Product Measures, Convolutions, and Transforms

i.e., T0 is an orthogonal family. Since T0 is orthogonal and T is dense inL2[0, 2π], T0 is complete.

Definition 5.6.2: A function in T is called a trigonometric polynomial.Thus, the above theorem says that trigonometric polynomials are dense

in L2[0, 2π]. Completeness of T in L2[0, 2π] and the results of Section 3.3lead to

Theorem 5.6.4: Let f ∈ L2[0, 2π]. Let (an, bn), n = 0, 1, 2, . . . andsn(f, ·)n≥0 be the associated Fourier coefficient sequences and partial sumsequence of the Fourier series for f as in Definition 5.6.1. Then

(i) sn(f, ·) → f in L2[0, 2π],

(ii)∑∞

n=0(a2n + b2

n) = 12π

∫ 2π

0 |f |2dx where for n ≥ 0, an = an/cn, withc2n = 1

∫ 2π

0 (cos nx)2dx = 12 , and for n ≥ 1, bn = bn

dnwith d2

n =12π

∫ 2π

0 (sinnx)2dx = 12 .

(iii) Further, if f , g ∈ L2[0, 2π], then 12π

∫ 2π

0 fg dx = a0α0 +∑∞n=1

(anαn

c2n

+ bnβn

d2n

), where (an, bn) : n = 0, 1, 2, . . . and

(αn, βn) : n = 0, 1, 2, . . . are, respectively, the Fourier coefficientsof f and g.

Clearly (ii) above is a restatement of Bessel’s equality. Assertion (iii) isknown as the Parseval identity. As for convergence pointwise or almosteverywhere, A. N. Kolmogorov showed in 1926 (see Korner (1989)) thatthere exists an f ∈ L1[0, 2π] such that limn→∞|sn(f, x)| = ∞ everywhereon [0, 2π]. This led to the belief that for f ∈ C[0, 2π], the mean square con-vergence of (i) in Theorem 5.6.4 cannot be improved upon. But L. Carlesonshowed in 1964 (see Korner (1989)) that for f in L2[0, 2π], sn(f, ·) → f(·)almost everywhere. Finally, turning to L1[0, 2π], one has the following:

Theorem 5.6.5: Let f ∈ L1[0, 2π]. Let (an, bn) : n ≥ 0 be as in (6.1)and satisfy

∑∞n=0(|an|+ |bn|) < ∞. Let sn(f, ·) be as in (6.2). Then sn(f, ·)

converges uniformly on [0, 2π] and the limit coincides with f almost every-where.

Proof: Note that∑∞

n=0(|an| + |bn|) < ∞ implies that the sequencesn(f, ·)n≥0 is a Cauchy sequence in the Banach space C[0, 2π] with thesup-norm. Thus, there exists a g in C[0, 2π] such that sn(f, ·) → g uni-formly on [0, 2π]. It is easy to check that this implies that g and f havethe same Fourier coefficients. Set h = g − f . Then h ∈ L1[0, 2π] and theFourier coefficients of h are all zero. This implies that h is orthogonal tothe members of the class T , which in turn yields that h is orthogonal to all

Page 184: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.6 Fourier series 171

continuous functions in C[0, 2π], i.e.,∫ 2π

0h(x)k(x)dx = 0

for all k ∈ C[0, 2π]. Since h ∈ L1[0, 2π] and for any interval A ⊂ [0, 2π],there exists a sequence knn≥1 of uniformly bounded continuous functions,such that kn → IA a.e. (m), by the DCT,∫

h(x)IA(x)dx = limn→∞

∫h(x)kn(x)dx = 0.

This in turn implies that h = 0 a.e., i.e., g = f a.e.

Remark 5.6.1: If f ∈ L2[0, 2π], the Fourier coefficients an, bn are squaresummable and hence go to zero as n →∞. What if f ∈ L1[0, 2π]?

If f ∈ L1[0, 2π], one can assert the following:

Theorem 5.6.6: (Riemann-Lebesgue lemma). Let f ∈ L1[0, 2π]. Then

limn→∞

∫ 2π

0f(x) cos nx dx = 0 = lim

n→∞

∫ 2π

0f(x) sin nx dx.

Proof: The lemma holds if f = IA for any interval A ⊂ [0, 2π] and sincestep functions (i.e., linear combinations of indicator functions of intervals)are dense in L1[0, 2π], the lemma is proved.

It can be shown that the mapping f → (an, bn)n≥0 from L1[0, 2π] tobivariate sequences that go to zero as n → ∞ is one-to-one but not onto(Rudin (1987), Chapter 5).

Remark 5.6.2: (The complex case). Let T ≡ z : z = eιθ, 0 ≤ θ ≤ 2πbe the unit circle in the complex plane C. Every function g : T → C can beidentified with a function f on R by f(t) = g(eιt). Clearly, f(·) is periodicon R with period 2π. In the rest of this section, for 0 < p < ∞, Lp(T )will stand for the collection of all Borel measurable functions f : [0, 2π]to C such that

∫[0,2π] |f |

pdm < ∞ where m(·) is the Lebesgue measure. Atrigonometric polynomial is a function of form

f(·) ≡k∑

n=−k

αneιnx ≡ a0 +k∑

n=1

(an cos nx + bn sin nx),

k < ∞, where αn, ann≥0, and bnn≥0 are sequences of complex num-bers.

The completeness of the trigonometric polynomials proved in Theorem5.6.3 implies that the family eιnx : n = 0,±1,±2, . . . is a complete or-thonormal basis for L2(T ), which is a complex Hilbert space.

Page 185: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

172 5. Product Measures, Convolutions, and Transforms

Thus Theorem 5.6.4 carries over to this case.

Theorem 5.6.7:

(i) Let f ∈ L2(T ). Then,

k∑n=−k

αneιnx → f in L2(T )

where

αn ≡12π

∫ 2π

0f(x)e−ιnsdx, n ∈ Z.

Further,∞∑

n=−∞|αn|2 =

12π

∫ 2π

0|f |2dm.

(ii) For any sequence αnn∈Z of complex numbers such that∑∞n=−∞ |αn|2 < ∞, the sequence

fk(x) ≡

∑kn=−k αneιnx

k≥1 con-

verges in L2(T ) to a unique f such that

αn =12π

∫ 2π

0f(x)e−ιnxdx.

(iii) For any f , g ∈ L2(T ),

∞∑n=−∞

αnβn =12π

∫ 2π

0f(x)g(x)dx

where

αn = f(n) =12π

∫ 2π

0f(x)e−ιnxdx,

βn = g(n) =12π

∫ 2π

0g(x)e−ιnxdx, n ∈ Z.

Further,∞∑

n=−∞|αnβn| < ∞.

(iv) L2(T ) is isomorphic to 2(Z), the Hilbert space of all square summablesequences of complex numbers on Z.

Similarly, Theorem 5.6.5 carries over to the complex case.

Page 186: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.7 Fourier transforms on R 173

Theorem 5.6.8: Let f ∈ L1(T ). Suppose

∞∑n=−∞

|f(n)| < ∞

where

f(n) =12π

∫ 2π

0f(x)e−ιnxdx, n ∈ Z.

Then

sn(f, x) ≡n∑

j=−n

f(j)e−ιjx

converges uniformly on [0, 2π] and the limit coincides with f a.e. and hencef is continuous a.e.

5.7 Fourier transforms on R

In this section and in Section 5.8, let Lp(R) stand for

Lp(R) ≡ f : f : R → C, Borel measurable,∫

R

|f |pdm < ∞ (7.1)

where m(·) is Lebesgue measure. Also, for f ∈ L1(R),∫

Rfdm will often be

written as∫

f(x)dx. Let

C0 ≡ f : f : R → C, continuous and lim|x|→∞

f(x) = 0. (7.2)

Definition 5.7.1: For f ∈ L1(R), t ∈ R,

f(t) ≡∫

f(x)e−ιtxdx (7.3)

is called the Fourier transform of f .

Proposition 5.7.1: Let f ∈ L1(R) and f(·) be as in (7.3). Then

(i) f(·) ∈ C0.

(ii) If fa(x) ≡ f(x− a), a ∈ R, then fa(t) = eιtaf(t), t ∈ R.

Proof:

(i) For any t ∈ R, tn → t ⇒ eιtnxf(x) → eιtxf(x) for all x ∈ R and since|eιtnxf(x)| ≤ |f(x)| for all n and x, by the DCT f(tn) → f(t). Toshow that f(t) → 0 as |t| → ∞, the same proof as that of Theorem5.6.6 works. Thus, it holds if f = I[a,b], for a, b ∈ R, a < b and sincethe step functions are dense in L1(R), it holds for all f ∈ L1(R).

Page 187: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

174 5. Product Measures, Convolutions, and Transforms

(ii) This is a consequence of the translation invariance of m(·), i.e., m(A+a) = m(A) for all A ∈ B(R), a ∈ R.

The continuity of f(·) can be strengthened to differentiability if, in ad-dition to f ∈ L1(R), xf(x) ∈ L1(R). More generally, if f ∈ L1(R) andxkf(x) ∈ L1(R) for some k ≥ 1, then f(·) is differentiable k-times with allderivatives f (r)(t) → 0 as |t| → ∞ for r ≤ k (Problem 5.22).

Proposition 5.7.2: Let f , g ∈ L1(R) and f ∗ g be their convolution asdefined in (4.7). Then

f ∗ g = f g. (7.4)

Proof:

f ∗ g(t) =∫

R

e−ιtx(∫

R

f(x− u)g(u)du)dx

=∫

R

(∫R

e−ιt(x−u)f(x− u)e−ιtug(u)du)dx

=∫

R

(ft ∗ gt)(x)dx (7.5)

where ft(x) = e−ιtxf(x), gt(x) = e−ιtxg(x). Thus, by Proposition 5.4.4

f ∗ g(t) =(∫

R

ft(x)dx)(∫

R

gt(x)dx)

= f(t)g(t).

The process of recovering f from f (i.e., that of finding an inversionformula) can be developed along the lines of Fejer’s theorem (Theorem5.6.1).

Theorem 5.7.3: (Fejer’s theorem). Let f ∈ L1(R), f(·) be as in (7.3)and

ST (f, x) ≡ 12π

∫ T

−T

f(t)eιtxdt, T ≥ 0, (7.6)

DR(f, x) ≡ 1R

∫ R

0ST (f, x)dT, R ≥ 0. (7.7)

(i) If f is continuous at x0 and f is bounded on R, then

limR→∞

DR(f, x0) = f(x0). (7.8)

(ii) If f is uniformly continuous and bounded on R, then

limR→∞

DR(f, ·) = f(·) uniformly on R. (7.9)

Page 188: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.7 Fourier transforms on R 175

(iii) As R →∞,DR(f, ·) → f(·) in L1(R). (7.10)

(iv) If f ∈ Lp(R), 1 ≤ p < ∞, then as R →∞,

DR(f, ·) → f(·) in Lp(R). (7.11)

Corollary 5.7.4: (Uniqueness theorem). If f and g ∈ L1(R) and f(·) =g(·), then f = g a.e. (m).

Proof: Let h = f − g. Then h ∈ L1(R) and h(·) ≡ 0. Thus, ST (h, ·) ≡ 0and DR(h, ·) ≡ 0 where ST (h, ·) and DR(h, ·) are as in (7.6) and (7.7).Hence by Theorem 5.7.3 (iii), h = 0 a.e. (m), i.e., f = g a.e. (m).

Corollary 5.7.5: (Inversion formula). Let f ∈ L1(R) and f ∈ L1(R).Then

f(x) =12π

∫R

f(t)eιtxdx a.e. (m). (7.12)

Proof: Since f ∈ L1(R), by the DCT,

ST (f, x) → 12π

∫R

f(t)eιtxdt as T →∞

for all x in R and hence DR(f, x) has the same limit as R →∞. Now (7.12)follows from (7.10).

The following results, i.e., Lemma 5.7.6 and Lemma 5.7.7, are needed forthe proof of Theorem 5.7.3. The first one is an analog of Lemma 5.6.2.

Lemma 5.7.6: (Fejer). For R > 0, let

KR(x) ≡ 12π

1R

∫ R

0

(∫ T

−T

eιtxdt)dT. (7.13)

Then

(i)

KR(x) =

(1−cos Rx)x2 , x = 0

R2π x = 0,

(7.14)

and hence KR(·) ≥ 0.

(ii) For δ > 0, ∫|x|≥δ

KR(x)dx → 0 as R →∞. (7.15)

Page 189: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

176 5. Product Measures, Convolutions, and Transforms

(iii) ∫ ∞

−∞KR(x)dx = 1. (7.16)

Proof:

(i) KR(0) = 12πR

∫ R

0 (2T )dT = 12π R. For x = 0 and R > 0,

KR(x) =1

2πR

∫ R

0

2 sinTx

xdT

=1

2πR

2(1− cos Rx)x2 .

(ii) For δ > 0,

0 ≤∫

|x|≥δ

KR(x)dx ≤ 22πR

∫|x|≥δ

1x2 dx

=2

πR

1δ→ 0 as R →∞.

(iii) ∫ ∞

−∞KR(x)dx =

1R

∫ ∞

0

(1− cos Rx

x2

)dx

=2π

∫ ∞

0

(1− cos u

u2

)du.

Now ∫ ∞

0

1− cos u

u2 du

= limL→∞

∫ L

0

(1− cos u

u2

)du (by the MCT)

= limL→∞

∫ L

0

(∫ u

0sin x dx

) 1u2 du

= limL→∞

∫ L

0sin x

(∫ L

x

1u2 du

)dx (by Fubini’s theorem)

= limL→∞

(∫ L

0

sin x

xdx− 1

L

∫ L

0sin x dx

)

= limL→∞

∫ L

0

sin x

xdx since

∣∣∣ ∫ L

0sin x dx

∣∣∣ ≤ 1.

Page 190: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.7 Fourier transforms on R 177

Thus, ∫ ∞

0

1− cos u

u2 du =∫ ∞

0

sin x

xdx =

π

2(7.17)

(cf. Problem 5.9). Hence (iii) follows.

Lemma 5.7.7: Let f ∈ Lp(R), 0 < p < ∞. Then∫ ∞

−∞|f(x− u)− f(x)|pdx → 0 as |u| → 0. (7.18)

Proof: The lemma holds if f ∈ CK , i.e., if f is continuous on R (withvalues in C) and vanishes outside a bounded interval. By Theorem 2.3.14,such functions are dense in Lp(R). So given f ∈ Lp(R), 0 < p < ∞ andε > 0, let g ∈ CK be such that∫

|f − g|pdm < ε.

For any 0 < p < ∞, there is a 0 < cp < ∞ such that for all x, y, z ∈ (0,∞),|x + y + z|p ≤ cp(|x|p + |y|p + |z|p). Then,∫

|f(x− u)− f(x)|pdx

≤ cp

∫ (|f(x− u)− g(x− u)|p + |g(x− u)− g(x)|p

+∫|g(x)− f(x)|p

)dx

= cp

(2ε +

∫|g(x− u)− g(x)|p

)du.

Solim

|u|→0

∫|f(x− u)− f(x)|pdx ≤ cp 2ε.

Since ε > 0 is arbitrary, the lemma is proved.

Proof of Theorem 5.7.3: From (7.7)

DR(f, x) ≡ 12πR

∫ R

0

(∫ T

−T

eιtx(∫ ∞

−∞e−ιtyf(y)dy

)dt)dT

=12π

1R

∫ R

0

(∫ T

−T

(∫ ∞

−∞eιtuf(x− u)du

)dt)dT.

Now Fubini’s theorem yields

DR(f, x) =∫ ∞

−∞f(x− u)

( 12π

1R

∫ R

0

(∫ T

−T

eιtudt)dR)du

=∫ ∞

−∞f(x− u)KR(u)du (7.19)

Page 191: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

178 5. Product Measures, Convolutions, and Transforms

where KR(·) is as in (7.13).Now let f be continuous at x0 and bounded on R by Mf . Fix ε > 0 and

choose δ > 0 such that |x− x0| < δ ⇒ |f(x)− f(x0)| < ε. From (7.16) and(7.19),

DR(f, x0)− f(x0) =∫ (

f(x0 − u)− f(x0))KR(u)du

implying

|DR(f, x0)− f(x0)| ≤∫

|u|<δ

|f(x0 − u)− f(x0)|KR(u)du

+∫

|u|≥δ

|f(x0 − u)− f(x0)|KR(u)du

< ε + 2Mf

∫|u|≥δ

KR(u)du.

Now from (7.15), it follows that

limR→∞

|DR(f, x0)− f(x0)| ≤ ε,

proving (i).The proof of (ii) is similar to this and is omitted.Clearly (iv) implies (iii). To establish (iv), note that for 1 ≤ p < ∞, by

Jensen’s inequality (which applies since KR(u) ≥ 0 and∫

KR(u)du = 1),for x in R,

|DR(f, x)− f(x)|p ≤∫ ∣∣(f(x− u)− f(x)

)∣∣pKR(u)du

and hence∫|DR(f, x)− f(x)|pdx ≤

∫ (∫|f(x− u)− f(x)|pdx

)KR(u)du.

Now (7.18) and the arguments in the proof of (i) yield (iv).

5.8 Plancherel transform

If f ∈ L2(R), it need not be in L1(R) and so the Fourier transform is notdefined. However, it is possible to extend the definition using an approxi-mation procedure due to Plancherel.

Proposition 5.8.1: Let f ∈ L1(R)∩L2(R) and let f(t) ≡∫

f(x)e−ιtxdx,t ∈ R. Then, f ∈ L2(R) and∫

|f |2dm =12π

∫|f |2dm. (8.1)

Page 192: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.8 Plancherel transform 179

Proof: Let f(x) = f(−x) and g = f ∗ f . Since f and f are in L1(R), g iswell defined and is in L1(R). Further by Cauchy-Schwarz inequality,

|g(x1)− g(x2)| ≤∫|f(x1 − u)− f(x2 − u)| |f(u)|du

≤(∫

|f(x1 − u)− f(x2 − u)|2du)1/2(∫

|f(u)|2du).

By Lemma 5.7.7, the right side goes to zero uniformly as |x1 − x2| → 0.Thus g(·) is uniformly continuous. Also, by Cauchy-Schwarz inequality,

|g(x)| =∣∣∣ ∫ f(x− u)f(u)du

∣∣∣≤

(∫|f(u)|2du

)1/2(∫|f(u)|2du

)1/2

=∫|f(u)|2du,

and hence g(·) is bounded. Thus, g is continuous, bounded, and integrableon R. By Theorem 5.7.3 (Fejer’s theorem)

DR(g, 0) → g(0) as R →∞. (8.2)

Butg(0) =

∫f(u)f(−u)du =

∫|f |2dm. (8.3)

By Proposition 5.7.2, g(t) = f(t) ˆf(t). Also, note that

ˆf(t) =

∫f(x)e−ιtxdx

=∫

f(−x)e−ιtxdx

= f(t)

and hence g(t) = |f(t)|2 ≥ 0. But

DR(g, 0) =12π

1R

∫ R

0

(∫ T

−T

g(t)dt)dT

=1π

∫ R

0g(t)

(1− |t|

R

)dt.

Since g(·) ≥ 0, by the MCT,

∫ R

0g(t)

(1− |t|

R

)dt ↑

∫ ∞

0g(t)dt as R ↑ ∞.

Page 193: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

180 5. Product Measures, Convolutions, and Transforms

Thus, limR→∞

DR(g, 0) = 1π

∫∞0 g(t)dt. Since g(−t) = g(t) for all t in R,

limR→∞

DR(g, 0) =12π

∫ ∞

−∞g(t)dt =

12π

∫|f(t)|2dt. (8.4)

Clearly, (8.2)–(8.4) imply (8.1).

Proposition 5.8.2: Let f ∈ L2(R) and fn(·) ≡ fI[−n,n](·), n ≥ 1.Then, fnn≥1, fnn≥1 are both Cauchy in L2(R) and hence, convergentin L2(R).

Proof: Since for each n ≥ 1, fn ∈ L2(R) ∩ L1(R), by Proposition 5.8.1,

‖fn1 − fn2‖22 =12π

∫|fn1 − fn2 |2dm. (8.5)

Since f ∈ L2(R), fn → f in L2(R), fnn≥1 is Cauchy in L2(R). By (8.5)fnn≥1 is also Cauchy in L2(R).

Definition 5.8.1: Let f ∈ L2(R). The Plancherel transform of f , denotedby f , is defined as limn→∞ fn, where

fn(t) =∫ n

−n

e−ιtxf(x)dx (8.6)

and the limit is taken in L2(R).

Theorem 5.8.3: Let f ∈ L2(R) and f be its Plancherel transform. Then

(i) ∫|f |2dm =

12π

∫|f |2dm. (8.7)

(ii) For f ∈ L1(R) ∩ L2(R), the Plancherel transform coincides with theFourier transform.

Proof: From Propositions 5.8.1 and 5.8.2 and the definition of f ,

12π

∫|f |2dm = lim

n→∞12π

∫|fn|2dm = lim

n→∞

∫|fn|2dm =

∫|f |2dm

proving (i). If f ∈ L1(R) ∩ L2(R), fn defined in (8.6) converges as n →∞pointwise to the Fourier transform of f (by the DCT). But by Definition5.7.1, fn converges in L2(R) to the Plancherel transform of f . So (ii) follows.

It can also be shown that for f , f as in the above theorem, the followinginversion formula holds:

12π

∫ n

−n

f(t)eιtxdt → f(x) in L2(R) (8.8)

Page 194: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.9 Problems 181

and that the map f → f is a Hilbert space isomorphism of L2(R) ontoL2(R). For a proof, see Rudin (1987), Chapter 9. This in turn implies thatif f , g ∈ L2(R) with Plancherel transforms f and g, respectively, then∫

f g dm =12π

∫f ¯g dm. (8.9)

This is known as the Parseval identity.

5.9 Problems

5.1 Verify that if µ1 and µ2 are finite measures on (Ω1,F1) and (Ω2,F2),respectively, then µ12(·) and µ21(·), defined by (1.4) and (1.5), re-spectively, are measures on (Ω1 × Ω2, F1 ×F2).

(Hint: Use the MCT.)

5.2 Let (Ω,F , µ) be a complete measure space such that F = P(Ω) andfor some A0 ∈ F with A0 = ∅, µ(A0) = 0. Let B ∈ P(Ω) \ F . Thenshow that (µ× µ)(Ω×A0) = 0 but B ×A0 ∈ F × F . Conclude that(Ω×Ω,F×F , µ×µ) is not complete. (An example of such a measurespace (Ω,F , µ) is the space ([0, 1],ML, m) where ML is the Lebesgueσ-algebra on [0, 1] and m is the Lebesgue measure.)

5.3 Let µi, i = 1, 2 be two σ-finite measures on (R,B(R)). Let Di = x :µi(x) > 0, i = 1, 2.

(a) Show that D1 ∪D2 is countable.(b) Let φi(x) = µi(x) x ∈ R, i = 1, 2. Show that φi is Borel

measurable for i = 1, 2.(c) Show that

∫φ1dµ2 =

∑z∈D1∩D2

φ1(z)φ2(z).(d) Deduce (2.11) from (c).

5.4 Extend Theorem 5.2.3 as follows. Let F1, F2 be two nondecreasingright continuous functions on [a, b]. Then∫

(a,b]F1dF2 +

∫(a,b]

F2dF1 = F1(b)F2(b)− F1(a)F2(a) +∫

(a,b]φ1dµ2,

where φ1 is as in Problem 5.3.

5.5 Let Fi : R → R be nondecreasing and right continuous, i = 1, 2.Show that if limb↑∞ F1(b)F2(b) = λ1 and lima↓−∞ F1(a)F2(a) = λ2exist and are finite, then∫

R

F1dF2 +∫

R

F2dF1 = λ1 − λ2 +∫

R

φ1dµ2

where φ1 is as in Problem 5.3.

Page 195: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

182 5. Product Measures, Convolutions, and Transforms

5.6 Let (Ω,F , µ) be a σ-finite measure space and f be a nonnegativemeasurable function. Then,

∫Ω fdµ =

∫[0,∞) µ(f ≥ t)dt.

(Hint: Consider the product space of (Ω,F , µ) with (R+,B(R+), m)and apply Tonelli’s theorem to the function g(ω, t) = I(f(ω) ≥ t),after showing that g is F × B(R+))-measurable.)

5.7 Let (Ω,F , P ) be a probability space and X : Ω → R+ be a randomvariable.

(a) Show that for any h : R+ → R+ that is absolutely continuous,∫Ω

h(X)dP = h(0) +∫

[0,∞)h′(t)P (X ≥ t)dt

= h(0) +∫

(0,∞)h′(t)P (X > t)dt.

(b) Show that for any 0 < p < ∞,∫Ω

XpdP =∫

[0,∞)ptp−1P (X ≥ t)dt.

(c) Show that for any 0 < p < ∞,∫Ω

X−pdP =1

Γ(p)

∫[0,∞)

ψX(t)tp−1dt,

where Γ(p) =∫[0,∞) e−ttp−1dt, p > 0, and ψX(t) =

∫Ω e−tXdP ,

t ∈ R+.

(Hint: (a) Apply Tonelli’s theorem to the function f(t, ω) ≡h′(t)I(X(w) ≥ t) on the product measure space ([0,∞) ×Ω,B([0,∞)) × F , m × P ), where m is Lebesgue measure on(R+,B(R+)).)

5.8 Let g : R+ → R+ and f : R2 → R+ be Borel measurable. LetA = (x, y) : x ≥ 0, 0 ≤ y ≤ g(x).

(a) Show that A ∈ B(R2).

(b) Show that∫R+

(∫[0,g(x)]

f(x, y)m(dy))m(dx) =

∫fIAdm(2),

where m(2) is Lebesgue measure on (R2,B(R2)) and m(·) isLebesgue measure on R.

Page 196: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.9 Problems 183

(c) If g is continuous and strictly increasing show that the two in-tegrals in (b) equal∫

R+

(∫[g−1(y),∞)

f(x, y)m(dx))m(dy).

5.9 (a) For 1 < A < ∞, let

hA(t) =∫ A

0e−xt sin x dx, t ≥ 0.

Use integration by parts to show that

|hA(t)| ≤ 11 + t2

+ e−t

andhA(t) → 1

1 + t2as A →∞.

(b) Show using Fubini’s theorem that for 0 < A < ∞,∫ ∞

0hA(t)dt =

∫ A

0

sin x

xdx.

(c) Conclude using the DCT that

limA→∞

∫ A

0

sin x

xdx =

∫ ∞

0

11 + t2

dt.

(d) Using Theorem 4.4.1 and the fact that φ(x) ≡ tanx is a (1–1)strictly monotone map from (0, π

2 ) to (0,∞) having the inversemap ψ(·) with derivative ψ′(t) ≡ 1

1+t2 , 0 < t < ∞, concludethat ∫ ∞

0

11 + t2

dt =π

2.

5.10 Show that I ≡∫∞0 e−x2/2dx =

√π2 .

(Hint: By Tonelli’s theorem, I2 =∫∞0

∫∞0 e− (x2+y2)

2 dxdy. Now usethe change of variables x = r cos θ, y = r sin θ, 0 < r < ∞, 0 < θ <π2 .)

5.11 Let µ be a finite measure on(R,B(R)

). Let f , g : R → R+ be

nondecreasing. Show that

µ(R)∫

fg dµ ≥(∫

fdµ)(∫

gdµ).

(Hint: Consider h(x1, x2) =(f(x1) − f(x2)

)(g(x1) − g(x2)

)on R2

and integrate w.r.t. µ× µ.)

Page 197: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

184 5. Product Measures, Convolutions, and Transforms

5.12 Let µ and λ be σ-finite measures on(R,B(R)

). Recall that

ν(A) ≡ (µ ∗ λ)(A) ≡∫ ∫

IA(x + y)dµdλ, A ∈ B(R).

(a) Show that for any Borel measurable f : R → R+, f(x + y) is〈B(R)× B(R),B(R)〉-measurable from R× R → R and∫

fdν =∫ ∫

f(x + y)dµdλ.

(b) Show that

ν(A) =∫

R

µ(A− t)λ(dt), A ∈ B(R).

(c) Suppose there exist countable sets Bλ, Bµ such that µ(Bcµ) =

0 = λ(Bcλ). Show that there exists a countable set Bν such that

ν(Bcν) = 0.

(d) Suppose that µ(x) = 0 for all x in R. Show that ν(x) = 0for all x in R.

(e) Suppose that µ m with dµdm = h. Show that ν m and find

dνdm in terms of h, µ and λ.

(f) Suppose that µ m and λ m. Show that

dm=

dm∗ dλ

dm.

5.13 (Convolution of cdfs). Let Fi, i = 1, 2 be cdfs on R. Recall that a cdfF on R is a function from R → R+ such that it is nondecreasing, rightcontinuous with F (x) → 0 as x → −∞ and F (x) → 1 as x →∞.

(a) Show that (F1 ∗F2)(x) ≡∫

RF1(x−u)dF2(u) is well defined and

is a cdf on R.

(b) Show also that (F1 ∗ F2)(·) = (F2 ∗ F1)(·).(c) Suppose t ∈ R is such that

∫etxdFi(x) < ∞ for i = 1, 2. Show

that∫

etxd(F1 ∗ F2)(x) =( ∫

etxdF1(x))( ∫

etxdF2(x)).

5.14 Let f , g ∈ L1(R,B(R), m).

(a) Show that if f is continuous and bounded on R, then so is f ∗ g.

(b) Show that if f is differentiable with a bounded derivative on R,then so is f ∗ g.

(Hint: Use the DCT.)

Page 198: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.9 Problems 185

5.15 Let f ∈ L1(R), g ∈ Lp(R), 1 ≤ p ≤ ∞.

(a) Show that if 1 ≤ p < ∞, then for all x in R

∣∣∣ ∫ |f(u)g(x−u)|du∣∣∣p ≤ (∫ |f |dm

)p−1(∫|g(x−u)|p|f(u)|du

)

and hence that

(f ∗ g)(x) ≡∫

f(u)g(x− u)du

is well defined a.e. (m) and

‖f ∗ g‖p ≤ ‖f‖1 ‖g‖p

with “=” holding iff either f = 0 a.e. or g = 0 a.e.

(b) Show that if p = ∞ then

‖f ∗ g‖∞ ≤ ‖f‖1 ‖g‖∞

and “=” can hold for some nonzero f and g.

(Hint for (a): Use Jensen’s inequality with probability measure dµ =|f |dm‖f‖1

if ‖f ||1 > 0.)

5.16 Let 1 ≤ p ≤ ∞ and q = 1− 1p . Let f ∈ Lp(R), g ∈ Lq(R).

(a) Show that f ∗ g is well defined and uniformly continuous.

(b) Show that if 1 < p < ∞,

lim|x|→∞

(f ∗ g)(x) = 0.

(Hint: For (a) use Holder’s inequality and Lemma 5.7.7. For (b)approximate g by simple functions.)

5.17 Let g : R → R be infinitely differentiable and be zero outside abounded interval.

(a) Let f : R → R be Borel measurable and∫

A|f |dm < ∞ for all

bounded intervals A in R. Show that f ∗ g is well defined andinfinitely differentiable.

(b) Show that for any f ∈ L1(R), there exist a sequence gnn≥1 ofsuch functions such that f ∗ gn → f in L1(R).

5.18 For f ∈ L1(R), let fσ = f∗φσ where φσ(x) = 1√2πσ

e− x2

2σ2 , 0 < σ < ∞,x ∈ R.

Page 199: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

186 5. Product Measures, Convolutions, and Transforms

(a) Show that fσ is infinitely differentiable.

(b) Show that if f is continuous and zero outside a bounded interval,then fσ converges to g uniformly on R as σ → 0.

(c) Show that if f ∈ Lp(R), 1 ≤ p < ∞, then fσ → f in Lp(R) asσ → 0.

5.19 Let f ∈ Lp(R), 1 ≤ p < ∞ and h(x) =∫

A+xf(u)du where A is a

bounded Borel set and

A + x ≡ y : y = a + x, a ∈ A.

(a) Show that h = f ∗ g for some g bounded and with boundedsupport.

(b) Show that h(·) is continuous and that lim|x|→∞

h(x) = 0.

(Hint: For 1 < p < ∞, use Holder’s inequality and show that m((A+

x1) (A + x2))→ 0 as |x1 − x2| → 0.)

5.20 (a) Verify the claims in Examples 5.4.1 directly.

(b) Verify the same using generating functions.

5.21 Let f be a probability density on R, i.e., f is nonnegative, Borelmeasurable and

∫fdm = 1. Show that |f(t)| < 1 for all t = 0.

(Hint: If |f(t0)| = 1 for some t0 = 0, show that∫ (

1− cos t0(x− θ))

f(x)dx = 0 for some θ.)

5.22 (a) Let f ∈ L1(R) and xkf(x) ∈ L1(R) for some k ≥ 1. Show thatf(·) is k-times differentiable on R with all derivatives f (r)(t) → 0as |t| → ∞ for r ≤ k.

(b) Let f ∈ L1(R). Suppose∫|tf(t)|dt < ∞. Show that there

exists a function g : R → R such that it is differentiable,lim|x|→∞(|g(x)| + |g′(x)|) = 0 and g = f a.e. (m). Extend thisto the case where

∫|tkf(t)|dt < ∞ for some k > 1.

(Hint: Consider g(x) = 12π

∫e−ιtxf(t)dt.)

5.23 Let f(x) = 1√2π

e− x22 , x ∈ R.

(a) Show that f(·) is real valued, differentiable and satisfies the ordi-nary differential equation f ′(t) + tf(t) = 0, t ∈ R and f(0) = 1.Find f(t).

Page 200: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

5.9 Problems 187

(b) For µ in R, σ > 0, let

fµ,σ(x) =1√2πσ

e− 12 ( x−u

σ )2 .

Find fµ,σ and verify that for any (µi, σi), i = 1, 2,

fµ1,σ1 ∗ fµ2,σ2 = fµ1+µ2,σ21+σ2

2.

(Hint for (b): Use Fourier transforms and uniqueness.)

5.24 (Rate of convergence of Fourier series). Consider the function h(x) =π2 − |x| in −π ≤ x ≤ π.

(a) Find h(n) ≡ 12π

∫ π

−πe−ιnxh(x)dx, n = 0,±1,±2.

(b) Show that∑∞

n=−∞ |h(n)| < ∞.

(c) Show that Sn(h, x) ≡∑+n

j=−n h(j)eιjx converges to h(x) uni-formly on [−π, π].

(d) Verify that

sup|Sn(h, x)− h(x)| : −π ≤ x ≤ π ≤ 2π

1(n− 1)

, n ≥ 2

and|Sn(h, 0)− h(0)| ≥ 2

π

1(n + 2)

.

(Remark: This example shows that the Fourier series of a functioncan converge very slowly such as in this example where the rate ofdecay is 1

n .)

5.25 Using Fejer’s theorem (Theorem 5.6.1) prove Wierstrass’ theorem onuniform approximation of a continuous function on a bounded closedinterval by a polynomial.

(Hint: Show that on bounded intervals a trigonometric polynomialcan be approximated uniformly by a polynomial using the powerseries representation of sine and cosine functions (see Section A.3).

5.26 Evaluate

limn→∞

∫ n

−n

sin λx

xeιtxdx, 0 < λ < ∞, 0 < x < ∞.

(Hint: For 0 < λ < ∞, sin λxx ∈ L2(R) and it is the Fourier transform

of f(t) = I[−λ,λ](·). Now apply Plancherel theorem. Alternatively useFubini’s theorem and the fact lim

n→∞∫ n

−nsin y

y dy exists in R.)

Page 201: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

188 5. Product Measures, Convolutions, and Transforms

5.27 Find an example of a function f ∈ L2(R) ∩(L1(R)

)c such that itsPlancherel transform f ∈ L1(R).

(Hint: Examine Problem 5.26.)

Page 202: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

6Probability Spaces

6.1 Kolmogorov’s probability model

Probability theory provides a mathematical model for random phenomena,i.e., those involving uncertainty. First one identifies the set Ω of possibleoutcomes of (random) experiment associated with the phenomenon. Thisset Ω is called the sample space, and an individual element ω of Ω is calleda sample point. Even though the outcome is not predictable ahead of time,one is interested in the “chances” of some particular statement to be validfor the resulting outcome. The set of ω’s for which a given statement isvalid is called an event. Thus, an event is a subset of Ω. One then identifiesa class F of events, i.e., a class F of subsets of Ω (not necessarily all ofP(Ω), the power set of Ω), and then a set function P on F such that for Ain F , P (A) represents the “chance” of the event A happening. Thus, it isreasonable to impose the following conditions on F and P :

(i) A ∈ F ⇒ Ac ∈ F(i.e., if one can define the probability of an event A, then the proba-bility of A not happening is also well defined).

(ii) A1, A2 ∈ F ⇒ A1 ∪A2 ∈ F(i.e., if one can define the probabilities of A1 and A2, then the prob-ability of at least one of A1 or A2 happening is also well defined).

(iii) for all A in F , 0 ≤ P (A) ≤ 1, P (∅) = 0, and P (Ω) = 1.

Page 203: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

190 6. Probability Spaces

(iv) A1, A2 ∈ F , A1 ∩A2 = ∅ ⇒ P (A1 ∪A2) = P (A1) + P (A2)(i.e., if A1 and A2 are mutually exclusive events, then the probabil-ity of at least one of the two happening is simply the sum of theprobabilities).

The above conditions imply that F is an algebra and P is a finitelyadditive set function. Next, as explained in Section 1.2, it is natural torequire that F be closed under monotone increasing unions and P bemonotone continuous from below. That is, if Ann≥1 is a sequenceof events in F such that An implies An+1 (i.e., An ⊂ An+1) for alln ≥ 1, then the probability of at least one of the An’s happeningis well defined and is the limit of the corresponding probabilities.In other words, the following conditions on F and P must hold inaddition to (i)–(iv) above:

(v) An ∈ F , An ⊂ An+1 for all n = 1, 2, . . . ⇒⋃

n≥1 An,∈ F andP (An) ↑ P (

⋃n≥1 An).

As noted in Section 1.2, conditions (i)–(v) imply that (Ω,F , P ) is a measurespace, i.e., F is a σ-algebra and P is a measure on F with P (Ω) = 1. Thatis, (Ω,F , P ) is a probability space. This is known as Kolmogorov’s probabil-ity model for random phenomena (see Kolmogorov (1956), Parthasarathy(2005)). Here are some examples.

Example 6.1.1: (Finite sample spaces). Let Ω ≡ ω1, ω2, . . . , ωk, 1 ≤k < ∞, F ≡ P(Ω), the power set of Ω and P (A) =

∑ki=1 piIA(ωi) where

piki=1 are such that pi ≥ 0 and

∑ki=1 pi = 1. This is a probability model

for random experiments with finitely many possible outcomes.An important application of this probability model is finite population

sampling. Let U1, U2, . . . , UN be a finite population of N units or objects.These could be individuals in a city, counties in a state, etc. In a typicalsample survey procedure, one chooses a subset of size n (1 ≤ n ≤ N) fromthis population. Let Ω denote the collection of all possible subsets of size n.Here k =

(Nn

), each ωi is a sample of size n and pi is the selection probability

of ωi. The assignment of piki=1 is determined by a given sampling scheme.

For example, in simple random sampling without replacement, pi = 1k for

i = 1, 2, . . . , k.Other examples include coin tossing, rolling of dice, bridge hands, and

acceptance sampling in statistical quality control (Feller (1968)).

Example 6.1.2: (Countably infinite sample spaces). Let Ω ≡ ω1, ω2, . . .be a countable set, F = P(Ω), and P (A) ≡

∑∞i=1 piIA(ωi) where pi∞

i=1satisfy pi ≥ 0 and

∑∞i=1 pi = 1. It is easy to verify that (Ω,F , P ) is a

probability space. This is a probability model for random experiments withcountably infinite number of outcomes. For example, the experiment oftossing a coin until a “head” is produced leads to such a probability space.

Page 204: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

6.2 Random variables and random vectors 191

Example 6.1.3: (Uncountable sample spaces).

(a) (Random variables). Let Ω = R, F = B(R), P = µF , the Lebesgue-Stieltjes measure corresponding to a cdf F , i.e., corresponding to afunction F : R → R that is nondecreasing, right continuous, andsatisfies F (−∞) = 0, F (+∞) = 1. See Section 1.3. This serves as amodel for a single random variable X.

(b) (Random vectors). Let Ω = Rk, F = B(Rk), P = µF , the Lebesgue-Stieltjes measure corresponding to a (multidimensional) cdf F on Rk

where k ∈ N. See Section 1.3. This is a model for a random vector(X1, X2, . . . , Xk).

(c) (Random sequences). Let Ω = R∞ ≡ R × R × . . . be the set of allsequences xnn≥1 of real numbers. Let C be the class of all finitedimensional sets of the form A × R × R × . . ., where A ∈ B(Rk)for some 1 ≤ k < ∞. Let F be the σ-algebra generated by C. Foreach 1 ≤ k < ∞, let µk be a probability measure on B(Rk) suchthat µk+1(A × R) = µk(A) for all A ∈ B(Rk). Then there exists aprobability measure µ on F such that µ(A × R × R × . . .) = µk(A)if A ∈ B(Rk). (This will be shown later as a special case of theKolmogorov’s consistency theorem in Section 6.3.) This will be amodel for a sequence Xnn≥1 of random variables such that foreach k, 1 ≤ k < ∞, the distribution of (X1, X2, . . . , Xk) is µk.

6.2 Random variables and random vectors

Recall the following definitions introduced earlier in Sections 2.1 and 2.2.

Definition 6.2.1: Let (Ω,F , P ) be a probability space and X : Ω → Rbe 〈F ,B(R)〉-measurable, that is, X−1(A) ∈ F for all A ∈ B(R). Then, Xis called a random variable on (Ω,F , P ).

Recall that X : Ω → R is 〈F ,B(R)〉-measurable iff for all x ∈ R, ω :X(ω) ≤ x ∈ F .

Definition 6.2.2: Let X be a random variable on (Ω,F , P ). Let

FX(x) ≡ P (ω : X(ω) ≤ x), x ∈ R. (2.1)

Then FX(·) is called the cumulative distribution function (cdf) of X.

Definition 6.2.3: Let X be a random variable on (Ω,F , P ). Let

PX(A) ≡ P (X−1(A)) for all A ∈ B(R). (2.2)

Then the probability measure PX is called the probability distribution ofX.

Page 205: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

192 6. Probability Spaces

Note that PX is the measure induced by X on (R,B(R)) under P andthat the Lebesgue-Stieltjes measure µFX

on B(R) corresponding with thecdf FX of X is the same as PX .

Definition 6.2.4: Let (Ω,F , P ) be a probability space, k ∈ N and X :Ω → Rk be 〈F ,B(Rk)〉-measurable, i.e., X−1(A) ∈ F for all A ∈ B(Rk).Then X is called a (k-dimensional) random vector on (Ω,F , P ).

Let X = (X1, X2, . . . , Xk) be a random vector with components Xi,i = 1, 2, . . . , k. Then each Xi is a random variable on (Ω,F , P ). This followsfrom the fact that the coordinate projection maps from Rk to R, given by

πi(x1, x2, . . . , xk) ≡ xi, 1 ≤ i ≤ k

are continuous and hence, are Borel measurable. Conversely, if for 1 ≤ i ≤k, Xi is a random variable on (Ω,F , P ), then X = (X1, X2, . . . , Xk) is arandom vector (cf. Proposition 2.1.3).

Definition 6.2.5: Let X be a k-dimensional random vector on (Ω,F , P )for some k ∈ N. Let

FX(x) ≡ P (ω : X1(ω) ≤ x1, X2(ω) ≤ x2, . . . , Xk(ω) ≤ xk) (2.3)

for x = (x1, x2, . . . , xk) ∈ Rk. Then FX(·) is called the joint cumulativedistribution function (joint cdf) of the random vector X.

Definition 6.2.6: Let X be a k-dimensional random vector on (Ω,F , P )for some k ∈ N. Let

PX(A) = P (X−1(A)) for all A ∈ B(Rk). (2.4)

The probability measure PX is called the (joint) probability distribution ofX.

As in the case k = 1, the Lebesgue-Stieltjes measure µFXon B(Rk)

corresponding to the joint cdf FX is the same as PX .Next, let X = (X1, X2, . . . , Xk) be a random vector. Let Y =

(Xi1 , Xi2 , . . . , Xir) for some 1 ≤ i1 < i2 < . . . < ir ≤ k and some 1 ≤ r ≤ k.

Then, Y is also a random vector. Further, the joint cdf of Y can be ob-tained from FX by setting the components xj , j ∈ i1, i2, . . . , ir equalto ∞. Similarly, the probability distribution PY can be obtained from PX

as an induced measure from the projection map π(x) = (xi1 , xi2 , . . . , xir),

x ∈ Rk. For example, if (i1, i2, . . . , ir) = (1, 2, . . . , r), r ≤ k, then

FY (y1, . . . , yr) = FX(y1, . . . , yr,∞, . . . ,∞), (y1, . . . , yr) ∈ Rr

andPY (A) = PX(A× R(k−r)), A ∈ B(Rr).

Page 206: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

6.2 Random variables and random vectors 193

Definition 6.2.7: Let X = (X1, X2, . . . , Xk) be a random vector on(Ω,F , P ). Then, for each i = 1, . . . , k, the cdf FXi

and the probabilitydistribution PXi of the random variable Xi are called the marginal cdf andthe marginal probability distribution of Xi, respectively.

It is clear that the distribution of X determines the marginal distribu-tion PXi

of Xi for all i = 1, 2, . . . , k. However, the marginal distributionsPXi : i = 1, 2, . . . , k do not uniquely determine the joint distribution PX ,without additional conditions, such as independence (see Problem 6.1).

Definition 6.2.8: Let X be a random variable on (Ω,F , P ). The expectedvalue of X, denoted by EX or E(X), is defined as

EX =∫

ΩXdP, (2.5)

provided the integral is well defined. That is, at least one of the two quan-tities

∫X+dP and

∫X−dP is finite.

If X is a random variable on (Ω,F , P ) and h : R → R is Borel measur-able, then Y = h(X) is also a random variable on (Ω,F , P ). The expectedvalue of Y may be computed as follows.

Proposition 6.2.1: (The change of variable formula). Let X be a randomvariable on (Ω,F , P ) and h : R → R be Borel measurable. Let Y = h(X).Then

(i)∫

Ω|Y |dP =

∫R

|h(x)|PX(dx) =∫

R

|y|PY (dy).

(ii) If∫Ω |Y |dP < ∞, then∫

ΩY dP =

∫R

h(x)PX(dx) =∫

R

yPY (dy). (2.6)

Proof: If h = IA for A in B(R), the proposition follows from the definitionof PX . By linearity, this extends to a nonnegative and simple function hand by the MCT, to any nonnegative measurable h, and hence to anymeasurable h.

Remark 6.2.1: Proposition 6.2.1 shows that the expectation of Y can becomputed in three different ways, i.e., by integrating Y with respect to P onΩ or by integrating h(x) on R with respect to the probability distributionPX of the random variable X or by integrating y on R with respect to theprobability distribution PY of the random variable Y .

Remark 6.2.2: If the function h is nonnegative, then the relation EY =∫R

h(x)PX(dx) is valid even if EY = ∞.

Page 207: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

194 6. Probability Spaces

Definition 6.2.9: For any positive integer n, the nth moment µn of arandom variable X is defined by

µn ≡ EXn, (2.7)

provided the expectation is well defined.

Definition 6.2.10: The variance of a random variable X is defined asVar(X) = E(X − EX)2, provided EX2 < ∞.

Definition 6.2.11: The moment generating function (mgf) of a randomvariable X is defined by

MX(t) ≡ E(etX) for all t ∈ R. (2.8)

Since etX is always nonnegative, E(etX) is well defined but could beinfinity. Proposition 6.2.1 gives a way of computing the moments and themgf of X without explicitly computing the distribution of Xk or etX . Asan illustration, consider the case of a random variable X defined on theprobability space (Ω,F , P ) with Ω = H,Tn, n ∈ N, F = the power setof Ω and P = the probability distribution defined by

P (ω) = pX(ω)qn−X(ω)

where 0 < p < 1, q = 1 − p, and X(ω) = the number of H’s in ω. By thechange of variable formula, the mgf of X is given by

MX(t) ≡∫

etxPX(dx)

=n∑

r=0

etr

(n

r

)prqn−r = (pet + q)n,

since PX , the probability distribution of X, is supported on 0, 1, 2, . . . , nwith PX(r) =

(nr

)prqn−r. Note that PX is the Binomial (n, p) distribu-

tion. Here, MX(t) is computed using the distribution of X, i.e., using themiddle term in (2.6) only.

The connection between the mgf MX(·) and the moments of a randomvariable X is given in the following propositions.

Proposition 6.2.2: Let X be a nonnegative random variable and t ≥ 0.Then

MX(t) ≡ E(etX) =∞∑

n=0

tnµn

n!(2.9)

where µn is as in (2.7).

Proof: Since etX =∑∞

n=0 tn Xn

n! and X is nonnegative, (2.9) follows fromthe MCT.

Page 208: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

6.2 Random variables and random vectors 195

Proposition 6.2.3: Let X be a random variable and let MX(t) be finitefor all |t| < ε, for some ε > 0. Then

(i) E|X|n < ∞ for all n ≥ 1,

(ii) MX(t) =∞∑

n=0

tnµn

n!for all |t| < ε,

(iii) MX(·) is infinitely differentiable on (−ε,+ε) and for r ∈ N, the rthderivative of MX(·) is

M(r)X (t) =

∞∑n=0

tn

n!µn+r = E(etXXr) for |t| < ε. (2.10)

In particular,M

(r)X (0) = µr = EXr. (2.11)

Proof: Since MX(t) < ∞ for all |t| < ε,

E(e|tX|) ≤ E(etX) + E(e−tX) < ∞ for |t| < ε. (2.12)

Also, e|tX| ≥ |t|n|X|nn! for all n ∈ N and hence, (i) follows by choosing a t

in (0, ε). Next note that∣∣∣∑n

j=0(tx)j

j!

∣∣∣ ≤ e|tx| for all x in R and all n ∈ N.Hence, by (2.12) and the DCT, (ii) follows.

Turning to (iii), since MX(·) admits a power series expansion convergentin |t| < ε, it is infinitely differentiable in |t| < ε and the derivatives of MX(·)can be found by term-by-term differentiation of the power series (see Rudin(1976), Chapter 9). Hence,

M(r)X (t) =

dr

dtr

( ∞∑n=0

tnµn

n!

)

=∞∑

n=0

µn

n!dr(tn)dtr

=∞∑

n=r

µntn−r

(n− r)!

=∞∑

n=0

tn

n!µn+r .

The verification of the second equality in (2.10) is left in an exercise (seeProblem 6.4).

Remark 6.2.3: If the mgf MX(·) is finite for |t| < ε for some ε > 0, thenby part (ii) of the above proposition, MX(t) has a power series expansion

Page 209: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

196 6. Probability Spaces

in t around 0 and µn

n! is simply the coefficient of tn. For example, if X hasa N(0, 1) distribution, then for all t ∈ R,

MX(t) =∫ ∞

−∞etx 1√

2πe−x2/2dx = et2/2

=∞∑

k=0

(t2)k

k!12k

. (2.13)

Thus, µn =

0 if n is odd(2k)!k!2k if n = 2k, k = 1, 2, . . .

Remark 6.2.4: If MX(t) is finite for |t| < ε for some ε > 0, then all themoments µnn≥1 of X are determined and also its probability distribution.However, in general, the sequence µnn≥1 of moments of X need notdetermine the distribution of X uniquely.

Table 6.2.1 gives the mean, variance, and the mgf of a number of standardprobability distributions on the real line.

For future reference, some of the inequalities established in Section 3.1are specialized for random variables and collected below without proofs.

Proposition 6.2.4: (Markov’s inequality). Let X be a random variableon (Ω,F , P ). Then for any φ : R+ → R+ nondecreasing and any t > 0 withφ(t) > 0,

P (|X| ≥ t) ≤E(φ(|X|)

)φ(t)

. (2.14)

In particular,

(i) for r > 0, t > 0,

P (X ≥ t) ≤ P (|X| ≥ t) ≤ E|X|rtr

, (2.15)

(ii) for any t ≥ 0,

P (|X| ≥ t) ≤ E(eθ|X|)eθt

,

for any θ > 0 and hence

P (|X| ≥ t) ≤ infθ>0

E(eθ|X|)eθt

. (2.16)

Proposition 6.2.5: (Chebychev’s inequality). Let X be a random variablewith EX2 < ∞, EX = µ, Var(X) = σ2. Then for any k > 0,

P (|X − µ| ≥ kσ) ≤ 1k2 . (2.17)

Page 210: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

6.2 Random variables and random vectors 197

TABLE 6.2.1. Mean, variance, mgf of the distributions listed in Tables 4.6.1 and4.6.2.

Distribution Mean Variance mgf M(t)

Bernoulli (p), p p(1− p) (1− p) + pet,0 < p < 1 t ∈ R

Binomial (n, p), np np(1− p)((1− p) + pet

)n,

p ∈ (0, 1), n ∈ N t ∈ R

Geometric (p), 1p

1−pp

pet

1−(1−p)et , t ∈p ∈ (0, 1)

(−∞,− log(1− p)

)Poisson (λ) λ λ exp

(λ(et − 1)

),

λ ∈ (0,∞) t ∈ R

Uniform (a, b), a+b2

(b−a)2

12ebt−eat

(b−a)t , t ∈ R \ 0;−∞ < a < b < ∞ M(0) = 1

Exponential (β), β β2 11−βt ,

β ∈ (0,∞) t ∈ (−∞, 1β )

Gamma (α, β), αβ αβ2 (1− βt)−α,α, β ∈ (0,∞) t ∈ (−∞, 1

β )

Beta (α, β), αα+β

αβ(α+β)2(α+β+1)

[1 +

∑∞k=1

α, β ∈ (0,∞)(∏k−1

r=0α+r

α+β+r

)tk

k!

],

t ∈ R

Cauchy (γ, σ), not defined not defined ∞ for all t = 0γ ∈ R, σ ∈ (0,∞) since since

E|X| = ∞ E|X|2 = ∞

Normal (γ, σ2), γ σ2 exp(γt + t2σ2

2 ),γ ∈ R, σ ∈ (0,∞) t ∈ R

Lognormal (γ, σ2), eγ+ σ22 [e2(γ+σ2) ∞ for all t > 0

γ ∈ R, σ ∈ (0,∞) −e2γ+σ2]

Proposition 6.2.6: (Jensen’s inequality). Let X be a random variablewith P (a < X < b) = 1 for −∞ ≤ a < b ≤ ∞. Let φ : (a, b) → R be convexon (a, b). Then

Eφ(X) ≥ φ(EX), (2.18)

provided E|X| < ∞ and E|φ(X)| < ∞.

Page 211: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

198 6. Probability Spaces

Proposition 6.2.7: (Holder’s inequality). Let X and Y be random vari-ables on (Ω,F , P ) with E|X|p < ∞, E|Y |q < ∞, 1 < p < ∞, 1 < q < ∞,1p + 1

q = 1. Then

E|XY | ≤ (E|X|p)1/p(E|Y |q)1/q, (2.19)

with equality holding iff P (c1|X|p = c2|Y |q) = 1 for some 0 ≤ c1, c2 < ∞.

Proposition 6.2.8: (Cauchy-Schwarz inequality). Let X and Y be ran-dom variables on (Ω,F , P ) with E|X|2 < ∞, E|Y |2 < ∞. Then

|Cov(X, Y )| ≤√

Var(X)√

Var(Y ) , (2.20)

where Cov(X, Y ) = EXY −EXEY . If Var(X) > 0, then equality holds in(2.20) iff P (Y = aX + b) = 1 for some constants a, b in R (Problem 6.6).

Proposition 6.2.9: (Minkowski’s inequality). Let X and Y be randomvariables on (Ω,F , P ) such that E|X|p < ∞, E|Y |p < ∞ for some 1 ≤ p <∞. Then

(E|X + Y |p)1/p ≤ (E|X|p)1/p + (E|Y |p)1/p. (2.21)

Definition 6.2.12: (Product moments of random vectors). Let X =(X1, X2, . . . , Xk) be a random vector. The product moment of order r =(r1, r2, . . . , rk), with ri being a nonnegative integer for each i, is defined as

µr ≡ µr1,r2,...,rk≡ E(Xr1

1 Xr22 · · ·Xrk

k ), (2.22)

provided E|Xr11 · · ·Xrk

k | < ∞. The joint moment generating function (jointmgf) of a random vector X = (X1, X2, . . . , Xk) is defined by

MX1,...,Xk(t1, t2, . . . , tk) ≡ E(et1X1+t2X2+···+tkXk), (2.23)

for all t1, t2, . . . , tk in R.As in the case of a random variable, if the joint mgf

MX1,X2,...,Xk(t1, . . . , tk) is finite for all (t1, t2, . . . , tk) with |ti| < ε

for all i = 1, 2, . . . , k for some ε > 0, then an analog of Proposition 6.2.3holds. For example, the following assertions are valid (cf. Problem 6.4):

(i)E|Xi|n < ∞ for all i = 1, 2, . . . , k and n ≥ 1. (2.24)

(ii) For t = (t1, . . . , tk) ∈ Rk and r = (r1, r2, . . . , rk) ∈ Zk+, let

tr = tr11 tr2

2 · · · trk

k ,

r! = r1!r2! · · · rk!, andµr = EXr1

1 Xr22 · · ·Xrk

k .

Page 212: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

6.3 Kolmogorov’s consistency theorem 199

Then,

MX(t1, . . . , tk) =∑

r∈Zk+

tr

r!µr (2.25)

for all t = (t1, t2, . . . , tk) ∈ (−ε,+ε)k.

(iii) For any r = (r1, . . . , rk) ∈ Zk+,

dr

dtrMX(t)

∣∣∣t=0

= µr, (2.26)

where dr

dtr = ∂r1

∂tr11

∂r2

∂tr22

. . . ∂rk

∂trkk

.

6.3 Kolmogorov’s consistency theorem

In the previous section, the case of a single random variable and that of afinite dimensional random vector were discussed. The goal of this section isto discuss infinite families of random variables such as a random sequenceXnn≥1 or a random function X(t) : 0 ≤ t < T, 0 ≤ T ≤ ∞. For exam-ple, Xn could be the population size of the nth generation of a randomlyevolving biological population, and X(t) could be the temperature at timet in a chemical reaction over a period [0, T ]. An example from modelingof spatial random phenomenon is a collection X(s) : s ∈ S of randomvariables X(s) where S is a specified region such as the U.S., and X(s) isthe amount of rainfall at location s ∈ S during a specified month.

Let (Ω,F , P ) be a probability space and Xα : α ∈ A be a collec-tion of random variables defined on (Ω,F , P ), where A is a nonemptyset. Then for any (α1, α2, . . . , αk) ∈ Ak, 1 ≤ k < ∞, the random vector(Xα1 , Xα2 , . . . , Xαk

) has a joint probability distribution µ(α1,α2,...,αk) over(Rk,B(Rk)).

Definition 6.3.1: A (real valued) stochastic process with index set A isa family Xα : α ∈ A of random variables defined on a probability space(Ω,F , P ).

Example 6.3.1: (Examples of stochastic processes). Let Ω = [0, 1], F =B([0, 1]), P = the Lebesgue measure on [0, 1]. Let A1 = 1, 2, 3, . . ., A2 =[0, T ], 0 < T < ∞. For ω ∈ Ω, n ∈ A1, t ∈ A2, let

Xn(ω) = sin 2πnω

Yt(ω) = sin 2πtω

Zn(ω) = nth digit in the decimal expansion of ω

Vn,t(ω) = X2n(ω) + Y 2

t (ω).

Then Xn : n ∈ A1, Zn : n ∈ A1, Vn,t : (n, t) ∈ A1×A2, Yt : t ∈ A2are all stochastic processes.

Page 213: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

200 6. Probability Spaces

Note that a real valued stochastic process Xα : α ∈ A may also beviewed as a random real valued function on the set A by the identificationω → f(ω, ·), where f(ω, α) = Xα(ω) for α in A.

Definition 6.3.2: The family µ(α1,α2,...,αk)(·) ≡ P ((Xα1 , . . . , Xαk) ∈ ·):

(α1, α2, . . . , αk) ∈ Ak, 1 ≤ k < ∞ of probability distributions is calledthe family of finite dimensional distributions (fdds) associated with thestochastic process Xα : α ∈ A.

This family of finite dimensional distributions satisfies the following con-sistency conditions: For any (α1, α2, . . . , αk) ∈ Ak, 2 ≤ k < ∞, and anyB1, B2, . . . , Bk in B(R),

C1: µ(α1,α2,...,αk)(B1×· · ·×Bk−1×R) = µ(α1,α2,...,αk−1)(B1×· · ·×Bk−1);

C2: For any permutation (i1, i2, . . . , ik) of (1, 2, . . . , k),

µ(αi1 ,αi2 ,...,αik)(Bi1×Bi2×· · ·×Bik

) = µ(α1,...,αk)(B1×B2×· · ·×Bk) .

To verify C1, note that

µ(α1,α2,...,,αk)(B1 ×B2 × · · · ×Bk−1 × R)= P (Xα1 ∈ B1, Xα2 ∈ B2, . . . , Xαk−1 ∈ Bk−1, Xαk

∈ R)= P (Xα1 ∈ B1, Xα2 ∈ B2, . . . , Xαk−1 ∈ Bk−1)= µ(α1,α2,...,αk−1)(B1 ×B2 × · · · ×Bk−1).

Similarly, to verify C2, note that

µ(αi1 ,αi2 ,...,αik)(Bi1 ×Bi2 · · · ×Bik

)

= P (Xαi1∈ Bi1 , Xαi2

∈ Bi2 , . . . , Xαik∈ Bik

)= P (Xα1 ∈ B1, Xα2 ∈ B2, . . . , Xαk

∈ Bk)= µ(α1,α2,...,αk)(B1 ×B2 × · · · ×Bk).

A natural question is that given a family of probability distributions QA ≡ν(α1,α2,...,αk) : (α1, α2, . . . , αk) ∈ Ak, 1 ≤ k < ∞ on finite dimensionalEuclidean spaces, does there exist a real valued stochastic process Xα :α ∈ A such that its family of finite dimensional distributions coincideswith QA?

Kolmogorov (1956) showed that if QA satisfies C1 and C2, then such astochastic process does exist. This is known as Kolmogorov’s consistencytheorem (also known as Kolmogorov’s existence theorem).

Theorem 6.3.1: (Kolmogorov’s consistency theorem). Let A be anonempty set. Let QA ≡ ν(α1,α2,...,αk) : (α1, α2, . . . , αk) ∈ Ak, 1 ≤ k < ∞be a family of probability distributions such that for each (α1, α2, . . . , αk) ∈Ak, 1 ≤ k < ∞,

Page 214: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

6.3 Kolmogorov’s consistency theorem 201

(i) ν(α1,α2,...,αk) is a probability distribution on (Rk,B(Rk)),

(ii) C1 and C2 hold, i.e., for all B1, B2, . . . , Bk ∈ B(R), 2 ≤ k < ∞,

ν(α1,α2,...,αk)(B1 ×B2 × · · · ×Bk−1 × R)= ν(α1,α2,...,αk−1)(B1 ×B2 × · · · ×Bk−1) (3.1)

and for any permutation (i1, i2, . . . , ik) of (1, 2, . . . , k),

µ(αi1 ,αi2 ,...,αik)(Bi1 ×Bi2 × · · · ×Bik

)

= µ(α1,α2,...,αk)(B1 ×B2 × · · · ×Bk). (3.2)

Then, there exists a probability space (Ω,F , P ) and a stochastic processXA ≡ Xα : α ∈ A on (Ω,F , P ) such that QA is the family of finitedimensional distributions associated with XA.

Remark 6.3.1: Thus the above theorem says that given the family QA

satisfying conditions (i) and (ii), there exists a real valued function onA × Ω such that for each ω, f(·, ω) is a function on A and for each(α1, α2, . . . , αk) ∈ Ak, the vector

(f(α1, ω), f(α2, ω), . . . , f(αk, ω)

)is a

random vector with probability distribution ν(α1,α2,...,αk). This randomfunction point of view is useful in dealing with functionals of the formM(ω) ≡ sup f(α, ω) : α ∈ A. For example, if A1 = 1, 2, . . ., then onemight consider functionals such as limn→∞ f(n, ω), limn→∞ 1

n

∑nj=1 f(j, ω),∑∞

j=1 f(j, ω), etc. Since the random functionals are not fully determinedby f(α, ω) for finitely many α’s, it is not possible to compute probabili-ties of events defined in terms of these functionals from the knowledge ofthe finite dimensional distribution of (f(α1, ω), . . . , f(αk, ω)) for a given(α1, . . . , αk), no matter how large k is. Kolmogorov’s consistency theoremallows one to compute these probabilities given all finite dimensional dis-tributions (provided that the functionals satisfy appropriate measurabilityconditions).

Given a probability measure µ on (R,B(R)), now consider the problemof constructing a probability space (Ω,F , P ) and a random variable X onit with distribution µ. A natural solution is to set the sample space Ω to beR, the σ-algebra F to be B(R), and the probability measure P to be µ andthe random variable X to be the identity map X(ω) ≡ ω. Similarly, givena probability measure µ on (Rk,B(R)k), one can set the sample space Ω tobe Rk and the σ-algebra F to be B(Rk) and the probability measure P tobe µ and the random vector X to be the identity map.

Arguing in the same fashion, given a family QA of finite dimensionaldistributions with index set A, to construct a stochastic process Xα : α ∈A with index set A on some probability space (Ω,F , P ), it is natural to setthe sample space Ω to be RA, the collection of all real valued functions onA, F to be a suitable σ-algebra that includes all finite dimensional events,

Page 215: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

202 6. Probability Spaces

P to be an appropriate probability measure that yields QA, and X to bethe identity map.

These considerations lead to the following definitions.

Definition 6.3.3: Let A be a nonempty set. Then RA ≡ f | f : A → R,the collection of all real valued functions on A.

If A is a finite set a1, a2, . . . , ak, then RA can be identified with Rk byassociating each f ∈ RA with the vector (f(a1), f(a2), . . . , f(ak)) in Rk.If A is a countably infinite set a1, a2, a3, . . ., then RA can be similarlyidentified with R∞, the set of all sequences x1, x2, x3, . . . of real numbers.If A is the interval [0, 1], then RA is the collection of all real valued functionson [0, 1].

Definition 6.3.4: Let A be a nonempty set. A subset C ⊂ RA is called afinite dimensional cylinder set (fdcs) if there exists a finite subset A1 ⊂ A,say, A1 ≡ α1, α2, . . . , αk, 1 ≤ k < ∞ and a Borel set B in B(Rk) suchthat C = f : f ∈ RA and (f(α1), f(α2), . . . , f(αk)) ∈ B. The set B iscalled a base for C.

The collection of all finite dimensional cylinder sets will be denoted byC.

The name cylinder is motivated by the following example:

Example 6.3.2: Let A = 1, 2, 3 and C = (x1, x2, x3) : x21 + x2

2 ≤ 1.Then C is a cylinder (in the usual sense of the English word), but withinfinite height and depth. According to Definition 6.3.4, C is also a cylinderin R3 with the unit circle in R2 as its base.

Examples 6.3.3 and 6.3.4 below are examples of fdcs, whereas Example6.3.5 is an example of a set that is not a fdcs.

Example 6.3.3: Let A = 1, 2 and C = (x1, x2) : | sin 2πx1| ≤ 1√2.

Example 6.3.4: Let A = 1, 2, 3, . . . and C = (x1, x2, x3, . . .) : x2174 +

x23010 − x2

425 ≤ 10.

Example 6.3.5: Let A = 1, 2, 3, . . . and D = (x1, x2, x3, . . .) : xj ∈ Rfor all j ≥ 1 and limn→∞ 1

n

∑nj=1 xj exists is not a finite dimensional

cylinder set (Problem 6.8).

Proposition 6.3.2: Let A be a nonempty set and C be the collection ofall finite dimensional cylinder sets in RA. Then C is an algebra.

Proof: Let C1, C2 ∈ C and let

C1 = f : f ∈ RA and(f(α1), f(α2), . . . , f(αk)

)∈ B1

C2 = f : f ∈ RA and(f(β1), f(β2), . . . , f(βj)

)∈ B2

Page 216: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

6.3 Kolmogorov’s consistency theorem 203

for some A1 = α1, α2, . . . , αk ⊂ A, A2 = β1, β2, . . . , βj ⊂ A, B1 ∈B(Rk), B2 ∈ B(Rj), 1 ≤ k < ∞, 1 ≤ j < ∞. Let A3 = A1 ∪A2 = γ1, γ2, . . . , γ, where without loss of generality (γ1, γ2, . . . , γk) =(α1, α2, . . . , αk) and (γ−j+1, . . . , γ−1, γ) = (β1, β2, . . . , βj). Then C1 andC2 may be expressed as

C1 = f : f ∈ RA and(f(γ1), f(γ2), . . . , f(γ)

)∈ B1

C2 = f : f ∈ RA and(f(γ1), . . . , f(γ)

)∈ B2

where B1 = B1 ×R−k and B2 = R−j ×B2. Thus, C1 ∪C2 = f : f ∈ RA

and (f(γ1), . . . , f(γ)) ∈ B1 ∪ B2. Since both B1 and B2 lie in B(R),C1 ∪ C2 ∈ C.

Next note that, Cc1 = f : f ∈ RA and

(f(α1), . . . , f(αk)

)∈ Bc

1. SinceBc

1 ∈ B(Rk), it follows that Cc1 ∈ C. Thus, C is an algebra.

Remark 6.3.2: If A is a finite nonempty set, the collection C is also aσ-algebra.

Definition 6.3.5: Let A be a nonempty set. Let RA be the σ-algebragenerated by the collection C. Then RA is called the product σ-algebra onRA.

Remark 6.3.3: If A = 1, 2, 3, . . . ≡ N and RN is identified with theset R∞ of all sequences of real numbers, then the product σ-algebra RN

coincides with the Borel σ-algebra B(R∞) on R∞ under the metric

d(x, y) =∞∑

j=1

12j

(|xj − yj |

1 + |xj − yj |

)(3.3)

for x = (x1, x2, . . .), y = (y1, y2, . . .) in R∞ (Problem 6.9).

Definition 6.3.6: Let A be a nonempty set. For any (α1, α2, . . . , αk) ∈ Ak,1 ≤ k < ∞, the projection map π(α1,...,αk) from RA to Rk is defined by

π(α1,α2,...,αk)(f) = (f(α1), f(α2), . . . , f(αk)). (3.4)

In particular, for α ∈ A,πα(f) = f(α) (3.5)

is called a co-ordinate map.

The projection map πA1 for any arbitrary subset A1 ⊂ A may be similarlydefined. The next proposition follows from the definition of RA.

Proposition 6.3.3:

(i) For each α ∈ A, the map πα from RA to R is 〈RA,B(R)〉-measurable.

(ii) For any (α1, α2, . . . , αk) ∈ Ak, 1 ≤ k < ∞, the map π(α1,α2,...,αk)

from RA to Rk is 〈RA,B(Rk)〉-measurable.

Page 217: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

204 6. Probability Spaces

Proof of Theorem 6.3.1: Let Ω = RA and F ≡ RA. Define a set functionP on C by

P (C) = µ(α1,α2,...,αk)(B) (3.6)

for a C in C with representation

C = ω : ω ∈ RA,(ω(α1), ω(α2), . . . , ω(αk)

)∈ B. (3.7)

The main steps in the proof are

(i) To show that P (C) as defined in (3.6) is independent of the repre-sentation (3.7) of C, and

(ii) P (·) is countably additive on C.

Next, by the Caratheodory extension theorem (Theorem 1.3.3), there existsa unique extension of P (also denoted by P ) to F such that (Ω,F , P ) isa probability space. Defining Xα(ω) ≡ πα(ω) = ω(α) for α in A yields astochastic process Xα : α ∈ A on the probability space

(RA,RA, P ) ≡ (Ω,F , P )

with the family QA as its set of finite dimensional distributions. Hence, itremains to establish (i) and (ii). Let C ∈ C admit two representations:

C ≡ ω :(ω(α1), ω(α2), . . . , ω(αk)

)∈ B1

≡ π(α1,α2,...,αk)(B1)

and

C ≡ ω :(ω(β1), ω(β2), . . . , ω(βj)

)∈ B2

≡ π−1(β1,β2,...,βj)

(B2)

for some A1 = α1, α2, . . . , αk ⊂ A, 1 ≤ k < ∞, and some A2 =β1, β2, . . . , βj ⊂ A, 1 ≤ j < ∞, B1 ∈ B(Rk) and B2 ∈ B(Rj). Let A3 =A1∪A2 = γ1, γ2, . . . , γ and w.l.o.g., let (γ1, γ2, . . . , γk) = (α1, α2, . . . , αk)and (γ−j+1, γ−j+2, γ−1, γ) = (β1, β2, . . . , βj). Then C may be repre-sented as

C = π−1γ1,γ2,...,γ

(B1)

= π−1γ1,γ2,...,γ

(B2)

where B1 = B1×R−k and B2 = R−j×B2. Note that (ω(γ1), . . . , ω(γ)) ∈B1 iff ω ∈ C iff (ω(γ1), . . . , ω(γ)) ∈ B2 and thus

B1 = B2. (3.8)

Page 218: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

6.3 Kolmogorov’s consistency theorem 205

Next by the first consistency condition (3.1) and induction,

ν(γ1,γ2,...,γ)(B1) = ν(α1,α2,...,αk)(B1). (3.9)

Also by (3.2), for B2 of the form B21 × B22 × · · · × B2j with B2i ∈ B(R)for all 1 ≤ i ≤ j,

ν(γ1,γ2,...,γ)(R−j ×B2) = ν(γ−j+1,...,γ,γ1,γ2,...,γ−j)(B2 × R−j).

Now note that

(a) ν(γ1,γ2,...,γ)(R−j ×B) and ν(γ−j+1,...,γ,γ1,γ2,...,γ−j)(B ×R−j), con-

sidered as set functions defined for B ∈ B(Rj), are probability mea-sures on B(Rj),

(b) they coincide on the class Γ of sets of the form B = B21×B22×· · ·×B2j with B2i ∈ B(R) for all i, and

(c) the class Γ is a π-class and it generates B(Rj).

Hence, by the uniqueness theorem (Theorem 1.3.6),

ν(γ1,γ2,...,γ)(R−j ×B) = ν(γ−j+1,...,γ,γ1,γ2,...,γ−j)(B × R−j) (3.10)

for all B ∈ B(Rj).Again by (3.1) and induction

ν(γ−j+1,...,γ,γ1,γ2,...,γ−j)(B2 × R−j)= ν(γ−j+1,...,γ)(B2) = ν(β1,β2,...,βj)(B2). (3.11)

Since B2 = R−j ×B2, by (3.10) and (3.11)

ν(γ1,γ2,...,γ)(B2) = ν(β1,β2,...,βj)(B2).

Now from (3.8) and (3.9) it follows that

ν(α1,...,αk)(B1) = ν(γ1,γ2,...,γ)(B1)

= ν(γ1,γ2,...,γ)(B2)= ν(β1,β2,...,βj)(B2),

thus establishing (i).To establish (ii), it needs to be shown that

(ii)aP (C1 ∪ C2) = P (C1) + P (C2) if C1, C2 ∈ C and C1 ∩ C2 = ∅.

(ii)bCn ∈ C, Cn ⊃ Cn+1 for all n,

⋂n≥1Cn = ∅ ⇒ P (Cn) ↓ 0.

Page 219: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

206 6. Probability Spaces

Let C1 = π−1(α1,...,αk)(B1) and C2 = π−1

(β1,...,βj)(B2) for B1 ∈ B(Rk), B2 ∈

B(Rj), α1, . . . , αk ⊂ A and β1, . . . , βj ⊂ A, 1 ≤ j, k < ∞.As in the proof of Proposition 6.3.2, C1 and C2 may be represented as

Ci = π−1(γ1,γ2,...,γ)

(Bi), i = 1, 2,

where Bi ∈ B(R). Since C1 and C2 are disjoint by hypothesis, it followsthat B1 and B2 are disjoint. Also, since

P (Ci) = ν(γ1,γ2,...,γ)(Bi), i = 1, 2,

and ν(γ1,...,γ)(·) is a measure on B(R), it follows that

P (C1 ∪ C2) = ν(γ1,...,γ)(B1 ∪ B2)

= ν(γ1,...,γ)(B1) + ν(γ1,...,γ)(B2)= P (C1) + P (C2),

thus proving (ii)a.To prove (ii)b, note that for any sequence Cnn≥1 ⊂ C, there exists a

countable set A1 = α1, α2, . . . , αn, . . ., an increasing sequence knn≥1of positive integers and a sequence of Borel sets Bnn≥1 such that Bn ∈B(Rkn) and Cn = π−1

(α1,α2,...,αkn )(Bn) for all n ∈ N. Now suppose thatCnn≥1 is decreasing. It will be shown that if limn→∞ P (Cn) = δ > 0,then

⋂n≥1 Cn = ∅. For each n, by the regularity of measures (Corollary

1.3.5), there exists a compact set Gn ⊂ Bn such that

ν(α1,...,αkn )(Bn \Gn) <δ

2n+1 .

Let Dn = π−1(α1,α2,...,αkn )(Gn). Then P (Cn\Dn) < δ

2n+1 . Let Hn =⋂n

j=1 Dj .Then Hnn≥1 is decreasing and

P (Cn \Hn) = P(Cn ∩Hc

n

)= P

( n⋃j=1

(Cn ∩Dcj))

≤n∑

j=1

P (Cn \Dj) ≤n∑

j=1

P (Cj \Dj)

(since Cnn≥1 is decreasing)

≤n∑

j=1

δ

2j+1 <δ

2.

Since P (Cn) ↓ δ > 0, Hn ⊂ Cn, and P (Cn \ Hn) < δ2 , it follows that

P (Hn) > δ2 for all n ≥ 1. This implies Hn = ∅ for each n. It will now

be shown that⋂

n≥1 Hn = ∅. Let ωnn≥1 be a sequence of elements

Page 220: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

6.3 Kolmogorov’s consistency theorem 207

from Ω = RA such that for each n, ωn ∈ Hn. Then, since Hnn≥1is a decreasing sequence, for each 1 ≤ j < ∞, ωn ∈ Hj for n ≥ j.This implies that the vector (ωn(α1), ωn(α2), . . . , ωn(αkj )) ∈ Gj for alln ≥ j. Since G1 is compact, there exists a subsequence n1ii≥1 such thatlimi→∞ ωn1i

(α1) = ω(α1) exists. Next, since G2 is compact, there exists afurther sequence n2ii≥1 of n1ii≥1 such that limi→∞ ωn2i

(α2) = ω(α2)exists. Proceeding this way and applying the usual ‘diagonal method,’ asubsequence nii≥1 is obtained such that limi→∞ ωni

(αj) = ω(αj) forall 1 ≤ j < ∞. Let ω(α) = 0 for α ∈ α1, α2, . . .. Since for each j,Gj is compact, (ω(α1), ω(α2), . . . , ω(αkj )) ∈ Gj and hence ω ∈ Hj . Thus,ω ∈

⋂j≥1 Hj ⊂

⋂j≥1 Cj implying

⋂j≥1 Cj = ∅. The proof of the theorem

is now complete.

When the index set A is countable and identified with the set N ≡1, 2, 3, . . ., it is possible to give a simpler formulation of the consistencyconditions.

Theorem 6.3.4: Let µnn≥1 be a sequence of probability measures suchthat

(i) for each n ∈ N, µn is a probability measure on (Rn,B(Rn)),

(ii) for each n ∈ N, µn+1(B × R) = µn(B) for all B ∈ B(Rn).

Then there exists a stochastic process Xn : n ≥ 1 on a probability space(Ω,F , P ) with Ω = R∞,F = B(R∞) such that for each n ≥ 1, the proba-bility distribution P(X1,X2,...,Xn) of the random vector (X1, X2, . . . , Xn) isµn.

Proof: For any i1, i2, . . . , ik ⊂ N, let j1 < j2 < · · · < jk be the in-creasing rearrangement of i1, i2, . . . , ik. Then there exists a permutation(r1, r2, . . . , rk) of (1, 2, . . . , k) such that j1 = ir1 , j2 = ir2 , . . . , jk = irk

.Now define

ν(j1,j2,...,jk)(·) ≡ µjkπ−1

j1,j2,...,jk(·)

where πj1,j2,...,jk(x1, . . . , xjk

) = (xj1 , xj2 , . . . , xjk) for all (x1, x2, . . . , xjk

) ∈Rjk .

Next define

ν(i1,i2,...,ik)(B1 ×B2 × . . .×Bk) ≡ ν(j1,j2,...,jk)(Br1 ×Br2 × . . .×Brk)

where Bi ∈ B(R) for all i, 1 ≤ i ≤ k. It can be verified that this family offinite dimensional distributions

QN ≡ ν(i1,i2,...,ik)(·) : i1, i2, . . . , ik ⊂ N, 1 ≤ k < ∞ (3.12)

satisfies the consistency conditions (3.1) and (3.2) of Theorem 6.3.1 andhence the assertion follows.

Page 221: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

208 6. Probability Spaces

Example 6.3.6: (Sequence of independent random variables). LetFnn≥1 be a sequence of cdfs on R. Consider the problem of constructinga sequence Xnn≥1 of random variables on a probability space (Ω,F , P )such that (i) for each n ∈ N, Xn has cdf Fn and (ii) for any n ∈ N and anyi1, i2, . . . , in ⊂ N, the random variables Xi1 , Xi2 , . . . , Xin

are indepen-dent, i.e.,

P (Xi1 ≤ x1, Xi2 ≤ x2, . . . , Xin ≤ xn) =n∏

j=1

Fij (xj) (3.13)

for all x1, x2, . . . , xn in R.

This problem can be solved by using Theorem 6.3.4. Let µn be theLebesgue-Stieltjes probability measure on (Rn,B(Rn)) corresponding tothe distribution function

F1,2,...,n(x1, x2, . . . , xn) ≡n∏

j=1

F (xj), x1, . . . , xn ∈ R.

It is easy to verify that the family µn : n ≥ 1 satisfies (i) and (ii) ofTheorem 6.3.4. Hence, there exist a probability measure P on the sequencespace Ω ≡ R∞ equipped with σ-algebra F ≡ B(R∞) and random variablesXn(ω) ≡ πn(ω) ≡ ω(n), for ω = (ω(1), ω(2), . . .) in R∞, n ≥ 1, such that(3.13) holds.

Example 6.3.7: (Family of independent random variables). Given a fam-ily Fα : α ∈ A of cdfs on R for some index set A, a construction similar toExample 6.3.6, but using Theorem 6.3.1 yields the existence of a real valuedstochastic process Xα : α ∈ A such that for any α1, α2, . . . , αn ⊂ A,1 ≤ n < ∞, the random variables Xα1 , Xα2 , . . . , Xαn

are independent,i.e., (3.13) holds.

Example 6.3.8: (Markov chains). Let Q = ((qij)) be a k × k stochasticmatrix for some 1 < k < ∞. That is,

(a) for all 1 ≤ i, j ≤ k, qij ≥ 0 and

(b) for each 1 ≤ i ≤ k,∑k

j=1 qij = 1.

Let p = (p1, p2, . . . , pk) be a probability vector, i.e., for all i, pi ≥ 0, and∑ki=1 pi = 1. Consider the problem of constructing a sequence Xnn≥1 of

random variables such that for each n ∈ N,

P (X1 = j1, X2 = j2, . . . , Xn = jn) = pj1qj1j2 . . . qjn−1jn (3.14)

for 1 ≤ ji ≤ k, i = 1, 2, . . . , n.

Page 222: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

6.3 Kolmogorov’s consistency theorem 209

Let µn be the discrete probability distribution determined by the rightside of (3.14), that is,

µn((j1, j2, . . . , jn)) = pj1qj1j2 . . . qjn−1jn

for all (j1, . . . , jn) such that 1 ≤ ji ≤ k for all 1 ≤ i ≤ n. It is easy toverify that µnn≥1 satisfies the conditions of Theorem 6.3.4 and hencethere exist a sequence Xnn≥1 of random variables satisfying (3.14). Itmay be verified that (3.14) is equivalent to

P (Xn+1 = jn+1|X1 = j1, . . . , Xn = jn)= qjnjn+1 = P (Xn+1 = jn+1|Xn = jn) (3.15)

for all n ≥ 1, 1 ≤ ji ≤ k, i = 1, 2, . . . , n + 1 provided P (X1 = j1, . . . , Xn =jn) > 0 and P (X1 = j) = pj for 1 ≤ j ≤ k. This says that the conditionaldistribution of Xn+1 given X1, X2, . . . , Xn depends only on Xn. This prop-erty is known as the Markov property, and the sequence Xnn≥1 is calleda Markov chain with state space S ≡ 1, 2, . . . , k and time homogeneoustransition probability matrix ((qij)).

When the state space S = 1, 2, . . ., the above construction goes overwith minor notational modifications.

Next consider the case S = R. A function Q : R×B(R) → [0, 1] is calleda probability transition function if

(i) for each x in R, Q(x, ·) is a probability measure on (R,B(R)) and

(ii) for each B in B(R), Q(·, B) is a Borel measurable function on R.

Let µ be a probability distribution on (R,B(R)). Using Theorem 6.3.4, itcan be shown that there exists a stochastic process Xnn≥1 such that

P (X1 ∈ B1, X2 ∈ B2, . . . , Xn ∈ Bn)

=∫

B1

∫B2

· · ·∫

Bn−1

Q(xn−1, Bn)Q(xn−2, dxn−1) · · ·Q(x1, dx2)µ(dx1),

(3.16)

where right side of (3.16) is a well-defined probability measure on(Rn,B(Rn)

)(Problem 6.18). Such a sequence Xnn≥ is called a Markov

chain with state space R, initial distribution µ, and transition probabilityfunction Q. For more on Markov chains, see Chapter 14.

Example 6.3.9: (Gaussian processes). Let A be a nonempty set andXα : α ∈ A be a stochastic process. Such a process is called Gaussianif for α1, α2, . . . , αk ⊂ A and real numbers t1, t2, . . . , tk, the randomvariable

∑ki=1 tiXαi

has a univariate normal distribution (with possiblyzero variance). For such a process, the functions µ(α) ≡ EXα and σ(α, β) ≡

Page 223: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

210 6. Probability Spaces

Cov(Xα, Xβ) are called the mean and covariance functions, respectively.Since Var(

∑ki=1 tiXαi) ≥ 0, it follows that for any t1, t2, . . . , tk,

k∑i=1

k∑j=1

titjσ(αi, αj) ≥ 0. (3.17)

This property of the covariance function σ(·, ·) is called nonnegative defi-niteness.

A natural question is: Given functions µ : A → R and σ : A × A → Rsuch that σ is symmetric and satisfies (3.17), does there exist a Gaussianprocess Xα : α ∈ A with µ(·) and σ(·; ) as its mean and covariancefunctions, respectively? The answer is yes and it follows from Theorem 6.3.1by defining the family QA of finite dimensional distributions as follows. Letν(α1,α2,...,αk) be the unique probability distribution on (Rk,B(Rk)) with themoment generating function

M(α1,α2,...,αk)(s1, s2, . . . , sk)

= exp( k∑

i=1

siµ(αi) +12

k∑i=1

k∑j=1

sisjσ(αi, αj))

(3.18)

for s1, s2, . . . , sk in R. If the matrix Σ ≡((σ(αi, αj))

), 1 ≤ i, j ≤ k is

positive definite, i.e., it is such that in (3.17) equality holds iff ti = 0 forall i, then ν(α1,...,αk)(·) can be shown to be a probability measure that isabsolutely continuous w.r.t. mk, the Lebesgue measure on Rk with density

1(2π)k/2 |Σ|−

12 e− ∑k

i=1∑k

j=1

(xi−µ(αi)

)σij

(xj−µ(αj)

)/2 where Σ ≡

((σij)

)=

Σ−1, the inverse of Σ and |Σ| = the determinant of Σ.The verification of conditions (3.1) and (3.2) for this family is left as an

exercise (Problem 6.12).

Remark 6.3.4: Kolmogorov’s consistency theorem (Theorem 6.3.1) re-mains valid when the real line R is replaced by a complete separa-ble metric space S. More specifically, let A be a nonempty set and forα1, α2, . . . , αk ⊂ A, 1 ≤ k < ∞, let ν(α1,α2,...,αk)(·) be a probability mea-sure on (Sk,B(Sk)). If the family QA ≡ ν(α1,α2,...,αk) : α1, α2, . . . , αk ⊂A, 1 ≤ k < ∞ satisfies the natural analogs of (3.1) and (3.2), thenthere exists a probability measure P on (Ω ≡ SA,F ≡ (B(S))A) andan S-valued stochastic process Xα : α ∈ A on (Ω,F , P ) such thatν(α1,α2,...,αk)(·) = P (Xα1 , Xα2 , . . . XαK

)−1(·). Here SA is the set of all Svalued functions on A, (B(S))A is the σ-algebra generated by the cylindersets of the form

C = f : f : A → S, f(αi) ∈ Bi, i = 1, 2, . . . , k

where α1, α2, . . . , αk ⊂ A, Bi ∈ B(S), 1 ≤ i ≤ k, 1 ≤ k < ∞, andalso Xα(ω) is the projection map Xα(ω) ≡ ω(α). The main step in the

Page 224: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

6.3 Kolmogorov’s consistency theorem 211

proof of Theorem 6.3.1 was to establish the countable additivity of the setfunction P on the algebra C of finite dimensional cylinder sets. This in turndepended upon the fact that any probability measure µ on (Rk,B(Rk)) for1 ≤ k < ∞ is regular, i.e., for every Borel set B in B(Rk) and for everyε > 0, there exists a compact set G ⊂ B such that µ(B\G) < ε. If S is aPolish space, then any probability measure on (Sk, (B(S))k), 1 ≤ k < ∞ isregular (see Billingsley (1968)), and hence, the main steps in the proof ofTheorem 6.3.1 go through in this case.

Remark 6.3.5: (Limitations of Theorem 6.3.1). In this construction, Ω =RA is rather large and the σ-algebra F ≡ (B(R))A is not large enough toinclude many events of interest when the index set A is uncountable. Infact, it can be shown that F coincides with the class of all sets G ⊂ Ω thatdepend only on a countable number of coordinates of ω. More precisely,the following holds.

Proposition 6.3.5: The σ-algebra

F = G : G = π−1A1

(B) for some B in B(R∞)and A1 ⊂ A, A1 countable. (3.19)

Proof: Verify that the right side of (3.19) is a σ-algebra containing theclass C of cylinder sets and also that, it is contained in F .

For example, if A = [0, 1], then the set C[0, 1] of all continuous functionsfrom [0, 1] → R is not a member of F ≡ (B(R))A. Similarly, if M(ω) ≡sup|ω(α)| : α ∈ [0, 1], then the set M(ω) ≤ 1 is not in F = (B(R))[0,1].When A is an interval in R, this difficulty can be overcome in several ways.One approach pioneered by J.L. Doob is the notion of separable stochasticprocesses (Doob (1953)). Another approach pioneered by Kolmogorov andSkorohod is to restrict Ω to the class of all continuous functions or functionsthat are right continuous and have left limits (Billingsley (1968)). For moreon stochastic processes, see Chapter 15.

Independent Random Experiments

If E1 and E2 are two random experiments with associated probabilityspaces (Ω1,F1, P1) and (Ω2,F2, P2), it is possible to model the experimentof performing both E1 and E2 independently by the product probabilityspace (Ω1 × Ω2,F1 × F2, P1 × P2) (see Chapter 5). The same idea carriesover to an arbitrary collection Eα : α ∈ A of random experiments. Itis possible to think of a grand experiment E in which all the Eα’s areindependent components by considering the product probability space

(×α∈AΩα,×α∈AFα,×α∈APα) ≡ (Ω,F , P ) (3.20)

where (Ωα,Fα, Pα) is the probability space corresponding to Eα. Here Ω ≡×α∈AΩα is the collection of all functions ω on A such that ω(α) ∈ Ωα,

Page 225: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

212 6. Probability Spaces

F ≡ ×α∈AFα is the σ-algebra generated by finite dimensional cylinder setsof the form

C = ω : ω(αi) ∈ Bαi , i = 1, 2, . . . , k, (3.21)

1 ≤ k < ∞, α1, α2, . . . , αk ⊂ A, Bαi∈ Fαi

and P ≡ ×α∈APα is theprobability measure on F such that for C of (3.21),

P (C) =k∏

i=1

Pαi(Bαi

). (3.22)

The proof of the existence of such a P on F is an application of the exten-sion theorem (Theorem 1.3.3). The verification of countable additivity onthe class C of cylinder sets is not difficult. See Kolmogorov (1956).

6.4 Problems

6.1 Let µ1 = µ2 be the probability distribution on Ω = 1, 2 withµ1(1) = 1/2. Find two distinct probability distributions on Ω× Ωwith µ1 and µ2 as the set of marginals.

6.2 Let Ω = (0, 1), F = B((0, 1)) and P be the Lebesgue measure on(0,1). Let X(ω) = − log ω, h(x) = x2 and Y = h(X). Find PX andPY and evaluate EY by applying the change of variables formula(Proposition 6.2.1).

6.3 In the change of variables formula, one of the three integrals is usuallyeasier to evaluate than the other two. In this problem, in part (a),the first integral is easier to evaluate than the other two while in part(b), the second one is easier.

(a) Let Z ∼ N(0, 1), X = Z2, and Y = e−X .

(i) Find the distributions PX and PY on (R,B(R)).(ii) Compute the integrals∫

R

e−z2φ(z)dz,

∫R

e−xPX(dx) and∫

R

yPY (dy),

where φ(z) = 1√2π

e−z2/2, −∞ < z < ∞. Verify that allthree integrals agree.

(b) Let X1, X2, . . . , Xn be iid N(0, 1) random variables. Let Y =(X1 + · · ·+ Xk) and Z = Y 2.

(i) Find the distributions of Y and Z.(ii) Evaluate

∫Rk(x1 + · · · + xk)2dPX1,...,Xk

(x1, . . . , xk),∫R

y2PY (dy), and∫

R+zPZ(dz).

Page 226: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

6.4 Problems 213

(c) Let X1, X2, . . . , Xk be independent Binomial (ni, p), i =1, 2, . . . , k random variables. Let Y = (X1 + · · ·+ Xk).(i) Find the distribution PY of Y .(ii) Evaluate

∫Rk(x1 + · · · + xk)dPX1,...,Xk

(x1, . . . , xk) and∫R

yPY (dy).

6.4 Let X be a random variable such that MX(t) ≡ E(etX) < ∞ for|t| < ε for some ε > 0.

(a) Show that E(etX |X|r) < ∞ for all r > 0 and |t| < ε.

(b) Show that M(r)X (t), the rth derivative of MX(t) for r ∈ N, satis-

fiesM

(r)X (t) = E(etXXr) for |t| < ε.

(c) Verify (2.25).

(Hint: (a) First show that for t1 ∈ (−ε, ε), there exist a t2 ∈ (−ε, ε)such that |t1| < |t2| < ε and for some C < ∞, et1x|x|r ≤ Ce|t2x| forall x in R.(b) Verify that for all x ∈ R, |ex − 1| ≤ |x|e|x|. Now use (a) and theDCT to show that MX(t) is differentiable and M

(1)X (t) = E(etXX)

for all |t| < ε. Now complete the proof by induction.)

6.5 Let X be a random variable.

(a) Show that φ(r) ≡ (E|X|r)1/r is nondecreasing on (0,∞).(b) Show that φ(r) ≡ log E|X|r is convex in (0, r0) if E|X|r0 < ∞.(c) Let M = supx : P (|X| > x) > 0. Show that

(i) limr↑∞

φ(r) = M .

(ii) limn→∞

E|X|n+1

E|X|n = M .

(Hint: For M < ∞, note that E|X|r ≥ (M − ε)rP (|X| > M − ε) forany ε > 0.)

6.6 Show that if equality holds in (2.20), then there exist constants a andb such that P (Y = aX + b) = 1.

(Hint: Show that there exist a constant a such that Var(Y − aX) =0.)

6.7 Determine C and its base B explicitly in Examples 6.3.3 and 6.3.4.

6.8 (a) Show that D in Example 6.3.5 is not a finite dimensional cylin-der set.

(Hint: Note that limn→∞

1n

∑nj=1 xj is not determined by the val-

ues of finitely many xi’s.)

Page 227: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

214 6. Probability Spaces

(b) Find three other such examples of sets D in R∞ that are notfinite dimensional cylinder sets.

6.9 Establish the assertion in Remark 6.3.3 by completing the followingsteps:

(a) Show that the coordinate map fn(x) ≡ xn from R∞ to R is con-tinuous under the metric d of (3.3). (Conclude, using Example1.1.6, that RN ⊂ B(R∞)).

(b) Let C1 ≡ A : A = (a1, b1) × · · · × (ak, bk) × R∞, −∞ ≤ ai <bi ≤ ∞, 1 ≤ i ≤ k, for some k < ∞ and C2 ≡ A : A is an openball in (R∞, d). Show that σ〈C2〉 ⊂ σ〈C1〉.

(c) Show that σ〈C2〉 = B(R∞) by showing that every open set in(R∞, d) is a countable union of open balls.

6.10 Show that the family QN defined in (3.12) satisfies the consistencyconditions (3.1) and (3.2) of Theorem 6.3.1.

6.11 Verify that the family of finite dimensional distributions defined bythe right side of (3.14) satisfies the conditions of Theorem 6.3.4.

6.12 Verify that the family of distributions defined in (3.18) satisfies con-ditions (3.1) and (3.2) of Theorem 6.3.1.

(Hint: Use the fact that for any k ≥ 1, any µ = (µ1, µ2, . . . , µk) ∈ Rk,and any nonnegative definite k × k matrix Σ ≡ ((σij))k×k, there is aunique probability distribution ν such that for any s = (s1, s2, . . . , sk)in Rk,

∫Rk

exp( k∑

i=1

sixk

)ν(dx)

= exp( k∑

i=1

siµi +12

k∑i=1

k∑j=1

sisjσij

).

Observe that this implies that for s = (s1, s2, . . . , sk) in Rk, the in-duced distribution (under ν) on R by the map g(x) =

∑ki=1 sixi

from Rk → R is univariate normal with mean∑k

i=1 siµi and variance∑ki=1∑k

j=1 sisjσij .)

6.13 Show that the set D ≡ C[0, 1] of continuous functions from [0, 1] toR is not a member of the σ-algebra F ≡ (B(R))[0,1].

(Hint: If D ∈ F , then by Proposition 6.3.5, D is of the form π−1A1

(B)for some B in B(R∞), where A1 ⊂ [0, 1] is countable. Show that forany such A1 and B, there exist functions f : [0, 1] → R such thatf ∈ π−1

A1(B) but f is not continuous on [0, 1].)

Page 228: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

6.4 Problems 215

6.14 Show that K ≡ω : ω ∈ R[0,1], sup

0≤α≤1|ω(α)| < 1

is not in F ≡

(B(R))[0,1].

(Hint: Observe that sup0≤α≤1

|ω(α)| is not determined by the values of

ω(α) for countably many α’s.)

6.15 Let µii≥1 be a sequence of probability distributions on(R,B(R)

)and let µ be a probability distribution on N with pi ≡ µ(i), i ≥ 1.

(a) Verify that ν(·) ≡∑

i≥1 piµi(·) is a probability distribution on(R,B(R)

).

(b) (i) Show that there exists a probability space (Ω,F , P ) and acollection of independent random variables J, X1, X2, . . .on (Ω,F , P ) such that for each i ≥ 1, Xi has distributionµi and J ∼ µ.

(ii) Let Y = XJ , i.e., Y (ω) ≡ XJ(ω)(ω). Show that Y is arandom variable on (Ω,F , P ) and Y ∼ ν.

6.16 Let F be a cdf on R and let F be decomposed as

F = αFd + βFac + γFsc

where α, β, γ ∈ [0, 1] and α +β + γ = 1 and Fd, Fac, Fsc are discrete,absolutely continuous, and singular continuous cdfs on R (cf. (4.5.3)).Show that there exist independent random variables X1, X2, X3 andJ on some probability space such that X1 ∼ Fd, X2 ∼ Fac, X3 ∼ Fsc,P (J = 1) = α, P (J = 2) = β, P (J = 3) = γ and XJ ∼ F , where ∼means “has cdf”.

6.17 Let µ be a probability measure on(R,B(R)

). Let for each x in R,

F (x, ·) be a cdf on(R,B(R)

). Let ψ(x, t) ≡ infy : F (x, y) ≥ t, for

x in R, 0 < t < 1. Assume that ψ(·, ·) : R× (0, 1) → R is measurable.Let X and U be independent random variables on some probabilityspace (Ω,F , P ) such that X ∼ µ and U ∼ uniform (0,1).

(a) Show that Y = ψ(X, U) is a random variable.(b) Show that P (Y ≤ y) =

∫R

F (x, y)µ(dx). (The distribution of Yis called a mixture of distributions with µ as the mixing distri-bution. This is of relevance in Bayesian statistical inference.)

6.18 (a) Let (Si,Si), i = 1, 2 be two measurable spaces. Let µ be a prob-ability measure on (S1,S1) and let Q : S1 × S2 → [0, 1] be suchthat for each x in S1, Q(x, ·) is a probability measure on (S2,S2)and for each B in S2, Q(·, B) is S1-measurable. Define

ν(B1 ×B2) ≡∫

B1

Q(x, B2)µ(dx)

Page 229: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

216 6. Probability Spaces

on C ≡ B1 × B2 : Bi ∈ si, i = 1, 2. Show that ν can beextended to be a probability measure on σ〈C〉 ≡ S1 × S2.

(b) Let µ and Q be as in Example 6.3.8 (cf. (3.16)). For each n ≥ 1let νn be a set function defined by the recursive scheme

ν1(·) = µ(·),

νn+1(A×B) =∫

A

Q(x, B)νn(dx), A ∈ B(Rn), B ∈ B(R).

Show that for each n, νn can be extended to be a probabilitymeasure on

(Rn,B(Rn)

). (Thus the right side of (3.16) is defined

to be νn(B1 ×B2 × · · · ×Bn).)

6.19 (Bayesian paradigm). Consider the setup in Problem 6.18 (a). Letλ(B) ≡ ν(S1 ×B) =

∫S1

Q(x, B)µ(dx) for all B in S2.

(a) Verify that λ is a probability measure on (S2,S2).

(b) Now fix B1 in S1. Show that there exists a function Q(x, B1),S2 → [0, 1] that is 〈S2,B(R)〉-measurable such that

ν(B1 ×B2) =∫

B2

Q(x, B1)λ(dx).

(Hint: Apply the Radon-Nikodym theorem to the pair ν(B1×·)and λ(·).)

(c) Let Ω = S1 × S2, F = σ〈C〉. For ω = (s1, s2), let θ(ω) = s1 andX(ω) = s2. Think of θ as the parameter, X as the data, Q(θ, ·)as the distribution of X given θ, µ(·) as the prior distribution ofθ and Q(x, B1) as the posterior probability that θ is in B1 giventhe data X = x. Compute Q(x, B1) when (Si,Si) = (R,B(R)),i = 1, 2, µ(·) ∼ N(0, 1), Q(θ, ·) ∼ N(θ, 1).

6.20 Let X be a random variable on some probability space (Ω,F , P ).Recall that a random variable X is

(a) discrete if there is a finite or countable set D ≡ aj : 1 ≤ j ≤k ≤ ∞ such that P (X ∈ D) = 1,

(b) continuous if for every x ∈ R, P (X = x) = 0 or equivalently thecdf FX(·) is continuous on all of R,

(c) absolutely continuous if there exists a nonnegative Borel measur-able function fX(·) on R such that for any −∞ < a < b < ∞,

P (a < X ≤ b) =∫

(a,b]fX(·)dm

or equivalently the induced measure PX−1 is m,

Page 230: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

6.4 Problems 217

(d) singular if PX−1⊥m or equivalently FX(·) = 0 a.e. m,

(e) singular continuous if it is singular and continuous.

Let g : R → R be Borel measurable and Y = g(X).

(a) Show that if X is discrete then so is Y but not conversely.

(b) Show that if X is continuous and g is (1–1) on the range of X,then Y is continuous.

(c) Show if X is absolutely continuous with pdf fX(·) and g is ab-solutely continuous on bounded intervals such that g′(·) > 0 a.e.(m), then Y is also absolutely continuous with pdf

fY (y) =fX

(g−1(y)

)g′(g−1(y)

) .

(d) Let X be as in (c) above. Suppose g is absolutely continuous onbounded intervals and there exist disjoint intervals Ij1≤j≤k,1 ≤ k ≤ ∞, such that

⋃1≤j≤k Ij = R and for each j, either

g′(·) > 0 a.e. (m) on Ij or g′(·) < 0 a.e. (m) on Ij . Show that Yis also absolutely continuous with pdf

fY (y) =∑

xj∈D(y)

fX(xj)|g′(xj)|

where Dy ≡ xj : xj ∈ Ij , g(xj) = y.(e) Use (c) to compute the pdf of Y when

(i) X ∼ N(0, 1), g(x) = ex.(ii) X ∼ N(0, 1), g(x) = x2.(iii) X ∼ N(0, 1), g(x) = sin 2πx.(iv) X ∼ exp(1), g(x) = e−x.

6.21 (Simple random sampling without replacement). Let S ≡1, 2, . . . , m, 1 < m < ∞. Fix 1 ≤ n ≤ m. Choose an elementX1 from S such that the probability that X1 = j is 1

m for all j ∈ S.Next, choose an element X2 from S−X1 such that the probabilitythat X2 = j is 1

(m−1) for j ∈ S−X1. Continue this procedure for n

steps. Write the outcome as the ordered vector ω ≡ (X1, X2, . . . , Xn).

(a) Identify the sample space Ω, the σ-algebra F and the probabilitymeasure P for this experiment.

(b) Show that for any permutation σ of 1, 2, . . . , n, the randomvector Yσ = (Xσ(1), Xσ(2), . . . , Xσ(n)) has the same distributionas (X1, X2, . . . , Xn).

Page 231: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

218 6. Probability Spaces

(c) Conclude that Xi1≤i≤n are identically distributed and thatEXi, Cov(Xi, Xj), i = j are independent of i and j and computethem.

(d) Answer the same questions (a)–(c) if the sampling is changedto with replacement, i.e., at each stage i, the probability thatP (Xi = j) = 1

m for all j ∈ S.

(e) In (d), let D be the number of distinct units in the sample. FindE(D) and Var(D).

6.22 Let X be a nonnegative random variable. Show that√1 + (EX)2 ≤ E

√1 + X2 ≤ 1 + EX.

(Note that f(x) ≡√

1 + x2 is convex on [0,∞) and bounded by 1+x.)

6.23 Let X and Y be nonnegative random variables defined on a proba-bility space (Ω,F , P ). Suppose X · Y ≥ 1 w.p. 1. Show that

EX · EY ≥ 1.

(Hint: Use Cauchy-Schwarz on√

X√

Y .)

6.24 Let µ be a probability measure on(R,B(R)

). Show that there is a

random variable X on the Lebesgue space ([0, 1],B([0, 1]), m) suchthat m X−1 ≡ µ where m is the Lebesgue measure. Extend this to(Rk,B(Rk)

), where k is an integer > 1.

(Note: This is true for any Polish space, i.e., a complete separablemetric space, see Billingsley (1968).)

Page 232: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

7Independence

7.1 Independent events and random variables

Although a probability space is nothing more than a measure space with themeasure of the whole space equal to one, probability theory is not merelya subset of measure theory. A distinguishing and fundamental feature ofprobability theory is the notion of independence.

Definition 7.1.1: Let (Ω,F , P ) be a probability space andB1, B2, . . . , Bn ⊂ F be a finite collection of events.

(i) B1, B2, . . . , Bn are called independent w.r.t. P , if

P

( k⋂j=1

Bij

)=

k∏j=1

P (Bij) (1.1)

for all i1, i2, . . . , ik ⊂ 1, 2, . . . , n, 1 ≤ k ≤ n.

(ii) B1, B2, . . . , Bn are called pairwise independent w.r.t. P if P (Bi ∩Bj) = P (Bi)P (Bj) for all i, j, i = j.

Note that a collection B1, B2, . . . , Bn of events may be independent withrespect to one probability measure P but not with respect to another mea-sure P ′. Note also that pairwise independence does not imply independence(Problem 7.1).

Definition 7.1.2: Let (Ω,F , P ) be a probability space. A collection ofevents Bα, α ∈ A ⊂ F is called independent w.r.t. P if for every finite

Page 233: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

220 7. Independence

subcollection α1, α2, . . . , αk ⊂ A, 1 ≤ k < ∞,

P

( k⋂i=1

Bαi

)=

k∏i=1

P (Bαi). (1.2)

Definition 7.1.3: Let (Ω,F , P ) be a probability space. Let A be anonempty set. For each α in A, let Gα ⊂ F be a collection of events.Then the family Gα : α ∈ A is called independent w.r.t P if for everychoice of Bα in Gα for α in A, the collection of events Bα : α ∈ A isindependent w.r.t. P as in Definition 7.1.2.

Definition 7.1.4: Let (Ω,F , P ) be a probability space and let Xα : α ∈A be a collection of random variables on (Ω,F , P ). Then the collectionXα : α ∈ A is called independent w.r.t. P if the family of σ-algebrasσ〈Xα〉 : α ∈ A is independent w.r.t. P , where σ〈Xα〉 is the σ-algebragenerated by Xα, i.e.,

σ〈Xα〉 ≡ X−1α (B) : B ∈ B(R). (1.3)

Note that the collection Xα : α ∈ A is independent iff for anyα1, α2, . . . , αk ⊂ A, and Bi ∈ B(R), for i = 1, 2, . . . , k, 1 ≤ k < ∞,

P (Xαi∈ Bi, i = 1, 2, . . . , k) =

k∏i=1

P (Xαi∈ Bi). (1.4)

It turns out that if (1.4) holds for all Bi of the form Bi = (−∞, xi], xi ∈ R,then it holds for all Bi ∈ B(R), i = 1, 2, . . . , k. This follows from theproposition below.

Proposition 7.1.1: Let (Ω,F , P ) be a probability space. Let A be anonempty set. Let Gα ⊂ F be a π-system for each α in A. Let Gα : α ∈ Abe independent w.r.t. P . Then the family of σ-algebras σ〈Gα〉 : α ∈ A isalso independent w.r.t. P .

Proof: Fix 2 ≤ k < ∞, α1, α2, . . . , αk ⊂ A, Bi ∈ Gαi , i = 1, 2, . . . , k−1.Let

L ≡

B : B ∈ σ〈Gαk〉, P (B1 ∩ · · · ∩Bk−1 ∩B) =

( k−1∏i=1

P (Bi))

P (B)

.

(1.5)

Page 234: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

7.1 Independent events and random variables 221

It is easy to verify that L is a λ-system. By hypothesis, L contains theπ-system Gαk

. Hence by the π-λ theorem (cf. Theorem 1.1.2), L = σ〈Gα〉.Iterating the above argument k times completes the proof.

Corollary 7.1.2: A collection Xα : α ∈ A of random variableson a probability space (Ω,F , P ) is independent w.r.t. P iff for anyα1, α2, . . . , αk ⊂ A and any x1, x2, . . . , xk in R, the joint cdf Fα1,α2,...,αk

of (Xα1 , Xα2 , . . . , Xαk) is the product of the marginal cdfs Fαi , i.e.,

Fα1,α2,...,αk(x1, x2, . . . , xk) ≡ P (Xαi ≤ xi, i = 1, 2, . . . , k)

=k∏

i=1

P (Xαi ≤ xi) =k∏

i=1

Fαi(xi). (1.6)

Proof: For the ‘if’ part let Gα ≡ X−1α

((−∞, x]

): x ∈ R, α ∈ A. Now

apply Proposition 7.1.1. The only if part is easy.

Remark 7.1.1: If the probability distribution of (Xα1 , Xα2 , . . . , Xαk) is

absolutely continuous w.r.t. the Lebesgue measure mk on Rk, then (1.6)and hence the independence of Xα1 , Xα2 , . . . , Xαk

is equivalent to thecondition that

fα1,α2,...,αk(x1, x2, . . . , xk) =

k∏i=1

fαi(xi), (1.7)

a.e. (mk), where f(α1,α2,...,αk) is the joint density of (Xα1 , Xα2 , . . . , Xαk),

and fαiis the marginal density of Xαi

, i = 1, 2, . . . , k. See Problem 7.18.

Proposition 7.1.3: Let (Ω,F , P ) be a probability space and letX1, X2, . . . , Xk, 2 ≤ k < ∞ be a collection of random variables on(Ω,F , P ).

(i) Then X1, X2, . . . , Xk is independent iff

Ek∏

i=1

hi(Xi) =k∏

i=1

Ehi(Xi) (1.8)

for all bounded Borel measurable functions hi : R → R, i = 1, 2, . . . , k.

(ii) If X1, X2 are independent and E|X1| < ∞, E|X2| < ∞, then

E|X1X2| < ∞ and EX1X2 = EX1EX2. (1.9)

Proof:

(i) If (1.8) holds, then taking hi = IBi with Bi ∈ B(R), i =1, 2, . . . , k yields the independence of X1, X2, . . . , Xk. Conversely,

Page 235: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

222 7. Independence

if X1, X2, . . . , Xk are independent, then (1.8) holds for hi = IBi

for Bi ∈ B(R), i = 1, 2, . . . , k, and hence for simple functionsh1, h2, . . . , hk. Now (1.8) follows from the BCT.

(ii) Note that by the change of variable formula (Proposition 6.2.1)

E|X1X2| =∫

R2|x1x2|dPX1,X2(x1, x2),

E|Xi| =∫

R

|xi|dPXi(xi), i = 1, 2,

where PX1,X2 is the joint distribution of (X1, X2) and PXi is themarginal distribution of Xi, i = 1, 2. Also, by the independence ofX1 and X2, PX1,X2 is equal to the product measure PX1×PX2 . Hence,by Tonelli’s theorem,

E|X1X2| =∫

R2|x1x2|dPX1,X2(x1, x2)

=∫

R2|x1x2|dPX1(x1)dPX2(x2)

=(∫

R

|x1|dPX1(x1))(∫

R

|x2|dPX2(x2))

= E|X1|E|X2| < ∞.

Now using Fubini’s theorem, one gets (1.9).

Remark 7.1.2: Note that the converse to (ii) above need not hold. That is,if X1 and X2 are two random variables such that E|X1| < ∞, E|X2| < ∞,E|X1X2| < ∞, and EX1X2 = EX1EX2, then X1 and X2 need not beindependent.

7.2 Borel-Cantelli lemmas, tail σ-algebras, andKolmogorov’s zero-one law

In this section some basic results on classes of independent events are estab-lished. These will play an important role in proving laws of large numbersin Chapter 8.

Definition 7.2.1: Let (Ω,F) be a measurable space and Ann≥1 be asequence of sets in F . Then

lim supn→∞

An ≡ limAn ≡∞⋂

k=1

( ⋃n≥k

An

)(2.1)

Page 236: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

7.2 Borel-Cantelli lemmas, tail σ-algebras, and Kolmogorov’s zero-one law 223

lim infn→∞ An ≡ limAn ≡

∞⋃k=1

⋂n≥k

An. (2.2)

Proposition 7.2.1: Both limAn and limAn ∈ F and

limAn = ω : ω ∈ An for infinitely many n

limAn = ω : ω ∈ An for all but a finite number of n.

Proof: Since Ann≥1 ⊂ F and F is a σ-algebra, Bk =⋃

n≥k An ∈ F foreach k ∈ N and hence limAn ≡

⋂∞k=1 Bk ∈ F . Next,

ω ∈ limAn

⇐⇒ ω ∈ Bk for all k = 1, 2, ...

⇐⇒ for each k, there exists nk ≥ k such that ω ∈ Ank

⇐⇒ ω ∈ An for infinitely many n.

The proof for limAn is similar.

In probability theory, limAn is referred to as the event that “An happensinfinitely often (i.o.)” and limAn as the event that “all but a finitely manyAn’s happen.”

Example 7.2.1: Let Ω = R, F = B(R), and let

An =

[0, 1

n

]for n odd[

1− 1n , 1]

for n even.

Then limAn = 0, 1, limAn = ∅.

The following result on the probabilities of limAn and limAn is veryuseful in probability theory.

Theorem 7.2.2: Let (Ω,F , P ) be a probability space and Ann≥1 be asequence of events in F . Then

(a) (The first Borel-Cantelli lemma). If∞∑

n=1P (An) < ∞, then

P (limAn) = 0.

(b) (The second Borel-Cantelli lemma). If∞∑

n=1P (An) = ∞ and Ann≥1

are pairwise independent, then P (limAn) = 1.

Remark 7.2.1: This result is also called a zero-one law as it asserts thatfor pairwise independent events Ann≥1, P (limAn) = 0 or 1 according to∑∞

n=1 P (An) < ∞ or equal to ∞.

Page 237: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

224 7. Independence

Proof:

(a) Let Zn ≡∑n

j=1 IAj. Then Zn ↑ Z ≡

∑∞j=1 IAj

and by the MCT,EZn ≡

∑nj=1 P (Aj) ↑ EZ. Thus,

∑∞j=1 P (Aj) < ∞ ⇒ EZ < ∞ ⇒

Z < ∞ w.p. 1 ⇒ P (Z = ∞) = 0. But the event limAn = Z = ∞and so (a) follows.

(b) Without loss of generality, assume P (Aj) > 0 for some j. Let Jn =Zn

EZnfor n ≥ j where Zn is as above. Then, EJn = 1 and by the

pairwise independence of Ann≥1, the variance of Jn is

Var(Jn) =

n∑j=1

P (Aj)(1− P (Aj))

(EZn)2≤ 1

(EZn).

If∑∞

j=1 P (Aj) = ∞, then EZn =∑n

j=1 P (Aj) ↑ ∞, by the MCT.Thus EJn ≡ 1, Var(Jn) → 0 as n → ∞. By Chebychev’s inequality,for all ε > 0,

P (|Jn − 1| > ε) ≤ Var(Jn)ε2

→ 0 as n →∞.

Thus, Jn → 1 in probability and hence there exists a subsequencenkk≥1 such that Jnk

→ 1 w.p. 1 (cf. Theorem 2.5.2). Since EZnk↑

∞, this implies that Znk→∞ w.p. 1. But Znn≥1 is nondecreasing

in n and hence Zn ↑ ∞ w.p. 1. Now since limAn = Z = ∞, itfollows that P (limAn) = P (Z = ∞) = 1.

Proposition 7.2.3: Let Xnn≥1, be a sequence of random variables onsome probability space (Ω,F , P ).

(a) If∑∞

n=1 P (|Xn| > ε) < ∞ for each ε > 0, then

P ( limn→∞ Xn = 0) = 1.

(b) If Xnn≥ are pairwise independent and P ( limn→∞ Xn = 0) = 1, then∑∞

n=1 P (|Xn| > ε) < ∞ for each ε > 0.

Proof:

(a) Fix ε > 0. Let An = |Xn| > ε, n ≥ 1. Then∑∞

n=1 P (An) < ∞ ⇒P (limAn) = 0, by the first Borel-Cantelli lemma (Theorem 7.2.2 (a)).But

(limAn)c = ω : there exists n(ω) < ∞ such that for alln ≥ n(ω), w ∈ An

= ω : there exists n(ω) < ∞ such that |Xn(ω)| ≤ ε

for all n ≥ n(ω)= Bε, say.

Page 238: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

7.2 Borel-Cantelli lemmas, tail σ-algebras, and Kolmogorov’s zero-one law 225

Thus,∑∞

n=1 P (An) < ∞⇒ P (Bε) = 1. Let B =⋂∞

r=1 B 1r. Now note

that ω : lim

n→∞ |Xn(ω)| = 0

=∞⋂

r=1

B 1r.

Since P (Bc) ≤∑∞

r=1 P (Bc1r

) = 0, P (B) = 1.

(b) Let Xnn≥1 be pairwise independent and∑∞

n=1 P (|Xn| > ε0) = ∞for some ε0 > 0. Let An = |Xn| > ε0. Since Xnn≥1 are pairwiseindependent, so are Ann≥1. By the second Borel-Cantelli lemma

P (limAn) = 1.

But ω ∈ limAn ⇒ lim supn→∞ |Xn| ≥ ε0 and henceP (limn→∞ |Xn| = 0) = 0. This contradicts the hypothesis thatP (lim supn→∞ |Xn| = 0) = 1.

Definition 7.2.2: The tail σ-algebra of a sequence of random variablesXnn≥1 on a probability space (Ω,F , P ) is

T =∞⋂

n=1

σ〈Xj : j ≥ n〉

and any A ∈ T is called a tail event. Further, any T -measurable randomvariable is called a tail random variable (w.r.t. Xnn≥1).

Tail events are determined by the behavior of the sequence Xnn≥1 forlarge n and they remain unchanged if any finite subcollection of the Xn’sare dropped or replaced by another finite set of random variables. Eventssuch as lim supn→∞ Xn < x or limn→∞ Xn = x, x ∈ R, belong to T .A remarkable result of Kolmogorov is that for any sequence of independentrandom variables, any tail event has probability zero or one.

Theorem 7.2.4: (Kolmogorov’s 0-1 law). Let Xnn≥1 be a sequence ofindependent random variables on a probability space (Ω,F , P ) and let T bethe tail σ-algebra of Xnn≥1. Then P (A) = 0 or 1 for all A ∈ T .

Remark 7.2.2: Note that in Proposition 7.2.3, the event A ≡limn→∞ Xn = 0 belongs to T , and hence, by the above theorem,P (A) = 0 or 1. Thus, proving that P (A) = 1 is equivalent to provingP (A) = 0. Kolmogorov’s 0-1 law only restricts the possible values of tailevents like A to 0 or 1, while the Borel-Cantelli lemmas (Theorem 7.2.2)provide a tool for ascertaining whether the value is either 0 or 1. On theother hand, note that Theorem 7.2.2 requires only pairwise independenceof Ann≥1 but Kolmogorov’s 0-1 law requires the full independence of thesequence Xnn≥1.

Page 239: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

226 7. Independence

Proof: For n ≥ 1, define the σ-algebras Fn and Tn by Fn =σ〈X1, . . . , Xn〉 and Tn = σ〈Xn+1, Xn+2, . . .〉. Since Xn, n ≥ 1 areindependent, Fn is independent of Tn for all n ≥ 1. Since, for each n,T =

⋂∞m=n Fm is a sub σ-algebra of Tn, this implies Fn is independent of

T for all n ≥ 1 and hence A ≡⋃∞

n=1 Fn is independent of T . It is easy tocheck that A is an algebra (and hence, is a π-system). Hence, by Propo-sition 7.1.1, σ〈A〉 is independent of T . Since T is also a sub-σ-algebra ofσ〈A〉 = σ〈Xn : n ≥ 1〉, this implies T is independent of itself. Hence forany B ∈ T ,

P (B ∩B) = P (B) · P (B),

which implies P (B) = 0 or 1.

Definition 7.2.3: Let (Ω,F , P ) be a probability space and let X : Ω → Rbe a 〈F , B(R)〉-measurable mapping. (Recall the definition of B(R) from(2.1.4)). Then X is called an extended real-valued random variable or anR-valued random variable.

Corollary 7.2.5: Let T be the tail σ-algebra of a sequence of indepen-dent random variables Xnn≥1 on (Ω,F , P ) and let X be a 〈T ,B(R)〉-measurable R-valued random variable from Ω to R. Then, there exists c ∈ Rsuch that

P (X = c) = 1.

Proof: If P (X ≤ x) = 0 for all x ∈ R, then P (X = +∞) = 1. Hence,suppose that B ≡ x ∈ R : P (X ≤ x) = 0 = ∅. Since X ≤ x ∈ T for allx ∈ R, P (X ≤ x) = 1 for all x ∈ B. Define c = infx : x ∈ B. Check thatP (X = c) = 1.

An immediate implication of Corollary 7.2.5 is that for any sequenceof independent random variables Xnn≥1, the R-valued random variableslim supn→∞ Xn and lim infn→∞ Xn are degenerate, i.e., they are constantsw.p. 1.

Example 7.2.2: Let Xnn≥1 be a sequence of independent randomvariables on (Ω,F , P ) with EXn = 0, EX2

n = 1 for all n ≥ 1. LetSn = X1 + . . . + Xn, n ≥ 1 and Φ(x) =

∫ x

−∞(√

2π)−1 exp(−y2/2)dy, x ∈ R.If P (Sn ≤

√nx) → Φ(x) for all x ∈ R, then

lim supn→∞

Sn√n

= +∞ a.s. (2.3)

To show this, let S = lim supn→∞ Sn/√

n. First it will be shown thatS is 〈T ,B(R)〉-measurable. For any m ≥ 1, define the variables Tm,n =(Xm+1 + . . .+Xn)/

√n and Sm,n = (X1 + . . .+Xm)/

√n, n > m. Note that

for any fixed m ≥ 1, Tm,n is σ〈Xm+1, . . .〉-measurable and Sm,n(ω) → 0 as

Page 240: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

7.3 Problems 227

n →∞ for all ω ∈ Ω. Hence, for any m ≥ 1,

S = lim supn→∞

(Sm,n + Tm,n)

= lim supn→∞

Tm,n

is σ〈Xm+1, Xm+2, . . .〉-measurable. Thus, S is measurable with respect toT =

⋂∞m=1 σ〈Xm+1, Xm+2, . . .〉. Hence, by Theorem 7.2.4, P (S = +∞) ∈

0, 1.If possible, now suppose that P (S = +∞) = 0. Then, by Corollary 7.2.5,

there exists c ∈ [−∞,∞) such that P (S = c) = 1. Let An = Sn >√

nx,n ≥ 1, with x = c + 1. Then,

0 < 1− Φ(x) = limn→∞ P (An)

≤ limn→∞ P

( ⋃m≥n

Am

)

= P

( ∞⋂n=1

∞⋃m=n

Am

)

= P( Sn√

n> x i.o.

)≤ P (S ≥ c + 1) = 0.

This shows that P (S = +∞) must be 1. Also see Problem 7.16.

Remark 7.2.3: It will be shown in Chapter 11 that if Xii≥1 are inde-pendent and identically distributed (iid) random variables with EX1 = 0and EX2

1 = 1, then

P( Sn√

n≤ x

)→ Φ(x) for all x in R.

(This is known as the central limit theorem.) Indeed, a stronger resultknown as the law of the iterated logarithm holds, which says that for suchXii≥1,

lim supn→∞

Sn√2n log log n

= +1, w.p. 1.

7.3 Problems

7.1 Give an example of three events A1, A2, A3 on some probability spacesuch that they are pairwise independent but not independent.

(Hint: Consider iid random variables X1, X2, X3 with P (X1 = 1) =

Page 241: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

228 7. Independence

12 = P (X1 = 0) and the events A1 = X1 = X2, A2 = X1 = X3,A3 = X3 = X1.)

7.2 Let Xα : α ∈ A be a collection of independent random variables onsome probability space (Ω,F , P ). For any subset B ⊂ A, let XB ≡Xα : α ∈ B.

(a) Let B be a nonempty proper subset of A. Show that the collec-tions XB and XBc are independent, i.e., the σ-algebras σ〈XB〉and σ〈XBc〉 are independent w.r.t. P .

(b) Let Bγ : γ ∈ Γ be a partition of A by nonempty proper subsetsBγ . Show that the family of σ-algebras σ〈XBγ

〉 : γ ∈ Γ areindependent w.r.t. P .

7.3 Let X1, X2 be iid standard exponential random variables, i.e.,

P (X1 ∈ A) =∫

A∩(0,∞)

e−xdx, A ∈ B(R).

Let Y1 = min(X1, X2) and Y2 = max(X1, X2)−Y1. Show that Y1 andY2 are independent. Generalize this to the case of three iid standardexponential random variables.

7.4 Let Ω = (0, 1), F = B((0, 1)), the Borel σ-algebra on (0,1) and Pbe the Lebesgue measure on (0, 1). For each ω ∈ (0, 1), let ω =∑∞

i=1Xi(ω)

2i be the nonterminating binary expansion of ω.

(a) Show that Xii≥1 are iid Bernouilli (12 ) random variables, i.e.,

is P (X1 = 0) = 12 = P (X1 = 1).

(Hint: Let si ∈ 0, 1, i = 1, 2, . . . , k, k ∈ N. Show that the setω : 0 < ω < 1, Xi(ω) = si, 1 ≤ i ≤ k is an interval of length2−k.)

(b) Show that

Y1 =∞∑

i=1

X2i−1

2i(3.1)

Y2 =∞∑

i=1

X2i

2i(3.2)

are independent Uniform (0,1) random variables.

(c) Using the fact that the set N × N of lattice points (m, n) is inone to one correspondence with N itself, construct a sequenceYii≥1 of iid Uniform (0,1) random variables such that for eachj, Yj is a function of Xii≥1.

Page 242: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

7.3 Problems 229

(d) For any cdf F , show that the random variable X(ω) ≡ F−1(ω)has cdf F , where

F−1(u) = infx : F (x) ≥ u for 0 < u < 1. (3.3)

(e) Let Fii≥1 be a sequence of cdfs on R. Using (c), construct asequence Zii≥1 of independent random variables on (Ω,F , P )such that Zi has cdf Fi, i ≥ 1.

(f) Show that the cdf of the random variable W ≡∑∞

i=12Xi

3i is theCantor function (cf. Section 4.5).

(g) Let p > 1 be a positive integer. For each ω ∈ (0, 1) let ω ≡∑∞i=1

Vi(ω)pi be the nonterminating p-nary expansion of ω. Show

that Vii≥1 are iid and determine the distribution of V1.

7.5 Let Xii≥1 be a Markov chain with state space S = 0, 1 andtransition probability matrix

Q =(

q0 p0p1 q1

)where pi = 1− qi, 0 < qi < 1, i = 0, 1 .

Let τ1 = minj : Xj = 0 and τk+1 = minj : j > τk, Xj = 0,k = 1, 2, . . .. Note that τk is the time of kth visit to the state 0.

(a) Show that τk+1 − τk : k ≥ 1 are iid random variables andindependent of τ1.

(b) Show that

Pi(τ1 < ∞) = 1 and hence Pi(τk < ∞) = 1

for all k ≥ 2, i = 0, 1 where Pi denotes the probability distribu-tion with X1 = i w.p. 1.

(Hint: Show that∑∞

k=1 P (τ1 > k | X1 = i) < ∞ for i = 0, 1and use the Borel-Cantelli lemma.)

(c) Show also that Ei(eθ0τ1) < ∞ for some θ0 > 0, i = 0, 1, whereEi denotes the expectation under Pi.

7.6 Let X1 and X2 be independent random variables.

(a) Show that for any p > 0,

E|X1 + X2|p < ∞ iff E|X1|p < ∞, E|X2|p < ∞.

Show that this is false if X1 and X2 are not independent.

(Hint: Use Fubini’s theorem to conclude that E|X1 +X2|p < ∞implies that E|X1 + x2|p < ∞ for some x2 and hence E|X1|p <∞.)

Page 243: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

230 7. Independence

(b) Show that if E(X21 + X2

2 ) < ∞, then

Var(X1 + X2) = Var(X1) + Var(X2). (3.4)

Show by an example that (3.4) need not imply the independenceof X1 and X2. Show also that if X1 and X2 take only two valueseach and (3.4) holds, then X1 and X2 are independent.

7.7 Let X1 and X2 be two random variables on a probability space(Ω,F , P ).

(a) Show that, if

P(X1 ∈ (a1, b1), X2 ∈ (a2, b2)

)= P

(X1 ∈ (a1, b1)

)P(X2 ∈ (a2, b2)

)(3.5)

for all a1, b1, a2, b2 in a dense set D in R, then X1 and X2 areindependent.

(Hint: Show that (3.5) implies that the joint cdf of (X1, X2) isthe product of the marginal cdfs of X1 and X2 and use Corollary7.1.2.)

(b) Let fi : R → R, i = 1, 2 be two one-one functions such thatboth fi and f−1

i are Borel measurable, i = 1, 2. Show that X1and X2 are independent iff f1(X1) and f2(X2) are independent.Conclude that X1 and X2 are independent iff eX1 and eX2 areindependent.

7.8 (a) Let X1 and X2 be two independent bounded random variables.Show that

E(p1(X1)p2(X2)) = (Ep1(X1))(Ep2(X2)) (3.6)

where p1(·) and p2(·) are polynomials.

(b) Show that if X1 and X2 are bounded random variables and (3.6)holds for all polynomials p1(·) and p2(·), then X1 and X2 are in-dependent.

(Hint: Use the facts that (i) continuous functions on a boundedclosed interval [a, b] can be approximated uniformly by polyno-mials, and (ii) for any interval (c, d) ⊂ [a, b], any random variableX and ε > 0, there exists a continuous function f on [a, b] suchthat E|f(X)− I(c,d)(X)| < ε, provided P (X = c or d) = 0.)

7.9 Let Xnn≥1 be a sequence of iid random variables on a probabilityspace (Ω,F , P ). Let R = R(ω) be the radius of convergence of thepower series

∑∞n=1 Xnrn.

Page 244: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

7.3 Problems 231

(a) Show that R is a tail random variable.

(Hint: Note that

R =1

lim supn→∞

|Xn|1/n.)

(b) Show that if E(log |X1|)+ = ∞, then R = 0 w.p. 1. and ifE(log |X1|)+ < ∞, then R ≥ 1 w.p. 1.

(Hint: Apply the Borel-Cantelli lemmas to An = |Xn| > λnfor each λ > 1.)

7.10 Let Ann≥1 be a sequence of events in (Ω,F , P ) such that

∞∑n=1

P (An ∩Acn+1) < ∞

and limn→∞ P (An) = 0. Show that

P (lim supn→∞

An) = 0.

Show also that limn→∞ P (An) = 0 can be replaced bylimn→∞ P (

⋂j≥n Aj) = 0.

(Hint: Let Bn = An ∩Acn+1, n ≥ 1, B = limBn, A = limAn. Show

that A ∩Bc ⊂ limAn.)

7.11 For any nonnegative random variable X, show that E|X| < ∞ iff∑∞n=1 P (|X| > εn) < ∞ for every ε > 0.

7.12 Let Xii≥1 be a sequence of pairwise independent and identicallydistribution random variables.

(a) Show that limn→∞ Xn

n = 0 w.p. 1 iff E|X1| < ∞.

(Hint: E|X1| < ∞⇐⇒∞∑

n=1P (|Xn| > εn) < ∞ for all ε > 0.)

(b) Show that E(log |X1|)+ < ∞ iff

(|Xn|

)1/n

→ 1 w.p. 1.

7.13 Let Xii≥1 be a sequence of identically distributed random variablesand let Mn = max|Xj | : 1 ≤ j ≤ n.

Page 245: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

232 7. Independence

(a) If E|X1|α < ∞ for some α ∈ (0,∞), then show that

Mn

n1/α→ 0 w.p. 1. (3.7)

(Hint: Fix ε > 0. Let An = |Xn| > εn1/α. Apply the firstBorel-Cantelli lemma.)

(b) Show that if Xii≥1 are iid satisfying (3.7) for some α > 0,then E|X1|α < ∞.

(Hint: Apply the second Borel-Cantelli lemma.)

7.14 Let X1 and X2 be independent random variables with distributionsµ1 and µ2. Let Y = (X1 + X2).

(a) Show that the distribution µ of Y is the convolution µ1 ∗ µ2 asdefined by

(µ1 ∗ µ2)(A) =∫

R

µ1(A− x)µ2(dx)

(cf. Problem 5.12).

(b) Show that if X1 has a continuous distribution then so does Y .

(c) Show that if X1 has an absolutely continuous distribution thenso does Y and that the density function of Y is given by

( dµ

dm

)(x) ≡ fY (x) =

∫fX1(x− u)µ2(du)

where fX1(x) ≡(

dµ1dm

)(x), the probability density of X1.

(d) Y has a discrete distribution iff both X1 and X2 are discrete.

7.15 (AR(1) series). Let Xnn≥0 be a sequence of random variables suchthat for some ρ ∈ R,

Xn+1 = ρXn + εn+1, n ≥ 0

where εnn≥1 are independent and independent of X0.

(a) Show that if |ρ| < 1 and E(log |ε1|)+ < ∞, then

Xn ≡n∑

j=0

ρjεj+1 converges w.p. 1

to a limit X∞, say.

Page 246: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

7.3 Problems 233

(b) Show that under the hypothesis of (a), for any bounded contin-uous function h : R → R and for any distribution of X0

Eh(Xn) → Eh(X∞).

(Hint: Show that for each n ≥ 1, Xn − ρnX0 and Xn have thesame distribution.)

7.16 Establish the following generalization of Example 7.2.2. Let Xnn≥1be a sequence of independent random variables on some probabilityspace (Ω,F , P ). Suppose there exists sequences ann≥1, xnn≥1,such that an ↑ ∞, xn ↑ ∞ and for each k < ∞, limn P (Sn ≤ anxk) ≡F (xk) exists and is < 1. Show that lim supn→∞

Sn

an= +∞ a.s.

7.17 (a) Let Xini=1 be random variables on a probability space

(Ω,F , P ) and let P (X1, X2, . . . , Xn)−1(·) be dominated bythe product measure µ × µ × · · · × µ where µ is a σ-finite measure on (R,B(R)) with Radon-Nikodym derivativef(x1, x2, . . . , xn). Show that Xin

i=1 are independent w.r.t. Piff f(x1, x2, . . . , xn) ≡

∏ni=1 hi(xi) for all (x1, x2, . . . , xn) ∈ R

where for each i, hi : R → R is Borel measurable.

(b) Use (a) to show that if (X1, X2) has an absolutely continuousdistribution with density f(x1, x2) then X1 and X2 are indepen-dent iff

f(x1, x2) = f1(x1)f2(x2)

where fi(·) is the density of Xi.

(c) Using (a) or otherwise conclude that if Xi, i = 1, 2 are bothdiscrete random variables then X1 and X2 are independent iff

P (X1 = a, X2 = b) = P (X1 = a)P (X2 = b)

for all a and b.

7.18 Let Xnn≥1 be a sequence of independent random variables suchthat for n ≥ 1, P (Xn = 1) = 1

n = 1 − P (Xn = 0). Show thatXn −→p 0 but not w.p. 1.

7.19 Let (Ω,F , P ) be a probability space.

(a) Suppose there exists events A1, A2, . . . , Ak that are independentwith 0 < P (Ai) < 1, i = 1, 2, . . . , k. Show that |Ω| ≥ 2k wherefor any set A, |A| is the number of elements in A.

(b) Let Xiki=1 be independent random variables such that Xi takes

ni distinct values with positive probability. Show that |Ω| ≥∏kj=1 ni.

Page 247: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

234 7. Independence

(c) Show that there exists a probability space (Ω,F , P ) such that|Ω| = 2k and k independent events A1, A2, . . . , Ak in F suchthat 0 < P (Ai) < 1, i = 1, 2, . . . , k.

7.20 (a) Let Ω ≡ (x1, x2) : x1, x2 ∈ R, x21 + x2

2 ≤ 1 be the unit disc inR2. Let F ≡ B(Ω), the Borel σ-algebra in Ω and P = normalizedLebesgue measure, i.e., P (A) ≡ m(A)

π , A ∈ F . For ω = (x1, x2)let

X1(ω) = x1, X2(ω) = x2,

and(R(ω), θ(ω)

)be the polar representation of ω. Show that

the random variables R and θ are independent but X1 and X2are not.

(b) Formulate and establish an extension of the above to the unitsphere in R3.

7.21 Let X1, X2, X3 be iid random variables such that P (X1 = x) = 0 forall x ∈ R.

(a) Show that for any permutation σ of (1,2,3)

P(Xσ(1) > Xσ(2) > Xσ(3)

)=

13!

.

(b) Show that for any i = 1, 2, 3

P(Xi = max

1≤j≤3Xj

)=

13.

(c) State and prove a generalization of (a) and (b) to random vari-ables Xi : 1 ≤ i ≤ n such that the joint distribution of(X1, X2, . . . , Xn) is the same as that of (Xσ(1), Xσ(2), . . . , Xσ(n))for any permutation σ of 1, 2, . . . , n and P (X1 = x) = 0 forall x ∈ R.

7.22 Let f , g : R → R be monotone nondecreasing. Show that for anyrandom variable X

Ef(X)g(X) ≥ Ef(X)Eg(X)

provided all the expectations exist.

(Hint: Let Y be independent of X with same distribution. Note thatZ =

(f(X)− f(Y )

)(g(X)− g(Y )

)≥ 0 w.p. 1.)

7.23 Let X1, X2, . . . , Xn be random variables on some probability space(Ω,F , P ). Show that if P (X1, X2, . . . , Xn)−1(·) mn, the Lebesguemeasure on Rn then for each i, PX−1

i (·) m. Give an ex-ample to show that the converse is not true. Show also that if

Page 248: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

7.3 Problems 235

P (X1, X2, . . . , Xn)−1(·) mn then X1, X2, . . . , Xn are indepen-dent iff f(X1,X2,...,Xn)(x1, x2, . . . , xn) =

∏ni=1 fXi

(xi) where the f ’sare the respective pdfs.

Page 249: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8Laws of Large Numbers

When measuring a physical quantity such as the mass of an object, itis commonly believed that the average of several measurements is morereliable than a single one. Similarly, in applications of statistical inferencewhen estimating a population mean µ, a random sample X1, X2, . . . , Xnof size n is drawn from the population, and the sample average Xn ≡1n

∑ni=1 Xi is used as an estimator for the parameter µ. This is based on

the idea that as n gets large, Xn will be close to µ in some suitable sense. Inmany time-evolving physical systems f(t) : 0 ≤ t < ∞, where f(t) is anelement in the phase space S, “time averages” of the form 1

T

∫ T

0 h(f(t))dt(where h is a bounded function on S) converge, as T gets large, to the“space average” of the form

∫Sh(x)π(dx) for some appropriate measure π

on S. The above three are examples of a general phenomenon known as thelaw of large numbers. This chapter is devoted to a systematic developmentof this topic for sequences of independent random variables and also tosome important refinements of the law of large numbers.

8.1 Weak laws of large numbers

Let Znn≥1 be a sequence of random variables on a probability space(Ω,F , P ). Recall that the sequence Znn≥1 is said to converge in proba-bility to a random variable Z if for each ε > 0,

limn→∞ P (|Zn − Z| ≥ ε) = 0. (1.1)

Page 250: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

238 8. Laws of Large Numbers

This is written as Zn −→p Z. The sequence Znn≥1 is said to convergewith probability one or almost surely (a.s.) to Z if there exists a set A in Fsuch that

P (A) = 1 and for all ω in A, limn→∞ Zn(ω) = Z(ω). (1.2)

This is written as Zn → Z w.p. 1 or Zn → Z a.s.

Definition 8.1.1: A sequence Xnn≥1 of random variables on a probabil-ity space (Ω,F , P ) is said to obey the weak law of large numbers (WLLN)with normalizing sequences of real numbers ann≥1 and bnn≥1 if

Sn − an

bn−→p 0 as n →∞ (1.3)

where Sn =∑n

i=1 Xi for n ≥ 1.

The following theorem says that if Xnn≥1 is a sequence of iid randomvariables with EX2

1 < ∞, then it obeys the weak law of large numbers withan = nEX1 and bn = n.

Theorem 8.1.1: Let Xnn≥1 be a sequence of iid random variables suchthat EX2

1 < ∞. Then

Xn ≡X1 + . . . + Xn

n−→p EX1. (1.4)

Proof: By Chebychev’s inequality, for any ε > 0,

P (|Xn − EX1| > ε) ≤ Var(Xn)ε2

=1ε2· σ2

n, (1.5)

where σ2 = Var(X1). Since σ2

nε2 → 0 as n →∞, (1.4) follows.

Corollary 8.1.2: Let Xnn≥1 be a sequence of iid Bernoulli (p) randomvariables, i.e., P (X1 = 1) = p = 1− P (X1 = 0). Let

pn =#i : 1 ≤ i ≤ n, Xi = 1

n, n ≥ 1, (1.6)

where for a finite set A, #A denotes the number of elements in A. Thenpn −→p p.

Proof: Check that EX1 = p and pn = Xn.

This says that one can estimate the probability p of getting a “head” ofa coin by tossing it n times and calculating the proportion of “heads.” Thisis also the basis of public opinion polls. Since the proof of Theorem 8.1.1depended only on Chebychev’s inequality, the following generalization isimmediate (Problem 8.1).

Page 251: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.1 Weak laws of large numbers 239

Theorem 8.1.3: Let Xnn≥1 be a sequence of random variables on aprobability space such that

(i) EX2n < ∞ for all n ≥ 1,

(ii) EXiXj = (EXi)(EXj) for all i = j(i.e., Xnn≥1 are uncorrelated),

(iii) 1n2

∑ni=1 σ2

i → 0 as n →∞, where σ2i = Var(Xi), i ≥ 1.

ThenXn − µn −→p 0 (1.7)

where µn ≡ 1n

∑ni=1 EXi.

Corollary 8.1.4: Let Xnn≥1 satisfy (i) and (ii) of the above theoremand let the sequence σ2

nn≥1 be bounded. Let µn ≡ 1n

∑ni=1 EXi → µ as

n →∞. Then Xn −→p µ.

An Application to Real Analysis

Let f : [0, 1] → R be a continuous function. K. Weierstrass showed thatf can be approximated uniformly over [0, 1] by polynomials. S.N. Bernsteinconstructed a special class of such polynomials. A proof of Bernstein’s resultusing the WLLN (Theorem 8.1.1) is given below.

Theorem 8.1.5: Let f : [0, 1] → R be a continuous function. Let

Bn,f (x) ≡n∑

r=0

f( r

n

)(n

r

)xr(1− x)n−r, 0 ≤ x ≤ 1 (1.8)

be the Bernstein polynomial of order n for the function f . Then

limn→∞ sup

|f(x)−Bn,f (x)| : 0 ≤ x ≤ 1

= 0.

Proof: Since f is continuous on the closed and bounded interval [0, 1], itis uniformly continuous and hence for any ε > 0, there exists a δε > 0 suchthat

|x− y| < δε ⇒ |f(x)− f(y)| < ε. (1.9)

Fix x in [0, 1]. Let Xnn≥1 be a sequence of iid Bernoulli (x) randomvariables. Let pn be as in (1.6). Then Bn,f (x) = Ef(pn). Hence,

|f(x)−Bn,f (x)| ≤ E|f(pn)− f(x)|

= E|f(pn)− f(x)|I(|pn − x| < δε)

+ E

|f(pn)− f(x)|I(|pn − x| ≥ δε)

≤ ε + 2‖f‖P (|pn − x| ≥ δε)

Page 252: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

240 8. Laws of Large Numbers

where ‖f‖ = sup|f(x)| : 0 ≤ x ≤ 1. But by Chebychev’s inequality,

P (|pn − x| ≥ δε) ≤ 1δ2ε

Var(pn)

=x(1− x)

nδ2ε

≤ 14nδ2

ε

for all 0 ≤ x ≤ 1.

Thus, sup|f(x)−Bn,f (x)| : 0 ≤ x ≤ 1

≤ ε + 2‖f‖ 1

4nδ2ε. Letting n →∞

first and then ε ↓ 0 completes the proof.

8.2 Strong laws of large numbers

.Definition 8.2.1: A sequence Xnn≥1 of random variables on a probabil-ity space (Ω,F , P ) is said to obey the strong law of large numbers (SLLN)with normalizing sequences of real numbers ann≥1 and bnn≥1 if

Sn − an

bn→ 0 as n →∞ w.p. 1, (2.1)

where Sn =∑n

i=1 Xi for n ≥ 1.

The following theorem says that if Xnn≥1 is a sequence of iid randomvariables with EX4

1 < ∞, then the strong law of large numbers holds withan = nEX1 and bn = n. This result is referred to as Borel’s SLLN.

Theorem 8.2.1: (Borel’s SLLN ). Let Xnn≥1 be a sequence of iid ran-dom variables such that EX4

1 < ∞. Then

Xn ≡X1 + X2 + . . . + Xn

n→ EX1 w.p. 1. (2.2)

Proof: Fix ε > 0 and let An ≡ |Xn − EX1| ≥ ε, n ≥ 1. To establish(2.2), by Proposition 7.2.3 (a), it suffices to show that

∞∑n=1

P (An) < ∞. (2.3)

By Markov’s inequality

P (An) ≤ E|Xn − EX1|4ε4

. (2.4)

Let Yi = Xi − EX1 for i ≥ 1. Since the Xi’s are independent, it is easy tocheck that

E|Xn − EX1|4 =1n4 E

(( n∑i=1

Yi

)4)

Page 253: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.2 Strong laws of large numbers 241

=1n4

(nEY 4

1 + 3n(n− 1)(EY 21 )2)

= O(n−2).

By (2.4) this implies (2.3).

The following two results are easy consequences of the above theorem.

Corollary 8.2.2: Let Xnn≥1 be a sequence of iid random variables thatare bounded, i.e., there exists a C < ∞ such that P (|X1| ≤ C) = 1. Then

Xn → EX1 w.p. 1.

Corollary 8.2.3: Let Xnn≥1 be a sequence of iid Bernoulli(p) randomvariables. Then

pn ≡#i : 1 ≤ i ≤ n, Xi = 1

n→ p w.p. 1. (2.5)

An application of the above result yields the following theorem on theuniform convergence of the empirical cdf to the true cdf.

Theorem 8.2.4: (Glivenko-Cantelli). Let Xnn≥1 be a sequence of iidrandom variables with a common cdf F (·). Let Fn(·), the empirical cdf basedon X1, X2, . . . , Xn, be defined by

Fn(x) ≡ 1n

n∑j=1

I(Xj ≤ x), x ∈ R. (2.6)

Then,∆n ≡ sup

x|Fn(x)− F (x)| → 0 w.p. 1. (2.7)

Remark 8.2.1: Note that by applying Corollary 8.2.3 to the sequence ofBernoulli random variables Yn ≡ I(Xn ≤ x)n≥1, one may conclude thatFn(x) → F (x) w.p. 1 for each fixed x. So the main thrust of this theoremis the uniform convergence on R of Fn to F w.p. 1. It can be shown that(2.7) holds for sequences Xnn≥1 that are identically distributed and onlypairwise independent. The proof is based on Etemadi’s SLLN (Theorem8.2.7) below.

The proof of Theorem 8.2.4 makes use of the following two lemmas.

Lemma 8.2.5: (Scheffe’s theorem: A generalized version). Let (Ω,F , µ) bea measure space and fnn≥1 and f be nonnegative µ-integrable functionssuch that as n → ∞, (i) fn → f a.e. (µ) and (ii)

∫fndµ →

∫fdµ. Then∫

|f − fn|dµ → 0 as n →∞.

Page 254: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

242 8. Laws of Large Numbers

Proof: See Theorem 2.5.4.

For any bounded monotone function H: R → R, define

H(∞) ≡ limx↑∞

H(x), H(−∞) ≡ limx↓−∞

H(x).

Lemma 8.2.6: (Polya’s theorem). Let Gnn≥1 and G be a collection ofbounded nondecreasing functions on R → R such that G(·) is continuouson R and

Gn(x) → G(x) for all x in D ∪ −∞,+∞,

where D is dense in R. Then ∆n ≡ sup|Gn(x)−G(x)| : x ∈ R → 0. Thatis, Gn → G uniformly on R.

Proof: Fix ε > 0. By the definitions of G(∞) and G(−∞), there exist C1and C2 in D such that

G(C1)−G(−∞) < ε, and G(∞)−G(C2) < ε. (2.8)

Since G(·) is continuous, it is uniformly continuous on [C1, C2] and so thereexists a δ > 0 such that

x, y ∈ [C1, C2], |x− y| < δ ⇒ |G(x)−G(y)| < ε. (2.9)

Also, there exist points a1 = C1 < a2 < . . . < ak = C2, 1 < k < ∞, in Dsuch that

max(ai+1 − ai) : 1 ≤ i ≤ k − 1 < δ.

Let a0 = −∞, ak+1 = ∞. By the convergence of Gn(·) to G(·), on D ∪−∞,∞,

∆n1 ≡ max|Gn(ai)−G(ai)| : 0 ≤ i ≤ k + 1 → 0 (2.10)

as n → ∞. Now note that for any x in [ai, ai+1], 1 ≤ i ≤ k − 1, by themonotonicity of Gn(·) and G(·), and by (2.9) and (2.10),

Gn(x)−G(x) ≤ Gn(ai+1)−G(ai)≤ Gn(ai+1)−G(ai+1) + G(ai+1)−G(ai)≤ ∆n1 + ε,

and similarly,Gn(x)−G(x) ≥ −∆n1 − ε.

Thussup|Gn(x)−G(x)| : a1 ≤ x ≤ ak ≤ ∆n1 + ε. (2.11)

Page 255: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.2 Strong laws of large numbers 243

For x < a1, by (2.8) and (2.10),

|Gn(x)−G(x)| ≤ |Gn(x)−Gn(−∞)|+ |Gn(−∞)−G(−∞)|+ |G(−∞)−G(x)|

≤ (Gn(a1)−Gn(−∞)) + |Gn(−∞)−G(−∞)|+ ε

≤ |Gn(a1)−G(a1)|+ |G(a1)−G(−∞)|+ 2|Gn(−∞)−G(−∞)|+ ε

≤ 3∆n1 + 2ε.

Similarly, for x > ak,

|Gn(x)−G(x)| ≤ 3∆n1 + 2ε.

Combining the above with (2.11) yields

∆n ≤ 3∆n1 + 2ε.

By (2.10),lim sup

n→∞∆n ≤ 2ε,

and ε > 0 being arbitrary, the proof is complete.

Proof of Theorem 8.2.4: First note that ∆n = supx∈Q |Fn(x) − F (x)|and hence, it is a random variable. Let B ≡ bj : j ∈ J be the set of jumpdiscontinuity points of F with the corresponding jump sizes pj : j ∈ J,where J is a subset of N. Let p =

∑j∈J pj .

Note that

Fn(x) =1n

n∑i=1

I(Xi ≤ x)

=1n

n∑i=1

I(Xi ≤ x, Xi ∈ B) +1n

n∑i=1

I(Xi ≤ x, Xi ∈ B)

= Fnd(x) + Fnc(x), say. (2.12)

Then, Fnd(x) =∑

j∈J pnjI(bj ≤ x), where

pnj =#i : 1 ≤ i ≤ n, Xi = bj

n.

Let pn =∑

j∈J pnj = 1n ·#i : 1 ≤ i ≤ n, Xi ∈ B. By Corollary 8.2.3, for

each j ∈ J ,pnj → pj w.p. 1 and pn → p w.p. 1.

Since B is countable, there exists a set A0 in F such that P (A0) = 1 andfor all ω in A0, pnj → pj for all j ∈ J and

∑j∈J pnj = pn → p =

∑j∈J pj .

Page 256: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

244 8. Laws of Large Numbers

By Lemma 8.2.5 (applied with µ being the counting measure on the set J),it follows that for ω in A0, ∑

j∈J

|pnj − pj | → 0. (2.13)

Let Fd(x) ≡∑

j∈J pjI(bj ≤ x), x ∈ R. Then,

supx∈R

|Fnd(x)− Fd(x)| ≤∑j∈J

|pnj − pj |, (2.14)

which → 0 as n →∞ for all ω in A0, by (2.13).Next let,

Fc(x) ≡ F (x)− Fd(x), x ∈ R.

Then, it is easy to check that, Fc(·) is continuous and nondecreasing on R,Fc(−∞) = 0 and Fc(∞) = 1− p.

Again, by Corollary 8.2.3, there exists a set A1 in F such that P (A1) = 1and for all ω in A1,

Fnc(x) → Fc(x)

for all rational x in R and

Fnc(∞) ≡ 1− pn → 1− p = Fc(∞).

Also, Fnc(−∞) = 0 = Fc(−∞). So by Lemma 8.2.6, with D = Q, for ω inA1,

supx∈R

|Fnc(x)− Fc(x)| → 0 as n →∞. (2.15)

Since P (A0 ∩A1) = 1, the theorem follows from (2.12)–(2.15).

Borel’s SLLN for iid random variables requires that E|X1|4 < ∞. Kol-mogorov (1956) improved on this significantly by using his “3-series”theorem and reduced the moment condition to E|X1| < ∞. More re-cently, Etemadi (1981) N. improved this further by assuming only thatthe Xnn≥1 are pairwise independent and identically distributed withE|X1| < ∞. More precisely, he proved the following.

Theorem 8.2.7: (Etemadi’s SLLN ). Let Xnn≥1 be a sequence ofpairwise independent and identically distributed random variables withE|X1| < ∞. Then

Xn → EX1 w.p. 1. (2.16)

Proof: The main steps in the proof are

(I) reduction to the nonnegative case,

Page 257: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.2 Strong laws of large numbers 245

(II) proof of convergence of Yn along a geometrically growing subsequenceusing the Borel-Cantelli lemma and Chebychev’s inequality, whereYn is the average of certain truncated versions of X1, . . . , Xn, andextending the convergence from the geometric subsequence to thefull sequence.

Step I: Since the Xnn≥1 are pairwise independent and identically dis-tributed with E|X1| < ∞, it follows that X+

n n≥1 and X−n n≥1 are both

sequences of pairwise independent and identically distributed nonnegativerandom variables with EX+

1 < ∞ and EX−1 < ∞. Also, since

Xn =1n

n∑i=1

Xi =(

1n

n∑i=1

X+i

)−(

1n

n∑i=1

X−i

),

it is enough to prove the theorem under the assumption that the Xi’s arenonnegative.

Step II: Now let Xi’s be nonnegative and let

Yi = XiI(Xi ≤ i), i ≥ 1.

Then,

∞∑i=1

P (Xi = Yi) =∞∑

i=1

P (Xi > i)

=∞∑

i=1

P (X1 > i) ≤∞∑

i=1

∫ i

i−1P (X1 > t)dt

=∫ ∞

0P (X1 > t)dt

= EX1 < ∞.

Hence, by the Borel-Cantelli lemma,

P (Xi = Yi, infinitely often) = 0.

This implies that w.p. 1, Xi = Yi for all but finitely many i’s and hence, itsuffices to show that

Yn ≡1n

n∑i=1

Yi → EX1 w.p. 1. (2.17)

Next, EYi = E(XiI(Xi ≤ i) = E(X1I(X1 ≤ i)) → EX1 (by the MCT)and hence

EYn =1n

n∑i=1

EYi → EX1 as n →∞. (2.18)

Page 258: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

246 8. Laws of Large Numbers

Suppose for the moment that for each fixed 1 < ρ < ∞, it is shown that

Ynk→ EX1 as k →∞ w.p. 1 (2.19)

where nk = ρk = the greatest integer less than or equal to ρk, k ∈ N.Then, since the Yi’s are nonnegative, for any n and k satisfying ρk ≤ n <ρk+1, one gets

1n

nk∑i=1

Yi ≤ Yn =1n

n∑j=1

Yj ≤1n

nk+1∑i=1

Yi

=⇒ nk

nYnk

≤ Yn ≤nk+1

nYnk+1

=⇒ 1ρ

Ynk≤ Yn ≤ ρYnk+1 .

From (2.19), it follows that

EX1 ≤ lim infn→∞ Yn ≤ lim sup

n→∞Yn ≤ ρEX w.p. 1.

Since this is true for each 1 < ρ < ∞, by taking ρ = 1 + 1r for r = 1, 2, . . .,

it follows that

EX1 ≤ lim infn→∞ Yn ≤ lim sup

n→∞Yn ≤ EX1 w.p. 1,

establishing (2.17).It now remains to prove (2.19). By (2.18), it is enough to show that

Ynk− EYnk

→ 0 as k →∞, w.p. 1. (2.20)

By Chebychev’s inequality and the pairwise independence of the variablesYnn≥1, for any ε > 0,

P (|Ynk− EYnk

| > ε) ≤ 1ε2

Var(Ynk) =

1ε2

1n2

k

nk∑i=1

Var(Yi)

≤ 1ε2

1n2

k

nk∑i=1

EY 2i .

Thus,

∞∑k=1

P (|Ynk− EYnk

| > ε) ≤ 1ε2

∞∑k=1

1n2

k

nk∑i=1

EY 2i

=1ε2

∞∑i=1

EY 2i

( ∑k:nk≥i

1n2

k

). (2.21)

Page 259: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.2 Strong laws of large numbers 247

Since nk = ρk > ρk−1 for 1 < ρ < ∞,

∑k:nk≥i

1n2

k

≤∑

k:ρk−1≥i

1ρ(k−1)2 ≤

C1

i2(2.22)

for some constant C1, 0 < C1 < ∞.Next, since the Xi’s are identically distributed,

∞∑i=1

EY 2i

i2=

∞∑i=1

EX21I(0 ≤ X1 ≤ i)

i2

=∞∑

i=1

i∑j=1

EX21I(j − 1 < X1 ≤ j)

i2

=∞∑

j=1

(EX2

1I(j − 1 < X1 ≤ j)) ∞∑

i=j

i−2

≤∞∑

j=1

(jEX1I(j − 1 < X1 ≤ j)

)· C2j

−1

= C2EX1 < ∞, (2.23)

for some constant C2, 0 < C2 < ∞.Now (2.21)–(2.23) imply that

∞∑k=1

P (|Ynk− EYnk

| > ε) < ∞

for each ε > 0. By the Borel-Cantelli lemma and Proposition 7.2.3 (a),(2.20) follows and the proof is complete.

The following corollary is immediate from the above theorem.

Corollary 8.2.8: (Extension to the vector case). Let Xn =(Xn1, . . . , Xnk)n≥1 be a sequence of k-dimensional random vectors de-fined on a probability space (Ω,F , P ) such that for each i, 1 ≤ i ≤ k,the sequence Xnin≥1 are pairwise independent and identically distributedwith E|X1i| < ∞. Let µ = (EX11, EX12, . . . , EX1k) and f : Rk → R becontinuous at µ. Then

(i) Xn ≡ (Xn1, Xn2, . . . , Xnk) → µ w.p. 1, where Xni = 1n

∑nj=1 Xji for

1 ≤ i ≤ k.

(ii) f(Xn) → f(µ) w.p. 1.

Example 8.2.1: Let (Xn, Yn), n = 1, 2, . . . be a sequence of bivariate iidrandom vectors with EX2

1 < ∞, EY 21 < ∞. Then the sample correlation

Page 260: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

248 8. Laws of Large Numbers

coefficient ρn, defined by,

ρn ≡( 1

n

∑ni=1 XiYi − XnYn

)√( 1

n

∑ni=1(Xi − Xn)2

) ( 1n

∑ni=1(Yi − Yn)2

)is a strongly consistent estimator of the population correlation coefficient ρ,defined by,

ρ =Cov(X1, Y1)√

Var(X1)Var(Y1),

i.e., ρn → ρ w.p. 1. This follows from the above corollary by taking f :R5 → R to be

f(t1, t2, t3, t4, t5) =

⎧⎪⎨⎪⎩

t5 − t1t2√(t3 − t21)(t4 − t22)

, for t3 > t21, t4 > t22

0, otherwise,

and the vector (Xn1, Xn2, . . . , Xn5) to be

Xn1 = Xn, Xn2 = Yn, Xn3 = X2n, Xn4 = Y 2

n , Xn5 = XnYn.

Corollary 8.2.9: (Extension to the pairwise m-dependent case). LetXnn≥1 be a sequence of random variables on a probability space (Ω,F , P )such that for an integer m, 1 ≤ m < ∞, and for each i, 1 ≤ i ≤ m, the ran-dom variables Xi, Xi+m, Xi+2m, . . . are identically distributed and pair-wise independent with E|Xi| < ∞. Then

Xn →1m

m∑i=1

EXi w.p. 1.

The proof is left as an exercise (Problem 8.2). For an application of theabove result to a discussion on normal numbers, see Problem 8.15.

Example 8.2.2: (IID Monte Carlo). Let (S,S, π) be a probability space,f ∈ L1(S,S, π) and λ =

∫Sfdπ. Let Xnn≥1 be a sequence of iid S-

valued random variables with distribution π. Then, the IID Monte Carloapproximation to λ is defined as

λn ≡1n

n∑i=1

f(Xi).

Note that by the SLLN, λn → λ w.p. 1.

An extension of this to the case where Xii≥1 is a Markov chain, knownas Markov chain Monte Carlo (MCMC), is discussed in Chapter 14.

Page 261: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.3 Series of independent random variables 249

8.3 Series of independent random variables

Let Xnn≥1 be a sequence of independent random variables on a proba-bility space (Ω,F , P ). The goal of this section is to investigate the conver-gence of the infinite series

∑∞n=1 Xn, i.e., that of the partial sum sequence,

Sn =∑n

i=1 Xi, n ≥ 1.The main result of this section is Kolmogorov’s 3-series theorem (The-

orem 8.3.5). The following two inequalities play a fundamental role in theproof of this theorem and also have other important applications.

Theorem 8.3.1: Let Xj : 1 ≤ j ≤ n be a collection of independentrandom variables. Let Si =

∑ij=1 Xj for 1 ≤ i ≤ n.

(i) (Kolmogorov’s first inequality). Suppose that EXj = 0v and EX2j <

∞, 1 ≤ j ≤ n. Then, for 0 < λ < ∞,

P(

max1≤i≤n

|Si| ≥ λ)≤ Var(Sn)

λ2 . (3.1)

(ii) (Kolmogorov’s second inequality). Suppose that there exists a constantC ∈ (0,∞) such that P (|Xj − EXj | ≤ C) = 1 for 1 ≤ j ≤ n. Then,for any 0 < λ < ∞,

P(

max1≤i≤n

|Si| ≤ λ)≤ (2C + 4λ)2

Var(Sn).

Proof: Let A = max1≤i≤n |Si| ≥ λ and let

A1 = |S1| ≥ λ,Aj = |S1| < λ, |S2| < λ, . . . , |Sj−1| < λ, |Sj | ≥ λ

for j = 2, . . . , n. Then A1, . . . , An are disjoint,⋃n

j=1 Aj = A and P (A) =∑nj=1 P (Aj). Since EXj = 0 for all j,

Var(Sn) = ES2n ≥ E(S2

nIA) =n∑

j=1

E(S2nIAj

)

=n∑

j=1

E[(

(Sn − Sj)2 + S2j + 2(Sn − Sj)Sj

)IAj

]

≥n∑

j=1

E(S2j IAj

) + 2n−1∑j=1

E((Sn − Sj)SjIAj

). (3.2)

Note that since X1, . . . , Xn are independent, (Sn−Sj) ≡∑n

i=j+1 Xi andSjIAj

are independent for 1 ≤ j ≤ n− 1. Hence,

E((Sn − Sj)SjIAj

)= E(Sn − Sj)E(SjIAj

) = 0.

Page 262: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

250 8. Laws of Large Numbers

Also on Aj , S2j ≥ λ2. Therefore, by (3.2),

Var(Sn) ≥n∑

j=1

λ2P (Aj) = λ2P (A).

This establishes (i). For a proof of (ii), see Chung (1974), p. 117.

Remark 8.3.1: Recall that Chebychev’s inequality asserts that for λ > 0,P (|Sn| ≥ λ) ≤ Var(Sn)

λ2 and thus Kolmogorov’s first inequality is signifi-cantly stronger. Kolmogorov’s first inequality has an extension known asDoob’s maximal inequality to a class of dependent random variables, calledmartingales (see Chapter 13). The next inequality is due to P. Levy.

Definition 8.3.1: For any random variable X, a real number c is calleda median of X if

P (X < c) ≤ 12≤ P (X ≤ c). (3.3)

Such a c always exists. It can be verified that c0 ≡ infx : P (X ≤ x) ≥ 12

is a median. Note that if c is a median of X and α is a real number, then αcis a median of αX and α+c is a median of α+X. Further, if P (|X| ≥ α) < 1

2for some α > 0, then any median c of X satisfies |c| ≤ α (Problem 8.4).

Theorem 8.3.2: (Levy’s inequality). Let Xj, j = 1, . . . , n be independentrandom variables. Let Sj =

∑nj=1 Xi, and cj,n be a median of (Sn−Sj) for

1 ≤ j ≤ n, where cn,n is set equal to 0. Then, for any 0 < λ < ∞,

(i) P(

max1≤j≤n

(Sj − cj,n) ≥ λ)≤ 2P (Sn ≥ λ) ;

(ii) P(

max1≤j≤n

|Sj − cj,n| ≥ λ)≤ 2P (|Sn| ≥ λ).

Proof: Let

Aj = Sj − Sn ≤ cj,n for 1 ≤ j ≤ n,

B =

max1≤j≤n

(Sj − cj,n) ≥ λ

,

B1 = S1 − c1,n ≥ λBj = Si − ci,n < λ for 1 ≤ i ≤ j − 1, Sj − cj,n ≥ λ,

for j = 2, . . . , n. Then B1, . . . , Bn are disjoint and⋃n

j=1 Bj = B. SinceX1, . . . , Xn are independent, Aj and Bj are independent for each j =1, 2, . . . , n. Also for each j, Aj = Sj − cj,n ≤ Sn, and hence on Aj ∩Bj ,Sn ≥ λ holds. Thus,

P (Sn ≥ λ) ≥n∑

j=1

P (Aj ∩Bj)

Page 263: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.3 Series of independent random variables 251

=n∑

j=1

P (Aj)P (Bj)

≥ 12

P

( n⋃j=1

Bj

)

=12

P (B),

proving part (i). Now part (ii) follows by applying part (i) to both Xini=1

and −Xini=1.

Recall that if Ynn≥1 is a sequence of random variables, then Ynn≥1converges w.p. 1 implies that Ynn≥1 converges in probability as well. Aremarkable result of P. Levy is that if Snn≥1 is the sequence of partialsums of independent random variables and Snn≥1 converges in proba-bility, then Snn≥1 must converge w.p. 1 as well. The proof of this usesLevy’s inequality proved above.

Theorem 8.3.3: Let Xnn≥1 be a sequence of independent random vari-ables. Let Sn =

∑nj=1 Xj for 1 ≤ n < ∞ and let Snn≥1 converge in

probability to a random variable S. Then Sn → S w.p. 1.

Proof: Recall that a sequence xnn≥1 of real numbers converges iff it isCauchy iff δn ≡ sup|xk − x| : k, ≥ n → 0 as n →∞. Let

∆n ≡ sup|Sk − S| : k, ≥ n and∆n ≡ sup|Sk − Sn| : k ≥ n.

Then, ∆n ≤ 2∆n and ∆n is decreasing in n. Suppose it is shown that

∆n −→p 0. (3.4)

Then, ∆n −→p 0 and hence there is a subsequence nkk≥1 such that∆nk

→ 0 as k → ∞ w.p. 1. Since ∆n is decreasing in n, this implies that∆n → 0 w.p. 1. Thus it suffices to establish (3.4). Fix 0 < ε < 1. Let

Sn, = Sn+ − Sn for ≥ 1,

∆n,k = max|Sn,| : 1 ≤ ≤ k, k ≥ 1.

Note that for each n ≥ 1, ∆n,kk≥1 is a nondecreasing sequence,lim

k→∞∆n,k = ∆n and hence, for any n ≥ 1,

P (∆n > ε) = limk→∞

P (∆n,k > ε). (3.5)

Levy’s inequality (Theorem 8.3.2) will now be used to bound P (∆n,k > ε)uniformly in k. Since Sn −→p S, for any η > 0, there exists an n0 ≥ 1 suchthat for all n ≥ n0,

P (|Sn − S| > η) < η.

Page 264: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

252 8. Laws of Large Numbers

This implies that for all k ≥ ≥ n0,

P (|Sk − S| > 2η) < 2η. (3.6)

If 0 < η < 14 , then the medians of Sk − S for k ≥ ≥ n0 are bounded

by 2η. Hence, for n ≥ n0 and k ≥ 1, applying Levy’s inequality (i.e., theabove theorem) to Xi : n + 1 ≤ i ≤ n + k,

P (∆n,k > ε) = P(

max1≤j≤k

|Sn,j | > ε)

≤ P(

max1≤j≤k

|Sn,j − cn+j,n+k| ≥ ε− 2η)

≤ 2P (|Sn,k| ≥ ε− 2η).

Now, choosing 0 < η < ε4 , (3.6) yields P (∆n,k > ε) < 4η < ε for all n ≥ n0,

k ≥ 1. Then, by (3.5), P (∆n > ε) ≤ ε for all n ≥ n0. Hence, (3.4) holds.

The following result on convergence of infinite series of independent ran-dom variables is an immediate consequence of the above theorem.

Theorem 8.3.4: (Khinchine-Kolmogorov’s 1-series theorem). LetXnn≥1 be a sequence of independent random variables on a probabilityspace (Ω,F , P ) such that EXn = 0 for all n ≥ 1 and

∑∞n=1 EX2

n < ∞.Then Sn ≡

∑nj=1 Xj converges in mean square and almost surely, as

n →∞.

Proof: For any n, k ∈ N,

E(Sn − Sn+k)2 = Var(Sn − Sn+k) =n+k∑

j=n+1

Var(Xj) =n+k∑

j=k+1

EX2j ,

by independence. Since∑∞

n=1 EX2n < ∞, Snn≥1 is a Cauchy sequence in

L2(Ω,F , P ) and hence converges in mean square to some S in L2(Ω,F , P ).This implies that Sn −→p S, and by the above theorem Sn → S w.p. 1.

Remark 8.3.2: It is possible to give another proof of the above theoremusing Kolmogorov’s inequality. See Problem 8.5.

Theorem 8.3.5: (Kolmogorov’s 3-series theorem). Let Xnn≥1 be asequence of independent random variables on a probability space (Ω,F , P )and let Sn =

∑ni=1 Xi, n ≥ 1. Then the sequence Snn≥1 converges w.p.

1 iff the following 3-series converge for some 0 < c < ∞:

(i)∑∞

i=1 P (|Xi| > c) < ∞,

(ii)∑∞

i=1 E(Yi) converges,

(iii)∑∞

i=1 Var(Yi) < ∞,

Page 265: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.3 Series of independent random variables 253

where Yi = XiI(|Xi| ≤ c), i ≥ 1.

Proof: (Sufficiency). By (i) and the Borel-Cantelli lemma, P (Xi =Yi i.o.) = P (|Xi| > c i.o.) = 0. Hence Snn≥1 converges w.p. 1 iff Tnn≥1converges w.p. 1, where Tn =

∑ni=1 Yi, n ≥ 1. By (iii) and the 1-series the-

orem, the sequence ∑n

i=1(Yi −EYi)n≥1 converges w.p. 1. Hence, by (ii),Tnn≥1 converges w.p. 1 and hence Snn≥1 converges w.p. 1.

(Necessity). Suppose Snn≥1 converges w.p. 1. Fix 0 < c < ∞ and letYi = XiI(|Xi| ≤ c), i ≥ 1. Since Snn≥1 converges w.p. 1, Xn → 0 w.p.1. Hence, w.p. 1, |Xi| ≤ c for all but a finite number of i’s. If Ai ≡ Xi =Yi = |Xi| > c, then by the second Borel-Cantelli lemma,

∞∑i=1

P (Ai) < ∞, establishing (i).

To establish (ii) and (iii), the following construction and the second in-equality of Kolmogorov will be used. Without loss of generality, assumethat there is another sequence Xnn≥1 of random variables on the sameprobability space (Ω,F , P ) such that (a) Xnn≥1 are independent, (b)Xnn≥1 is independent of Xnn≥1, and (c) for each n ≥ 1, Xn =d Xn,i.e., Xn and Xn have the same distribution. Let

Yi = XiI(|Xi| ≤ c),Zi = Yi − Yi, i ≥ 1,

Tn ≡n∑

i=1

Yi,

Tn ≡n∑

i=1

Yi,

and

Rn ≡n∑

i=1

Zi, n ≥ 1.

Since Sn ≡∑n

i=1 Xin≥1 converges w.p. 1, and Xi = Yi for all but afinite number of i, Tnn≥1 converges w.p. 1. Since Yin≥1 and Yin≥1

have the same distribution on R∞, Tnn≥1 converges w.p. 1. Thus thedifference sequence Rnn≥1 converges w.p. 1.

Next, note that Znn≥1 are independent random variables with mean 0and Znn≥1 are uniformly bounded by 2c. Applying Kolmogorov’s secondinequality (Theorem 8.3.1 (b)) to Zj : m < j ≤ m + n yields

P

(max

m<j≤m+n|Rj −Rm| ≤ ε

)≤ (2c + 4ε)2∑m+n

i=m+1 Var(Zi)(3.7)

for all m ≥ 1, n ≥ 1, 0 < ε < ∞.

Page 266: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

254 8. Laws of Large Numbers

Let ∆m ≡ maxm<j |Rj −Rm|. Let n →∞ in (3.7) to conclude that

P (∆m ≤ ε) ≤ (2c + 4ε)2∑∞i=m+1 Var(Zi)

.

Now suppose (iii) does not hold. Then, since Yi and Yi are iid, Var(Zi) =2Var(Yi) for all i ≥ 1, and thus

∑∞i=m+1 Var(Zi) = ∞ and hence P (∆m ≤

ε) = 0 for each m ≥ 1, 0 < ε < ∞. This implies that P (∆m > ε) = 1 foreach ε > 0 and hence that ∆m = ∞ w.p. 1 for all m ≥ 1. This contradictsthe convergence w.p. 1 of the sequence Rnn≥1. Hence (iii) holds.

By the 1-series theorem, ∑n

i=1(Yi − EYi)n≥1 converges w.p. 1. Since∑n

i=1 Yin≥1 converges w.p. 1,∑∞

i=1 EYi converges, establishing (ii). Thiscompletes the proof of necessity part and of the theorem.

Remark 8.3.3: To go from the convergence w.p. 1 of Rnn≥1 to (iii), itsuffices to show that if (iii) fails, then for each 0 < A < ∞, P (|Rn| ≤ A) →0 as n →∞. This can be established without the use of (3.7) but using thecentral limit theorem (to be proved later in Chapter 11), which shows thatif Var(Rn) →∞, then

P

(Rn√

Var(Rn)≤ x

)→ Φ(x) ≡ 1√

∫ x

−∞e−t2/2dt,

for all x in R. (Also see Billingsley (1995), p. 290.)

8.4 Kolmogorov and Marcinkiewz-Zygmund SLLNs

For a sequence of independent and identically distributed random variablesXnn≥1, Kolmogorov showed that Xnn≥1 obeys the SLLN with bn = niff E|X1| < ∞. Marcinkiewz and Zygmund generalized this result andproved a class of SLLNs for Xnn≥1 when E|X|p < ∞ for some p ∈ (0, 2).The proof uses Kolmogorov’s 3-series theorem and some results from realanalysis. This approach is to be contrasted with Etemadi’s proof of theSLLN, which uses a decomposition of the random variables Xnn≥1 intopositive and negative parts and uses monotonicity of the sum to establishalmost sure convergence along a subsequence by an application of the Borel-Cantelli lemma. The alternative approach presented in this section is alsouseful for proving SLLNs for sums of independent random variables thatare not necessarily identically distributed.

The next three are preparatory results for Theorem 8.4.4.

Lemma 8.4.1: (Abel’s summation formula). Let ann≥1 and bnn≥1be two sequences of real numbers. Then, for all n ≥ 2,

n∑j=1

ajbj = Anbn −n−1∑j=1

Aj(bj+1 − bj) (4.1)

Page 267: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.4 Kolmogorov and Marcinkiewz-Zygmund SLLNs 255

where Ak =∑k

j=1 aj, k ≥ 1.

Proof: Let A0 = 0. Then, aj = Aj −Aj−1, j ≥ 1. Hence,

n∑j=1

ajbj =n∑

j=1

(Aj −Aj−1)bj =n∑

j=1

Ajbj −n∑

j=1

Aj−1bj

=n∑

j=1

Ajbj −n−1∑j=1

Ajbj+1,

yielding (4.1).

Lemma 8.4.2: (Kronecker’s lemma). Let ann≥1 and bnn≥1 be se-quences of real numbers such that 0 < bn ↑ ∞ as n → ∞ and

∑∞j=1 aj

converges. Then,

1bn

n∑j=1

ajbj −→ 0 as n →∞. (4.2)

Proof: Let Ak =∑k

j=1 aj , A ≡∑∞

j=1 aj = limk→∞ Ak and Rk = A−Ak,k ≥ 1. Then, by Lemma 8.4.1 for n ≥ 2,

n∑j=1

ajbj = Anbn −n−1∑j=1

Aj(bj+1 − bj)

= Anbn −n−1∑j=1

(A−Rj)(bj+1 − bj)

= Anbn −A

n−1∑j=1

(bj+1 − bj) +n−1∑j=1

Rj(bj+1 − bj)

= Anbn −Abn + Ab1 +n−1∑j=1

Rj(bj+1 − bj)

= −Rnbn + Ab1 +n−1∑j=1

Rj(bj+1 − bj). (4.3)

Since∑∞

n=1 an converges, Rn → 0 as n →∞. Hence, given any ε > 0, thereexists N = Nε > 1 such that |Rn| ≤ ε for all n ≥ N . Since 0 < bn ↑ ∞, forall n > N ,

∣∣∣∣b−1n

n−1∑j=1

Rj(bj+1 − bj)∣∣∣∣

Page 268: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

256 8. Laws of Large Numbers

≤ b−1n

N−1∑j=1

|Rj | |bj+1 − bj |+ ε b−1n

n−1∑j=N

(bj+1 − bj)

= b−1n

N−1∑j=1

|Rj | |bj+1 − bj |+ ε.

Now letting n →∞ and then letting ε ↓ 0, yields

lim supn→∞

∣∣∣∣b−1n

n−1∑j=1

Rj(bj+1 − bj)∣∣∣∣ = 0.

Hence, (4.2) follows from (4.3).

Lemma 8.4.3: For any random variable X,∞∑

n=1

P (|X| > n) ≤ E|X| ≤∞∑

n=0

P (|X| > n). (4.4)

Proof: For n ≥ 1, let An = n − 1 < |X| ≤ n. Define the randomvariables

Y =∞∑

n=1

(n− 1) IAn and Z =∞∑

n=1

n IAn .

Then, it is clear that Y ≤ |X| ≤ Z, so that

EY ≤ E|X| ≤ EZ. (4.5)

Note that

EY =∞∑

n=1

(n− 1)P (An)

=∞∑

n=2

n−1∑j=1

P (An)

=∞∑

j=1

∞∑n=j+1

P (n− 1 < |X| ≤ n)

=∞∑

j=1

P (|X| > j).

Similarly, one can show that EZ =∑∞

j=0 P (|X| > j). Hence, (4.4) follows.

Theorem 8.4.4: (Marcinkiewz-Zygmund SLLNs). Let Xnn≥1 be a se-quence of identically distributed random variables and let p ∈ (0, 2). WriteSn =

∑ni=1 Xi, n ≥ 1.

Page 269: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.4 Kolmogorov and Marcinkiewz-Zygmund SLLNs 257

(i) If Xnn≥1 are pairwise independent and

Sn − nc

n1/pconverges w.p. 1 (4.6)

for some c ∈ R, then E|X1|p < ∞.

(ii) Conversely, if E|X1|p < ∞ and Xnn≥1 are independent, then (4.6)holds with c = EX1 for p ∈ [1, 2) and with any c ∈ R for p ∈ (0, 1).

Corollary 8.4.5: (Kolmogorov’s SLLN ). Let Xnn≥1 be a sequence ofiid random variables. Then,

Sn − nc

n→ 0 w.p. 1

for some c ∈ R iff E|X1| < ∞, in which case, c = EX1.

Thus, Kolmogorov’s SLLN corresponds with the special case p = 1 ofTheorem 8.4.4. Note that compared with the WLLN and Borel’s SLLNof Sections 8.1 and 8.2, Kolmogorov’s SLLN presents a significant im-provement in the moment condition, i.e., it assumes the finiteness of onlythe first absolute moment. Further, both the Kolmogorov’s SLLN and theMarcinkiewz-Zygmund SLLN are proved under minimal moment condi-tions, since the corresponding moment conditions are shown to be neces-sary.

Proof of Theorem 8.4.4: (i) Suppose that (4.6) holds for some c ∈ R.Then,

Xn

n1/p=

Sn − Sn−1

n1/p

=Sn − nc

n1/p− Sn−1 − (n− 1)c

n1/p+

c

n1/p

→ 0 as n →∞, a.s.

Hence, P (|Xn/n1/p| > 1 i.o.) = 0. By the second Borel-Cantelli lemmaand by the pairwise independence of Xnn≥1, this implies

∞∑n=1

P

(|Xn|n1/p

> 1)

< ∞,

i.e.,∞∑

n=1

P(|X1|p > n

)< ∞.

Hence, by Lemma 8.4.3, E|X1|p < ∞.

Page 270: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

258 8. Laws of Large Numbers

To prove (ii), suppose that E|X1|p < ∞ for some p ∈ (0, 2). For 1 ≤p < 2, w.l.o.g. assume that EX1 = 0. Next, define the variables Zn =XnI(|Xn|p ≤ n), n ≥ 1. Then, by Lemma 8.4.3,

∞∑n=1

P (Xn = Zn)

=∞∑

n=1

P (|Xn|p > n) =∞∑

n=1

P (|X1|p > n) ≤ E|X1|p < ∞.

Hence, by the Borel-Cantelli lemma,

P (Xn = Zn i.o.) = 0. (4.7)

Note that, in view of (4.7), (4.6) holds with c = 0 if and only if

n1/pn∑

i=1

Zi → 0 as n →∞, w.p. 1. (4.8)

Note that for any j ∈ N, θ > 1 and β ∈ (−∞, 0)\−1,∞∑

n=j

n−θ ≤ j−θ +∞∑

n=j+1

∫ n

n−1x−θdx

= j−θ +1

θ − 1· j−(θ−1)

≤ θ

θ − 1· j−(θ−1) (4.9)

and similarly,

j∑n=1

nβ ≤[β + j(β+1)]/(β + 1)

≤ β

β + 1I(β < −1) +

jβ+1

β + 1I(−1 < β < 0). (4.10)

Now,

∞∑n=1

Var(Zn/n1/p)

≤∞∑

n=1

EX21I(|X1|p ≤ n) · n−2/p

=∞∑

n=1

n∑j=1

EX21I(j − 1 < |X1|p ≤ j) · n−2/p

Page 271: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.4 Kolmogorov and Marcinkiewz-Zygmund SLLNs 259

=∞∑

j=1

( ∞∑n=j

n−2/p

)· EX2

1I(j − 1 < |X1|p ≤ j)

≤ 22− p

∞∑j=1

j−( 2p −1) · EX2

1I((j − 1) < |X1|p ≤ j

)(by (4.9))

≤ 22− p

∞∑j=1

j−( 2p −1) · E|X1|pI(j − 1 < |X1|p ≤ j) · (j1/p)2−p

=2

2− pE|X1|p < ∞.

Hence, by Theorem 8.3.4,∑∞

n=1(Zn − EZn)/n1/p converges w.p. 1. ByKronecker’s lemma (viz. Lemma 8.4.2),

n−1/pn∑

j=1

(Zj − EZj) → 0 as n →∞, w.p. 1. (4.11)

Now consider the case p = 1. In this case, E|X1| < ∞ and by the DCT,EZn = EX1I(|X1| ≤ n) → EX1 = 0 as n → ∞. Hence, n−1∑n

i=1 EZi →0. Part (ii) of the theorem now follows from (4.8) and (4.11) for p = 1.

Next consider the case p ∈ (0, 2), p = 1. Using (4.9) and (4.10), one canshow (cf. Problem 8.12) that

n−1/pn∑

j=1

EZj → 0 as n →∞. (4.12)

Hence, by (4.8), (4.11), and (4.12), one gets (4.6) with c = 0 for p ∈(0, 2)\1. Finally, note that for p ∈ (0, 1), and for any c ∈ R,

Sn − nc

n1/p=

Sn

n1/p− nc

n1/p

→ 0 as n →∞, a.s.,

whenever Sn/n1/p → 0 as n → ∞, w.p. 1. Hence, (4.6) holds with anarbitrary c ∈ R for p ∈ (0, 1). This completes the proof of part (ii) forp ∈ (0, 2)\1 and hence of the theorem.

The next result gives a SLLN for independent random variables that arenot necessarily identically distributed.

Theorem 8.4.6: Let Xnn≥1 be a sequence of independent random vari-ables. If

∑∞n=1 E|Xn|αn/nαn < ∞ for some αn ∈ [1, 2], n ≥ 1, then

n−1n∑

j=1

(Xj − EXj) → 0 as n →∞, w.p. 1. (4.13)

Page 272: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

260 8. Laws of Large Numbers

Proof: W.l.o.g. suppose that EXn = 0 for all n ≥ 1. Let Yn =XnI(|Xn| ≤ n)/n. Note that |EYn| = |n−1(EXn − EXnI(|Xn| > n))| =n−1|EXnI(|Xn| > n)|, n ≥ 1. Since 1 ≤ αn ≤ 2,

∞∑n=1

P (|Xn| > n) + |EYn|

≤ 2∞∑

n=1

n−1E|Xn|I(|Xn| > n)

≤ 2∞∑

n=1

E|Xn|αn/nαn < ∞

and∞∑

n=1

Var(Yn) ≤∞∑

n=1

n−2EX2nI(|Xn| ≤ n)

≤∞∑

n=1

n−αnEXαnn < ∞.

Hence, by Kolmogorov’s 3-series theorem,∑∞

n=1(Xn/n) converges w.p. 1.Now the theorem follows from Lemma 8.4.2.

Corollary 8.4.7: Let Xnn≥1 be a sequence of independent random vari-ables such that for some α ∈ [1, 2],

∑∞n=1(n

−αE|Xn|α) < ∞. Then (4.13)holds.

8.5 Renewal theory

8.5.1 Definitions and basic propertiesLet Xnn≥0 be a sequence of nonnegative random variables that are in-dependent and, for i ≥ 1, identically distributed with cdf F . Let Sn =∑n

i=0 Xi for n ≥ 0. Imagine a system where a component in operation attime t = 0 lasts X0 units of time and then is replaced by a new one thatlasts X1 units of time, which, at failure, is replaced by yet another new onethat lasts X2 units of time and so on. The sequence Snn≥0 representsthe sequence of epochs when ‘renewal’ takes place and is called a renewalsequence. Assume that P (X1 = 0) < 1. Then, since P (X1 < ∞) = 1, it fol-lows that for each n, P (Sn < ∞) = 1 and limn→∞ Sn = ∞ w.p. 1 (Problem8.16). Now define the counting process N(t) : t ≥ 0 by the relation

N(t) = k if Sk−1 ≤ t < Sk for k = 0, 1, 2, . . . (5.1)

where S−1 = 0. Thus N(t) counts the number of renewals up to time t.

Page 273: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.5 Renewal theory 261

Definition 8.5.1: The stochastic process N(t) : t ≥ 0 is called a renewalprocess with lifetime distribution F . The renewal sequence Snn≥0 and therenewal process N(t) : t ≥ 0 are called nondelayed or standard if X0 hasthe same distribution as X1 and are called delayed otherwise.

Since P (X1 ≥ 0) = 1, Snn≥0 is nondecreasing in n and for each t ≥ 0,the event N(t) = k = Sk−1 ≤ t < Sk belongs to the σ-algebra σ〈Xj :0 ≤ j ≤ k〉 and hence N(t) is a random variable. Using the nontrivialityhypothesis that P (X1 = 0) < 1, it is shown below that for each t > 0, therandom variable N(t) has finite moments of all order.

Proposition 8.5.1: Let P (X1 = 0) < 1. Then there exists 0 < λ < 1 (notdepending on t) and a constant C(t) ∈ (0,∞) such that

P (N(t) > k) ≤ C(t)λk for all k > 0. (5.2)

Proof: For t > 0, k ∈ N,

P (N(t) > k) = P (Sk ≤ t)= P

(e−θSk ≥ e−θt

)for θ > 0

≤ eθtE(e−θSk

)(by Markov’s inequality)

= eθtE(e−θX0

)(E(e−θX1

))k

.

By BCT, limθ↑∞ E(e−θX1) = P (X1 = 0) < 1. Hence, there exists a θ largesuch that λ ≡ E(e−θX1) is less than one, thus, completing the proof.

Corollary 8.5.2: There exists an s0 > 0 such that the moment generatingfunction (m.g.f.) E(esN(t)) < ∞ for all s < s0 and t ≥ 0.

Proof: From (5.2), for any t > 0, it follows that P(N(t) = k

)= O(λk) as

k → ∞ for some 0 < λ < 1 and hence E(esN(t)

)=∑∞

k=0(es)kP

(N(t) =

k)

< ∞ for any s such that esλ < 1, i.e., for all s < s0 ≡ − log λ.

From (5.1), it follows that for t > 0,

SN(t)−1 ≤ t < SN(t)

⇒(N(t)− 1

N(t)

) SN(t)−1

(N(t)− 1)≤ t

N(t)≤(SN(t)

N(t)

). (5.3)

Let A be the event that Sn

n → EX1 as n → ∞ and let B be the eventthat N(t) →∞ as t →∞. Since Sn →∞ w.p. 1, it follows that P (B) = 1.Also, by the SLLN, P (A) = 1. On the event C = A ∩B, it holds that

SN(t)

N(t)→ EX1 as t →∞.

Page 274: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

262 8. Laws of Large Numbers

This together with (5.3) yields the following result.

Proposition 8.5.3: Suppose that P (X1 = 0) < 1. Then,

limt→∞

N(t)t

=1

EX1w.p. 1. (5.4)

Definition 8.5.2: The function U(t) ≡ EN(t) for the nondelayed processis called the renewal function.

An explicit expression for U(·) is given by (5.13) below.Next consider the convergence of EN(t)/t. By (5.4) and Fatou’s lemma,

one gets

lim inft→∞

EN(t)t

≥ 1EX1

. (5.5)

It turns out that the lim inft→∞ in (5.5) can be replaced by limt→∞ and≥ by equality. To do this it suffices to show that the family N(t)

t : t ≥ kis uniformly integrable for some k < ∞. This can be done by showingE(N(t)

t )2 is bounded in t (see Chung (1974), Chapter 5). An alternateapproach is to bound the lim sup. For this one can use an identity knownas Wald’s equation (see also Chapter 13).

8.5.2 Wald’s equationLet Xjj≥1 be independent random variables with EXj = 0 for all j ≥ 1.Also, let S0 = 0, Sn =

∑nj=1 Xj , n ≥ 1.

Definition 8.5.3: A positive integer valued random variable N is calleda stopping time with respect to Xjj≥1 if for every j ≥ 1, the eventN = j ∈ σ〈X1, . . . , Xj〉. A stopping time N is called bounded if thereexists a K < ∞ such that P (N ≤ K) = 1.

Example 8.5.1: N ≡ minn :∑n

j=1 Xj ≥ 25 is a stopping time w.r.t.Xjj≥1, but M ≡ maxn :

∑nj=1 Xj ≥ 25 is not.

Proposition 8.5.4: Let Xjj≥1 be independent random variables withEXj = 0. Let N be a bounded stopping time w.r.t. Xjj≥1. Then

E(|SN |) < ∞ and ESN = 0.

Proof: Let K ∈ N be such that P (N ≤ K) = 1. Then |SN | ≤∑K

j=1 |Xi|and hence E|SN | < ∞. Next, SN =

∑Kj=1 XjI(N ≥ j) and hence

ESN =K∑

j=1

E(XjI(N ≥ j)

).

Page 275: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.5 Renewal theory 263

But the event N ≥ j = N ≤ j − 1c ∈ σ〈X1, X2, . . . , Xj−1〉. SinceXj is independent of σ〈X1, X2, . . . , Xj−1〉,

E(XjI(N ≥ j)

)= 0 for 1 ≤ j ≤ K.

Thus ESN = 0.

Corollary 8.5.5: Let Xjj≥1 be iid random variables with E|X1| < ∞.Let N be a bounded stopping time w.r.t. Xjj≥1. Then

ESN = (EN)EX1.

Corollary 8.5.6: Let Xjj≥1 be iid nonnegative random variable withE|X1| < ∞. Let N be a stopping time w.r.t. Xjj≥1. Then

ESN = (EN)EX1.

Proof: Let Nk = N ∧ k, k = 1, 2, . . .. Then Nk is a bounded stoppingtime. By Corollary 8.5.5,

E(SNk) = (ENk)EX1.

Let k ↑ ∞. Then 0 ≤ SNk↑ SN and Nk ↑ N . By the MCT, ESNk

↑ ESN

and ENk ↑ EN .

Theorem 8.5.7: (Wald’s equation). Let Xjj≥1 be iid random variableswith E|X1| < ∞. Let N be a stopping time w.r.t. Xjj≥1 such that EN <∞. Then

ESN = (EN)EX1.

Proof: Let Tn =∑n

j=1 |Xj |, n ≥ 1. Let Nk = N ∧ k, k = 1, 2, . . .. Thenby Corollary 8.5.5,

E(SNk) = (ENk)EX1.

Also, |SNk| ≤ TNk

and

ETNk= (ENk)E|X1|.

Further, as k →∞, Nk → N , SNk→ SN , TNk

→ TN , and

ETNk→ ETN = (EN)E|X1| < ∞.

So, by the extended DCT (Theorem 2.3.11)

ESNk→ ESN

i.e., (ENk)EX1 → ESN

i.e., ESN = (EN)EX1.

Page 276: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

264 8. Laws of Large Numbers

8.5.3 The renewal theoremsIn this section, two versions of the renewal theorem will be proved. For this,the notation and concepts introduced in Sections 8.5.1 and 8.5.2 will be usedwithout further explanation. Note that for each t > 0 and j = 0, 1, 2, . . .,the event N(t) = j = Sj−1 ≤ t < Sj belongs to σ〈X0, . . . , Xj〉.Thus, by Wald’s equation (Theorem 8.5.7 above)

E(SN(t)) =(EN(t)

)EX1 + EX0.

Let m ∈ (0,∞) and Xi = minXi, m, i ≥ 0. Let Snn≥0 and N(t)t≥0be the associated renewal sequence and renewal process, respectively.Again, by Wald’s equation,

E(SN(t)

)=(EN(t)

)EX1 + EX0.

But since SN(t)−1 ≤ t < SN(t), it follows that SN(t) ≤ t + m and hence

(EN(t))EX1 + EX0 ≤ t + m.

This yields

lim supt→∞

EN(t)t

≤ 1EX1

.

Clearly, for all t > 0, N(t) ≥ N(t) and hence

lim supt→∞

EN(t)

t≤ 1

EX1. (5.6)

Since this is true for each m ∈ (0,∞) and by the MCT, EX1 → EX1 asm →∞, it follows that

lim supt→∞

EN(t)t

≤ 1EX1

.

Combining this with (5.5) leads to the following result.

Theorem 8.5.8: (The weak renewal theorem). Let N(t) : t ≥ 0 be arenewal process with distribution F . Let µ =

∫[0,∞) xdF (x) ∈ (0,∞). Then,

limt→∞

EN(t)t

=1µ

. (5.7)

The above result is also valid when µ = ∞ when 1µ is interpreted as zero.

Definition 8.5.4: A random variable X (and its probability distribution)is called arithmetic (or lattice) if there exists a ∈ R and d > 0 such that X−a

d

Page 277: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.5 Renewal theory 265

is integer valued. The largest such d is called the span of (the distributionof) X.

Definition 8.5.5: A random variable X (and its distribution distribution)is called nonarithmetic (or nonlattice) if it is not arithmetic.

The weak renewal theorem (Theorem 8.5.8) implies that EN(t) = t/µ+o(t) as t →∞. This suggests that E

(N(t + h)−N(t)

)= (t + h)/µ− t/µ +

o(t) = h/µ + o(t). A strengthening of the above result is as follows.

Theorem 8.5.9: (The strong renewal theorem). Let N(t) : t ≥ 0 be arenewal process with a nonarithmetic distribution F with a finite positivemean µ. Then, for each h > 0,

limt→∞ E

(N(t + h)−N(t)

)=

h

µ. (5.8)

Remark 8.5.1: Since

N(t) =k−1∑j=0

(N(t− j)−N(t− j − 1)

)+ N(t− k)

where k ≤ t < k + 1, the weak renewal theorem follows from the strongrenewal theorem.

The following are the “arithmetic versions” of Theorems 8.5.8 and 8.5.9.Let Xii≥0 be independent positive integer valued random variables suchthat Xii≥1 are iid with distribution pjj≥1. Let Sn =

∑nj=0 Xj , n ≥ 0,

S−1 = 0. Let Nn = k if Sk−1 ≤ n < Sk, k = 0, 1, 2, . . .. Let

un = P (there is a renewal at time n)= P (Sk = n for some k ≥ 0).

Theorem 8.5.10: Let µ =∑∞

j=1 jpj ∈ (0,∞). Then

1n

n∑j=0

uj →1µ

as n →∞. (5.9)

Theorem 8.5.11: Let µ =∑∞

j=1 jpj ∈ (0,∞) and g.c.d. k : pk > 0 = 1.Then

un →1µ

as n →∞. (5.10)

For proofs of Theorems 8.5.9 and 8.5.11, see Feller (1966) for an analyticproof or Lindvall (1992) for a proof using the coupling method. The proofof Theorem 8.5.10 is similar to that of Theorem 8.5.8.

Page 278: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

266 8. Laws of Large Numbers

8.5.4 Renewal equationsThe above strong renewal theorems have many applications. These are viawhat are known as renewal equations.

Let F (·) be a cdf such that F (0) = 0. Let B0 ≡ f | f : [0,∞) → R, fis Borel measurable and bounded on bounded intervals.

Definition 8.5.6: A function a(·) is said to satisfy the renewal equationwith distribution F (·) and forcing function b(·) ∈ B0 if a ∈ B0 and

a(t) = b(t) +∫

(0,t]a(t− u)dF (u) for t ≥ 0. (5.11)

Theorem 8.5.12: Let F be a cdf such that F (0) = 0 and let b(·) ∈ B0.Then there is a unique solution a0(·) ∈ B0 to (5.11) given by

a0(t) =∫

[0,t]b(t− u)U(du) (5.12)

where U(·) is the Lebesgue-Stieltjes measure induced by the nondecreasingfunction

U(t) ≡∞∑

n=0

F (n)(t), (5.13)

with F (n)(·), n ≥ 0 being defined by the relations

F (n)(t) =∫

(0,t]F (n−1)(t− u)dF (u), t ∈ R, n ≥ 1,

F (0)(t) =

1 if t ≥ 00 t < 0.

It will be shown below that the function U(·) defined in (5.13) is therenewal function EN(t) as in Definition 8.5.2.

Proof: For any function b ∈ B0 and any nondecreasing right continuousfunction G : [0,∞) → R, let

(b ∗G)(t) ≡∫

[0,t]b(t− u)dG(u).

Then since F (0) = 0, the equation (5.11) can be rewritten as

a = b + a ∗ F. (5.14)

Let Xii≥1 be iid random variables with cdf F . Then it is easy to verifythat F (n)(t) = P (Sn ≤ t), where S0 = 0, and Sn =

∑ni=1 Xi for n ≥ 1. Let

Page 279: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.5 Renewal theory 267

N(t) : t ≥ 0 be as defined by (5.1). Then, for t ∈ (0,∞),

EN(t) =∞∑

j=1

P(N(t) ≥ j

)=

∞∑j=1

P (Sj−1 ≤ t) =∞∑

n=0

F (n)(t) = U(t).

By Proposition 8.5.1, U(t) < ∞ for all t > 0 and is nondecreasing.Since b ∈ B0 for each 0 < t < ∞, a0 defined by (5.12) is well-defined. Bydefinition a0 = b ∗ U and by (5.13), a0 satisfies (5.14) and hence (5.11). Ifa1 and a2 from B0 are two solutions to (5.14) then a ≡ a1 − a2 satisfies

a = a ∗ F

and hencea = a ∗ F (n) for all n ≥ 1.

This implies

M(t) ≡ sup|a(u)| : 0 ≤ u ≤ t ≤ M(t)F (n)(t).

But F (n)(t) → 0 as n → ∞. Hence |a| = 0 on (0, t] for each t. Thusa0 = b ∗ U is the unique solution to (5.11).

The discrete or arithmetic analog of the renewal equation (5.11) is asfollows. Let Xii≥1 be iid positive integer valued random variables withdistribution pjj≥1. Let S0 = 0, and Sn =

∑ni=1 Xi for n ≥ 1. Let un =

P (Sj = n for some j ≥ 0). Then, u0 = 1 and un satisfies un =∑n

j=1 pjun−j

for n ≥ 1. For any sequence bjj≥0, the equation

an = bn +n∑

j=1

an−jpj , n = 0, 1, 2, . . . (5.15)

is called the discrete renewal equation. As in the general case, it can beshown (Problem 8.17 (a)) that the unique solution to (5.15) is given by

an =n∑

j=0

bn−juj . (5.16)

The following convergence results are easy to establish from Theorem 8.5.11(Problem 8.17 (b)).

Theorem 8.5.13: (The key renewal theorem, discrete case). Let pjj≥1be aperiodic, i.e., g.c.d. k : pk > 0 = 1 and µ ≡

∑∞j=1 jpj ∈ (0,∞). Let

unn≥0 be the renewal sequence associated with pjj≥1. That is, u0 = 1and un =

∑nj=1 pjun−j for n ≥ 1. Let bjj≥0 be such that

∑∞j=1 |bj | < ∞.

Let ann≥0 satisfy a0 = b0 and

an = bn +∞∑

j=1

an−jpj n ≥ 1. (5.17)

Page 280: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

268 8. Laws of Large Numbers

Then an =∑∞

j=0 bjun−j, n ≥ 0 and limn→∞ an =

∞∑j=0

bj.

The nonarithmetic analog of the above is as follows.

Definition 8.5.7: A function b(·) ∈ B0 is directly Riemann integrable (dri)on [0,∞) iff (i) for all h > 0,

∑∞n=0 sup|b(u)| : nh ≤ u ≤ (n + 1)h < ∞,

and (ii) limh→0∑∞

n=0 h(mn(h)−mn(h))

= 0 where

mn(h) = supb(u) : nh ≤ u ≤ (n + 1)hmn(h) = infb(u) : nh ≤ u ≤ (n + 1)h.

Theorem 8.5.14: (The key renewal theorem, nonarithmetic case).Let F (·) be a nonarithmetic distribution with F (0) = 0 and µ =∫[0,∞) udF (u) < ∞. Let U(·) =

∑∞n=0 F (n)(·) be the renewal function asso-

ciated with F . Let b(·) ∈ B0 be directly Riemann integrable.Then the unique solution to the renewal equation

a = b + a ∗ F (5.18)

is given by a = b ∗ U and

limt→∞ a(t) =

c(b)µ

(5.19)

where c(b) ≡ limh→0

∞∑n=0

hmn(h).

Remark 8.5.2: A sufficient condition for b(·) to be dri is that it is Rie-mann integrable on bounded intervals and that there exists a nonincreasingintegrable function h(·) on [0,∞) and a constant C such that |b(·)| ≤ Ch(·)(Problem 8.18 (b)).

8.5.5 ApplicationsHere are two important applications of the above two theorems to a classof stochastic processes known as regenerative processes.

Definition 8.5.8:

(a) A sequence of random variables Ynn≥0 is called regenerative if thereexists a renewal sequence Tjj≥0 such that the random cycles andcycle length variables ηj =

(Yi : Tj ≤ i < Tj+1, Tj+1 − Tj

)for

j = 0, 1, 2, . . . are iid.

(b) A stochastic process Y (t) : t ≥ 0 is called regenerative if thereexists a renewal sequence Tjj≥0 such that the random cycles and

Page 281: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.5 Renewal theory 269

cycle length variables ηj ≡ Y (t) : Tj ≤ t < Tj+1, Tj+1 − Tj forj = 0, 1, 2, . . . are iid.

(c) In both (a) and (b), the sequence Tjj≥0 are called the regenerationtimes.

Example 8.5.2: Let Ynn≥0 be a countable state space Markov chain(see Chapter 14) that is irreducible and recurrent. Fix a state ∆. Let

T0 = minn : n > 0, Yn = ∆Tj+1 = minn : n > Tj , Yn = ∆, n ≥ 0.

Then Ynn≥0 is regenerative (Problem 8.19).

Example 8.5.3: Let Y (t) : t ≥ 0 be a continuous time Markov chain (seeChapter 14) with a countable state space that is irreducible and recurrent.Fix a state ∆. Let

T0 = inft : t > 0, Y (t) = ∆Tj+1 = inft : t > Tj , Y (t) = ∆.

Then Y (t) : t ≥ 0 is regenerative (Problem 8.19).

Theorem 8.5.15: Let Ynn≥0 be a regenerative sequence of random vari-ables with some state space (S,S) where S is a σ-algebra on S with regener-ation times Tjj≥0. Let f : S → R be bounded and 〈S,B(R)〉-measurable.Let

an ≡ Ef(Yn+T0),bn ≡ Ef(YT0+n)I(T1 > T0 + n). (5.20)

Let µ = E(T1− T0) ∈ (0,∞) and g.c.d. j : pj ≡ P (T1− T0 = j) > 0 = 1.Then

(i) an →∫

S

f(y)π(dy)

where π(A) ≡ 1µ E(∑T1−1

j=T0IA(Yj)

), A ∈ S.

(ii) In particular,

‖P (Yn ∈ ·)− π(·)‖ → 0 as n →∞, (5.21)

where ‖ · ‖ denotes the total variation norm.

Proof: By the regenerative property, ann≥1 satisfies the renewal equa-tion

an = bn +n∑

j=0

an−jpj

Page 282: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

270 8. Laws of Large Numbers

and hence, part (i) of the theorem follows from Theorem 8.5.13 and thefact

∑∞n=0 bn = µπ(A).

To prove (ii) note that an ≡ Ef(Yn) = E(f(Yn)I(T0 > n)) +∑nj=0 an−jP (T0 = j) and by DCT limn→∞ an = limn→∞ an.It is not difficult to show that for any two probability measures µ and ν

on (S,S), the total variation norm

‖µ− ν‖ = sup∣∣∣ ∫ fdµ−

∫fdν

∣∣∣ : f ∈ B(S, R)

where B(S, R) = f : f : S → R, F measurable, sup|f(s)| : s ∈ S ≤ 1(Problem 4.10 (b)). Thus,

‖P (Yn+T0 ∈ ·)− π(·)‖

≤ sup∣∣∣Ef(Yn0+T )−

∫fdπ

∣∣∣ : f ∈ B(S, R)

. (5.22)

Now, for any f ∈ B(S, R) and any integer K ≥ 1, from Theorem 8.5.13,∣∣∣Ef(Yn0+T )−∫

fdπ∣∣∣

≤K∑

j=0

bj

∣∣∣un−j −1µ

∣∣∣+ 2∞∑

j=(K+1)

P (T1 − T0 > j) ≡ δn, say (5.23)

where bj is defined in (5.20). Since E(T1 − T0) < ∞, given ε > 0, thereexists a K such that

∞∑j=(K+1)

P (T1 − T0 > j) < ε/2.

By Theorem (8.5.11), un → 1µ . Thus, in (5.23), lim δn ≤ ε and so from

(5.22), (ii) follows.

Theorem 8.5.16: Let Y (t) : t ≥ 0 be a regenerative stochastic processwith state space (S,S) where S is a σ-algebra on S. Let f : S → R bebounded and 〈S,B(R)〉-measurable. Let

a(t) = Ef(YT0+t), t ≥ 0,

b(t) ≡ Ef(YT0+t)I(T1 > T0 + t), t ≥ 0.

Let µ = E(T1 − T0) ∈ (0,∞) and the distribution of T1 − T0 be nonarith-metic. Then

(i) a(t) →∫

S

f(y)π(dy)

where π(A) = 1µ E( ∫ T

T0IA(Y (u))du

), A ∈ S.

Page 283: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.6 Ergodic theorems 271

(ii) In particular,

‖P (Yt ∈ ·)− π(·)‖ → 0 as t →∞ (5.24)

where ‖ · ‖ is the total variation norm.

The proof of this is similar to that of the previous theorem but usesTheorem 8.5.14.

8.6 Ergodic theorems

8.6.1 Basic definitions and examplesThe law of large numbers proved in Section 8.2 states that if Xii≥1are pairwise independent and identically distributed and if h(·) is a Borelmeasurable function, then

the time average, i.e.,1n

n∑i=1

h(Xi)

→ Eh(X1), i.e., space average w.p. 1 (6.1)

as n →∞, provided E|h(X1)| < ∞.The goal of this section is to investigate how far the independence as-

sumption can be relaxed.

Definition 8.6.1: (Stationary sequences). A sequence of random variablesXii≥1 on a probability space (Ω,F , P ) is called strictly stationary if foreach k ≥ 1 the joint distribution of (Xi+j : j = 1, 2, . . . , k) is the same forall i ≥ 0.

Example 8.6.1: Xii≥1 iid.

Example 8.6.2: Let Xii≥1 be iid. Fix 1 ≤ < ∞. Let h : R → R bea Borel function and Yi = h(Xi, Xi+1, . . . , Xi+−1), i ≥ 1. Then Yii≥1 isstrictly stationary.

Example 8.6.3: Let Xii≥1 be a Markov chain with a stationary dis-tribution π. If X1 ∼ π then Xii≥1 is strictly stationary (see Chapter14).

It will be shown that if Xii≥1 is a strictly stationary sequence that isnot a mixture of two other strictly stationary sequences, then (6.1) holds.This is known as the ergodic theorem (Theorem 8.6.1 below).

Definition 8.6.2: (Measure preserving transformations). Let (Ω,F , P )be a probability space and T : Ω → Ω be 〈F ,F〉 measurable. Then, T is

Page 284: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

272 8. Laws of Large Numbers

called P -preserving (or simply measure preserving on (Ω,F , P )) if for allA ∈ F , P (T−1(A)) = P (A). That is, the random point T (ω) has the samedistribution as ω.

Let X be a real valued random variable on (Ω,F , P ). Let Xi ≡X(T (i−1)(ω)) where T (0)(ω) = ω, T (i)(ω) = T (T (i−1)(ω)), i ≥ 1. ThenXii≥1 is a strictly stationary sequence.

It turns out that every strictly stationary sequence arises this way. LetXii≥1 be a strictly stationary sequence defined on some probability space(Ω,F , P ). Let P be the probability measure induced by X ≡ Xi(ω)i≥1

on(Ω ≡ R∞, F ≡ B(R∞)

)where R∞ is the space of all sequences of

real numbers and B(R∞) is the σ-algebra generated by finite dimensionalcylinder sets of the form x : (xj : j = 1, 2, . . . , k) ∈ Ak, 1 ≤ k < ∞, Ak ∈B(Rk). Let T : R∞ → R∞ be the unilateral (one sided) shift to the right,i.e., T

((xi)i≥1

)= (xi)i≥2. Then T is measure preserving on (Ω, F , P ). Let

Y1(ω) = x1, and Yi(ω) = Y1(T i−1ω) = xi for i ≥ 2 if ω = (x1, x2, x3, . . .).Then Yii≥1 is a strictly stationary sequence on (Ω, F , P ) and has thesame distribution as Xii≥1.

Example 8.6.4: Let Ω = [0, 1], F = B([0, 1]), P = Lebesgue measure.Let Tω ≡ 2ω mod 1, i.e.,

Tω =

⎧⎨⎩

2ω if 0 ≤ ω < 12

2ω − 1 if 12 ≤ ω < 1

0 ω = 1.

Then T is measure preserving since P (ω : a < Tω < b) = (b− a) for all0 < a < b < 1 (Problem 8.20).

This example is an equivalent version of the iid sequence δii≥1 ofBernoulli (1/2) random variables. To see this, let ω =

∑∞i=1

δi(ω)2i be the bi-

nary expansion of ω. Then δii≥1 is iid Bernoulli (1/2) and Tω = 2ω mod1 =

∑∞i=2

δi(ω)2i−1 (cf. Problem 7.4). Thus T corresponds with the unilateral

shift to right on the iid sequence δii≥1. For this reason, T is called theBernoulli shift.

Example 8.6.5: (Rotation). Let Ω = (x, y) : x2 + y2 = 1 be the unitcircle. Fix θ0 in [0, 2π). If ω = (cos θ, sin θ), θ in [0, 2π) set Tω =

(cos(θ +

θ0), sin(θ + θ0)). That is, T rotates any point ω on Ω counterclockwise

through an angle θ0. Then T is measure preserving w.r.t. the Uniformdistribution on [0, 2π].

Definition 8.6.3: Let (Ω,F , P ) be a probability space and T : Ω → Ωbe a 〈F ,F〉 measurable map. A set A ∈ F is T-invariant if A = T−1A.A set A ∈ F is almost T -invariant w.r.t. P if P (A T−1A) = 0 whereA1A2 = (A1 ∩Ac

2)∪ (Ac1 ∩A2) is the symmetric difference of A1 and A2.

Page 285: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.6 Ergodic theorems 273

It can be shown that A is almost T -invariant w.r.t. P iff there exists aset A′ that is T -invariant and P (AA′) = 0 (Problem 8.21).

Examples of T -invariant sets are A1 = ω : T jω ∈ A0 for infinitely manyi ≥ 1 where A0 ∈ F ; A2 =

ω : 1

n

∑nj=1 h(T jω) converges as n → ∞

where h : Ω → R is a F measurable function. On the other hand, the eventx : x1 ≤ 0 is not shift invariant in

(R∞,B(R∞)

)nor is it almost shift

invariant if P corresponds to the iid case with a nondegenerate distribution.The collection I of T -invariant sets is a σ-algebra and is called the in-

variant σ-algebra. A function h : Ω → R is I-measurable iff h(ω) = h(Tω)for all ω (Problem 8.22).

Definition 8.6.4: A measure preserving transformation T on a probabilityspace (Ω,F , P ) is ergodic or irreducible (w.r.t. P ) if A is T -invariant impliesP (A) = 0 or 1.

Definition 8.6.5: A stationary sequence of random variables Xii≥1is ergodic if the unilateral shift T is ergodic on the sequence space(R∞,B(R∞), P ) where P is the measure on R∞ induced by Xii≥1.

Example 8.6.6: Consider the above sequence space. Then A ∈ F is in-variant with respect to the unilateral shift implies that A is in the tailσ-algebra T ≡

⋂∞n=1 σ(Xj(ω), j ≥ n) (Problem 8.23). If Xii≥1 are inde-

pendent then by the Kolmogorov’s zero-one law, A ∈ T implies P (A) = 0or 1. Thus, if Xii≥1 are iid then it is ergodic.

On the other hand, mixtures of iid sequences are not ergodic as seenbelow.

Example 8.6.7: Let Xii≥1 and Yii≥1 be two iid sequences with dif-ferent distributions. Let δ be Bernoulli (p), 0 < p < 1 and independent ofboth Xii≥1 and Yii≥1. Let Zi ≡ δXi + (1− δ)Yi, i ≥ 1. Then Zii≥1is a stationary sequence and is not ergodic (Problem 8.24).

The above example can be extended to mixtures of irreducible positiverecurrent discrete state space Markov chains (Problem 8.25 (a)). Anotherexample is Example 8.6.5, i.e., rotation of the circle when θ is rational(Problem 8.25 (b)).

Remark 8.6.1: There is a simple example of a measure preserving trans-formation T that is ergodic but T 2 is not. Let Ω = ω1, ω2, ω1 = ω2. LetTω1 = ω2, Tω2 = ω1, P be the distribution P (ω1) = P (ω2) = 1

2 . ThenT is ergodic but T 2 is not (Problem 8.26).

Page 286: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

274 8. Laws of Large Numbers

8.6.2 Birkhoff’s ergodic theoremTheorem 8.6.1: Let (Ω,F , P ) be a probability space, T : Ω → Ω be ameasure preserving ergodic map on (Ω,F , P ) and X ∈ L1(Ω,F , P ). Then

1n

n−1∑j=0

X(T jω) → EX ≡∫

ΩXdP (6.2)

w.p. 1 and in L1 as n →∞.

Remark 8.6.2: A more general version is without the assumption ofT being ergodic. In this case, the right side of (6.2) is a random vari-able Y (ω) that is T -invariant, i.e., Y (ω) = Y (T (ω)) w.p. 1 and satisfies∫

AXdP =

∫A

Y dP for all T -invariant sets A. This Y is called the condi-tional expectation of X given I, the σ-algebra of invariant sets (Chapter13).

For a proof of this version, see Durrett (2004).The proof of Theorem 8.6.1 depends on the following inequality.

Lemma 8.6.2: (Maximal ergodic inequality). Let T be measure preservingon a probability space (Ω,F , P ) and X ∈ L1(Ω,F , P ). Let S0(ω) = 0,Sn(ω) =

∑n−1j=0 X(T jω), n ≥ 1, Mn(ω) = maxSj(ω) : 0 ≤ j ≤ n. Then

E(X(ω)I

(Mn(ω) > 0

))≥ 0.

Proof: By definition of Mn(ω), Sj(ω) ≤ Mn(ω) for 1 ≤ j ≤ n. Thus

X(ω) + Mn(Tω) ≥ X(ω) + Sj(Tω) = Sj+1(ω).

Also, since Mn(Tω) ≥ 0,

X(ω) ≥ X(ω)−Mn(Tω) = S1(ω)−Mn(Tω).

Thus X(ω) ≥ maxSj(ω) : 1 ≤ j ≤ n

− Mn(Tω). For ω such

that Mn(ω) > 0, Mn(ω) = maxSj(ω) : 1 ≤ j ≤ n

and hence

X(ω) ≥ Mn(ω) − Mn(Tω). Also, since X ∈ L1(Ω,F , P ) it follows thatMn ∈ L1(Ω,F , P ) for all n ≥ 1. Taking expectations yields

E(X(ω)I

(Mn(ω) > 0

))≥ E

(Mn(ω)−Mn(Tω)I

(Mn(ω) > 0

))≥ E

(Mn(ω)−Mn(Tω)I

(Mn(ω) ≥ 0

))(since Mn(Tω) ≥ 0)

= E(Mn(ω)−Mn(Tω)

)= 0,

since T is measure preserving.

Page 287: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.6 Ergodic theorems 275

Remark 8.6.3: Note that the measure preserving property of T is usedonly at the last step.

Proof of Theorem 8.6.1: W.l.o.g. assume that EX = 0. Let Z(ω) ≡lim supn→∞

Sn(ω)n . Fix ε > 0 and set Aε ≡ ω : Z(ω) > ε. It will be shown

that P (Aε) = 0. Clearly, Aε is T invariant. Since T is ergodic, P (Aε) = 0 or1. Suppose P (Aε) = 1. Let Y (ω) = X(ω)−ε. Let Mn,Y (ω) ≡ maxSj,Y (ω) :0 ≤ j ≤ n where S0,Y (ω) ≡ 0, Sj,Y (ω) ≡

∑j−1k=0 Y (T kω), j ≥ 1. Then by

Lemma 8.6.2 applied to Y (ω)

E(Y (ω)I

(Mn,Y (ω) > 0

))≥ 0.

But Bn ≡ ω : Mn,Y (ω) > 0 = ω : sup1≤j≤n1j Sj,Y (ω) > 0. Clearly,

Bn ↑ B ≡ ω : sup1≤j<∞1j Sj,Y (ω) > 0. Since 1

j Sj,Y (ω) = 1j Sj(ω) − ε

for j ≥ 1, B ⊃ Aε and since P (Aε) = 1, it follows that P (B) = 1. Also|Y | ≤ |X| + ε ∈ L1(Ω,F , P ). So by the bounded convergence theorem,0 ≤ E(Y IBn) → E(Y IB) = EY = 0 − ε < 0, which is a contradic-tion. Thus P (Aε) = 0. This being true for every ε > 0 it follows thatP (limn→∞

Sn(ω)n ≤ 0) = 1. Applying this to −X(ω) yields

P(

limn→∞

Sn(ω)n

≥ 0)

= 1

and hence P(limn→∞

Sn(ω)n = 0

)= 1.

To prove L1-convergence, note that applying the above to X+ and X−

yields

fn(ω) ≡ 1n

n∑i=1

X+(T iω) → EX+(ω) w.p. 1.

Since T is measure preserving∫

fn(ω)dP = EX+(ω) forall n. So by Scheffe’s theorem (Lemma 8.2.5),

∫|fn(ω) −

EX+(ω)|dP → 0, i.e., E∣∣ 1n

∑ni=1 X+(T iω)− EX+

∣∣ → 0. Similarly,E∣∣ 1n

∑ni=1 X−(T iω)− EX−∣∣→ 0. This yields L1 convergence.

Corollary 8.6.3: Let Xii≥1 be a stationary ergodic sequence of Rk

valued random variables on some probability space (Ω,F , P ). Let h : Rk →R be Borel measurable and let E|h(X1, X2, . . . , Xk)| < ∞. Then

1n

n∑i=1

h(Xi, Xi+1, . . . , Xi+k−1) → Eh(X1, X2, . . . , Xk) w.p. 1.

Proof: Consider the probability space Ω = (Rk)∞, F ≡ B((Rk)∞) and

P the probability measure induced by the map ω → (Xi(ω))i≥1 and theunilateral shift map T on Ω defined by T (xi)i≥1 = (xi)i≥2. Then T is

Page 288: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

276 8. Laws of Large Numbers

measure preserving and ergodic. So the corollary follows from Theorem8.6.1.

Remark 8.6.4: This corollary is useful in statistical time series analysis.If Xii≥1 is a real valued stationary ergodic sequence, then the mean m ≡EX1, variance Var(X1), and covariance Cov(X1, X2) can all be estimatedconsistently by the corresponding sample functions

1n

n∑i=1

Xi,1n

n∑i=1

X2i −

(1n

n∑i=1

Xi

)2

, and

1n

n∑i=1

XiXi+1 −(

1n

n∑i=1

Xi

)2

.

Further, the joint distribution of (X1, X2, . . . , Xk) for any k ≥ 1, canbe estimated consistently by the corresponding empirical measure, i.e.,Ln(A1, A2, . . . , Ak) ≡ 1

n

∑ni=1 I(Xi+k ∈ Ak, j = 1, 2, . . . , k), which con-

verges toP (X1 ∈ A1, X2 ∈ A2, . . . , Xk ∈ Ak) w.p. 1

where Ai ∈ B(R), i = 1, 2, . . . , k.

The next three results (Theorems 8.6.4–8.6.6) are consequences and ex-tensions of the ergodic theorem, Theorem 8.6.1. For proofs, see Durrett(2004).

The first one is the following result on the behavior of the log-likelihoodfunction of a stationary ergodic sequence of random variables with a finiterange.

Theorem 8.6.4: (Shannon-McMillan-Breiman theorem). Let Xii≥1 bea stationary ergodic sequence of random variables with values in a finiteset S ≡ a1, a2, . . . , ak. For each n, x1, x2, . . . , xn in S, let

p(xn | xn−1, xn−2, . . . , x1) = P (Xn = xn | Xj = xj , 1 ≤ j ≤ n− 1)

≡ P (Xj = xj : 1 ≤ j ≤ n)P (Xj = xj : 1 ≤ j ≤ n− 1)

whenever the denominator is positive and let p(x1, x2, . . . , xn) = P (X1 =x1, X2 = x2, . . . , Xn = xn). Then

limn→∞

1n

log p(X1, X2, . . . , Xn) = −H exists w.p. 1

where H ≡ limn→∞ E(− log p(Xn | Xn−1, Xn−2, . . . , X1)

)is called the

entropy rate of Xii≥1.

Remark 8.6.5: In the iid case this is a consequence of the strong law oflarge numbers, and H can be identified as

∑kj=1(− log pj)pj where pj =

Page 289: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.6 Ergodic theorems 277

P (X1 = aj), 1 ≤ j ≤ k. This is called the Kolmogorov-Shannon entropy ofthe distribution pj : 1 ≤ j ≤ k.

If Xii≥1 is a stationary ergodic Markov chain, then again it is a con-sequence of the strong law of large numbers, and H can be identified with

E(− log p(X2 | X1)

)=

k∑i=1

πi

k∑j=1

(− log pij)pij

where π ≡ πi : 1 ≤ i ≤ k is the stationary distribution and P ≡((pij)

)is

the transition probability matrix of the Markov chain Xii≥1. See Problem8.27.

A more general version of the ergodic Theorem 8.6.1 is the following.

Theorem 8.6.5: (Kingman’s subadditive ergodic theorem). Let Xm,n :0 ≤ m < nn≥1 be a collection of random variables such that

(i) X0,m + Xm,n ≥ X0,n for all 0 ≤ m < n, n ≥ 1.

(ii) For all k ≥ 1, Xnk,(n+1)kn≥1 is a stationary sequence.

(iii) The sequence Xm,m+k, k ≥ 1 has a distribution that does not de-pend on m ≥ 0.

(iv) EX+0,1 < ∞ and for all n, EX0,n

n ≥ γ0, where γ0 > −∞.

Then

(i) limn→∞

EX0,n

n = infn≥1

EX0,n

n ≡ γ.

(ii) limn→∞

X0,n

n ≡ X exists w.p. 1 and in L1, and EX = γ.

(iii) If Xnk,(n+1)kn≥1 is ergodic for each k ≥ 1, then X ≡ γ w.p. 1.

A nice application of this is a result on products of random matrices.

Theorem 8.6.6: Let Aii≥1 be a stationary sequence of k × k randommatrices with nonnegative entries. Let αm,n(i, j) be the (i, j)th entry inAm+1, · · · , An. Suppose E| log α1,2(i, j)| < ∞ for all i, j. Then

(i) limn→∞

1n log α0,n(i, j) = η exists w.p. 1.

(ii) For any m, limn→∞

1n log ‖Am+1 · · · , An‖ = η w.p. 1, where for any k×k

matrix B ≡ ((bij)), ‖B‖ = max∑k

j=1 |bij | : 1 ≤ i ≤ k.

Page 290: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

278 8. Laws of Large Numbers

Remark 8.6.6: A concept related to ergodicity is that of mixing. A mea-sure preserving transformation T on a probability space (Ω,F , P ) is mixingif for all A, B ∈ B

limn→∞

∣∣P (A ∩ T−nB)− P (A)P (T−nB)∣∣ = 0.

A stationary sequence of random variables Xii≥1 is mixing if the unilat-eral shift on the sequence space R∞ induced by Xii≥1 is mixing. If T ismixing and A is T -invariant, then taking B = A in the above yields

P (A) = P 2(A)

i.e., P (A) = 0 or 1. Thus, if T is mixing, then T is ergodic. Conversely, ifT is ergodic, then by Theorem 8.6.1, for any B in B

1n

n∑j=1

IB(T jω) → P (B) w.p. 1.

Integrating both sides over A w.r.t. P yields 1n

∑nj=1 P (A ∩ T−jB) →

P (A)P (B), i.e., T is mixing in an average sense, i.e., the Cesaro sense. Asufficient condition for a stationary sequence to be mixing is that the tailσ-algebra be trivial. If Xii≥1 is a stationary irreducible Markov chainwith a countable state space, then it is mixing iff it is aperiodic.

For proofs of the above results, see Durrett (2004).

8.7 Law of the iterated logarithm

Let Xnn≥1 be a sequence of iid random variables with EX1 = 0, EX21 =

1. The SLLN asserts that the sample mean Xn = 1n

∑ni=1 Xi → 0 w.p. 1.

The central limit theorem (to be proved later) asserts that for all −∞ <a < b < ∞, P (a ≤

√nXn ≤ b) → Φ(b) − Φ(a) where Φ(·) is the standard

Normal cdf. This suggests that Sn =∑n

i=1 Xi is of the order magnitude√

nfor large n. This raises the question of how large does Sn√

nget as a function

of n. It turns out that it is of the order√

2n log log n. More precisely, thefollowing holds:

Theorem 8.7.1: (Law of the iterated logarithm). Let Xi(ω)i≥1 be iidrandom variables on a probability space (Ω,F , P ) with mean zero and vari-ance one. Let S0(ω) = 0, Sn(ω) =

∑ni=1 Xi(ω), n ≥ 1. For each ω, let

A(ω) be the set of limit points of

Sn(ω)√2n log log n

n≥1

. Then Pω : A(ω) =

[−1,+1] = 1.

For a proof, see Durrett (2004).

Page 291: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.8 Problems 279

A deep generalization of the above was obtained by Strassen (1964).

Theorem 8.7.2: Under the setup of Theorem 8.7.1, the following holds:Let Yn( j

n ;ω) = Sj(ω)√2n log log n

, j = 0, 1, 2, . . . , n and Yn(t, ω) be the functionobtained by linearly interpolating the above values on [0, 1]. For each ω, letB(ω) be the set of limit points of Yn(·, ω)n≥1 in the function space C[0, 1]of all continuous functions on [0, 1] with the supnorm. Then

Pω : B(ω) = K = 1

where K ≡f : f : [0, 1] → R, f is continuously differentiable, f(0) = 0

and 12

∫ 10 (f ′(t))2dt ≤ 1

.

8.8 Problems

8.1 Prove Theorem 8.1.3 and Corollary 8.1.4.

(Hint: Use Chebychev’s inequality.)

8.2 Let Xnn≥1 be a sequence of random variables on a probabilityspace (Ω,F , P ) such that for some m ∈ N and for each i = 1, . . . , m,Xi, Xi+m, Xi+2m, . . . are identically distributed and pairwise inde-pendent. Furthermore, suppose that E(|X1|+ · · ·+ |Xm|) < ∞. Showthat

Xn −→1m

m∑i=1

EXi, w.p. 1.

(Hint: Reduce the problem to nonnegative Xn’s and apply Theorem8.2.7 for each i = 1, . . . , m.)

8.3 Let f be a bounded measurable function on [0,1] that is continuousat 1

2 . Evaluate limn→∞

∫ 10

∫ 10 · · ·

∫ 10 f(

x1+x2+···+xn

n

)dx1dx2 . . . dxn.

8.4 Show that if P (|X| > α) < 12 for some real number α, then any

median of X must lie in the interval [−α, α].

8.5 Prove Theorem 8.3.4 using Kolmogorov’s first inequality (Theorem8.3.1 (a)).

(Hint: Apply Theorem 8.3.1 to ∆n,k defined in the proof of Theorem8.3.3 to establish (3.4).)

8.6 Let Xnn≥1 be a sequence of iid random variables with E|X1|α < ∞for some α > 0. Derive a necessary and sufficient condition on αfor almost sure convergence of the series

∑∞n=1 Xn sin 2πnt for all

t ∈ (0, 1).

Page 292: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

280 8. Laws of Large Numbers

8.7 Show that for any given sequence of random variables Xnn≥1, thereexists a sequence of real numbers ann≥1 ⊂ (0,∞) such that Xn

an→ 0

w.p. 1.

8.8 Let Xnn≥1 be a sequence of independent random variables with

P (Xn = 2) = P (Xn = nβ) = an, P (Xn = an) = 1− 2an

for some an ∈ (0, 13 ) and β ∈ R. Show that

∑∞n=1 Xn converges if and

only if∑∞

n=1 an < ∞.

8.9 Let Xnn≥1 be a sequence of iid random variables with E|X1|p = ∞for some p ∈ (0, 2). Then P (lim sup

n→∞|n−1/p∑n

i=1 Xi| = ∞) = 1.

8.10 For any random variable X and any r ∈ (0,∞), E|X|r < ∞ iff∑∞n=1 nr−1(log n)rP (|X| > n log n) < ∞.

(Hint: Check that∑m

n=1 nr−1(log n)r ∼ r−1mr(log m)r as m →∞.)

8.11 Let Xnn≥1 be a sequence of independent random variables withEXn = 0, EX2

n = σ2n, s2

n =∑n

j=1 σ2j →∞. Then, show that for any

a > 12 ,

s−2n (log s2

n)−an∑

i=1

Xi → 0 w.p. 1.

8.12 Show that for p ∈ (0, 2), p = 1, (4.12) holds.

(Hint: For p ∈ (1, 2),∑∞

n=1 |EZn/n1/p| ≤∑∞

n=1 E|X1|I(|X1| >

n)n−1/p =∑∞

j=1∑j

n=1 n−1/p · E|X1|I(j < |X1|p ≤ j + 1) ≤p

p−1E|X1|p < ∞, by (4.10). For p ∈ (0, 1),∑∞

n=1 |EZn/n1/p| ≤∑∞j=1(

∑∞n=j n−1/p)E|X1|I(j − 1 < |X1|p ≤ j) ≤ 1

1−pE|X1|p, by(4.9).)

8.13 Let Yi = xiβ + εi, i ≥ 1 where εnn≥1 is a sequence of iid randomvectors, xnn≥1 is a sequence of constants, and β ∈ R is a constant(the regression parameter). Let βn =

∑ni=1 xiYi/

∑ni=1 x2

i denote (theleast squares) estimator of β. Let n−1∑n

i=1 x2i → c ∈ (0,∞) and

Eε1 = 0.

(a) If E|ε1|1+δ < ∞ for some δ ∈ (0,∞), then show that

βn −→ β as n →∞, w.p. 1. (8.1)

(b) Suppose sup|xi| : i ≥ 1 < ∞ and E|ε1| < ∞. Show that (8.1)holds.

Page 293: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.8 Problems 281

8.14 (Strongly consistent estimation.) Let Xii≥1 be random variables onsome probability space (Ω,F , P ) such that (i) for some integer m ≥ 1the collections Xi : i ≤ n and Xi : i ≥ n + m are independentfor each n ≥ 1, and (ii) the distribution of Xi+j ; 0 ≤ j ≤ k isindependent of i, for all k ≥ 0.

(a) Show that for every ≥ 1 and h : R → R withE|h(X1, X2, . . . , X)| < ∞, there are functions fn : Rn →Rn≥1 such that fn(X1, X2, . . . , Xn) → λ ≡ Eh(X1, X2, . . . , X)w.p. 1. In this case, one says λ is estimable from Xii≥1 in astrongly consistent manner.

(b) Now suppose the distribution µ(·) of X1 is a mixture of theform µ =

∑ki=1 αiµi. Suppose there exist disjoint Borel sets

Ai1≤i≤k in R such that µi(Ai) = 1 for each i. Show thatall the αi’s as well as λi ≡

∫hi(x)dµi where hi ∈ L1(µi) are

estimable from Xii≥1 in a strongly consistent manner.

8.15 (Normal numbers). Recall that in Section 4.5 it was shown that forany positive integer p > 1 and for any 0 ≤ ω ≤ 1, it is possible towrite ω as

ω =∞∑

i=1

Xi(ω)pi

(8.2)

where for each i, Xi(ω) ∈ 0, 1, 2, . . . , p−1. Recall also that such anexpansion is unique except for ω of the form q/pn, q = 1, 2, . . . , pn−1,n ≥ 1 in which case there are exactly two expansions, one of which isrecurring. In what follows, for such ω’s the recurrent expansion willbe the one used in (8.2). A number ω in [0,1] is called normal w.r.t.the integer p if for every finite pattern a1a2 . . . ak where k ≥ 1 is apositive integer and ai ∈ 0, 1, 2, . . . , p− 1 for 1 ≤ i ≤ k the relativefrequency 1

n

∑ni=1 δi(ω) where

δi(ω) =

1 if Xi+j(ω) = aj+1, j = 0, 1, 2, . . . , k − 10 otherwise

converges to p−k as n →∞. A number ω in [0,1] is called absolutelynormal if it is normal w.r.t. p for every integer p > 1. Show thatthe set A of all numbers ω in [0,1] that are absolutely normal hasLebesgue measure one.

(Hint: Note that in (8.2), the function Xi(ω)i≥1 are iid randomvariables. Now use Problem 8.14 repeatedly.)

8.16 Show that for the renewal sequence Sn∞n=0, if P (X1 > 0) > 0, then

limn→∞ Sn = ∞ w.p. 1.

Page 294: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

282 8. Laws of Large Numbers

8.17 (a) Show that ann≥0 of (5.16) is the unique solution to (5.15) byusing generating functions (cf. Section 5.5).

(b) Deduce Theorems 8.5.13 and 8.5.14 from Theorems 8.5.11 and8.5.12, respectively.

(Hint: For Theorems 8.5.13 use the DCT , and for Theorem8.5.14, show first that

k∑n=0

mn(h)(U((n + 1)h)− U(nh)

)≤ a(kh)

≤k∑

n=0

mn(h)(U((n + 1)h)− U(nh)

).)

8.18 (a) Let b(·) : [0,∞) → R be dri. Show that b(·) is Riemann inte-grable on every bounded interval. Conclude that if b(·) is dri itmust be continuous almost everywhere w.r.t. Lebesgue measure.

(b) Let b(·) : [0,∞) → R be Riemann integrable on [0, K] for eachK < ∞. Let h(·) : [0,∞) → R+ be nonincreasing and integrablew.r.t. Lebesgue measure and |b(·)| ≤ h(·) on [0,∞). Show thatb(·) is dri.

8.19 Verify that the sequence Ynn≥0 in Example 8.5.2 and the processY (t) : t ≥ 0 in Example 8.5.3 are both regenerative.

8.20 Show that the map T in Example 8.6.4 in Section 8.6 is measurepreserving.

(Hint: Show that for 0 < a < b < 1, P(ω : Tω ∈ (a, b)

)= (b− a).)

8.21 Let T be a measure preserving map on a probability space (Ω,F , P ).Show that A is almost T -invariant w.r.t. P iff there exists a set A1such that A1 = T−1A1 and P (AA1) = 0.

(Hint: Consider A1 =⋃∞

n=0 T−nA. )

8.22 Show that a function h : Ω → R is I-measurable iff h(ω) = h(Tω)for all ω where I is the σ-algebra of T -invariant sets.

8.23 Consider the sequence space(R∞,B(R∞)

). Show that A ∈ B(R∞)

is invariant w.r.t. the unilateral shift T implies that A is in the tailσ-algebra.

8.24 In Example 8.6.7 of Section 8.6, show that Zii≥1 is a stationarysequence that is not ergodic.

(Hint: Assuming it is ergodic, derive a contradiction using the er-godic Theorem 8.6.1.)

Page 295: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.8 Problems 283

8.25 (a) Extend Example 8.6.7 to the Markov chain case with two disjointirreducible positive recurrent subsets.

(b) Show that in Example 8.6.5, if θ0 is rational, then T is notergodic.

8.26 (a) Verify that in Remark 8.6.1, T is ergodic but T 2 is not.(b) Construct a Markov chain with four states for which T is ergodic

but T 2 is not.

8.27 In Remark 8.6.5, prove the Shannon-McMillan-Breiman theorem di-rectly for the Markov chain case.

(Hint: Express p(X1, X2, . . . , Xn) as(

n−1∏i=1

pXiXi+1

)p(X1).)

8.28 Let Xii≥1 be iid Bernoulli (1/2) random variables. Let

W1 =∞∑

i=1

2X2i

4i

W2 =∞∑

i=1

X2i−1

4i.

(a) Show that W1 and W2 are independent.(b) Let A1 = ω : ω ∈ (0, 1) such that in the expansion of ω in

base 4 only the digits 0 and 2 appear and A2 = ω : ω ∈(0, 1) such that in the expansion of ω in base 4 only the digits0 and 1 appear. Show that m(A1) = m(A2) = 0 where m(·)is Lebesgue measure and hence that the distribution of W1 andW2 are singular w.r.t. m(·).

(c) Let W ≡ W1 + W2. Then show that W has uniform (0,1) distri-bution.

(Hint: For (b) use the SLLN.)

Remark: This example shows that the convolution of two singularprobability measures can be absolutely continuous w.r.t. Lebesguemeasure.

8.29 Let Xnn≥1 be a sequence of pairwise independent and identicallydistributed random variables with P (X1 ≤ x) = F (x), x ∈ R. Fix0 < p < 1. Suppose that F (ζp + ε) > p for all ε > 0 where

ζp = F−1(p) ≡ infx : F (x) ≥ p.

Show that ζn ≡ F−1n (p) ≡ infx : Fn(x) ≥ p converges to ζp w.p. 1

where Fn(x) ≡ n−1∑ni=1 I(Xi ≤ x), x ∈ R is the empirical distribu-

tion function of X1, . . . , Xn.

Page 296: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

284 8. Laws of Large Numbers

8.30 Let Xii≥1 be random variables such that EX2i < ∞ for all i ≥ 1.

Suppose 1n

∑ni=1 EXi → 0 and an ≡ 1

n2

∑nj=0(n − j)v(j) → 0 as

n →∞ where v(j) = supi

∣∣Cov(Xi, Xi+j)∣∣.

(a) Show that Xn −→p 0.

(b) Suppose further that∑∞

n=1 an < ∞. Show that Xn → 0 w.p. 1.

(c) Show that as n →∞, v(n) → 0 implies an → 0 but the converseneed not hold.

8.31 Let Xii≥1 be iid random variables with cdf F (·). Let Fn(x) ≡1n

∑ni=1 I(Xi ≤ x) be the empirical cdf. Suppose xn → x0 and F (·)

is continuous at x0. Show that Fn(xn) → F (x0) w.p. 1.

8.32 Let p be a positive integer > 1. Let δii≥1 be iid random variablewith distribution P (δ1 = j) = pj , 0 ≤ j ≤ p−1, pj ≥ 0,

∑p−10 pj = 1.

Let X =∑∞

i=1δi

pi . Show that

(a) P (X ∈ (0, 1)) = 1.

(b) FX(x) ≡ P (X ≤ x) is continuous and strictly increasing in (0,1)if 0 < pj < 1 for any 0 ≤ j ≤ p− 1.

(c) FX(·) is absolutely continuous iff pj = 1j for all 0 ≤ j ≤ p− 1 in

which case FX(x) ≡ x, 0 ≤ x ≤ 1.

8.33 (Random AR-series). Let Xnn≥0 be a sequence of random variablessuch that

Xn+1 = ρn+1Xn + εn+1, n ≥ 0

where the sequence (ρn, εn)n≥1 are iid and independent of X0.

(a) Show that if E(log |ρ1|) < 0 and E(log |ε1|)+ < ∞ then

Xn ≡n∑

j=0

ρ1ρ2 . . . ρj , εj+1 converges w.p. 1.

(b) Show that under the hypothesis of (a), for any bounded contin-uous function h : R → R and for any distribution of X0

Eh(Xn) → Eh(X∞).

(Hint: Show by SLLN that there is a 0 < λ < 1 such thatρ1, ρ2, . . . , ρj = 0(λj) w.p. 1 as j → ∞ and by Borel-Cantelli|εj | = 0(λ′j) for some λ′ > 0 λ′λ < 1.)

8.34 (Iterated random functions). Let (S, ρ) be a complete separable met-ric space. Let (G,G) be a measurable space. Let f : G × S → S be

Page 297: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

8.8 Problems 285

〈G × B(S),B(S)〉 measurable function. Let (Ω,F , P ) be a probabil-ity space and θii≥1 be iid G-valued random variables on (Ω,F , P ).Let X0 be an S-valued random variable on (Ω,F , P ) independent ofθii≥1. Define Xnn≥0 by the random iteration scheme,

X0(x, ω) ≡ x

Xn+1(x, ω) = f(θn+1(ω), Xn(x, ω)

)n ≥ 0.

(a) Show that for each n ≥ 0, the map Xn = S× Ω → S is 〈B(S)×F ,B(S)〉 measurable.

(b) Let fn(x) ≡ fn(x, ω) ≡ f(θn(ω), x). Let Xn(x, ω) =f1(f2, . . . , fn(x)). Show that for each x and n, Xn(x, ω) andXn(x, ω) have the same distribution.

(c) Now assume that for all ω, f(θ1(ω), x) is Lipschitz from S to S,i.e.,

i(ω) ≡ supx =y

d(f(θi(ω), x), f(θi(ω), y))d(x, y)

< ∞.

Show that i(ω) is a random variable on (Ω,F , P ), i.e. that i(·) :Ω → R+ is 〈F ,B(R)〉 measurable.

(d) Suppose that E| log 1(ω)| < ∞ and E log 1(ω) < 0,E| log d(f(θ1, x), x)| < ∞ for all x. Show that limn Xn(x, ω) =X∞(ω) exists w.p. 1 and is independent of x w.p. 1.

(Hint: Use Borel-Cantelli to show that for each x,Xn(x, ω)n≥1 is Cauchy in (S, ρ).)

(e) Under the hypothesis in (d) show that for any bounded contin-uous h : S → R and for any x ∈ S, limn→∞ Eh(Xn(x, ω)) =Eh(X∞(ω)).

(f) Deduce the results in Problems 7.15 and 8.33 as special cases.

8.35 (Extension of Gilvenko-Cantelli (Theorem 8.2.4) to the multivari-ate case). Let Xnn≥1 be a sequence of pairwise independentand identically distributed random vectors taking values in Rk withcdf F (x) ≡ P

(X11 ≤ x1, X12 ≤ x2, . . . , X1k ≤ xk

)where X1 =

(X11, X12, . . . , X1k) and x = (x1, x2, . . . , xk) ∈ R. Let Fn(x) ≡1n

∑ni=1 I(Xi ≤ x) be the empirical cdf based on Xi1≤i≤n. Show

that sup|Fn(x)− F (x)| : x ∈ R → 0 w.p. 1.

(Hint: First prove an extension of Polya’s theorem (Lemma 8.2.6) tothe multivariate case.)

Page 298: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

9Convergence in Distribution

9.1 Definitions and basic properties

In this section, the notion of ‘convergence in distribution’ of a sequence ofrandom variables is discussed. The importance and usefulness of this no-tion lie in the following observation: if a sequence of random variables Xn

converges in distribution to a random variable X, then one may approx-imate the probabilities P (Xn ∈ A) by P (X ∈ A) for large n for a largeclass of sets A ∈ B(R). In many situations, exact evaluation of P (Xn ∈ A)is a more difficult task than the evaluation of P (X ∈ A). As a result, onemay work with the limiting value P (X ∈ A) instead of P (Xn ∈ A), whenn is large. As an example, consider the following problem from statisticalinference. Let Y1, Y2, . . . be a collection of iid random variables with a fi-nite second moment. Suppose that one is interested in finding the observedlevel of significance or the p-value for a statistical test of the hypothesesH0 : µ = 0 against an alternative H1 : µ = 0 about the population meanµ. If the test statistic Yn = n−1∑n

i=1 Yi is used and the test rejects H0for large values of |

√nYn|, then the p-value of the test can be found using

the function ψn(a) ≡ P0(|√

nYn| > a), a ∈ [0,∞), where P0 denotes thejoint distribution of Ynn≥1 under µ = 0. Note that here, finding ψn(·) isdifficult, as it depends on the joint distribution of Y1, . . . , Yn. If, however,one knows that under µ = 0,

√nYn converges in distribution to a normal

random variable Z (which is in fact guaranteed by the central limit the-orem, see Chapter 11), then one may approximate ψn(a) by P (|Z| > a),which can be found, e.g., by using a table of normal probabilities.

Page 299: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

288 9. Convergence in Distribution

The formal definition of ‘convergence in distribution’ is given below.

Definition 9.1.1: Let Xn, n ≥ 0 be a collection of random variables andlet Fn denote the cdf of Xn, n ≥ 0. Then, Xnn≥1 is said to converge indistribution to X0, written as Xn −→d X0, if

limn→∞ Fn(x) = F0(x) for every x ∈ C(F0) (1.1)

where C(F0) = x ∈ R : F0 is continuous at x.

Definition 9.1.2: Let µnn≥0 be probability measures on (R,B(R)).Then µnn≥1 is said to converge to µ0 weakly or in distribution, denotedby µn −→d µ0, if (1.1) holds with Fn(x) ≡ µn

((−∞, x]

), x ∈ R, n ≥ 0.

Unlike the notions of convergence in probability and convergence almostsurely, the notion of convergence in distribution does not require that therandom variables Xn, n ≥ 0 be defined on a common probability space.Indeed, for each n ≥ 0, Xn may be defined on a different probability space(Ωn,Fn, Pn) and Xnn≥1 may converge in distribution to X0. In sucha context, the notions of convergence of Xnn≥1 to X0 in probabilityor almost surely are not well defined. Definition 9.1.1 requires only thecdfs of Xn’s to converge to that of X0 at each x ∈ C(F0) ⊂ R, but doesnot require the (almost sure or in probability) convergence of the randomvariables Xn’s themselves.

Example 9.1.1: For n ≥ 1, let Xn ∼ Uniform (0, 1n ), i.e., Xn has the cdf

Fn(x) =

⎧⎨⎩

0 if x ≤ 0nx if 0 < x < 1

n1 if x ≥ 1

n

and let X0 be the degenerate random variable taking the value 0 withprobability 1, i.e., the cdf of X0 is

F0(x) =

0 if x < 01 if x ≥ 0.

Note that the function F0(x) is discontinuous only at x = 0. Hence,C(F0) = R\0. It is easy to verify that for every x = 0,

Fn(x) → F0(x) as n →∞.

Hence, Xn −→d X0.

Example 9.1.2: Let ann≥1 and bnn≥1 be sequences of real numberssuch that 0 < bn < ∞ for all n ≥ 1. Let Xn ∼ N(an, bn), n ≥ 1. Then, thecdf of Xn is given by

Fn(x) = Φ(x− an

bn

), x ∈ R (1.2)

Page 300: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

9.1 Definitions and basic properties 289

where Φ(x) =∫ x

−∞ φ(t)dt and φ(x) = (2π)−1/2 exp(−x2/2), x ∈ R. IfX0 ∼ N(a0, b0) for some a0 ∈ R, b0 ∈ [0,∞), then using (1.2), one canshow that Xn −→d X0 if and only if an → a0 and bn → b0 as n → ∞(Problem 9.8).

Next some simple implications of Definition 9.1.1 are considered.

Proposition 9.1.1: If Xn −→p X0, then Xn −→d X0.

Proof: Let Fn denote the cdf of Xn, n ≥ 0. Fix x ∈ C(F0). Then, for anyε > 0,

P (Xn ≤ x) ≤ P (X0 ≤ x + ε) + P (Xn ≤ x, X0 > x + ε)≤ P (X0 ≤ x + ε) + P (|Xn −X0| > ε) (1.3)

and similarly,

P (Xn ≤ x) ≥ P (X0 ≤ x− ε)− P (|Xn −X0| > ε). (1.4)

Hence, by (1.3) and (1.4),

F0(x− ε)− P (|Xn −X0| > ε) ≤ Fn(x) ≤ F0(x + ε) + P (|Xn −X0| > ε).

Since Xn −→p X0, letting n →∞, one gets

F0(x− ε) ≤ lim infn→∞ Fn(x) ≤ lim sup

n→∞Fn(x) ≤ F0(x + ε) (1.5)

for all ε ∈ (0,∞). Note that as x ∈ C(F0), F0(x−) = F0(x). Hence, lettingε ↓ 0 in (1.5), one has limn→∞ Fn(x) = F0(x). This proves the result.

As pointed out before, the converse of Proposition 9.1.1 is false in general.The following is a partial converse. The proof follows from the definitionsof convergence in probability and convergence in distribution and is left asan exercise (Problem 9.1).

Proposition 9.1.2: If Xn −→d X0 and P (X0 = c) = 1 for some c ∈ R,then Xn −→p c.

Theorem 9.1.3: Let Xn, n ≥ 0 be a collection of random variables withrespective cdfs Fn, n ≥ 0. Then, Xn −→d X0 if and only if there exists adense set D in R such that

limn→∞ Fn(x) = F0(x) for all x ∈ D. (1.6)

Proof: Since C(F0)c has at most countably many points, the ‘only if’ partfollows. To prove the ‘if’ part, suppose that (1.6) holds. Fix x ∈ C(F0).Then, there exist sequences xnn≥1, ynn≥1 in D such that xn ↑ x andyn ↓ x as n →∞. Hence, for any k, n ∈ N,

Fn(xk) ≤ Fn(x) ≤ Fn(yk).

Page 301: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

290 9. Convergence in Distribution

By (1.6), for every k ∈ N,

F0(xk) = limn→∞ Fn(xk) ≤ lim inf

n→∞ Fn(x)

≤ lim supn→∞

Fn(x) ≤ limn→∞ Fn(yk) = F0(yk). (1.7)

Since x ∈ C(F0), limk→∞ F0(xk) = F0(x) = limk→∞ F0(y). Hence, by(1.7), limn→∞ Fn(x) exists and equals F0(x). This completes the proof ofTheorem 9.1.3.

Theorem 9.1.4: (Polya’s theorem). Let Xn, n ≥ 0 be random variableswith respective cdfs Fn, n ≥ 0. If F0 is continuous on R, then

supx∈R

∣∣Fn(x)− F0(x)∣∣→ 0 as n →∞.

Proof: This is a special case of Lemma 8.2.6 and uses the following propo-sition.

Proposition 9.1.5: If a cdf F is continuous on R, then it is uniformlycontinuous on R.

The proof of Proposition 9.1.5 is left as an exercise (Problem 9.2).

Theorem 9.1.6: (Slutsky’s theorem). Let Xnn≥1 and Ynn≥1 be twosequences of random variables such that for each n ≥ 1, (Xn, Yn) is definedon a probability space (Ωn,Fn, Pn). If Xn −→d X and Yn −→p a for somea ∈ R, then

(i) Xn + Yn −→d X + a,

(ii) XnYn −→d aX, and

(iii) Xn/Yn −→d X/a, provided a = 0.

Proof: Only a proof of part (i) is given here. The other parts may beproved similarly. Let F0 denote the cdf of X. Then, the cdf of X + a isgiven by F (x) = F0(x − a), x ∈ R. Fix x ∈ C(F ). Then, x − a ∈ C(F0).For any ε > 0 (as in the derivations of (1.3) and (1.4)),

P (Xn + Yn ≤ x) ≤ P (|Yn − a| > ε) + P (Xn + a− ε ≤ x) (1.8)

and

P (Xn + Yn ≤ x) ≥ P (Xn + a + ε ≤ x)− P (|Y − a| > ε). (1.9)

Now fix ε > 0 such that x− a− ε, x− a + ε ∈ C(F0). This is possible sinceR\C(F0) is countable. Then, from (1.8) and (1.9), it follows that

lim supn→∞

P (Xn + Yn ≤ x)

Page 302: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

9.2 Vague convergence, Helly-Bray theorems, and tightness 291

≤ limn→∞

[P ((Yn − a) > ε) + P (Xn ≤ x− a + ε)

]= F0(x− a + ε) (1.10)

and similarly,

lim infn→∞ P (Xn + Yn ≤ x) ≥ F0(x− a− ε). (1.11)

Now letting ε → 0+ in such a way that x − a ± ε ∈ C(F0), from (1.10)and (1.11), it follows that

F0((x− a)−) ≤ lim infn→∞ P (Xn + Yn ≤ x)

≤ lim supn→∞

P (Xn + Yn ≤ x)

≤ F0(x− a).

Since x− a ∈ C(F0), (i) is proved.

9.2 Vague convergence, Helly-Bray theorems, andtightness

One version of the Bolzano-Weirstrass theorem from real analysis statesthat if A ⊂ [0, 1] is an infinite set, then there exists a sequence xnn≥1 ⊂ Asuch that limn→∞ xn ≡ x exists in [0, 1]. Note that x need not be in Aunless A is closed. There is an analog of this for sub-probability measureson (R,B(R)), i.e., for measures µ on (R,B(R)) such that µ(R) ≤ 1. First,one needs a definition of convergence of sub-probability measures.

Definition 9.2.1: Let µnn≥1, µ be sub-probability measures on(R,B(R)). Then µnn≥1 is said to converge to µ vaguely, denoted byµn −→v µ, if there exists a set D ⊂ R such that D is dense in R and

µn((a, b]) → µ((a, b]) as n →∞ for all a, b ∈ D. (2.1)

Example 9.2.1: Let Xnn≥1, X be random variables such that Xn

converges to X in distribution, i.e.,

Fn(x) ≡ P (Xn ≤ x) → F (x) ≡ P (X ≤ x) (2.2)

for all x ∈ C(F ), the set of continuity points of F . Since the complementof C(F ) is at most countable, (2.2) implies that µn −→v µ where µn(·) ≡P (Xn ∈ ·) and µ(·) ≡ P (X ∈ ·).

Remark 9.2.1: It follows from above that if µnn≥1, µ are probabilitymeasures, then

µn −→d µ ⇒ µn −→v µ. (2.3)

Page 303: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

292 9. Convergence in Distribution

Conversely, it is not difficult to show that (Problem 9.4) if µn −→v µand µn and µ are probability measures, then µn −→d µ.

Example 9.2.2: Let µn be the probability measure corresponding to theUniform distribution on [−n, n], n ≥ 1. It is easy to show that µn −→v µ0,where µ0 is the measure that assigns zero mass to all Borel sets. This showsthat if µn −→v µ, then µn(R) need not converge to µ(R). But if µn(R) doesconverge to µ(R) and µ(R) > 0 and if µn −→v µ, then it can be shownthat µ′

n −→d µ′ where µ′n = µn

µn(R) and µ′ = µµ(R) .

Theorem 9.2.1: (Helly’s selection theorem). Let A be an infinite collec-tion of sub-probability measures on (R,B(R)). Then, there exist a sequenceµnn≥1 ⊂ A and a sub-probability measure µ such that µn −→v µ.

Proof: Let D ≡ rnn≥1 be a countable dense set in R (for example, onemay take D = Q, the set of rationals or D = Dd, the set of all dyadicrationals of the form j/2n : j an integer, n a positive integer). Let foreach x, A(x) ≡ µ((−∞, x]) : µ ∈ A. Then A(x) ⊂ [0, 1] and so by theBolzano-Weirstrass theorem applied to the set A(r1), one gets a sequenceµ1nn≥1 ⊂ A such that limn→∞ F1n(r1) ≡ F (r1) exists, where F1i(x) ≡µ1i((−∞, x]), x ∈ R. Next, applying the Bolzano-Weirstrass theorem toF1n(r2)n≥1 yields a further subsequence µ2nn≥1 ⊂ µ1nn≥1 ⊂ Asuch that limn→∞ F2n(r2) ≡ F (r2) exists, where F2i(x) ≡ µ2i((−∞, x]),i ≥ 1. Continuing this way, one obtains a sequence of nested subsequencesµjnn≥1, j = 1, 2, . . . such that for each j, limn→∞ Fjn(rj) ≡ F (rj) exists.In particular, for the subsequence µnnn≥1,

limn→∞ Fnn(rj) = F (rj) (2.4)

exists for all j. Now set

F (x) ≡ infF (r) : r > x, r ∈ D. (2.5)

Then, F (·) is a nondecreasing right continuous function on R (Problem 9.5)and it equals F (·) on D. Let µ be the Lebesgue-Stieltjes measure generatedby F . Since Fnn(x) ≤ 1 for all n and x, it follows that F (x) ≤ 1 for all xand hence that µ is a sub-probability measure. Suppose it is shown that(2.4) also implies that

limn→∞ Fnn(x) = F (x) (2.6)

for all x ∈ CF , the set of continuity points of F . Then, all a, b ∈ CF ,µnn((a, b]) ≡ Fnn(b) − Fnn(a) → F (b) − F (a) = µ((a, b]) and hence thatµn −→v µ. To establish (2.6), fix x ∈ CF and ε > 0. Then, there is a δ > 0such that for all x−δ < y < x+δ, F (x)−ε < F (y) < F (x)+ε. This impliesthat there exist x − δ < r < x < r′ < x + δ, r, r′ ∈ D and F (x) − ε <F (r) ≤ F (x) ≤ F (r′) < F (x) + ε. Since Fnn(r) ≤ Fnn(x) ≤ Fnn(r′), it

Page 304: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

9.2 Vague convergence, Helly-Bray theorems, and tightness 293

follows that

F (x)− ε ≤ limn→∞

Fnn(x) ≤ limn→∞ Fnn(x) ≤ F (x) + ε,

establishing (2.6).

Next, some characterization results on vague convergence and conver-gence in distribution will be established. These can then be used to definethe notions of convergence of sub-probability measures on more generalmetric spaces.

Theorem 9.2.2: (The first Helly-Bray theorem or the Helly-Bray theoremfor vague convergence). Let µnn≥1 and µ be sub-probability measures on(R,B(R)). Then µn −→v µ iff∫

fdµn →∫

fdµ (2.7)

for all f ∈ C0(R) ≡ g | g : R → R is continuous and lim|x|→∞ g(x) = 0.

Proof: Let µn −→v µ and let f ∈ C0(R). Given ε > 0, choose K largesuch that |f(x)| < ε for |x| > K. Since µn −→v µ, there exists a dense setD ⊂ R such that µn((a, b]) → µ((a, b]) for all a, b ∈ D. Now choose a, b ∈ Dsuch that a < −K and b > K. Since f is uniformly continuous on [a, b] andD is dense in R, there exist points x0 = a < x1 < x2 < · · · < xm = b in Dsuch that supxi≤x≤xi+1

|f(x)− f(xi)| < ε for all 0 ≤ i < m. Now

∫fdµn =

∫(−∞,a]

fdµn +m−1∑i=0

∫(xi,xi+1]

fdµn +∫

(b,∞)fdµn

and so∣∣∣∣∫

fdµn −m−1∑i=0

f(xi)µn((xi, xi+1])∣∣∣∣ < 2ε + ε · µn((a, b]) < 3ε.

A similar approximation holds for∫

fdµ. Since µn, µ are sub-probabilitymeasures, it follows that

∣∣∣∣∫

fdµn −∫

fdµ

∣∣∣∣ < 6ε + ‖f‖m∑

i=0

∣∣µn((xi, xi+1])− µ((xi, xi+1])∣∣,

where ‖f‖ = sup|f(x)| : x ∈ R. Letting n →∞ and noting that µn −→v

µ and ximi=0 ⊂ D, one gets

lim supn→∞

∣∣∣∣∫

fdµn −∫

fdµ

∣∣∣∣ ≤ 6ε.

Page 305: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

294 9. Convergence in Distribution

Since ε > 0 is arbitrary, (2.7) follows and the proof of the “only if” part iscomplete.

To prove the converse, let D be the set of points x : µ(x) = 0. Fixa, b ∈ D, a < b. Let ε > 0. Let f1 be the function defined by

f1(x) =

⎧⎨⎩

1 if a ≤ x ≤ b0 if x < a− ε or x > b + εlinear on a− ε ≤ x < a, b ≤ x ≤ b + ε.

Then, f1 ∈ C0(R) and by (2.7),∫f1dµn →

∫f1dµ.

But µn((a, b]) ≤∫

f1dµn and∫

f1dµ ≤ µ((a − ε, b + ε]). Thus,lim supn→∞ µn((a, b]) ≤ µ((a − ε, b + ε]). Letting ε ↓ 0 and noting thata, b ∈ D, one gets

lim supn→∞

µn((a, b]) ≤ µ((a, b]). (2.8)

A similar argument with f2 = 1 on [a + ε, b − ε] and 0 for x ≤ a and ≥ band linear in between, yields

lim infn→∞ µn((a, b]) ≥ µ((a, b]).

This with (2.8) completes the proof of the “if” part.

Theorem 9.2.3: (The second Helly-Bray theorem or the Helly-Bray the-orem for weak convergence). Let µnn≥1, µ be probability measures on(R,B(R)). Then, µn −→d µ iff∫

fdµn →∫

fdµ (2.9)

for all f ∈ CB(R) ≡ g | g : R → R, g is continuous and bounded.

Proof: Let µn −→d µ. Let ε > 0 and f ∈ CB(R) be given. Choose K largesuch that µ((−K,K]) > 1− ε. Also, choose a < −K and b > K such thatµ(a) = µ(b) = 0, a, b ∈ D. Let a = x0 < x1 < < xm = b be chosenso that x0, . . . , xm ∈ D and

supxi≤x≤xi+1

|f(x)− f(xi)| < ε

for all i = 1, . . . , m−1. Since∫

fdµn−∫

fdµ =∫(−∞,a] fdµn−

∫(−∞,a] fdµ+∑m−1

i=1 (∫(xi,xi+1]

fdµn−∫(xi,xi+1]

fdµ)+∫(b,∞) fdµn−

∫(b,∞) fdµ, it follows

that ∣∣∣∣∫

fdµn −∫

fdµ

∣∣∣∣ < ‖f‖[(

µn((−∞, a]) + µ((−∞, a]))

Page 306: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

9.2 Vague convergence, Helly-Bray theorems, and tightness 295

+m−1∑i=0

∣∣µn((xi, xi+1])− µ((xi, xi+1])∣∣

+ µn((b,∞)) + µ((b,∞))].

Since, a, b, x0, x1, . . . , xm ∈ D,

lim supn→∞

∣∣∣∣∫

fdµn −∫

fdµ

∣∣∣∣ ≤ ‖f‖2(1− µ((a, b])) ≤ ‖f‖2ε.

Since ε > 0 is arbitrary, the “only if” part is proved.Next consider the “if” part. Since C0(R) ⊂ CB(R), (2.9) and Theorem

9.2.2 imply that µn −→v µ. As noted in Remark 9.2.1, if µnn≥1, µ areprobability measures then µn −→v µ iff µn −→d µ. So the proof is complete.

Definition 9.2.2:

(a) A sequence of probability measures µnn≥1 on (R,B(R)) is calledtight if for any ε > 0, there exists M = Mε ∈ (0,∞) such that

supn≥1

µn

([−M,M ]c

)< ε. (2.10)

(b) A sequence of random variables Xnn≥1 is called tight or stochasti-cally bounded if the sequence of probability distributions µnn≥1 ofXnn≥1 is tight, i.e., given any ε > 0, there exists M = Mε ∈ (0,∞)such that

supn≥1

P (|Xn| > M) < ε. (2.11)

Remark 9.2.3: In Definition 9.2.2 (b), the random variables Xn, n ≥ 1need not be defined on a common probability space. If Xn is defined on aprobability space (Ωn,Fn, Pn), n ≥ 1, then (2.11) needs to be replaced by

supn≥1

Pn(|Xn| > M) < ε.

Example 9.2.3: Let Xn ∼ Uniform(n, n + 1). Then, for any given M ∈(0,∞),

P (|Xn| > M) ≥ P (Xn > M) = 1 for all n > M.

Consequently, for any M ∈ (0,∞),

supn≥1

P (|Xn| > M) = 1

and the sequence Xnn≥1 cannot be stochastically bounded.

Page 307: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

296 9. Convergence in Distribution

Example 9.2.4: For n ≥ 1, let

Xn ∼ Uniform(an, 2 + an), (2.12)

where an = (−1)n. Then, Xnn≥1 is stochastically bounded. Indeed,|Xn| ≤ 3 for all n ≥ 1 and therefore, for any ε > 0, (2.11) holds withM = 3. Note that in this example, the sequence Xnn≥1 does not con-verge in distribution to a random variable X. From (2.12), it follows thatas k →∞,

X2k −→d Uniform(1, 3), X2k−1 −→d Uniform(−1, 1). (2.13)

Examples 9.2.3 and 9.2.4 highlight two important characteristics of atight sequence of random variables or probability measures. First, the no-tion of tightness of probability measures or random variables is analogousto the notion of boundedness of a sequence of real numbers. For a sequenceof bounded real numbers xnn≥1, all the xn’s must lie in a bounded inter-val [−M,M ], M ∈ (0,∞). For a sequence of random variables Xnn≥1,the condition of tightness requires that given ε > 0 arbitrarily small, thereexists an M = Mε in (0,∞) such that for each n, Xn lies in [−M,M ] withprobability at least 1− ε. Thus, for a tight sequence of random variables,no positive mass can escape to ±∞, which is contrary to what happenswith the random variables Xnn≥1 of Example 9.2.3.

The second property illustrated by Example 9.2.4 is that like a boundedsequence of real numbers, a tight or stochastically bounded sequence ofrandom variables may not converge in distribution, but has one or moreconvergent subsequences (cf. (2.13)). Indeed, the notion of tightness canbe characterized by this property, as shown by the following result. Forconsistency with the other results in this section, it is stated in terms ofprobability measures instead of random variables.

Theorem 9.2.4: Let µnn≥1 be a sequence of probability measureon (R,B(R)). The sequence µnn≥1 is tight iff given any subsequenceµnii≥1 of µnn≥1, there exists a further subsequence µmii≥1 ofµnii≥1 and a probability measure µ on (R,B(R)) such that

µmi−→d µ as i →∞. (2.14)

Proof: Suppose that µnn≥1 is tight. Given any subsequence µnii≥1

of µnn≥1, by Helly’s selection theorem (Theorem 9.2.1), there exists asub-probability measure µ and a further subsequence µmii≥1 of µnii≥1such that

µmi−→v µ. (2.15)

Next, fix ε ∈ (0, 1). Since µnn≥1 is tight, there exists M ∈ (0,∞) suchthat

supn≥1

µn

([−M,M ]c

)< ε. (2.16)

Page 308: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

9.2 Vague convergence, Helly-Bray theorems, and tightness 297

By (2.15) and (2.16), there exists a, b ∈ D, a < −M , b > M such that

µ((a, b]

)= lim

i→∞µmi

((a, b]

)≥ lim inf

i→∞µmi

([−M,M ]

)= 1− lim sup

n→∞µmi

([−M,M ]

)≥ 1− ε.

Since ε ∈ (0, 1) is arbitrary, this shows that µ is a probability measure andhence, the ‘only if’ part is proved. Next, consider the ‘if part.’ Supposeµnn≥1 is not tight. Then, there exists ε0 ∈ (0, 1) such that for all M ∈(0,∞),

supn≥1

µn

([−M,M ]c

)> ε0.

Hence, for each k ∈ N, there exists nk ∈ N such that

µnk

([−k, k]c

)≥ ε0. (2.17)

Since any finite collection of probability measures on (R,B(R)) is tight,it follows that µnk

: k ∈ N is a countable infinite set. Hence, by thehypothesis, there exist a subsequence µmi

i≥1 in µnk: k ∈ N and a

probability measure µ such that

µmi −→d µ as i →∞. (2.18)

Let a, b ∈ R be such that µ(a) = 0 = µ(b) and µ((a, b]c

)< ε0/2. By

(2.18), there exists i0 ≥ 1 such that for all i ≥ i0,

µmi

((a, b]c

)< µ

((a, b]c

)+ ε0/2 < ε0.

Since (a, b]c ⊃ [−k, k]c for all k > max|a|, |b| and µmi : i ≥ i0 ⊂ µnk:

k ∈ N, this contradicts (2.17). Hence, µnn≥1 is tight.

Theorem 9.2.5: Let µnn≥1, µ be probability measures on (R,B(R)). Ifµn −→d µ, then µnn≥1 is tight.

Proof: Fix ε ∈ (0,∞). Then, there exists a, b ∈ R such that µ(a) = 0 =µ(b) and µ

((a, b]c

)< ε/2. Since µn −→d µ, there exists n0 ≥ 1 such that

for all n ≥ n0, ∣∣µn

((a, b]

)− µ((a, b]

)∣∣ < ε/2.

Thus, for all n ≥ n0,

µn

((a, b]c

)≤ µ

((a, b]c

)+ ε/2 < ε. (2.19)

Also, for each n = 1, . . . , n0, there exist Mi ∈ (0,∞) such that

µi

([−Mi, Mi]c

)< ε. (2.20)

Page 309: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

298 9. Convergence in Distribution

Let M = maxMi : 0 ≤ i ≤ n0, where M0 = max|a|, |b|. Then by (2.19)and (2.20),

supn≥1

µn

([−M,M ]c

)< ε.

Thus, µnn≥1 is tight.

An easy consequence of Theorems 9.2.4 and 9.2.5 is the following char-acterization of weak convergence.

Theorem 9.2.6: Let µnn≥1 be a sequence of probability measures on(R,B(R)). Then µn −→d µ iff µnn≥1 is tight and all weakly convergentsubsequences of µnn≥1 converge to the same limiting probability measureµ.

Proof: If µn −→d µ, then any weakly convergent subsequence of µnn≥1converges to µ and by Theorem 9.2.5, µnn≥1 is tight. Hence, the ‘only if’part follows. To prove the ‘if’ part, suppose that µnn≥1 is tight and thatall weakly convergent subsequences of µnn≥1 converges to µ. Let Fnn≥1and F denote the cdfs corresponding to µnn≥1 and µ, respectively. Ifpossible, suppose that µnn≥1 does not converge in distribution to µ.Then, by definition, there exists x0 ∈ R with µ

(x0

)= 0 such that Fn(x0)

does not converge to F (x0) as n → ∞. Then, there exist ε0 ∈ (0, 1) and asubsequence nii≥1 such that

∣∣Fni(x0)− F (x0)

∣∣ ≥ ε0 for all i ≥ 1. (2.21)

Since µnn≥1 is tight, there exists a subsequence mii≥1 ⊂ nii≥1 anda probability measure µ0 such that

µmi−→d µ0 as i →∞. (2.22)

By hypothesis, µ0 = µ. Hence µ0(x0) = µ(x0

)= 0 and by (2.22),

Fmi(x0) → F (x0) as i →∞,

contradicting (2.21). Therefore, µn −→d µ.

For another proof of the ‘if’ part, see Problem 9.6.

Note that by Slutsky’s theorem, if Xn −→d X and Yn −→p 0, thenXnYn −→p 0. The following result gives a refinement of this.

Proposition 9.2.7: If Xnn≥1 is stochastically bounded and Yn −→p 0,then XnYn −→p 0.

The proof is left as an exercise (Problem 9.7).

Page 310: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

9.3 Weak convergence on metric spaces 299

9.3 Weak convergence on metric spaces

The Helly-Bray theorems proved above suggest the following definitions ofvague convergence and convergence in distribution for measures on metricspaces. Recall that (S, d) is called a metric space, if S is a nonempty setand d is a function from S× S → [0,∞) such that

(i) d(x, y) = d(y, x) for all x, y ∈ S,

(ii) d(x, y) = 0 iff x = y for all x, y ∈ S,

(iii) d(x, z) ≤ d(x, y) + d(y, z) for all x, y, z ∈ S.

A common example of a metric space is given by S = Rk and d(x, y),the Euclidean distance. A set G ⊂ S is open if for all x ∈ G, there existsan ε > 0 such that for any y in S, d(x, y) < ε ⇒ y ∈ G. The set

B(x, ε) = y : d(x, y) < ε

is called the open ball of radius ε with center at x, x ∈ S, ε > 0. Recall thatf : S → R is continuous if f−1((a, b)) is open for every −∞ < a < b < ∞.A family G of open sets in S is called an open cover for a set B ⊂ S if foreach x ∈ B, there exists a G ∈ G such that x ∈ G. A set K ⊂ S is calledcompact if given any open cover G for K, there is a finite subfamily G1 ⊂ Gsuch that G1 is an open cover for K.

Let S be the Borel σ-algebra on S, i.e., let S be the σ-algebra generatedby the open sets in S. A measure on the measurable space (S,S) is oftensimply referred to as a measure on (S, d).

Definition 9.3.1: Let µnn≥1 and µ be sub-probability measures on ametric space (S, d), i.e., µnn≥1 and µ are measures on (S,S) such thatµn(S) ≤ 1 for all n ≥ 1 and µ(S) ≤ 1. Then µnn≥1 converges vaguely toµ (written as µn −→v µ) if ∫

fdµn →∫

fdµ (3.1)

for all f ∈ C0(S), where C0(S) ≡ f | f : S → R, f is continuous andfor every ε > 0, there exists a compact set K such that |f(x)| < ε for allx ∈ K.

Definition 9.3.2: Let µnn≥1 and µ be probability measures on a metricspace (S, d). Then µnn≥1 converges in distribution or converges weaklyto µ (written as µn −→d µ) if∫

fdµn →∫

fdµ (3.2)

for all f ∈ CB(S) ≡ f | f : S → R, f is continuous and bounded.

Page 311: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

300 9. Convergence in Distribution

Recall that a sequence xnn≥1 in a metric space (S, d) is called Cauchyif for every ε > 0, there exists Nε such that n, m > Nε ⇒ d(xn, xm) < ε.A metric space (S, d) is complete if every Cauchy sequence xnn≥1 in Sconverges in S, i.e., given a Cauchy sequence xnn≥1, there exists an x inS such that d(xn, x) → 0 as n →∞.

Example 9.3.1: For any k ∈ N, Rk with the Euclidean metric is completebut the set of all rational vectors Qk with the Euclidean metric d(x, y) ≡‖x− y‖ is not complete. The set C[0, 1] of all continuous functions on [0, 1]is complete with the supremum metric d(f, g) = sup|f(u) − g(u)| : 0 ≤u ≤ 1 but the set of all polynomials on [0, 1] is not complete under thesame metric.

Recall that a set D is called dense in (S, d) if B(x, ε) ∩ D = ∅ for allx ∈ S and for all ε > 0, where B(x, ε) is the open ball with center at x andradius ε. Also, (S, d) is called separable if there exists a countable dense setD ⊂ S.

Definition 9.3.3: A metric space (S, d) is called Polish if it is completeand separable.

Example 9.3.2: All Euclidean spaces with the Euclidean metric as wellas with the Lp metric for 1 ≤ p ≤ ∞, are complete. The space C[0, 1] ofcontinuous functions on [0,1] with the supremum metric is complete. AllLp-spaces over measure spaces with a σ-finite measure and a countablygenerated σ-algebra, 1 ≤ p ≤ ∞, are complete (cf. Chapter 3).

The following theorem gives several equivalent conditions for weak con-vergence of probability measures on a Polish space.

Theorem 9.3.1: Let (S, d) be Polish and µnn≥1, µ be probability mea-sures. Then the following are equivalent:

(i) µn −→d µ.

(ii) For any open set G, lim infn→∞ µn(G) ≥ µ(G).

(iii) For any closed set C, lim supn→∞

µn(C) ≤ µ(C).

(iv) For all B ∈ S such that µ(∂B) = 0,

limn→∞ µn(B) = µ(B),

where ∂B is the boundary of B, i.e., ∂B = x : for all ε > 0,B(x, ε) ∩B = ∅, B(x, ε) ∩Bc = ∅.

(v) For every uniformly continuous and bounded function f : S → R,∫fdµn →

∫fdµ.

Page 312: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

9.3 Weak convergence on metric spaces 301

The proof uses the following fact.

Proposition 9.3.2: For every open set G in a metric space (S, d), thereexists a sequence fnn≥1 of bounded continuous functions from S to [0,1]such that as n ↑ ∞, fn(x) ↑ IG(x) for all x ∈ S.

Proof: Let Gn ≡ x : d(x, Gc) > 1n where for any set A in (S, d),

d(x, A) ≡ infd(x, y) : y ∈ A. Then since G is open, d(x, Gc) > 0 for all xin G. Thus Gn ↑ G. Let for each n ≥ 1,

fn(x) ≡ d(x, Gc)d(x, Gc) + d(x, Gn)

, x ∈ S. (3.3)

Check that (Problem 9.10) for each n, fn(x) is continuous on S, fn(x) = 1on Gn and 0 on Gc, 0 ≤ fn(x) ≤ 1 for all x in S. Further fn(·) ↑ IG(·).

Proof of Theorem 9.3.1: (i) ⇒ (ii): Let G be open. Choose fnn≥1 asin Proposition 9.3.2. Then for j ∈ N,

µn(G) ≥∫

fjdµn ⇒ lim infn→∞ µn(G) ≥ lim inf

n→∞

∫fjdµn =

∫fjdµ

(by (i)). But limj→∞∫

fjdµ = µ(G), by the bounded convergence theorem.Hence (ii) holds.

(ii) ⇔ (iii): Suppose (ii) holds. Let C be closed. Then G = Cc is open. Soby (ii),

lim infn→∞ µn(Cc) ≥ µ(Cc) ⇒ lim sup

n→∞µn(C) ≤ µ(C),

since µn and µ are probability measures. Thus, (iii) holds. Similarly, (iii)⇒ (ii).

(iii) ⇒ (iv): For any B ∈ S, let B0 and B denote, respectively, the interiorand the closure of B. That is, B0 = y : B(y, ε) ⊂ B for some ε > 0 andB = y : for some xnn≥1 ⊂ B, limn→∞ xn = y. Then, for any n ≥ 1,

µn(B0) ≤ µn(B) ≤ µn(B)

and by (ii) and (iii),

µ(B0) ≤ lim infn→∞ µn(B) ≤ lim sup

n→∞µn(B) ≤ µ(B).

But ∂B = B \ B0 and so µ(∂B) = 0 implies µ(B0) = µ(B). Thus,limn→∞ µn(B) = µ(B).

(iv) ⇒ (v) ⇒ (i): This will be proved for the case where S is the real line.For the general Polish case, see Billingsley (1968). Let F (x) ≡ µ((−∞, x])and Fn(x) ≡ µn((−∞, x]), x ∈ R, n ≥ 1. Let x be a continuity point of F .Then µ(x) = 0. Since if B = (−∞, x], then ∂B = x, by (iv),

Fn(x) = µn((−∞, x]) → µ((−∞, x]) = F (x).

Page 313: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

302 9. Convergence in Distribution

Thus, µn −→d µ. By Theorem 9.2.3, (i) holds and hence (v) holds.

(v) ⇒ (i): Note that in the proof of Theorem 9.2.2, the approximatingfunctions f1 and f2 were both uniformly continuous. Hence, the assertionfollows from Theorem 9.2.2 and Remark 9.2.1. This completes the proof ofTheorem 9.3.1.

The following example shows that the inequality can be strict in (ii) and(iii) of the above theorem.

Example 9.3.3: Let X be a random variable. Set Xn = X+ 1n , Yn = X− 1

n ,n ≥ 1. Since Xn and Yn both converge to X w.p. 1, the distributions ofXn and Yn converge to that of X.

Now suppose that there is a value x0 such that P (X = x0) > 0. Then,

µn

((−∞, x0)

)≡ P (Xn < x0)

= P(X < x0 −

1n

)→ P (X < x0) = µ

((−∞, x0)

),

µn

((−∞, x0]

)= P (Xn ≤ x0)

= P(X ≤ x0 −

1n

)→ P (X < x0) < µ

((−∞, x0]

),

and

νn

((−∞, x0)

)≡ P (Yn < x0)

= P(X < x0 +

1n

)→ P (X ≤ x0) > P (X < x0).

Note that µn and νn both converge in distribution to µ. However, for theclosed set (−∞, x0],

lim supn→∞

µn

((−∞, x0]

)< µ

((−∞, x0]

)and for the open set (−∞, x0),

lim infn→∞ νn

((−∞, x0)

)> µ

((−∞, x0)

).

Remark 9.3.1: Convergent sequences of probability distributions arisein a natural way in parametric families in mathematical statistics. For ex-ample, let µ(·; θ) denote the normal distribution with mean θ and variance1. Then, θn → θ ⇒ µn(·) ≡ N(θn, 1) −→d N(θ, 1) ≡ µ(·). Similarly, letθ = (λ, Σ), where λ ∈ Rk and Σ is a k×k positive definite matrix. Let µ(·; θ)be the k-variate normal distribution with mean λ and variance covariance

Page 314: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

9.4 Skorohod’s theorem and the continuous mapping theorem 303

matrix Σ. Then, µ(·; θ) is continuous in θ in the sense that if θn → θ inthe Euclidean metric, then µ(·; θn) −→d µ(·; θ). Most parametric familiesin mathematical statistics possess this continuity property.

Definition 9.3.4: Let µnn≥1 be a sequence of probability measures on(S,S), where S is a Polish space and S is the Borel σ-algebra on S. Thenµnn≥1 is called tight if for any ε > 0, there exists a compact set K suchthat

supn≥1

µn(Kc) < ε. (3.4)

A sequence of S-valued random variables Xnn≥1 is called tight or stochas-tically bounded if the sequence µXn

n≥1 is tight, where µXnis the proba-

bility distribution of Xn on (S,S).

If S = Rk, k ∈ N, and Xnn≥1 is a sequence of k-dimensional randomvectors, then, by Definition 9.3.4, Xnn≥1 is tight if and only if for everyε > 0, there exists M ∈ (0,∞) such that

supn≥1

P (‖Xn‖ > M) < ε, (3.5)

where ‖ · ‖ denotes the usual Euclidean norm on Rk. Furthermore, if Xn =(Xn1, . . . , Xnk), n ≥ 1, then the tightness of Xnn≥1 is equivalent to thetightness of the k sequences of random variables Xnjn≥1, j = 1, . . . , k(Problem 9.9).

An analog of Theorem 9.2.4 holds for probability measures on (S,S)when S is Polish.

Theorem 9.3.3: (Prohorov-Varadarajan theorem). Let µnn≥1 be a se-quence of probability measures on (S,S) where S is a Polish space and Sis the Borel σ-algebra on S. Then, µnn≥1 is tight iff given any subse-quence µni

i≥1 ⊂ µnn≥1, there exist a further subsequence µmii≥1 of

µnii≥1 and a probability measure µ on (S,S) such that

µmi −→d µ as i →∞. (3.6)

For a proof of this result, see Section 1.6 of Billingsley (1968). This resultis useful for proving weak convergence in function spaces (e.g., see Chapter11 where a functional central limit theorem is stated).

9.4 Skorohod’s theorem and the continuousmapping theorem

If Xnn≥1 is a sequence of random variables that converge to a randomvariable X in probability, then Xn does converge in distribution to X (cf.

Page 315: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

304 9. Convergence in Distribution

Proposition 9.1.1). Here is another proof of this fact using Theorem 9.2.3.Let f : R → R be bounded and continuous. Then Xn → X in probabilityimplies that f(Xn) → f(X) in probability (Problem 9.13) and by the BCT,∫

fdµn = Ef(Xn) → Ef(X) =∫

fdµ,

where µn(·) = P (Xn ∈ ·), n ≥ 1 and µ(·) = P (X ∈ ·). Hence, µn −→d µ. Inparticular, it follows that if Xn → X w.p. 1, then Xn −→d X. Skorohod’stheorem is a sort of converse to this. If µn −→d µ, then there exist randomvariables Xn, n ≥ 1 and X such that Xn has distribution µn, n ≥ 1 andX has distribution µ and Xn → X w.p. 1.

Theorem 9.4.1: (Skorohod’s theorem). Let µnn≥1, µ be probabilitymeasures on (R,B(R)) such that µn −→d µ. Let

Xn(ω) ≡ supt : µn((−∞, t]) < ωX(ω) ≡ supt : µ((−∞, t]) < ω

for 0 < ω < 1. Then, Xn and X are random variables on((0, 1),B

((0, 1)

), m) where m is the Lebesgue measure. Furthermore, Xn

has distribution µn, n ≥ 1, X has distribution µ and Xn → X w.p. 1.

Proof: For any cdf F (·), let F−1(u) ≡ supt : F (t) < u. Then for anyu ∈ (0, 1) and t ∈ R, it can be verified that F−1(u) ≤ t ⇒ F (t) ≥ u ⇒F−1(u) ≤ t and hence, if U is a Uniform (0,1) random variable (Problem9.11),

P (F−1(U) ≤ t) = P (U ≤ F (t)) = F (t),

implying thatF−1(U) has cdf F (·).

This shows that Xn, n ≥ 1 and X have the asserted distributions. Itremains to show that

Xn(ω) → X(ω) w.p. 1

Fix ω ∈ (0, 1) and let y < X(ω) be such that µ(y) = 0. Now y <X(ω) ⇒ µ((−∞, y]) < ω. Since µn −→d µ and µ(y) = 0, µn((−∞, y]) →µ((−∞, y]) and so µn((−∞, y]) < ω for large n. This implies that Xn(ω) ≥y for large n and hence lim infn→∞ Xn(ω) ≥ y. Since this is true for ally < X(ω) with µ(y) = 0, and since the set of all such y’s is dense in R,it follows that

lim infn→∞ Xn(ω) ≥ X(ω) for all ω in (0, 1).

Next fix ε > 0 and y > X(ω + ε), and µ(y) = 0. Then µ((−∞, y]) ≥ω + ε. Since µ(y) = 0, µn((−∞, y]) → µ((−∞, y]). Thus, for large n,

Page 316: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

9.4 Skorohod’s theorem and the continuous mapping theorem 305

µn(−∞, y] ≥ ω. This implies that Xn(ω) ≤ y for large n and hence thatlim supn→∞ Xn(ω) ≤ y. Since this is true for all y > X(ω + ε), µ(y) = 0,it follows that

lim supn→∞

Xn(ω) ≤ X(ω + ε) for every ε > 0

and hence thatlim sup

n→∞Xn(ω) ≤ X(ω+).

Thus it has been shown that for all 0 < ω < 1,

X(ω) ≤ lim infn→∞ Xn(ω) ≤ lim sup

n→∞Xn(ω) ≤ X(ω+).

Since X(ω) is a nondecreasing function on (0, 1), it has at most a countablenumber of discontinuities and so

limn→∞ Xn(ω) = X(ω) w.p. 1.

An immediate consequence of the above theorem is the continuity ofconvergence in distribution under continuous transformations.

Theorem 9.4.2: (The continuous mapping theorem). Let Xnn≥1, X berandom variables such that Xn −→d X. Let f : R → R be Borel measurablesuch that P (X ∈ Df ) = 0, where Df is the set of discontinuities of f . Thenf(Xn) −→d f(X). In particular, this holds if f : R → R is continuous.

Remark 9.4.1: It can be shown that for any f : R → R, the set Df =x : f is discontinuous at x ∈ B(R) (Problem 9.12). Thus, X ∈ Df ∈ F ,and P (X ∈ Df ) is well defined.

Proof: By Skorohod’s theorem, there exist random variables Xnn≥1,X defined on the Lebesgue space (Ω = (0, 1), B((0, 1)), m = Lebesguemeasure) such that Xn =d Xn for n ≥ 1, X =d X, and

Xn → X w.p. 1.

Let A = ω : Xn(ω) → X(ω) and B = ω : X(ω) ∈ Df. Then,P (A) = 1 = P (B) and so, for ω ∈ A ∩B,

f(Xn(ω)) → f(X(ω)).

Thus, f(Xn) → f(X) w.p. 1 and hence f(Xn) −→d f(X).

Another easy consequence of Skorohod’s theorem is the Helly-Bray The-orem 9.2.3. Since Xn → X w.p. 1 and f is a bounded continuous function,then f(Xn) → f(X) w.p. 1 and so by the bounded convergence theorem

Ef(Xn) → Ef(X).

Page 317: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

306 9. Convergence in Distribution

Since Xn =d Xn for n ≥ 1 and X =d X, this is the same as saying that

Ef(Xn) → Ef(X).

That is,∫

fdµn →∫

fdµ, where µn(·) = P (Xn ∈ ·), n ≥ 1 and µ(·) =P (X ∈ ·).

Remark 9.4.2: Skorohod’s theorem is valid for any Polish space. Supposethat S is a Polish space and µnn≥1 and µ are probability measures on(S,S), where S is the Borel σ-algebra on S, such that µn −→d µ. Thenthere exist random variables Xn and X defined on the Lebesgue space((0, 1), B((0, 1)), m = the Lebesgue measure

)such that for all n ≥ 1, Xn

has distribution µn, X has distribution µ and Xn → X w.p. 1. For a proof,see Billingsley (1968).

9.5 The method of moments and the momentproblem

9.5.1 Convergence of momentsLet Xnn≥1 and X be random variables such that Xn converges toX in distribution. Suppose for some k > 0, E|Xn|k < ∞ for eachn ≥ 1. A natural question is: When does this imply E|X|k < ∞ andlimn→∞ E|Xn|k = EXk?

By Skorohod’s theorem, one can assume w.l.o.g. that Xn → X w.p. 1.Then the results from Section 2.5 yield the following.

Theorem 9.5.1: Let Xnn≥1 and X be a collection of random vari-ables such that Xn −→d X. Then, for each 0 < k < ∞, the following areequivalent:

(i) E|Xn|k < ∞ for each n ≥ 1, E|X|k < ∞ and E|Xn|k → E|X|k.

(ii) |Xn|kn≥1 are uniformly integrable, i.e., for every ε > 0, there existsan Mε ∈ (0,∞) such that

supn≥1

E(|Xn|kI(|Xn| > Mε)) < ε.

Remark 9.5.1: Recall that a sufficient condition for the uniform integra-bility of |Xn|kn≥1 is that

supn≥1

E|Xn| < ∞ for some ∈ (k,∞).

Example 9.5.1: Let Xn have the distribution P (Xn = 0) = 1 − 1n ,

P (Xn = n) = 1n for n = 1, 2, . . .. Then Xn −→d 0 but EXn = 1 does not

go to 0. Note that Xnn≥1 is not uniformly integrable here.

Page 318: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

9.5 The method of moments and the moment problem 307

Remark 9.5.2: In Theorem 9.5.1, under hypothesis (ii), it follows that

E|Xn| → E|X| for all real numbers ∈ (0, k)

andEXp

n → EXp for all positive integers p, 0 < p ≤ k.

9.5.2 The method of momentsSuppose that Xnn≥1 are random variables such that limn→∞ EXk

n =mk < ∞ exists for all integers k = 0, 1, 2, . . .. Does there exist a randomvariable X such that Xn −→d X? The answer is ‘yes’ provided that themoments mkk≥1 determine the distribution of the random variable Xuniquely.

Theorem 9.5.2: (Frechet-Shohat theorem). Let Xnn≥1 be a sequenceof random variables such that for each k ∈ N, limn→∞ EXk

n ≡ mk existsand is finite. If the sequence mkk≥1 uniquely determines the distributionof a random variable X, then Xn −→d X.

Proof: Suppose that for some subsequence njj≥1, the probabilitydistributions µnjj≥1 of Xnjj≥1 converge vaguely to some µ. SinceEX2

njj≥1 is a bounded sequence, µnjj≥1 is tight. Hence µ must be

a probability distribution and by Theorem 9.5.1, the moments of µ mustcoincide with mkk≥1. Since the sequence mkk≥1 determines the distri-bution uniquely, µ is unique and is the unique vague limit point of µnn≥1and by Theorem 9.2.6, µn −→d µ. So if X is a random variable with dis-tribution µ, then Xn −→d X.

The above “method of moments” used to be a tool for proving conver-gence in distribution, e.g., for proving asymptotic normality of the Binomial(n, p) distribution. Since it requires existence of all moments, this methodis too restrictive and is of limited use. However, the question of when dothe moments determine a distribution is an interesting one and is discussednext.

9.5.3 The moment problemSuppose mkk≥1 is a sequence of real numbers such that there is at leastone probability measure µ on (R,B(R)) such that for all k ∈ N

mk =∫

xkµ(dx).

Does the sequence mkk≥1 determine µ uniquely? This is a part ofthe Hamburger-moment problem, which includes seeking conditions under

Page 319: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

308 9. Convergence in Distribution

which a given sequence mkk≥1 is the moment sequence of a probabilitydistribution.

The answer to the uniqueness question posed above is ‘no,’ as the fol-lowing example shows.

Example 9.5.2: Let Y be a standard normal random variable and letX = exp(Y ). Then X is said to have the log-normal distribution (whichis a misnomer as a more appropriate name would be something like expo-normal). Then X has the probability density function

f(x) = 1√

2π1x exp(−[log x]2/2) x > 0

0 otherwise.(5.1)

Consider now the family of functions

fα(x) = f(x)(1 + α sin(2π log x))

with |α| ≤ 1. It is clear that fα(x) ≥ 0. Further, it is not difficult to checkthat for any α ∈ [−1, 1],∫

xrfα(x)dx =∫

xrf(x)dx

for all r = 0, 1, 2, . . .. Thus, the sequence of moments mk ≡∫

xkf(x)dxdoes not determine the log-normal distribution (5.1).

A sufficient condition for uniqueness is Carleman’s condition:

∞∑k=1

m−1/2k2k = ∞. (5.2)

For a proof, see Feller (1966) or Shohat and Tamarkin (1943).

Remark 9.5.3: A special case of the above is when

lim supk→∞

m1/2k2k = r ∈ [0,∞). (5.3)

In particular, if mkk≥1 is a moment sequence, then within the class ofprobability distributions µ that have bounded support and have mkk≥1as their moment sequence, µ is uniquely determined. This is so since ifM ≡ supx : µ([−x, x]) < 1, then (Problem 9.27)

m1/2k2k → M as k →∞. (5.4)

More generally, if µ is a probability distribution on R such that∫etxdµ(x) < ∞ for all |t| < δ for some δ > 0, then all its moments are

finite and (5.2) holds and hence µ is uniquely determined by its moments

Page 320: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

9.6 Problems 309

(Problem 9.28). In particular, the normal and Gamma distributions aredetermined by their moments.

Remark 9.5.4: If mkk≥1 is a moment sequence of a distribution µconcentrated on [0,∞), the problem of determining µ uniquely is known asthe Stieltjes moment problem. If X is a random variable with distribution µ,let Y = δ

√X, where δ is independent of X and takes two values −1,+1

with equal probability. Then Y has a symmetric distribution and for allk ≥ 1,

E|Y |2k = E|X|k.

The distribution of Y is uniquely determined (and hence that of√

X andhence that of X) if

lim supk→∞

(EY 2k)1/2k

2k< ∞

i.e.,

lim supk→∞

m1/2kk

2k< ∞.

9.6 Problems

9.1 If Xn −→d X0 and P (X0 = c) = 1 for some c ∈ R, then Xn −→p c.

9.2 If a cdf is continuous on R, then it is uniformly continuous on R.

(Hint: Use the facts that (i) given any ε > 0, there exists M ∈ (0,∞)such that F (−x)+1−F (x) < ε for all x > M , and (ii) F is uniformlycontinuous on [−M,M ].)

9.3 Prove parts (ii) and (iii) of Theorem 9.1.6.

9.4 Let µnn≥1, µ be probability measures on (R,B(R)) such thatµn −→v µ. Show that µn −→d µ.

9.5 Show that the function F (·), defined in (2.5), is nondecreasing andright continuous and that the function F (x) ≡ infF (r) : r ≥ x, r ∈D is nondecreasing and left continuous.

9.6 Give another proof of the ‘if’ part of Theorem 9.2.6 by using The-orem 9.2.1 and showing that for any f : R → R continuous andbounded and any subsequence nii≥1, there exist a further subse-quence mjj≥1 such that amj

=∫

fdµmj→ a =

∫fdµ and hence,

an ≡∫

fdµn → a.

9.7 If Xnn≥1 is stochastically bounded and Yn −→p 0, then show thatXnYn −→p 0.

Page 321: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

310 9. Convergence in Distribution

9.8 (a) Let Xn ∼ N(an, bn) for n ≥ 0, where bn > 0 for n ≥ 1, b0 ∈[0,∞) and an ∈ R for all n ≥ 0.

(i) Show that if an → a0, bn → b0 as n →∞, then Xn −→d X0.(ii) Show that if Xn −→d X0 as n → ∞, then an → a0 and

bn → b0.

(Hint: First show that bnn≥1 is bounded and then thatann≥1 is bounded and finally, that a0 and b0 are the onlylimit points of ann≥1 and bnn≥1, respectively.)

(b) For n ≥ 1, let Xn ∼ N(an,Σn) where an ∈ Rk and Σn is a k×kpositive definite matrix, k ∈ N. Then, Xnn≥1 is stochasticallybounded if and only if ‖an‖n≥1 and ‖Σn‖n≥1 are bounded.

9.9 Let Xjnn≥1, j = 1, . . . , k, k ∈ N be sequences of random variables.Let Xn = (X1n, . . . , Xkn), n ≥ 1. Show that the sequence of randomvectors Xnn≥1 is tight in Rk iff for each 1 ≤ j ≤ k, the sequence ofrandom variables Xjnn≥1 is tight in R.

9.10 Let (S, d) be a metric space.

(a) For any set A ⊂ S, let

d(x, A) ≡ infd(x, y) : y ∈ A.

Show that for each A, d(·, A) is continuous on S.

(b) Let fn(·) be as in (3.3). Show that fn(·) is continuous on S andfn(·) ↑ IG(·).

(Hint: Note that d(x, Gc) + d(x, Gn) > 0 for all x in S. )

9.11 For any cdf F , let F−1(u) ≡ supt : F (t) < u, 0 < u < 1. Show thatfor any 0 < u0 < 1 and t0 in R,

F−1(u0) ≤ t0 ⇔ F (t0) ≥ u0.

(Hint: For ⇒, use the right continuity of F and for ⇐, use thedefinition of sup.)

9.12 For a function f : Rk → R (k ∈ N), define Df = x ∈ Rk : f isdiscontinuous at x. Show that Df ∈ B(Rk).

9.13 If Xn −→p X and f : R → R is continuous, then f(Xn) −→p f(X).

9.14 (The Delta method). Let Xnn≥1 be a sequence of random variablesand ann≥1 ⊂ (0,∞) be a sequence of constants such that an →∞as n →∞ and

an(Xn − θ) −→d Z

Page 322: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

9.6 Problems 311

for some random variable Z and for some θ ∈ R. Let H : R → R bea function that is differentiable at θ with derivative c. Show that

an

(H(Xn)−H(θ)

)−→d cZ.

(Hint: By Taylor’s expansion, for any x ∈ R,

H(x) = H(θ) + c(x− θ) + R(x)(x− θ)

where R(x) → 0 as x → θ. Now use Problem 9.7 and Slutsky’stheorem.)

9.15 Let X be a random variable with P (X = c) > 0 for some c ∈ R.Give examples of two sequences Xnn≥1 and Ynn≥1 satisfyingXn −→d X and Yn −→d X such that

limn→∞ P (Xn ≤ c) = P (X ≤ c)

butlim

n→∞ P (Yn ≤ c) = P (X ≤ c).

(Hint: Take Xn =d X, n ≥ 1 and Yn =d X + 1n , n ≥ 1, say.)

9.16 Let µnn≥1, µ be probability measures on (R,B(R)) such that∫fdµn →

∫fdµ for all f ∈ F

for some collection F of functions from R to R specified below. Doesµn −→d µ if

(a) F = f | f : R → R, f is bounded and continuously differen-tiable on R with a bounded derivative ?

(b) F = f | f : R → R, f is bounded and infinitely differentiableon R ?

(c) F ≡ f | f is a polynomial with real coefficients and∫|x|kµ(dx) +

∫|x|kdµn(dx) < ∞ for all n, k ∈ N ?

9.17 For any two cdfs F , G on R, define

dL(F, G) = infε > 0 : G(x− ε)− ε < F (x)< G(x + ε) + ε for all x ∈ R. (6.1)

Verify that dL defines a metric on the collection of all probabilitydistributions on (R,B(R)). The metric dL is called the Levy metric.

9.18 Let µnn≥1, µ be probability measures on (R,B(R)), with the cor-responding cdfs Fnn≥1 and F . Show that µn −→d µ iff

dL(Fn, F ) → 0 as n →∞.

Page 323: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

312 9. Convergence in Distribution

9.19 (a) Show that for any two cdfs F , G on R,

dL(F, G) ≤ dK(F, G), (6.2)

wheredK(F, G) = sup

x∈R

|F (x)−G(x)| (6.3)

(dK is called the Kolmogorov distance or metric between F andG).

(b) Give examples where (i) equality holds in (6.2), and (ii) wherestrict inequality holds in (6.2).

9.20 Let µnn≥1, µ be probability measures on (R,B(R)) such thatµn −→d µ. Let fa : a ∈ R be a collection of bounded functionsfrom R → R such that µ(Dfa

) = 0 for all a ∈ R and |fa(x)−fb(x)| ≤h(x)|b−a| for all a, b ∈ R and for some h : R → (0,∞) with µ(Dh) = 0and

∫|h|dµ < ∞. Show that

supa∈R

∣∣∣ ∫ fadµn −∫

fadµ∣∣∣→ 0 as n →∞.

9.21 Let Xnn≥1, X be k-dimensional random vectors such that Xn −→d

X. Let Ann≥1 be a sequence of r× k-matrices of real numbers andbnn≥1 ⊂ Rr, r ∈ N. Define Yn = AnXn + bn and Zn = AnXnXT

n

where XTn denotes the transpose of X. Suppose that An → A and

bn → b. Show that

(a) Yn −→d Y , where Y =d AX + b,(b) Zn −→d Z, where Z =d AXXT .

(Note: Here convergence in distribution of a sequence of r×k matrix-valued random variables may be interpreted by considering the corre-sponding rk-dimensional random vectors obtained by concatenatingthe rows of the r × k matrix side-by-side and using the definition ofconvergence in distribution for random vectors.)

9.22 Let µn, µ be probability measures on a countable set D ≡ ajj≥1 ⊂R. Let pnj = µn(aj), j ≥ 1, n ≥ 1 and pj = µ(aj). Show that, asn →∞, µn −→d µ iff for all j, pnj → pj iff

∑j |pnj − pj | → 0.

9.23 Let Xn ∼ Binomial(n, pn), n ≥ 1. Suppose npn → λ, 0 < λ < ∞.Show that Xn → X, where X ∼ Poisson(λ).

9.24 (a) Let Xn ∼ Geo(pn), i.e. P (Xn = r) = qr−1n pn, r ≥ 1, where

0 < pn < 1 and qn = 1 − pn. Show that as n → ∞ if pn → 0then

pnXn −→d X, (6.4)

where X ∼ Exponential (1).

Page 324: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

9.6 Problems 313

(b) Fix a positive integer k. Let for n ≥ 1,

pnr =(

r − 1k − 1

)pr−1

n qr−kn , r ≥ k

where 0 < pn < 1, qn = 1− pn.(i) Verify that for each n, pnrr≥k is a probability distribution,

i.e.,∑∞

r=k pnr = 1.(ii) Let Yn be a random variable with distribution P (Yn = r) =

pnr, r ≥ k. Show that as n →∞ if pn → 0 then pnYnn≥1converges in distribution and identify the limit.

9.25 Let Fnn≥1 and Gnn≥1 be two sequences of cdfs on R such that,as n →∞, Fn −→d F , Gn −→d G where F and G are cdfs on R.

(a) Show that for each n ≥ 1,

Hn(x) ≡ (Fn ∗Gn)(x) ≡∫

R

Fn(x− y)dG(y)

is a cdf on R.(b) Show that, as n →∞, Hn −→d H where H = F ∗G, by direct

calculation and by Skorohod’s theorem (i.e., Theorem 9.4.1) andProblem 7.14.

9.26 Let Yn have discrete uniform distribution on the integers 1, 2, . . . , n.Show that Xn ≡ Yn

n and let X ∼ Uniform (0,1) random variable.Show that Xn −→d X using three different methods as follows:

(a) Helly-Bray theorem,(b) the method of moments,(c) using the cdfs.

9.27 Establish (5.4) in Remark 9.5.3.

(Hint: Show that for any ε > 0, m1/2k2k ≥ (M − ε)

(µ(x : |x| >

M − ε))1/2k.)

9.28 Let µ be a probability distribution on R such that φ(t0) ≡∫et|x|dµ(x) < ∞ for some t0 > 0. Show that Carleman’s condition

(5.2) is satisfied.

(Hint: Show that by Cramer’s inequality (Corollary 3.1.5)

m2k ≤ 2k φ(t0)∫ ∞

0x2k−1e−t0xdx

= φ(t0)2k t−2k0 (2k − 1)!

and then use Stirling’s approximation: ‘n! ∼√

2π nn+1/2e−n as n →∞’ (Feller (1968)).)

Page 325: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

314 9. Convergence in Distribution

9.29 (Continuity theorem for mgfs). Let Xnn≥1 and X be random vari-ables such that for some δ > 0, the mgf MXn

(t) ≡ E(etXn) andMX(t) ≡ E(etX) are finite for all |t| < δ. Further, let MXn(t) →MX(t) for all |t| < δ. Show that Xn −→d X.

(Hint: Show first that Xnn≥1 is tight and the fact that by Remark9.5.3, the distribution of X is determined by MX(·).)

9.30 Let Xn ∼ Binomial(n, pn). Suppose npn →∞. Let Yn = Xn−npn√npn(1−pn)

,

n ≥ 1. Show that Yn −→d Y , where Y ∼ N(0, 1).

(Hint: Use Problem 9.29.)

9.31 Use the continuity theorem for mgfs to establish (6.4) and the con-vergence in distribution of pnYnn≥1 in Problem 9.24 (b)(ii).

9.32 Let Xj , Vj : j ≥ 1 be a collection of random variables on someprobability space (Ω,F , P ) such that P (Vj ∈ N) = 1 for all j, Vj →∞w.p. 1 and Xj −→d X. Suppose that for each j, the random variableVj is independent of the sequence Xnn≥1. Show that XVj

−→d X.

(Hint: Verify that for any bounded continuous function h : R → R,∣∣Eh(XVj)− Eh(X)

∣∣ ≤ 2‖h‖P (Vj ≤ N) + ∆NP (Vj > N)

where ∆N = supk>N

∣∣Eh(Xk)−Eh(X)∣∣ and ‖h‖ = sup|h(x)| : x ∈ R.)

9.33 Let Xn −→d X and xn → x as n →∞. If P (X = x) = 0, then showthat P (Xn ≤ xn) → P (X ≤ x).

9.34 (Weyl’s equi-distribution property). Let 0 < α < 1 be an ir-rational number. Let µn(·) be the measure defined by µn(A) ≡1n

∑n−1j=0 IA(jα mod 1), A ∈ B([0, 1]). Show that µn −→d Uniform

(0,1).

(Hint: Verify that∫

fdµn →∫ 10 f(x)dx for all f of the form

f(x) = eι2πkx, k ∈ Z and then approximate a bounded continuousfunction f by trigonometric polynomials (cf. Section 5.6).)

9.35 (a) Let Xii≥1 be iid random variables with Uniform (0,1) distri-bution. Let Mn = max1≤i≤n Xi. Show that n(1 − Mn) −→d

Exponential (1).(b) Let Xii≥1 be iid random variables such that λ ≡ supx :

P (X1 ≤ x) < 1 < ∞, P (X1 = λ) = 0, and P (λ − x <X1 < λ) ∼ cxαL(x) as x ↓ 0 where α > 0, c > 0, and L(·)is slowly varying at 0, i.e., limx↓0

L(cx)L(x) = 1 for all 0 < c < ∞.

Let Mn = max1≤i≤n Xi. Show that Yn ≡ n1/α(λ − Mn) con-verges in distribution as n →∞ and identify the limit.

Page 326: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

9.6 Problems 315

9.36 Let Xii≥1 be iid positive random variables such that P (X1 < x) ∼cxαL(x) as x ↓ 0, where c, α and L(·) are as in Problem 9.35. LetX1n ≡ min1≤i≤n Xi. Find ann≥1 ⊂ R+ such that Zn ≡ anX1n

converges in distribution to a nondegenerate limit and identify thedistribution. Specialize this to the cases where X1 has a pdf fX(·)such that

(a) limx↓0 fX(x) = fX(0+) exists and is positive,

(b) X1 has a Beta (a, b) distribution.

Page 327: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

10Characteristic Functions

10.1 Definition and examples

Characteristic functions play an important role in studying (asymptotic)distributional properties of random variables, particularly for sums of in-dependent random variables. The main uses of characteristic functions are(1) to characterize the probability distribution of a given random variable,and (2) to establish convergence in distribution of a sequence of randomvariables and to identify the limit distribution.

Definition 10.1.1:

(i) The characteristic function of a random variable X is defined as

φX(t) = E exp(ιtx), t ∈ R, (1.1)

where ι =√−1.

(ii) The characteristic function of a probability measure µ on (R,B(R))is defined as

µ(t) =∫

exp(ιtx)µ(dx), t ∈ R. (1.2)

(iii) Let F be cdf on R. Then, the characteristic function of F is definedas µF (·), where µF is the Lebesgue-Stieltjes measure correspondingto F .

Page 328: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

318 10. Characteristic Functions

Note that the integrands in (1.1) and (1.2) are complex valued. Hereand elsewhere, for any f1, f2 ∈ L1(Ω,F , µ), the integral of (f1 + ιf2) withrespect to µ is defined as∫

(f1 + ιf2)dµ =∫

f1dµ + ι

∫f2dµ. (1.3)

Thus, the characteristic function of X is given by φX(t) = (E cos tX) +ι(E sin tX), t ∈ R. Since the functions cos tx and sin tx are bounded forevery t ∈ R, φX(t) is well defined for all t ∈ R. Furthermore, φX(0) = 1and for any t ∈ R,

|φX(t)| =(E cos tX)2 + (E sin tX)2

1/2

≤E(cos tX)2 + E(sin tX)2

1/2 ≤ 1. (1.4)

If equality holds in (1.4), i.e., if |φX(t0)| = 1 for some t0 = 0, then therandom variable is necessarily discrete, as shown by the following proposi-tion.

Proposition 10.1.1: Let X be a random variable with characteristic func-tion φX(·). Then the following are equivalent:

(i) |φX(t0)| = 1 for some t0 = 0.

(ii) There exist a ∈ R, h = 0 such that

P(X ∈ a + jh : j ∈ Z

)= 1. (1.5)

Proof: Suppose that (i) holds. Since |φX(t0)| = 1, there exists a0 ∈ Rsuch that

φX(t0) = eιa0 , i.e., e−ιa0φX(t0) = 1.

Let a = a0/t0. Since the characteristic function of (X − a) is given bye−ιatφX(t), it follows that E exp(ιt0(X − a)) = 1. Equating the real parts,one gets

E cos t0(X − a) = 1. (1.6)

Since | cos θ| ≤ 1 for all θ and cos θ = 1 if and only if θ = 2πn for somen ∈ Z, (1.6) implies that

P(t0(X − a) ∈ 2πj : j ∈ Z

)= 1. (1.7)

Therefore, (ii) holds with h = 2π|t0| and with a = a0/t0.

For the converse, note that with pj = P (X = a + jh), j ∈ Z,

φX(t) =∑j∈Z

exp(ιt(a + jh)

)pj , t ∈ R,

Page 329: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

10.1 Definition and examples 319

and hence∣∣φX

( 2πh

)∣∣ = 1.

Definition 10.1.2: A random variable X satisfying (1.5) for some a ∈ Rand h > 0 is called a lattice random variable. In this case, the distributionof X is also called lattice or arithmetic. If X is a nondegenerate latticerandom variable, then the largest h > 0 for which (1.5) holds is called thespan (of the probability distribution or of the characteristic function) of X.

An inspection of the proof of Proposition 10.1.1 shows that for a latticerandom variable X with span h > 0, its characteristic function satisfies therelation ∣∣φX(2πj/h)

∣∣ = 1 for all j ∈ Z. (1.8)

In particular, this implies that lim sup|t|→∞ |φX(t)| = 1. The next resultshows that characteristic functions of random variables with absolutelycontinuous cdfs exhibit a very different limit behavior.

Proposition 10.1.2: Let X be a random variable with cdf F and charac-teristic function φX . If F is absolutely continuous, then

lim|t|→∞

|φX(t)| = 0. (1.9)

Proof: Since F is absolutely continuous, the probability distribution µX

of X has a density, say f , w.r.t. the Lebesgue measure m on R, and

φX(t) =∫

exp(ιtx)f(x)dx, t ∈ R.

Fix ε ∈ (0,∞). Since f ∈ L1(R,B(R), m), by Theorem 2.3.14, there existsa step function fε =

∑kj=1 cjI(ajbj) with 1 ≤ k < ∞ and aj , bj , cj ∈ R for

j = 1, . . . , k, such that ∫|f − fε|dm < ε/2. (1.10)

Next note that for any t = 0,∣∣∣ ∫ exp(ιtx)fε(x)dx∣∣∣

=∣∣∣∣

k∑j=1

cj

∫ bj

aj

exp(ιtx)dx

∣∣∣∣≤

k∑j=1

|cj |2|t| . (1.11)

Hence, by (1.10) and (1.11), it follows that

|φX(t)| =∣∣∣ ∫ exp(ιtx)f(x)dx

∣∣∣

Page 330: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

320 10. Characteristic Functions

≤∫|f − fε|dx +

∣∣∣ ∫ exp(ιtx)fε(x)dx∣∣∣

< ε/2 + ε/2

for all |t| > tε, where tε = 4∑k

j=1 |cj |/ε. Thus (1.9) holds.

Note that the above proof shows that for any f ∈ L1(m), the Fouriertransforms

f(t) ≡∫

eιtxf(x)dx, t ∈ R

satisfies lim|t|→∞ f(t) = 0. This is known as the Riemann-Lebesgue lemma(cf. Proposition 5.7.1).

Next, some basic results on smoothness properties of the characteristicfunction are presented.

Proposition 10.1.3: Let X be a random variable with characteristic func-tion φX(·). Then, φX(·) is uniformly continuous on R.

Proof: For t, h ∈ R, ∣∣φX(t + h)− φX(t)∣∣

=∣∣∣E exp(ι(t + h)X)− exp(ιtX)

∣∣∣=

∣∣∣E exp(ιtX) · (eιhX − 1)∣∣∣

≤ E∣∣eιhX − 1

∣∣ ≡ E∆(h), say,

where ∆(h) ≡ | exp(ιhX) − 1|. It is easy to check that |∆(h)| ≤ 2 andlimh→0 ∆(h) = 0 w.p. 1 (infact, everywhere). Hence, by the BCT, E∆(h) →0 as h → 0. Therefore,

limh→0

supt∈R

∣∣φX(t + h)− φX(t)∣∣ ≤ lim

h→0E∆(h) = 0 (1.12)

and hence, φX(·) is uniformly continuous on R.

Theorem 10.1.4: Let X be a random variable with characteristic functionφX(·). If E|X|r < ∞ for some r ∈ N, then φX(·) is r-times continuouslydifferentiable and

φ(r)X (t) = E(ιX)r exp(ιtX), t ∈ R. (1.13)

For proving the theorem, the following bound on the function exp(ιx) isuseful.

Lemma 10.1.5: For any x ∈ R, r ∈ N,∣∣∣∣ exp(ιx)−r−1∑k=0

(ιx)k

k!

∣∣∣∣ ≤ min|x|rr!

,2|x|r−1

(r − 1)!

. (1.14)

Page 331: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

10.1 Definition and examples 321

Proof: Note that for any x ∈ R and for any r ∈ N,

dr

dxr[exp(ιx)] =

[ dr

dxrcos x

]+ ι[ dr

dxrsin x

]= ιr exp(ιx). (1.15)

Hence, by (1.15) and Taylor’s expansion (applied to the functions sinx andcos x of a real variable x), for any x ∈ R and r ∈ N,

exp(ιx) =r−1∑k=0

(ιx)k

k!+

(ιx)r

(r − 1)!

∫ 1

0(1− u)r exp(ιux)du. (1.16)

Hence, for any x ∈ R and any r ∈ N,∣∣∣∣ exp(ιx)−r−1∑k=0

(ιx)k

k!

∣∣∣∣ ≤ |x|rr!

. (1.17)

Also, for r ≥ 2, using (1.17) with r replaced by r − 1, one gets∣∣∣∣ exp(ιx)−r−1∑k=0

(ιx)k

k!

∣∣∣∣≤

∣∣∣∣ exp(ιx)−r−2∑k=0

(ιx)k

k!

∣∣∣∣+ |x|r−1

(r − 1)!

≤ 2|x|r−1

(r − 1)!. (1.18)

Hence, by (1.17) and (1.18), (1.14) holds for all x ∈ R, r ∈ N with r ≥ 2.For r = 1, (1.14) follows from (1.17) and the bound ‘supx | exp(ιx)−1| ≤ 2.’

Lemma 10.1.5 gives two upper bounds on the difference between thefunction exp(ιx) and its (r − 1)th order Taylor’s expansion around x = 0.For small values of |x|, the first bound

(i.e., |x|r

r!

)is more accurate, whereas

for large values of |x|, the other bound(i.e., 2|x|r−1

(r−1)!

)is more accurate.

Proof of Theorem 10.1.4: Let µ denote the probability distribution ofX on (R,B(R)). Suppose that E|X| < ∞. First it will be shown that φX(·)is differentiable with φ

(1)X (t) = EιX exp(ιtX), t ∈ R. Fix t ∈ R. For any

h ∈ R, h = 0,

h−1[φX(t + h)− φX(t)]

=∫

R

exp(ιtx)[exp(ιhx)− 1

h

]µ(dx)

=∫

R

exp(ιtx)[exp(ιhx)− 1

h− ιx

]µ(dx) +

∫R

ιx exp(ιtx)µ(dx)

Page 332: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

322 10. Characteristic Functions

≡∫

ψh(x)µ(dx) +∫

R

ιx exp(ιhx)µ(dx), say. (1.19)

By Lemma 10.1.5 (with r = 2),

|ψh(x)| ≤ minh|x|2

2, 2|x|

for all x ∈ R, h = 0. (1.20)

Hence, limh→0 ψh(x) = 0 for each x ∈ R. Also, |ψh(x)| ≤ 2|x| and∫|x|µ(dx) = E|X| < ∞. Hence, by the DCT,

limh→0

∫ψh(x)µ(dx) = 0

and therefore, from (1.19), it follows that φX(·) is differentiable at t withφ

(1)X (t) =

∫ιx exp(ιtx)µ(dx) = EιX exp(ιtX).

Next suppose that the assertion of the theorem is true for some r ∈ N.To prove it for r + 1, note that for t ∈ R and h = 0,

h−1[φ(r)(t + h)− φr(t)]− E(ιX)r+1 exp(ιtX)

=∫

(ιx)rψh(x)µ(dx), (1.21)

where ψh(x) is as in (1.19). Now using the bound (1.20) on ψh(x), theDCT, and the condition E|X|r+1 < ∞, one can show that the right side of(1.21) goes to zero as h → 0. By induction, this completes the proof of thetheorem.

Proposition 10.1.6: Let X and Y be two independent random variables.Then

φX+Y (t) = φX(t) · φY (t), t ∈ R. (1.22)

Proof: Follows from (1.3), Proposition 7.1.3, and the independence of Xand Y .

For a complex number z = a + ib, a, b ∈ R, let z = a − ib denote thecomplex conjugate of z and let

Re(z) = a and Im(z) = b (1.23)

respectively denote the real and the imaginary parts of z.

Corollary 10.1.7: Let X be a random variable with characteristic func-tion φX . Then, φX , |φX |2 and Re(φX) are characteristic functions, whereRe(φX)(t) = Re(φX(t)), t ∈ R.

Proof: φX(t) = E exp(−ιtX) = E exp(ιt(−X)), t ∈ R ⇒ φX is thecharacteristic function of −X. Next, let Y be an independent copy of X.Then, by (1.22), φX−Y (t) = |φX(t)|2, t ∈ R.

Page 333: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

10.2 Inversion formulas 323

Finally, Re(φX)(t) = 12 (φX(t) + φX(t)) =

∫exp(ιtx)µ(dx), t ∈ R, where

µ(A) = 2−1[P (X ∈ A) + P (−X ∈ A)

], A ∈ B(R).

Definition 10.1.3: A function φ : R → C, the set of complex num-bers is said to be nonnegative definite if for any k ∈ N, t1, t2, . . . , tk ∈ R,α1, α2, . . . , αk ∈ C

k∑i=1

k∑j=1

αiαj φ(ti − tj) ≥ 0. (1.24)

Proposition 10.1.7: Let φ(·) be the characteristic function of a randomvariable X. Then φ is nonnegative definite.

Proof: Check that for k, ti, αi as in Definition 10.1.3,

k∑i=1

k∑j=1

αiαj φ(ti − tj) = E

(∣∣∣∣k∑

j=1

αjeιtjX

∣∣∣∣2)

.

A converse to the above is known as the Bochner-Khinchine theorem,which states that if φ : R → C is nonnegative definite, continuous, andφ(0) = 1, then φ is the characteristic function of a random variable X. Fora proof, see Chung (1974).

Another criterion for a function φ : R → C to be a characteristic functionis due to Polya. For a proof, see Chung (1974).

Proposition 10.1.8: (Polya’s criterion). Let φ : R → C satisfy φ(0) = 1,φ(t) ≥ 0, φ(t) = φ(−t) for all t ∈ R and φ(·) is nonincreasing and convexon [0,∞). Then φ is a characteristic function.

10.2 Inversion formulas

Let F be a cdf and φ be its characteristic function. In this section, twoinversion formulas to get the cdf F from φ are presented. The first one isfrom Feller (1966), and the second one is more standard.

Unless otherwise mentioned, for the rest of this section, X will be a ran-dom variable with cdf F and characteristic function φX and N a standardnormal random variable independent of X.

Lemma 10.2.1: Let g : R → R be a Borel measurable bounded functionvanishing outside a bounded set and let ε ∈ (0,∞). Then

Eg(X + εN) =12π

∫ ∫g(x)φX(t)e−ιtxe− ε2t2

2 dtdx. (2.1)

Page 334: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

324 10. Characteristic Functions

Proof: The integrand on the right is bounded by e− ε2t22 |g(x)| and so is

integrable on R×R with respect to the Lebesgue measure on (R2,B(R2)).Further, φX(t) =

∫eιtydF (y) and e− t2

2 = 1√2π

∫eιtxe− x2

2 dx, t ∈ R. Byrepeated applications of Fubini’s theorem and the above two identities, theright side of (2.1) becomes

1√2πε

∫g(x)

∫ (∫ε√2π

eιt(y−x)e−ε2t2

2 dt

)dF (y)dx

[set s = εt]

=1√2πε

∫g(x)

∫ (∫1√2π

eιs(y−x)/εe−s2/2ds

)dF (y)dx

=∫

g(x)(

1√2πε

∫e− (y−x)2

2ε2 dF (y))

dx.

Since X and N are independent and N has an absolutely continuousdistribution w.r.t. the Lebesgue measure, X + εN also has an absolutelycontinuous distribution with density

fX+εN (x) =1√2πε

∫e− (y−x)2

2ε2 dF (y), x ∈ R.

Thus, the right side of (2.1) reduces to∫g(x)fX+εN (x)dx = Eg(X + εN).

Corollary 10.2.2: Let g : R → R be continuous and let g(x) = 0 for all|x| > K, for some K, 0 < K < ∞. Then

Eg(X) =∫

g(x)dF (x)

= limε→0+

∫ ∫12π

g(x)e−ιtxφX(t)e− ε2t22 dtdx. (2.2)

Proof: This follows from Lemma 10.2.1, the fact that X + εN → X w.p.1 as ε → 0, and the BCT.

Corollary 10.2.3: (Feller’s inversion formula). Let a and b, −∞ < a <b < ∞, be two continuity points of F . Then

F (b)− F (a) = limε→0+

∫ b

a

(12π

∫e−ιtxφX(t)e− ε2t2

2 dt

)dx. (2.3)

Proof: This follows from Lemma 10.2.1 and Theorem 9.4.2, since thefunction g(x) = 1 for a ≤ x ≤ b and 0 otherwise is continuous except at aand b, which are continuity points of F .

Page 335: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

10.2 Inversion formulas 325

Corollary 10.2.4: If φX(t) is integrable w.r.t. the Lebesgue measure mon R, then F is absolutely continuous with density w.r.t. m, given by

f(x) =12π

∫e−ιtxφX(t)dt, x ∈ R. (2.4)

Proof: If φX is integrable, then

12π

∫φX(t)e−ιtxe− ε2t2

2 dt

is bounded by (2π)−1∫|φX(t)|dt for all x ∈ R, and it converges to

(2π)−1∫

e−ιtxφX(t)dt as ε → 0+ for each x ∈ R. Hence, by the BCTand Corollary 10.2.3, for any a, b,−∞ < a < b < ∞, that are continuitypoints of F

F (b)− F (a) =∫ b

a

[ 12π

∫φX(t)e−ιtxdt

]dx.

Since F has at most countably many discontinuity points and F is rightcontinuous, the above relation holds for all −∞ < a < b < ∞.

Remark 10.2.1: The integrability of φX in Corollary 10.2.4 is only a suf-ficient condition. The standard exponential distribution has characteristicfunction (1−ιt)−1 which is not integrable but the distribution is absolutelycontinuous.

Corollary 10.2.5: (Uniqueness). The characteristic function φX deter-mines F uniquely.

Proof: Since a cdf F is uniquely determined by its values on the set of itscontinuity points, this corollary follows from Corollary 10.2.3.

A more standard inversion formula is the following.

Theorem 10.2.6: Let F be a cdf on R and φ(t) ≡∫

eιtxdF (x), t ∈ R beits characteristic function.

(i) For any a < b, a, b ∈ R, that are continuity points of F ,

limT→∞

12π

∫ T

−T

e−ιta − e−ιtb

ιtφ(t)dt = µF ((a, b)), (2.5)

where µF is the Lebesgue-Stieltjes measure generated by F .

(ii) For any a ∈ R,

µF (a) = limT→∞

12T

∫ T

−T

e−ιtaφ(t)dt. (2.6)

Page 336: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

326 10. Characteristic Functions

A multivariate extension of part (i) and its proof are given in Section10.4. See also Problem 10.4. For a proof of part (ii), see Problem 10.5 orsee Chung (1974) or Durrett (2004).

Remark 10.2.2: (Inversion formula for integer valued random variables).If X is integer valued with pk = P (X = k), k ∈ Z, then its characteristicfunction is the Fourier series

φ(t) =∑k∈Z

pkeιtk, t ∈ R. (2.7)

Since∫ π

−πeιtjdt = 2π if j = 0 and = 0 otherwise, multiplying both sides of

(2.7) by e−ιtk and integrating over t ∈ (−π, π) and using DCT, one gets

pk =12π

∫ π

−π

φ(t)e−ιtkdt, k ∈ Z. (2.8)

As a corollary to part (ii) of Theorem 10.2.6, one can deduce a criterionfor a distribution to be continuous. Let µ be a probability distribution andlet pj be its atoms, if any. Let α =

∑j p2

j . Let X and Y be two inde-pendent random variables with distribution µ and characteristic functionφ(·). Then Z = X − Y has characteristic function |φ(·)|2 and by Theorem10.2.6, part (ii),

P (Z = 0) = limT→∞

12T

∫ T

−T

|φ(t)|2dt.

But P (Z = 0) = α. Hence, it follows that

∑j∈Z

p2j = lim

T→∞1

2T

∫ T

−T

|φ(t)|2dt. (2.9)

Corollary 10.2.7: A distribution is continuous iff

limT→∞

12T

∫ T

−T

|φ(t)|2dt = 0. (2.10)

Some consequences of the uniqueness result (cf. Corollary 10.2.5) are thefollowing.

Corollary 10.2.8: For a random variable X, X and −X have the samedistribution iff the characteristic function φX(t) of X is real valued for allt ∈ R.

Proof: If φX(t) is real, then

φX(t) =∫

(cos tx)dF (x) for all t ∈ R,

Page 337: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

10.3 Levy-Cramer continuity theorem 327

where F is the cdf of X. So

φX(t) = φX(−t) = E(e−ιtX) = E(eιt(−X)). (2.11)

Since the characteristic function of −X coincides with φX(t), the ‘if part’follows.

To prove the ‘only if’ part, suppose that X and −X have the samedistribution. Then as in (2.11),

φX(t) = φ−X(t) = φX(−t) = φX(t),

where for any complex number z = a + ιb, a, b ∈ R, z ≡ a− ιb denotes itsconjugate. Hence, φX(t) is real for all t ∈ R.

Example 10.2.1: The standard Cauchy distribution has density

f(x) =1π

11 + x2 , −∞ < x < ∞. (2.12)

Its characteristic function is given by

φ(t) =1π

∫eιtx

1 + x2 dx = e−|t|, t ∈ R. (2.13)

To see this, let X1 and X2 be two independent copies of the standardexponential distribution. Since φX1(t) = (1 − ιt)−1, t ∈ R, Y ≡ X1 − X2has characteristic function

φY (t) = |φX1(t)|2 = (1 + t2)−1, t ∈ R.

Since φY is integrable, the density of Y is

fY (y) =12π

∫1

1 + u2 e−ιuydu, y ∈ R.

But by the convolution formula, fY (y) =∫

x>−ye−xe−(y+x)dx =∫∞

0 e−xe−(y+x)11(0,∞)(y + x)dx = 12e−|y|, y ∈ R. So

∫1

1 + u2 eιuydt = e−|y|, y ∈ R,

proving (2.13).

10.3 Levy-Cramer continuity theorem

Characteristic functions are very useful in determining distributions, mo-ments, and establishing various identities involving distributions. But by

Page 338: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

328 10. Characteristic Functions

far their most important use is in establishing convergence in distribution.This is the content of a continuity theorem established by Paul Levy andH. Cramer. It says that the map ψ taking a distribution F to its character-istic function φ is a homeomorphism. That is, if Fn −→d F , then φn → φand conversely. Here, the notion of convergence of φn to φ is that of uniformconvergence on bounded intervals. The following result deals with the ‘if’part.

Theorem 10.3.1: Let Fn, n ≥ 1 and F be cdfs with characteristicfunctions φn, n ≥ 1 and φ, respectively. Let Fn −→d F . Then, for each0 < K < ∞,

sup|t|≤K

|φn(t)− φ(t)| → 0 as n →∞.

That is, φn converges to φ uniformly on bounded intervals.

Proof: By Skorohod’s theorem, there exist random variables Xn, X de-fined on the Lebesgue space

([0, 1],B([0, 1]), m

)where m(·) is the Lebesgue

measure such that Xn ∼ Fn, X ∼ F and Xn → X w.p. 1. Now, for anyt ∈ R,

|φn(t)− φ(t)| =∣∣∣E(eιtXn − eιtX

)∣∣∣≤ E

(∣∣∣1− eιt(X−Xn)∣∣∣)

≤ E(∣∣∣1− eιt(X−Xn)

∣∣∣11(|X −Xn| ≤ ε))

+P (|Xn −X| > ε).

Hence,

sup|t|≤K

|φn(t)− φ(t)| ≤(

sup|u|≤Kε

|1− eιu|)

+ P (|Xn −X| > ε).

Given K and δ > 0, choose ε ∈ (0,∞) small such that

sup|u|≤Kε

|1− eιu| < δ.

Since for all ε > 0, P (|Xn −X| > ε) → 0 as n →∞, it follows that

limn→∞ sup

|t|≤K

|φn(t)− φ(t)| = 0.

The Levy-Cramer theorem is a converse to the above theorem. That is,if φn → φ uniformly on bounded intervals, then Fn −→d F . Actually, it isa stronger result than this converse. It says that it is enough to know thatφn converges pointwise to a limit φ that is continuous at 0. Then φ is thecharacteristic function of some distribution F and Fn −→d F . The key tothis is that under the given hypotheses, the family Fnn≥1 is tight.

Page 339: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

10.3 Levy-Cramer continuity theorem 329

The next result relates the tail behavior of a probability measure to thebehavior of its characteristic function near the origin, which in turn will beused to establish the tightness of Fnn≥1.

Lemma 10.3.2: Let µ be a probability measure on R with characteristicfunction φ. Then, for each δ > 0,

µ(x : |x|δ ≥ 2

)≤ 1

δ

∫ δ

−δ

(1− φ(u))du.

Proof: Fix δ ∈ (0,∞). Then, using Fubini’s theorem and the fact that1− sin x

x ≥ 0 for all x, one gets∫ δ

−δ

(1− φ(u))du =∫ (∫ δ

−δ

(1− eιux)du))

µ(dx)

=∫ [

2δ − 2 sin δx

x

]µ(dx)

= 2δ∫ [

1− sin δx

]µ(dx)

≥ 2δ

∫x:|xδ|≥2

(1− 1

|xδ|

)µ(dx)

≥ δµ(x : |x|δ ≥ 2

).

Lemma 10.3.3: Let µnn≥1 be a sequence of probability measures withcharacteristic functions φnn≥1. Let limn→∞ φn(t) ≡ φ(t) exist for |t| ≤δ0 for some δ0 > 0. Let φ(·) be continuous at 0. Then µnn≥1 is tight.

Proof: For any 0 < δ < δ0, by the BCT,

∫ δ

−δ

(1− φn(t))dt → 1δ

∫ δ

−δ

[1− φ(t)]dt.

Also, by continuity of φ at 0,

∫ δ

−δ

[1− φ(t)]dt → 0 as δ → 0.

Thus, given ε > 0, there exists a δε ∈ (0, δ0) and an Mε ∈ (0,∞) such thatfor all n ≥ Mε,

1δε

∫ δε

−δε

(1− φn(t))dt < ε.

By Lemma 10.3.2, this implies that for all n ≥ Mε,

µn

(x : |x| ≥ 2

δε

)< ε.

Page 340: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

330 10. Characteristic Functions

Now choose Kε > 2δε

such that

µj

(x : |x| ≥ Kε

)< ε for 1 ≤ j ≤ Mε.

Then,supn≥1

µn

(x : |x| ≥ Kε

)< ε

and hence, µnn≥1 is tight.

Theorem 10.3.4: (Levy-Cramer continuity theorem). Let µnn≥1 be asequence of probability measures on (R,B(R)) with characteristic functionsφnn≥1. Let limn→∞ φn(t) ≡ φ(t) exist for all ∈ R and let φ be continuousat 0. Then φ is the characteristic function of a probability measure µ andµn −→d µ.

Proof: By Lemma 10.3.3, µnn≥1 is tight. Let µnjj≥1 be any subse-

quence of µnn≥1 that converges vaguely to a limit µ. By tightness, µ isa probability measure and by Theorem 10.3.1, limj→∞ φnj

(t) is the char-acteristic function of µ. That is, φ is the characteristic function of µ. Sinceφ determines µ uniquely, all vague limit points of µnn≥1 coincide with µand hence by Theorem 9.2.6, µn −→d µ.

This theorem will be used extensively in the next chapter on central limittheorems. For the moment, some easy applications are given.

Example 10.3.1: (Convergence of Binomial to Poisson). Let Xnn≥1be a sequence of random variables such that Xn ∼ Binomial(Nn, pn) forall n ≥ 1. Suppose that as n → ∞, Nn → ∞, pn → 0 and Nnpn → λ,λ ∈ (0,∞). Then

Xn −→d X where X ∼ Poisson(λ). (3.1)

To prove (3.1), note that the characteristic function φn of Xn is

φn(t) = (pneιt + 1− pn)Nn

=(1 + pn(eιt − 1)

)Nn

=(1 +

Nnpn

Nn(eιt − 1)

)Nn

, t ∈ R.

Next recall the fact that if znn≥1 is a sequence of complex numbers suchthat limn→∞ zn = z exists, then

(1 + n−1zn)n → z as n →∞. (3.2)

So φn(t) → eλ(eιt−1) for all t ∈ R. Since φ(t) ≡ eλ(eιt−1), t ∈ R is thecharacteristic function of a Poisson (λ) random variable, (3.1) follows.

Page 341: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

10.3 Levy-Cramer continuity theorem 331

A direct proof of (3.1) consists of showing that for each j = 0, 1, 2, . . .

P (Xn = j) ≡(

Nn

j

)pj

n(1− pn)Nn−j → P (X = j) =e−λλj

j!.

Example 10.3.2: (Convergence of Binomial to Normal). Let Xn ∼Binomial(Nn, pn) for all n ≥ 1. Suppose that as n → ∞, Nn → ∞ ands2

n ≡ Nnpn(1− pn) →∞. Then

Zn ≡Xn −Nnpn

sn−→d N(0, 1). (3.3)

To prove (3.3), note that the characteristic function φn of Zn is

φn(t) =[pn exp(ιt(1− pn)/sn) + (1− pn) exp(−ιtpn/sn)

]Nn

≡[1 +

zn(t)Nn

]Nn

, say,

where zn(t) = Nn

[(pne

ιtsn

(1−pn) + (1− pn)e−ιtpn

sn

)− 1]. By (3.2), it suffices

to show that for all t ∈ R,

zn(t) → − t2

2as n →∞.

By Lemma 10.1.5, for any x real,

∣∣∣eιx − 1− ιx− (ιx)2

2

∣∣∣ ≤ |x|33!

. (3.4)

Since sn →∞, for any t ∈ R, with pn(t) ≡ tpn/sn and qn(t) ≡ t(1−pn)/sn,one has

zn(t) = Nn

[pn exp(ιt(1− pn)/sn) + (1− pn) exp(−ιtpn/sn)

− 1]

= Nn

[pn

eιqn(t) − 1− ιqn(t)

+ (1− pn)

eιpn(t) − 1− ιpn(t)

]= Nn

[pn

2(ιqn(t))2 +

1− pn

2(ιpn(t))2

]+ NnO

(pn(1− pn)|t|3s3

n

)

=−t2

2+ o(1) as n →∞.

This is known as the DeMovire-Laplace CLT in the case Nn = n, pn = p,0 < p < 1. The original proof was based on Stirling’s approximation.

Page 342: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

332 10. Characteristic Functions

Example 10.3.3: (Convergence of Poisson to Normal). Let Xnn≥1 bea sequence of random variables such that for n ≥ 1, Xn ∼ Poisson(λn),λn ∈ (0,∞). Let Yn = Xn−λn√

λn, n ≥ 1. If λn →∞ as n →∞, then

Yn −→d N(0, 1). (3.5)

To prove (3.5), note that the characteristic function φn of Yn is

φn(t) = exp(− ιt

√λn

)exp

(λn

[exp

(ιt/√

λn

)− 1])

= exp(λn

[exp

(ιt/√

λn

)− 1−

(ιt/√

λn

)]),

t ∈ R. Now using (3.4) again it is easy to show that for each t ∈ R,

λn

(exp

( ιt√λn

)− 1− ιt√

λn

)→ −t2

2as n →∞.

Hence, (3.5) follows.

10.4 Extension to Rk

Definition 10.4.1:

(a) Let X = (X1, . . . , Xk) be a k-dimensional random vector (k ∈ N).The characteristic function of X is defined as

φX(t) = E exp(ιt ·X)

= E exp(

ιk∑

j=1

tjXj

), (4.1)

t = (t1, . . . , tk) ∈ Rk, where t · x =∑k

j=1 tjxj denotes the innerproduct of the two vectors t = (t1, . . . , tk), x = (x1, . . . , xk) ∈ Rk.

(b) For a probability measure µ on(Rk,B(Rk)

), its characteristic func-

tion is defined as

φ(t) =∫

Rk

exp(ιt · x)µ(dx). (4.2)

Note that for a linear combination L ≡ a1X1+· · ·+akXk, a1, . . . , ak ∈ R,of a set of random variables X1, . . . , Xk, all defined on a common proba-bility space, the characteristic functions of L and X = (X1, . . . , Xk) are

Page 343: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

10.4 Extension to Rk 333

related by the identity

φL(λ) = E exp(

ιλ

k∑j=1

ajXj

)

= φX(λa), λ ∈ R, (4.3)

where a = (a1, . . . , ak). Thus, the characteristic function of a random vec-tor X = (X1, . . . , Xk) is determined by the characteristic functions of allits linear combinations and vice versa. It will now be shown that as inthe one-dimensional case, the characteristic function of a random vector Xuniquely determines its probability distribution. The following is a multi-variate version of Theorem 10.2.6.

Theorem 10.4.1: Let X = (X1, . . . , Xk) be a random vector with charac-teristic function φX(·) and let A = (a1, b1]× · · · × (ak, bk] be a rectangle inRk with −∞ < ai < bi < ∞ for all i = 1, . . . , k. If P (X ∈ ∂A) = 0, then

P (X ∈ A) = limT→∞

1(2π)k

∫ T

−T

· · ·∫ T

−T

k∏j=1

hj(tj)

× φX(t1, . . . , tk)dt1 . . . dtk, (4.4)

where ∂A denotes the boundary of A and where hj(tj) ≡(exp(−ιtjaj) −

exp(−ιtjbj))(ιtj)−1 for tj = 0 and hj(0) = (bj − aj), 1 ≤ j ≤ k.

Proof: Consider the product space Ω = [−T, T ]k×Rk with the correspond-ing Borel-σ-algebra F = B([−T, T ]k) × B(Rk) and the product measureµ = µ1×µ2, where µ1 is the Lebesgue’s measure on

([−T, T ]k,B([−T, T ]k)

)and µ2 is the probability distribution of X on

(Rk,B(Rk)

). Since the func-

tion

f(t, x) ≡k∏

j=1

hj(tj) exp(ιt · x),

(t, x) ∈ Ω is integrable w.r.t. the product measure µ, by Fubini’s theorem,

IT ≡∫ T

−T

· · ·∫ T

−T

k∏j=1

hj(tj)

φX(t1, . . . , tk)dt1 . . . dtk

=∫

Rk

∫ T

−T

· · ·∫ T

−T

k∏j=1

hj(tj) exp(ιtjxj)dt1 . . . dtk µ2(dx)

=∫

Rk

k∏j=1

[ ∫ T

−T

exp(ιtj(xj − aj))− exp(ιtj(xj − bj))ιtj

dtj

]µ2(dx)

Page 344: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

334 10. Characteristic Functions

=∫

Rk

k∏j=1

[2∫ T

0

sin tj(xj − aj)tj

dtj

− 2∫ T

0

sin tj(xj − bj)tj

dtj

]µ2(dx), (4.5)

using (1.3) and the fact that sin θθ and cos θ

θ are respectively even and oddfunctions of θ. It can be shown that (Problem 10.8)

limT→∞

∫ T

0

sin t

tdt = π/2. (4.6)

Hence, by the change of variables theorem, it follows that, for any c ∈ R,

limT→∞

∫ T

0

sin tc

tdt =

⎧⎨⎩

0 if c = 0π/2 if c > 0−π/2 if c < 0

(4.7)

and

supT>0,c∈R

∣∣∣∣∫ T

0

sin tc

tdt

∣∣∣∣ = supT>0

∣∣∣∣∫ T

0

sin u

udu

∣∣∣∣ ≡ K < ∞. (4.8)

This implies that as T →∞, the integrand in (4.5) converges to the function∏kj=1 gj(xj) for each x ∈ Rk, where

gj(y) =

⎧⎪⎨⎪⎩

π if y ∈ aj , bj2π if y ∈ (aj , bj)

0 if y ∈ (−∞, aj) ∪ (bj ,∞).

(4.9)

Hence, by (4.5), (4.8), (4.9), and the BCT,

limT→∞

IT =∫

Rk

k∏j=1

gj(xj)µ2(dx).

By the boundary condition P (X ∈ ∂A) = 0, the right side above equals(2π)kP (X ∈ (a1, b1)× · · · × (ak, bk)), proving the theorem.

Remark 10.4.1: The inversion formula (2.3) can also be extended to themultivariate case.

Corollary 10.4.2: A probability measure on (Rk,B(Rk)) is uniquely de-termined by its characteristic function.

Proof: Let µ and ν be probability measures on (Rk,B(Rk)) with the samecharacteristic function φ(·), i.e.,

φ(t) =∫

exp(ιt · x)µ(dx) =∫

exp(ιt · x)ν(dx),

Page 345: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

10.4 Extension to Rk 335

t ∈ Rk. Let A = A : A = (a1, b1] × · · · × (ak, bk], −∞ < ai < bi < ∞,i = 1, . . . , k, µ(∂A) = 0 = ν(∂A). It is easy to verify that A is a π-class.Since there are only countably many rectangles (a1, b1]×· · ·× (ak, bk] withµ(∂A) + ν(∂A) = 0, A generates B(Rk). But, by Theorem 10.4.1,

µ(A) = ν(A)

= limT→∞

(2π)−k

∫ T

−T

· · ·∫ T

−T

k∏j=1

hj(tj)

φ(t1, . . . , tk)dt1, . . . , dtk

for all A ∈ A. Hence, by Theorem 1.2.4, µ(B) = ν(B) for all B ∈ B(Rk),i.e., µ = ν.

Corollary 10.4.3: A probability measure µ on (Rk,B(Rk)) is determinedby its values assigned to the collection of half-spaces H ≡ H : H = x ∈Rk : a · x ≤ c, a ∈ Rk, c ∈ R.

Proof: Let X be the identity mapping on Rk. Then, for any H = x ∈Rk : a · x ≤ c, X ∈ H = a ·X ≤ c. Thus, the values µ(H) : H ∈ Hdetermine the probability distributions (and hence, the characteristic func-tions) of all linear combinations of X. Consequently, by (4.3), it determinesthe characteristic function of X. By Corollary 10.4.2, this determines µuniquely.

Theorem 10.4.4: Let Xnn≥1, X be k-dimensional random vectors.Then Xn −→d X iff

φXn(t) → φX(t) for all t ∈ Rk. (4.10)

Proof: Suppose that Xn −→d X. Then, (4.10) follows from the continuousmapping theorem for weak convergence (cf. Theorem 9.4.2). Conversely,suppose (4.10) holds. Let X

(j)n and X(j) denote the jth components of Xn

and X, respectively, j = 1, . . . , k. By (4.10), for any j ∈ 1, . . . , k,

limn→∞ E exp(ιλX(j)

n ) = E exp(ιλX(j)) for all λ ∈ R.

Hence, by Theorem 10.3.4

X(j)n −→d X(j) for all j = 1, . . . , k. (4.11)

This implies that the sequence of random vectors Xnn≥1 is tight (Prob-lem 9.9). Hence, by Theorem 9.3.3, given any subsequence nii≥1, thereexists a further subsequence n′

ii≥1 ⊂ nii≥1 and a random vector X0such that Xn′

i−→d X0 as i →∞. By the ‘only if’ part, this implies

φXn′i(t) → φX0(t) as i →∞,

Page 346: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

336 10. Characteristic Functions

for all t ∈ Rk. Thus, φX0(·) = φX(·) and by the uniqueness of character-istic functions, X0 =d X. Thus, all convergent subsequences of Xnn≥1have the same limit. By arguments similar to the proof of Theorem 9.2.6,Xn −→d X. This completes the proof of the theorem.

Theorem 10.4.4 shows that as in the one-dimensional case, the (point-wise) convergence of the characteristic functions of a sequence of k-dimensional random vectors Xnn≥1 to a given characteristic function isequivalent to convergence in distribution of the sequence Xnn≥1. Sincethe characteristic function of a random vector is determined by the char-acteristic functions of all its linear combinations, this suggests that onemay also be able to establish convergence in distribution of a sequence ofrandom vectors by considering the convergence of the sequences of linearcombinations that are one-dimensional random variables. This is indeedtrue as shown by the following result.

Theorem 10.4.5: (Cramer-Wold device). Let Xnn≥1 be a sequence ofk-dimensional random vectors and let X be a k-dimensional random vector.Then, Xn −→d X iff for all a ∈ Rk,

a ·Xn −→d a ·X. (4.12)

Proof: Suppose Xn −→d X. Then, for any a ∈ Rk, the function h(x) =a ·x, x ∈ Rk is continuous on Rk. Hence, (4.12) follows from Theorem 9.4.2.

Conversely, suppose that (4.12) holds for all a ∈ Rk. By (4.3) and The-orem 10.3.1, this implies that as n →∞

φXn(a) = φa·Xn

(1)→ φa·X(1) = φX(a),

for all a ∈ Rk. Hence, by Theorem 10.4.4, Xn −→d X.

Recall that a set of random variables X1, . . . , Xk defined on a commonprobability space are independent iff the joint cdf of X1, . . . , Xk is theproduct of the marginal cdfs of the Xi’s. A similar characterization ofindependence can be given in terms of the characteristic functions, as shownby the following result. The proof is left as an exercise (Problem 10.16).

Proposition 10.4.6: Let X1, . . . , Xk, (k ∈ N) be a collection of randomvariables defined on a common probability space. Then, X1, . . . , Xk are in-dependent iff

φ(X1,...,Xk)(t1, . . . , tk) =k∏

j=1

φXj (tj)

for all t1, . . . , tk ∈ R.

Page 347: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

10.5 Problems 337

10.5 Problems

10.1 Let Xnn≥1 and Ynn≥1 be two sequences of random variables suchthat for each n ≥ 1, Xn and Yn are defined on a common probabilityspace and Xn and Yn are independent. If Xn −→d X and Yn −→d Y ,then show that

Xn + Yn −→d X0 + Y0 (5.1)

where X0 =d X, Y0 =d Y (cf. Section 2.2) and X0 and Y0 are indepen-dent. Show by an example that (5.1) is false without the independencehypothesis.

10.2 Give an example of a nonlattice discrete distribution F on R sup-ported by only a three point set.

10.3 Let F be an absolutely continuous cdf on R with density f and withcharacteristic function φ. Show that if f has a derivative f (1) ∈ L1(R),then

lim|t|→∞

|tφ(t)| = 0.

Generalize this result when f is r-times differentiable and the jthderivative f (j) lie in L1(R) for j = 1, . . . , r.

10.4 Let F be a cdf on R with characteristic function φ. Show that for anya < b, a, b ∈ R,

limT→∞

12π

∫ T

−T

[exp(−ιta)− exp(−ιtb)

](ιt)−1φ(t)dt

= µF ((a, b)) +12µF (a, b), (5.2)

where µF denotes the Lebesgue-Stieltjes measure corresponding toF .

(Hint: Use (4.7) and the arguments in the proof of Theorem 10.4.1.)

10.5 Let φ be a characteristic function of a cdf F and let µF denote theLebesgue-Stieltjes measure corresponding to F .

(a) Show that for any a ∈ R and T ∈ (0,∞),

∫ T

−T

exp(−ιta)φ(t)dt

= 2TµF (a)

+∫

x =a

exp(ιT (x− a))− exp(−ιT (x− a))T (x− a)

µF (dx).

(5.3)

Page 348: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

338 10. Characteristic Functions

(b) Conclude from (5.3) that for any a ∈ R,

F (a)− F (a−) = limT→∞

12T

∫ T

−T

exp(−ιta)φ(t)dt. (5.4)

10.6 Let F be a cdf on R with characteristic function φ. If |φ| ∈ L2(R),then show that F is continuous.

(Hint: Use Corollary 10.1.7.)

10.7 Let Fnn≥1, F be cdfs with characteristic functions φnn≥1, φ,respectively. Suppose that Fn −→d F .

(a) Give an example to show that φn may not converge to φ uni-formly over all of R.

(Hint: Try φn(t) ≡ e− t2n .)

(b) Let µnn≥1 and µ denote the Lebesgue-Stieltjes measurescorresponding to Fnn≥1 and F , respectively. Suppose thatµnn≥1 and µ are dominated by a σ-finite measure λ on(R,B(R)) with Radon-Nikodym derivatives fn = dµn

dλ , n ≥ 1and f = dµ

dλ . If fn −→ f a.e. (λ), then show that

supt∈R

|φn(t)− φ(t)| → 0 as n →∞. (5.5)

10.8 Let G(x, a) = (1 + a2)−1(1− e−axa sin x + cos x), x ∈ R, a ∈ R.

(a) Show that for any a > 0, x0 ≥ 0,∫ x0

0(sinx) e−axdx = G(a, x0). (5.6)

(Hint: Consider the derivatives of the left and the right sides of(5.6) w.r.t. x0.)

(b) Use Fubini’s theorem to justify that for all T > 0,∫ T

0

∫ ∞

0(sinx) e−axdadx =

∫ ∞

0

∫ T

0(sinx) e−axdxda. (5.7)

(c) Use (5.6), (5.7) and the identity that for x > 0,∫∞0 e−axda = 1

xto conclude that for any T > 0∫ T

0

sin x

xdx =

∫ ∞

0G(a, T )da. (5.8)

(d) Use the DCT and the fact that∫∞0 (1+a2)−1da = π

2 to concludethat the limit of the right side of (5.8) exists and equals π

2 .

Page 349: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

10.5 Problems 339

10.9 Let F1, F2, and F3 be three cdfs on R. Then show by an examplethat F1 ∗ F2 = F1 ∗ F3 does not imply that F1 = F2. Here Fi ∗ Fj

denotes the convolution of Fi and Fj , 1 ≤ i, j ≤ 3.

(Hint: For F1, consider a cdf whose characteristic function φ has abounded support.)

10.10 Let µ be a probability measure on R with characteristic function φ.Prove that

∫ ∞

−∞[1− Re(φ(t))]t−2dt =

∫|x|µ(dx).

10.11 Let φ be the characteristic function of a random variable X. If |φ(t)| =1 = |φ(αt)| for some t = 0 and α ∈ R irrational, then there existsx0 ∈ R such that P (X = x0) = 1.

(Hint: Use Proposition 10.1.1.)

10.12 Show that for any characteristic function φ, t ∈ R : |φ(t)| = 1 iseither 0 or countably infinite or all of R.

10.13 Let Xnn≥1 be a sequence of iid random variables with a nonde-generate distribution F . Suppose that there exist an ∈ (0,∞) andbn ∈ R such that

a−1n

( n∑j=1

Xj − bn

)−→d Z (5.9)

for some nondegenerate random variable Z.

(a) Show thatan →∞ as n →∞.

(Hint: If an → a ∈ R, then E exp(ιtaZ) = limn→∞E exp

(ιt[∑n

j=1 Xj − bn

])= 0 for all except countably many

t ∈ R, which leads to a contradiction.)

(b) Show that as n →∞

bn − bn−1 = o(an) andan

an−1→ 1.

(Hint: Use (a) to show that(∑n−1

j=1 Xj − bn

)/an −→d Z and

by (5.9),(∑n−1

j=1 Xj − bn−1

)/an−1 −→d Z.)

Page 350: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

340 10. Characteristic Functions

10.14 Show that for every T ∈ (0,∞), there exist two distinct characteristicfunctions φ1T and φ2T satisfying

φ1T (t) = φ2T (t) for all |t| ≤ T.

(Hint: Let φ1(t) = e−|t|, t ∈ R and for any T ∈ (0,∞), define aneven function φ2T (·) by

φ2T (t) =

⎧⎪⎨⎪⎩

φ1(t) for 0 ≤ t ≤ T

φ1(T ) + (t− T )(−φ1(T )) T ≤ t < T + 1

0 t > T.

Now use Polya’s criterion.)

10.15 Show that φα(t) = exp(−|t|α), t ∈ R, α ∈ (0,∞) is a characteristicfunction for 0 ≤ α ≤ 2.

10.16 Prove Proposition 10.4.6.

(Hint: The ‘only if’ part follows from (4.2) and Proposition 7.1.3.The ‘if part’ follows by using the inversion formulas of Theorems10.4.1 and 10.2.6 and the characterization of independence in termsof cdfs (Corollary 7.1.2).)

10.17 Let Xnn≥0 be a collection of random variables with characteristicfunctions φnn≥0. Suppose that

∫|φn(t)|dt < ∞ for all n ≥ 0 and

that φn(·) → φ0(·) in L1(R) as n →∞. Show that

supB∈B(R)

∣∣P (Xn ∈ B)− P (X0 ∈ B)∣∣→ 0

as n →∞.

10.18 Let φ(·) be a characteristic function on R such that φ(t) → 0 as|t| → ∞. Let X be a random variable with φ as its characteristicfunction. For each n ≥ 1, let Xn = k

n if kn ≤ X < k+1

n , k = 0,±1,±2, . . .. Show that if φn(t) ≡ E(eιtXn), then φn(t) → φ(t) foreach t ∈ R but for each n ≥ 1,

sup|φn(t)− φ(t)| : t ∈ R

= 1.

10.19 Let δii≥1 be iid random variables with distribution

P (δ1 = 1) = P (δ1 = −1) = 1/2.

Let Xn =∑n

i=1δi

2i and X = limn→∞ Xn.

(a) Find the characteristic function of Xn.

Page 351: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

10.5 Problems 341

(b) Show that the characteristic function of X is φX(t) ≡ sin tt .

10.20 Let Xkk≥1 be iid random variables with pdf f(x) = 12 e−|x|, x ∈ R.

Show that∑∞

k=11k Xk converges w.p. 1 and compute its characteristic

function.

(Hint: Note that the characteristic function of the standard Cauchy(0,1) distribution is e−|t|.)

10.21 Establish an extension of formula (2.3) to the multivariate case.

Page 352: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11Central Limit Theorems

11.1 Lindeberg-Feller theorems

The central limit theorem (CLT) is one of the oldest and most useful resultsin probability theory. Empirical findings in applied sciences, dating back tothe 17th century, showed that the averages of laboratory measurements onvarious physical quantities tended to have a bell-shaped distribution. TheCLT provides a theoretical justification for this observation. Roughly speak-ing, it says that under some mild conditions, the average of a large numberof iid random variables is approximately normally distributed. A versionof this result for 0–1 valued random variables was proved by DeMoivreand Laplace in the early 18th century. An extension of this result to theaverages of iid random variables with a finite second moment was done inthe early 20th century. In this section, a more general set up is considered,namely, that of the limit behavior of the row sums of a triangular array ofindependent random variables.

Definition 11.1.1: For each n ≥ 1, let Xn1, . . . , Xnrn be a collectionof random variables defined on a probability space (Ωn,Fn, Pn) such thatXn1, . . . , Xnrn

are independent. Then, Xnj : 1 ≤ j ≤ rnn≥1 is called atriangular array of independent random variables.

Let Xnj : 1 ≤ j ≤ rnn≥1 be a triangular array of independent randomvariables. Define the row sums

Page 353: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

344 11. Central Limit Theorems

Sn =rn∑

j=1

Xnj , n ≥ 1. (1.1)

Suppose that EX2nj < ∞ for all j, n. Write s2

n = Var(Sn) =∑rn

j=1 Var(Xnj), n ≥ 1. The following condition, introduced by Lindeberg,plays an important role in establishing convergence of

(Sn−ESn

sn

)to a stan-

dard normal random variable in distribution.

Definition 11.1.2: Let Xnj : 1 ≤ j ≤ rnn≥1 be a triangular array ofindependent random variables such that

EXnj = 0, 0 < EX2nj ≡ σ2

nj < ∞ for all 1 ≤ j ≤ rn, n ≥ 1. (1.2)

Then, Xnj : 1 ≤ j ≤ rnn≥1 is said to satisfy the Lindeberg condition iffor every ε > 0,

limn→∞ s−2

n

rn∑j=1

EX2njI(|Xnj | > εsn) = 0, (1.3)

where s2n =

∑rn

j=1 σ2nj , n ≥ 1.

Example 11.1.1: Let Xnn≥1 be a sequence of iid random variableswith EX1 = µ and Var(X1) = σ2 ∈ (0,∞). Consider the centered andscaled sample mean

Tn =√

n(Xn − µ)σ

, n ≥ 1, (1.4)

where Xn = n−1∑nj=1 Xj . Note that Tn can be written as the row sum of

a triangular array of independent random variables:

Tn =n∑

j=1

Xnj , (1.5)

where Xnj = (Xj − µ)/σ√

n, 1 ≤ j ≤ n, n ≥ 1. Clearly, Xnj : 1 ≤j ≤ nn≥1 satisfies (1.2) with σ2

nj = EX2nj = 1

nσ2 Var(X1) = 1/n for all1 ≤ j ≤ n, and hence, s2

n =∑n

j=1 σ2nj = 1 for all n ≥ 1. Now, for any ε > 0,

s−2n

n∑j=1

EX2njI(|Xnj | > εsn)

=n∑

j=1

E(Xj − µ

σ√

n

)2I(∣∣∣Xj − µ

σ√

n

∣∣∣ > ε)

= n[ 1σ2n

E(X1 − µ)2I

(|X1 − µ| > εσ

√n)]

= σ−2E(X1 − µ)2I

(|X1 − µ| > εσ

√n)→ 0 as n →∞,

Page 354: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.1 Lindeberg-Feller theorems 345

by the DCT, since E(X1−µ)2 < ∞. Thus, the triangular array Xnj : 1 ≤j ≤ n of (1.5) satisfies the Lindeberg condition (1.3).

The main result of this section is the following CLT for scaled row sumsof a triangular array of independent random variables.

Theorem 11.1.1: (Lindeberg CLT). Let Xnj : 1 ≤ j ≤ rnn≥1 be atriangular array of independent random variables satisfying (1.2) and theLindeberg condition (1.3). Then,

Sn

sn−→d N(0, 1) (1.6)

where Sn =∑rn

j=1 Xnj and s2n = Var(Sn) =

∑rn

j=1 σ2nj.

As a direct consequence of Theorem 11.1.1 and Example 11.1.1, one getsthe more familiar version of the CLT for the sample mean of iid randomvariables.

Corollary 11.1.2: (CLT for iid random variables). Let Xnn≥1 be asequence of iid random variables with EX1 = µ and Var(X1) = σ2 ∈(0,∞). Then, √

n(Xn − µ) −→d N(0, σ2), (1.7)

where Xn = n−1∑nj=1 Xnj, n ≥ 1.

For proving the theorem, the following simple inequality will be used.

Lemma 11.1.3: For any m ∈ N and for any complex numbers z1, . . . , zm,ω1, . . . , ωm, with |zj | ≤ 1, |ωj | ≤ 1 for all j = 1, . . . , m,∣∣∣∣

m∏j=1

zj −m∏

j=1

ωj

∣∣∣∣ ≤m∑

j=1

|zj − ωj |. (1.8)

Proof: Inequality (1.8) follows from the identity

m∏j=1

zj −m∏

j=1

ωj =m∏

j=1

zj −(m−1∏

j=1

zj

)ωm

+(m−1∏

j=1

zj

)ωm −

(m−2∏j=1

zj

)ωm−1ωm

+ · · ·+ z1

m∏j=2

ωj −m∏

j=1

ωj .

Proof of Theorem 11.1.1: W.l.o.g., suppose that s2n = 1 for all n ≥ 1.

(Otherwise, setting Xnj ≡ Xnj/sn, 1 ≤ j ≤ rn, n ≥ 1, it is easy to check

Page 355: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

346 11. Central Limit Theorems

that for the triangular array Xnj : 1 ≤ j ≤ rnn≥1, the variance of the nthrow sum s2

n ≡∑rn

j=1 Var(Xnj) = 1 for all n ≥ 1, the Lindeberg conditionholds, and s−1

n

∑rn

j=1 Xnj −→d N(0, 1) iff (1.6) holds.) Then, by Theorem10.3.4, it is enough to show that

limn→∞ E exp(ιtSn) = e−t2/2 for all t ∈ R. (1.9)

For any ε > 0,

∆n ≡ maxEX2

nj : 1 ≤ j ≤ rn

≤ max

EX2

njI(|Xnj | > ε) + EX2njI(|Xnj | ≤ ε) : 1 ≤ j ≤ rn

rn∑j=1

EX2njI(|Xnj | > ε) + ε2

= o(1) + ε2 as n →∞, by the Lindeberg condition (1.3).

Hence,∆n → 0 as n →∞. (1.10)

Fix t ∈ R. Let φnj(·) denote the characteristic function of Xnj , 1 ≤ j ≤ rn,n ≥ 1. Note that by (1.10), there exists n0 ∈ N such that for all n ≥ n0,I1n ≡ max|1 − t2σ2

nj/2| : 1 ≤ j ≤ rn ≤ 1. Next, noting that s2n =∑rn

j=1 σ2nj = 1, by Lemma 11.1.3, for all n ≥ n0,∣∣∣E exp(ιtSn)− e−t2/2

∣∣∣≤

∣∣∣∣rn∏

j=1

φnj(t)−rn∏

j=1

(1−

t2σ2nj

2

)∣∣∣∣+∣∣∣∣

rn∏j=1

(1−

t2σ2nj

2

)−

rn∏j=1

exp(−t2σ2nj/2)

∣∣∣∣≤

rn∑j=1

∣∣∣∣φnj(t)−[1−

t2σ2nj

2

]∣∣∣∣+

rn∑j=1

∣∣∣∣ exp(−t2σ2nj/2)−

[1−

t2σ2nj

2

]∣∣∣∣≡ I2n + I3n, say. (1.11)

It will now be shown that

limn→∞ Ikn = 0 for k = 2, 3.

First consider I2n. Since∣∣ exp(ιx)− [1 + ιx + (ιx)2/2]

∣∣ ≤ min|x|3/3!, |x|2for all x ∈ R (cf. Lemma 10.1.5) and EXnj = 0 for all 1 ≤ j ≤ rn, for any

Page 356: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.1 Lindeberg-Feller theorems 347

ε > 0, by the Lindeberg condition, one gets

I2n ≡rn∑

j=1

∣∣∣φnj(t)−[1−

t2σ2nj

2

]∣∣∣=

n∑j=1

∣∣∣E exp(ιtXnj)−[1 + ιtEXnj +

(ιt)2

2!EX2

nj

]∣∣∣≤

rn∑j=1

E min |tXnj |3

3!, |tXnj |2

≤rn∑

j=1

E|tXnj |3I(|Xnj | ≤ ε) +rn∑

j=1

E(tXnj)2I(|Xnj | > ε)

≤ |t|3εrn∑

j=1

EX2nj + t2

rn∑j=1

EX2njI(|Xnj | > ε)

≤ |t|3ε + t2 · o(1) as n →∞. (1.12)

Since ε ∈ (0,∞) is arbitrary, I2n → 0 as n →∞.Next, consider I3n. Note that for any x ∈ R,

|ex − 1− x| =∣∣∣∣

∞∑k=2

xk/k!∣∣∣∣ ≤ x2

∞∑k=2

|x|k−2

k!≤ x2e|x|.

Hence, using (1.10) and the fact that s2n = 1, one gets

I3n ≤rn∑

j=1

( t2σ2nj

2

)2exp(t2σ2

nj/2)

≤ t4 exp(t2∆n/2)[ rn∑

j=1

σ2nj ·∆n

]

= t4 exp(t2∆n/2)∆n

→ 0 as n →∞. (1.13)

Now (1.9) follows from (1.11), (1.12), and (1.13). This completes the proofof the theorem.

Oftentimes, verification of the Lindeberg condition (1.3) becomes diffi-cult as one has to find the truncated second moments of Xnj ’s. A simplersufficient condition for the CLT is provided by Lyapounov’s condition.

Definition 11.1.3: A triangular array Xnj : 1 ≤ j ≤ rnn≥1 of inde-pendent random variables satisfying (1.2) is said to satisfy Lyapounov’scondition if there exists a δ ∈ (0,∞) such that

limn→∞ s−(2+δ)

n

rn∑j=1

E|Xnj |2+δ = 0, (1.14)

Page 357: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

348 11. Central Limit Theorems

where s2n =

∑rn

j=1 EX2nj .

Note that by Markov’s inequality, if a triangular array Xnj : 1 ≤ j ≤rnn≥1 satisfies Lyapounov’s condition (1.14), then for any ε ∈ (0,∞),

s−2n

rn∑j=1

EX2njI(|Xnj | > εsn)

≤ s−2n

rn∑j=1

E|Xnj |2(|Xnj |/εsn)δ

→ 0 as n →∞.

Thus, Xnj : 1 ≤ j ≤ rnn≥1 satisfies the Lindeberg condition (1.3). Thisobservation leads to the following result.

Corollary 11.1.4: (Lyapounov’s CLT). Let Xnj : 1 ≤ j ≤ rnn≥1 bea triangular array of independent random variables satisfying (1.2) andLyapounov’s condition (1.14). Then, (1.6) holds, i.e.,

Sn

sn−→d N(0, 1).

It is clear that Lyapounov’s condition is only a sufficient but not a nec-essary condition for the validity of the CLT. In contrast, under some reg-ularity conditions on the triangular array Xnj : 1 ≤ j ≤ rnn≥1, whichessentially says that the individual random variables Xnj ’s are ‘uniformlysmall’, the Lindeberg condition is also a necessary condition for the CLT.This converse is due to W. Feller.

Theorem 11.1.5: (Feller’s theorem). Let Xnj : 1 ≤ j ≤ rnn≥1 be atriangular array of independent random variables satisfying (1.2) such thatfor any ε > 0,

limn→∞ max

1≤j≤rn

P (|Xnj | > εsn) = 0, (1.15)

where s2n =

∑rn

j=1 EX2nj. Let Sn =

∑rn

j=1 Xnj. If, in addition,

Sn

sn−→d N(0, 1), (1.16)

then Xnj : 1 ≤ j ≤ rnn≥1 satisfies the Lindeberg condition.

A triangular array Xnj : 1 ≤ j ≤ rnn≥1 satisfying (1.15) is called anull array. Thus, the converse of Theorem 11.1.1 holds for null arrays. Itmay be noted that there exist non-null arrays for which (1.16) holds butthe Lindeberg condition fails (Problem 11.9).

Proof: W.l.o.g., suppose that s2n = 1 for all n ≥ 1. Next fix ε ∈ (0,∞).

Then, setting t0 = 4/ε, and noting that 1 =∑rn

j=1 EX2nj and cos x ≥

Page 358: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.1 Lindeberg-Feller theorems 349

1− x2/2 for all x ∈ R, one gets

rn∑j=1

(E cos t0Xnj − 1

)+

t202

=rn∑

j=1

E( t20X

2nj

2− 1 + cos t0Xnj

)

≥rn∑

j=1

E( t20X

2nj

2− 1 + cos t0Xnj

)I(|Xnj | > ε

)

≥rn∑

j=1

E( t20X

2nj

2− 2)I(|Xnj | > ε

)

≥( t20

2− 2

ε2

) rn∑j=1

EX2njI(|Xnj | > ε

)

=6ε2

rn∑j=1

EX2njI(|Xnj | > ε

).

Hence, the Lindeberg condition would hold if it is shown that for all t ∈ R,

rn∑j=1

(E cos tXnj − 1

)+

t2

2→ 0 as n →∞

⇔ exp( rn∑

j=1

(E cos tXnj − 1

))→ e−t2/2 as n →∞. (1.17)

Let φnj(t) = E exp(ιtXnj), t ∈ R denote the characteristic function of Xnj ,1 ≤ j ≤ rn, n ≥ 1. Note that E cos tXnj = Re(φnj(t)), where recall thatfor any complex number z, Re(z) denotes the real part of z, i.e., Re(z) = aif z = a + ιb, a, b ∈ R. Since the function h(z) = |z| is continuous on Cand | exp(φnj(t))| = exp(E cos tXnj), it follows that (1.17) holds if, for allt ∈ R,

exp( rn∑

j=1

(φnj(t)− 1))→ e−t2/2 as n →∞.

However, by (1.16), E exp(ιtSn) =∏rn

j=1 φnj(t) → e−t2/2 for all t ∈ R.Hence, it is enough to show that

I1n(t) ≡[

exp( rn∑

j=1

[φnj(t)− 1])−

rn∏j=1

φnj(t)]

→ 0 as n →∞ for all t ∈ R. (1.18)

Page 359: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

350 11. Central Limit Theorems

Note that for any ε ∈ (0,∞), by (1.15) and the inequality |eιx − 1| ≤min2, |x| for all x ∈ R, one has

|φnj(t)− 1| =∣∣E( exp(ιtXnj)− 1

)∣∣≤ E min|tXnj |, 2≤ 2P (|Xnj | > ε) + |t|ε

uniformly in j = 1, . . . , rn. Hence, letting n →∞ and then ε ↓ 0, by (1.15),one gets

I2n(t) ≡ max1≤j≤rn

|φnj(t)− 1| = o(1) as n →∞ (1.19)

for all t ∈ R. Further, by the inequality |eιx − 1− ιx| ≤ |x|2/2, x ∈ R,

rn∑j=1

|φnj(t)− 1| =rn∑

j=1

∣∣E exp(ιtXnj)− 1− E(ιtXnj)∣∣

≤ t2

2

rn∑j=1

EX2nj =

t2

2s2

n =t2

2(1.20)

uniformly in t ∈ R, n ≥ 1. Now fix t ∈ R. Then, by (1.19), there existsn0 ∈ N such that for all n ≥ n0, max1≤j≤rn

|φn(t) − 1| ≤ 1. Hence, bythe arguments in the proof of Lemma 11.1.3, and by the inequalities |ez| ≤∑∞

k=0 |z|k/k! = e|z|, and |ez − 1− z| =∣∣∑∞

k=2 zk/k!∣∣ ≤ |z|2 exp(|z|), z ∈ C,

for all n ≥ n0, one has

I1n(t) =∣∣∣∣

rn∏j=1

exp([φnj(t)− 1]

)−

rn∏j=1

φnj(t)∣∣∣∣

≤rn∑

j=1

∣∣ exp([φnj(t)− 1]

)− φnj(t)

∣∣ · rn−j∏k=1

∣∣ exp([φnj(t)− 1]

)∣∣≤

rn∑j=1

∣∣ exp([φnj(t)− 1]

)− φnj(t)

∣∣ · exp( rn∑

j=1

∣∣φnj(t)− 1∣∣)

≤rn∑

j=1

∣∣ exp([φnj(t)− 1]

)− φnj(t)

∣∣ · exp( t2

2

)

=rn∑

j=1

∣∣ exp([φnj(t)− 1]

)− 1− [φnj(t)− 1]

∣∣ · exp( t2

2

)

≤rn∑

j=1

∣∣φnj(t)− 1∣∣2 · exp

(1 +

t2

2

)

≤ max1≤j≤rn

|φnj(t)− 1|( rn∑

j=1

|φnj(t)− 1|)

exp(1 +

t2

2

)

Page 360: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.1 Lindeberg-Feller theorems 351

≤ I2n(t) ·[t2 · exp

(1 +

t2

2

)/2]

→ 0 as n →∞,

by (1.19) and (1.20). Hence, (1.18) holds. This completes the proof of thetheorem.

The following example is an application of the Lindeberg CLT for prov-ing asymptotic normality of the least squares estimator of a regressionparameter.

Example 11.1.2: Let

Yj = xjβ + εj , j = 1, 2, . . . (1.21)

be a simple linear regression model, where xnn≥1 is a given sequence ofreal numbers, β ∈ R is the regression parameter and εnn≥1 is a sequenceof iid random variables with Eε1 = 0 and Eε21 ≡ σ2 ∈ (0,∞). The leastsquares estimator of β based on Y1, . . . , Yn is given by

βn =n∑

j=1

xjYj/a2n, n ≥ 1,

where a2n =

∑nj=1 x2

j . Suppose that the sequence xnn≥1 satisfies

max1≤j≤n

|xj |/an → 0 as n →∞. (1.22)

Then,an(βn − β) −→d N(0, σ2). (1.23)

To prove (1.23), note that by (1.21),

an(βn − β) = an

[ n∑j=1

xjYj − a2nβ

]/a2

n

= a−1n

n∑j=1

xjεj ≡n∑

j=1

Xnj , say (1.24)

where Xnj = xjεj/an, 1 ≤ j ≤ n, n ≥ 1. Note that EXnj = 0, EX2nj <

∞ and s2n ≡

∑nj=1 EX2

nj =∑n

j=1 x2jEε2j/a2

n = σ2. Thus, Xnj : 1 ≤j ≤ nn≥1 is a triangular array of independent random variables satisfying(1.2). Next, let mn = max|xj |/an : 1 ≤ j ≤ n, n ≥ 1. Then, by (1.22),for any δ ∈ (0,∞),

s−2n

n∑j=1

EX2njI(|Xnj | > δsn

)

Page 361: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

352 11. Central Limit Theorems

= σ−2a−2n

n∑j=1

x2jEε2jI(|xjεj/an| > δσ)

≤ σ−2a−2n

n∑j=1

x2j · Eε21I(mn · |ε1| > δσ)

= σ−2Eε21I(|ε1| > δσ ·m−1

n

)→ 0 as n →∞ by the DCT.

Thus, Xnj : 1 ≤ j ≤ nn≥1 satisfies the Lindeberg condition (1.3) andhence, by Theorem 11.1.1,∑n

j=1 Xnj

σ−→d N(0, 1),

which, in view of (1.24), implies (1.23).

The next result gives a multivariate generalization of Theorem 11.1.1.

Theorem 11.1.6: (A multivariate version of the Lindeberg CLT). Foreach n ≥ 1, let Xnj : 1 ≤ j ≤ rn be a collection of independent k-dimensional random vectors satisfying

EXnj = 0, 1 ≤ j ≤ rn andrn∑

j=1

EX ′njXnj = Ik,

where Ik denotes the identity matrix of order k and for any vector x, x′

denotes its transpose. Suppose that for every ε ∈ (0,∞),

limn→∞

rn∑j=1

E‖Xnj‖2I(‖Xnj‖ > ε) = 0.

Then,rn∑

j=1

Xnj −→d N(0, Ik).

The proof is a consequence of Theorem 11.1.1 and the Cramer-Wolddevice (cf. Theorem 10.4.5) and is left as an exercise (Problem 11.17).

11.2 Stable distributions

If Xnn≥1 is a sequence of iid N(µ, σ2) random variables, then for eachk ≥ 1, Sk ≡

∑ki=1 Xi has a N(kµ, kσ2) distribution. Similarly, if Xnn≥1

Page 362: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.2 Stable distributions 353

is a sequence of iid Cauchy (µ, σ) random variables, then for each k ≥ 1,Sk ≡

∑ki=1 Xi has a Cauchy (kµ, kσ) distribution. Thus, in both cases,

for each k ≥ 1, there exist constants ak and bk such that Sk has the samedistribution as akX1 + bk (Problem 11.21).

Definition 11.2.1: A nondegenerate random variable X is called stableif the above property holds, i.e., for each k ∈ N, there exist constants ak

and bk such thatSk =d akX1 + bk, (2.1)

where X1, X2, . . . are iid random variables with the same distribution asX, and Sk =

∑ki=1 Xi. In this case, the distribution FX of X is called a

stable distribution.

There are two characterizations of stable distributions.

Theorem 11.2.1: A nondegenerate distribution F is stable iff there existsa sequence of iid random variable Ynn≥1 and constants ann≥1 andbnn≥1 such that

(∑ni=1 Yi − bk

)/ak converges in distribution to F .

Theorem 11.2.2: A nondegenerate distribution F is stable iff its charac-teristic function φ(t) admits the representation

φ(t) = exp(ιtc− b|t|α(1 + ιλsgn(t)ωα(t))

)(2.2)

where ι =√−1, −1 ≤ λ ≤ 1, 0 < α ≤ 2, 0 ≤ b < ∞, and the functions

ωα(t) and sgn(·) are defined as

ωα(t) =

tan πα2 if α = 1

2π log |t| if α = 1 (2.3)

and

sgn(t) =

⎧⎨⎩

1 if t > 0−1 if t < 00 if t = 0 .

Remark 11.2.1: When α = 2, φ(t) in (2.2) reduces to exp(ιtc−bt2), whichis the characteristic function of a normal random variable with mean c andvariance 2b.

Remark 11.2.2: When α = 1, λ = 0, φ(t) reduces to exp(ιtc−b|t|), whichis the characteristic function of a Cauchy (c, b) distribution.

Remark 11.2.3: Since |φ(t)| is integrable, the distribution F must beabsolutely continuous. Apart from the normal and Cauchy distributions,for α = 1/2, λ = 1, the density is given by

f(x) =1√2π

1x3/2 exp

(− 1

2x

), x > 0. (2.4)

Page 363: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

354 11. Central Limit Theorems

For an explicit expression for the density of F , in other cases, see Feller(1966), Section 17.6.

Definition 11.2.2: The parameter α in (2.2) is called the index of thestable distribution.

Remark 11.2.4: The parameter λ is related to the behavior of the ratioof the right tail of the distribution to the left tail through the relation

limx→∞

1− F (x)F (−x)

=1 + λ

(1− λ), (2.5)

where for λ = 1, the ratio on the right side of (2.5) is defined to be +∞.

Definition 11.2.3: A function L : (0,∞) → (0,∞) is called slowly varyingat ∞ if

limx→∞

L(cx)L(x)

= 1 for all 0 < c < ∞. (2.6)

A function f : (0,∞) → (0,∞) is called regularly varying at ∞ withindex α ∈ R, α = 0 if f(x) = xαL(x) for all x ∈ (0,∞) where L(·)is slowly varying at ∞. The functions L1(t) = log t, L2(t) = log(log t),L3(t) = (log t)2 are slowly varying at ∞ but the function L4(t) = tp is notso for p = 0.

There is a companion result to Theorem 11.2.1 giving necessary and suffi-cient conditions for convergence of normalized sums of iid random variablesto a stable distribution.

Theorem 11.2.3: Let F be a nondegenerate stable distribution with indexα, 0 < α < 2. Then in order that a sequence Ynn≥1 of iid randomvariables admits a sequence of constants ann≥1 and bnn≥1 such that∑n

i=1 Yi − bn

an−→d F, (2.7)

it is necessary and sufficient that

limx→∞

P (Y1 > x)P (|Y1| > x)

≡ θ ∈ [0, 1] (2.8)

exists andP (|Y1| > x) = x−αL(x), (2.9)

where L(·) is a slowly varying function at ∞. If (2.8) and (2.9) hold, thenthe normalizing constants ann≥1 and bnn≥1 may be chosen to satisfy

na−αn L(an) → 1 and bn = nEY1I(|Y1| ≤ an). (2.10)

Page 364: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.2 Stable distributions 355

Remark 11.2.5: The analog of Theorem 11.2.3 for the case α = 2, i.e.,for the normal distribution is the following.

Theorem 11.2.4: Let Ynn≥1 be iid random variables. In order thatthere exist constants ann≥1 and bnn≥1 such that

∑ni=1 Yi − bn

an−→d N(0, 1), (2.11)

it is necessary and sufficient that

x2P (|Y1| > x)EY 2

1 I(|Y1| ≤ x)→ 0 as x →∞. (2.12)

Remark 11.2.6: Note that condition (2.12) holds if EY 21 < ∞. However,

if P (|Y1| > x) ∼ Cx2 as x → ∞, then EY 2

1 = ∞ and the classical CLT (cf.Corollary 11.1.2) fails. However, in this case, (2.12) holds and

∑ni=1 Yi is

asymptotically normal with a suitable centering and scaling (different from√n) (Problem 11.20).

Here, only the proof of Theorem 11.2.1 will be given. Further, a proofof Theorem 11.2.3, sufficiency part, is also outlined. For the rest, see Feller(1966) or Gnedenko and Kolmogorov (1968). For proving Theorem 11.2.1,the following result is needed.

Theorem 11.2.5: (Khinchine’s theorem on convergence of types). LetWnn≥1 be a sequence of random variables such that for some sequencesαnn≥1 ⊂ [0,∞) and βnn≥1 ⊂ R, both Wn and αnWn + βn convergein distribution to nondegenerate distributions G and H on R, respectively.Then limn→∞ αn = α and limn→∞ βn = β exist with 0 < α < ∞ and−∞ < β < ∞.

Proof: Let W ′nn≥1 be a sequence of random variables such that for

each n ≥ 1, Wn and W ′n have the same distribution and Wn and W ′

n

are independent. Then Yn ≡ Wn − W ′n and Zn ≡ αn(Wn − W ′

n) bothconvergence in distribution to nondegenerate limits, say G and H. IndeedG = G∗G and H = H ∗H, where ∗ denotes convolution. This implies thatαnn≥1 cannot have 0 or ∞ as limit points. Also if 0 < α ≤ α′ < ∞ aretwo limit points of αnn≥1, then H(x) = G( x

α ) = G( xα′ ) for all x. Since

G(·) is nondegenerate, α must equal α′ and so limn→∞ αn exists in (0,∞).This implies that limn→∞ βn exists in R.

Proof of Theorem 11.2.1: The ‘only if’ part follows from the definitionof F being stable, since one can take Ynn≥1 to be iid with distributionF .

Page 365: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

356 11. Central Limit Theorems

For the ‘if part,’ let Ynn≥1 be iid random variables such that thereexists constants ann≥1 and bnn≥1 such that as n →∞∑n

i=1 Yi − bn

an−→d F.

To show that F is stable, fix an integer r ≥ 1. Let Xnn≥1 be iid randomvariables with distribution F . Then as k →∞,∑kr

i=1 Yi − bkr

akr−→d X1.

Also, the left side above equals

r−1∑j=0

(∑(j+1)ki=jk+1 Yi − bk

ak

)ak

akr+

rbk − bkr

akr= αkr

( r−1∑j=0

ηjk

)+ βkr, say.

where αkr = ak

akr, ηjk =

(∑(j+1)ki=jk+1 Yi−bk

ak

)and βkr = rbk−bkr

akr. Since ηjk :

j = 0, 1, . . . , r − 1 are independent and for each j, ηjk −→d Xj+1 ask →∞, it follows that as k →∞,

Wk =r−1∑j=0

ηjk −→dr−1∑j=0

Xj+1 =r∑

j=1

Xj .

Also, as k →∞,αkrWk + βkr −→d X1.

Since F is nondegenerate, both X1 and∑r

j=1 Xj are nondegenerate randomvariables. Thus, as k →∞, Wk and αkrWk + βkr converge in distributionto nondegenerate random variables. Thus, by Khinchine’s theorem on con-vergence types proved above, it follows that αkr → α′

r and βkr → β′r,

0 < α′r < ∞ and −∞ < β′

r < ∞. This yields that for each r ∈ N that∑rj=1 Xj has the same distribution as 1

α′r(X1 − β′

r), i.e., X1 is stable.

Proof of the sufficiency part of Theorem 11.2.3: (Outline). Theproof is based on the continuity theorem. The characteristic function ofTn ≡ Sn−bn

anis

φn(t) = E(eιt Sn−bn

an

)=(φ( t

an

)e−ιtb′

n/an

)n

≡(1 +

1n

hn(t))n

where b′n = bn/n and hn(t) = n

(φ( t

an)e−ιtb′

n/an − 1). Let G(·) be the cdf

of Y1. Then

hn(t) = n

∫ (eιt(y−b′

n)/an − 1)dG(y) =

∫ (eιtx − 1

)µn(dx)

Page 366: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.2 Stable distributions 357

where µn(A) = nP (Y1 ∈ b′n + anA), A ∈ B(R). If A = (u, v], 0 < u < v <

∞, then

nP(Y1 ∈ b′

n + anA)

= nP(anu + b′

n < Y1 ≤ anv + b′n

)= nP

(Y1 > anu + b′

n

)− nP

(Y1 > anv + b′

n

).

By (2.8)–(2.10),

nP (Y1 > anx) =(

P (Y1 > anx)P (Y1 > an)

)nP (Y1 > an) → θx−α for x > 0.

By using (2.10), it can be show that

b′n

an=

EY1I(|Y1| ≤ an)an

= O(a1−α

n L(an)an

)= o(1) as n →∞.

Hence, it follows that

nP (Y1 > anu + b′n)− nP (Y1 > anv + b′

n) → θ(u−α − v−α).

Similarly, for A = (−v,−u],

nP (Y1 ∈ b′n + anA) → (1− θ)(u−α − v−α).

This suggests that hn(t) should approach

θα

∫ ∞

0(eιtx − 1)x−(α+1)dx + (1− θ)α

∫ 0

−∞(eιtx − 1)|x|−(α+1)dx.

But there are integrability problems for |x|−(α+1) near 0 and so a morecareful analysis is needed. It can be shown that

limn→∞ hn(t) = ιtc + θα

∫ ∞

0

(eιtx − 1− ιtx

1 + x2

)x(α+1)dx

+ (1− θ)α∫ 0

−∞

(eιtx − 1− ιtx

1 + x2

)|x|−(α+1)dx

where c is a constant. The right side is continuous at t = 0 and so, theresult follows by the continuity theorem. For details, see Feller (1966).

Remark 11.2.7: By the necessity part of Theorem 11.2.3, every stabledistribution F must satisfy

1− F (x) = θx−αL(x)F (−x) = (1− θ)x−αL(x)

Page 367: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

358 11. Central Limit Theorems

for large x where 0 ≤ θ ≤ 1 and L(·) is slowing varying at∞ and 0 < α < 2.This implies that F has moments of order p such that α > p. Distributionssatisfying the above tail condition are called heavy tailed and arise in manyapplications. The Pareto distribution in economics is an example of a heavytail distribution.

Remark 11.2.8: One way to generate heavy tailed distributions is asfollows. If Y is a positive random variable such that there exist 0 < c < ∞and 0 < p < ∞ satisfying

P (Y < y) ∼ cyp as y ↓ 0,

then the random variable X = Y −q has the property

P (X > x) = P (Y < x−1/q) ∼ cx−p/q as x →∞.

If p < 2q, then X has heavy tails. Thus if Ynn≥1 are iid Gamma(1,2), thenn−1∑n

i=1 Y −1i converges in distribution to a one sided Cauchy distribution

(Problem 11.15).

Definition 11.2.4: Let F and G be two probability distributions on R.Then G is said to belong to the domain of attraction of F if there exist asequence of iid random variables Ynn≥1 with distribution G and constantsann≥1 and bnn≥1 such that∑n

i=1 Yi − bn

an−→d F.

Theorem 11.2.1 says that the only nondegenerate distributions F that ad-mit a nonempty domain of attraction are the stable distributions.

11.3 Infinitely divisible distributions

Definition 11.3.1: A random variable X (and its distribution) is calledinfinitely divisible if for each integer k ∈ N, there exist iid random variablesXk1, Xk2, . . . , Xkk such that

∑kj=1 Xkj has the same distribution as X.

Examples include constants (degenerate distributions), normal, Poisson,Cauchy, and Gamma distributions. But distributions with bounded supportcannot be infinitely divisible unless they are degenerate. In fact, if X isinfinitely divisible satisfying P (|X| ≤ M) = 1 for some M < ∞, thenthe Xki’s in the above definition must satisfy P

(|Xk1| < M

k

)= 1 and

so Var(Xk1) ≤ EX2k1 ≤ M2

k2 implying Var(X) = kVar(Xk1) ≤ M2

k for eachk ≥ 1. Hence Var(X) must be zero, and the random variable X is a constantw.p. 1.

The following results are easy to establish.

Page 368: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.3 Infinitely divisible distributions 359

Theorem 11.3.1: (a) If X and Y are independent and infinitely divisible,then X + Y is also infinitely divisible. (b) If Xn is infinitely divisible foreach n ∈ N and Xn −→d X, then X is infinitely divisible.

Proof: (a) Follows from the definition. (b) For each k ≥ 1 and n ≥ 1,there exist iid random variables Xnk1, Xnk2, . . . , Xnkk such that Xn and∑k

j=1 Xnkj have the same distribution. Now fix k ≥ 1. Then for any y > 0,(P (Xnk1 > y)

)k = P (Xnkj > y for all j = 1, ..., k) ≤ P (Xn > ky)

and similarly, (P (Xnk1 < −y)

)k ≤ P (Xn ≤ ky).

Since Xn −→d X, the distributions of Xnn≥1 are tight and so areXnk1∞

n=1. So if Fk is a weak limit point of Xnk1∞n=1 and if Ykjk

j=1 areiid with distribution Fk, then X and

∑kj=1 Ykj have the same distribution

and so X is infinitely divisible.

A large class of infinitely divisible distributions are generated by thecompound Poisson family.

Definition 11.3.2: Let Ynn≥1 be iid random variables and let N bea Poisson (λ) random variable, independent of the Ynn≥1. The randomvariable X ≡

∑Ni=1 Yi is said to have a compound Poisson distribution.

Theorem 11.3.2: A compound Poisson distribution is infinitely divisible.

Proof: Let X be a random variable as in Definition 11.3.2. For eachk ≥ 1, let Nik

i=1 be iid Poisson random variables with mean λk that are

independent of Ynn≥1. Let

Xkj =Tj+1∑

i=Tj+1

Yi, 1 ≤ j ≤ k

where T1 = 0, Tj =∑j−1

i=1 Ni, 2 ≤ j ≤ k. Then Xkjkj=1 are iid and∑k

j=1 Xkj and X are identically distributed and so X is infinitely divisible.

Although the converse to the above is not valid, it is known that everyinfinitely divisible distribution is the limit of a sequence of centered andscaled compound Poisson distributions. This is a consequence of a deep re-sult giving an explicit formula for the characteristic function of an infinitelydivisible distribution which is stated below. For a proof of this result (statedbelow), see Feller (1966) and Chung (1974) or Gnedenko and Kolmogorov(1968).

Theorem 11.3.3: (Levy-Khinchine representation theorem). Let X be aninfinitely divisible random variable. Then its characteristic function φ(t) ≡

Page 369: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

360 11. Central Limit Theorems

E(eιtX) is of the form

φ(t) = exp(

ιtc− βt2

2+∫

R

(eιtx − 1− ιtx

1 + x2

)µ(dx)

),

where c ∈ R, β > 0 and µ is a measure on (R,B(R)) such that µ(0) = 0and

∫|x|≤1 x2µ(dx) < ∞ and µ(x : |x| > 1) < ∞.

Corollary 11.3.4: Stable distributions are infinitely divisible.

Proof: The normal distribution corresponds to the case µ(·) ≡ 0 andβ > 0. For nonnormal stable laws with index α < 2, set β = 0 and µ(dx) =θx−(α+1)dx for x > 0 and (1− θ)|x|−(α+1)dx for x < 0.

Corollary 11.3.5: Every infinitely divisible distribution is the limit ofcentered and scaled compound Poisson distributions.

Proof: Since the normal distribution can be obtained as a (weak) limit ofcentered and scaled Poisson distributions, it is enough to consider the casewhen β = 0, c = 0. Let µn(A) = µ(A ∩ x : |x| > n−1), A ∈ B(R) and let

φn(t) = exp(∫ (

eιtx − 1− ιtx

1 + x2

)µn(dx)

)

= exp(

λn

[ ∫(eιtx − 1)µn(dx)− ιtcn

])

where

µn(A) = µn(A)/µn(R), A ∈ B(R), λn = µn(R), and

cn =∫

x

1 + x2 µn(dx).

Thus, φn(·) is a compound Poisson characteristic function centered at cn,with Poisson parameter λn and with the compounding distribution µn.By the DCT, φn(t) → φ(t) for each t ∈ R. Hence by the Levy-Cramercontinuity theorem, the result follows.

Another characterization of infinitely divisible distributions is similar tothat of stable distributions. Recall that a stable distribution is one that isthe limit of normalized sums of iid random variables and conversely.

Theorem 11.3.6: A random variable X is infinitely divisible iff it is thelimit in distribution of a sequence Xnn≥1 where for each n, Xn is thesum of n iid random variables Xnjn

j=1.

Thus X is infinitely divisible iff it is the limit in distribution of the rowsums of a triangular array of random variables where in each row, all therandom variables are iid.

Page 370: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.4 Refinements and extensions of the CLT 361

Proof: The ‘only if’ part follows from the definition. For the ‘if’ part, fixk ≥ 1. Then Xk·n can be written as

Xk·n =k∑

j=1

Yjn,

where Yjn =∑jn

r=(j−1)n+1 Xk·n,r j = 1, 2, . . . , k. By hypothesis, Xk·n −→d

X. Now, Yjnkj=1 are iid and it can be shown, as in the proof of Theorem

11.3.1, that for each i = 1, . . . , k, Yin∞n=1 are tight and hence, converges in

distribution to a limit Yi through a subsequence, and that X and∑k

i=1 Yi

have the same distribution. Thus, X is infinitely divisible.

11.4 Refinements and extensions of the CLT

This section is devoted to studying some refinements and generalizationsof the basic CLT results, such as the rate of convergence in the CLT, Edge-worth expansions and large deviations for sums of iid random variables,and also a generalization of the basic CLT to a functional version.

11.4.1 The Berry-Esseen theoremLet X1, X2, . . . be a sequence of iid random variables with EX1 = µ andVar(X1) = σ2 ∈ (0,∞). Then, Corollary 11.1.2 and Polya’s theorem implythat

∆n ≡ supx∈R

∣∣∣∣P(Sn − nµ

σ√

n≤ x

)− Φ(x)

∣∣∣∣→ 0 as n →∞, (4.1)

where Sn = X1 + · · · + Xn, n ≥ 1, and Φ(·) is the cdf of the N(0, 1)distribution. A natural question that arises in this context is “how fast does∆n go to zero?” Berry (1941) and Esseen (1942) independently proved that∆n = O(n−1/2) as n → ∞, provided E|X1|3 < ∞. This result is referredto as the Berry-Esseen theorem.

Theorem 11.4.1: (The Berry-Esseen theorem). Let Xnn≥1 be a se-quence of iid random variables with EX1 = µ, Var(X1) = σ2 ∈ (0,∞) andE|X1|3 < ∞. Then, for all n ≥ 1,

∆n ≡ supx∈R

∣∣∣∣P(Sn − nµ

σ√

n≤ x

)− Φ(x)

∣∣∣∣ ≤ C · E|X1 − µ|3σ3√

n(4.2)

where C ∈ (0,∞) is a constant.

The value of the constant C ∈ (0,∞) does not depend on n and onany characteristics of the distribution of X1. Indeed, the proof of Theorem

11.4.1 below shows that C ≤√

2π ·[

52 + 12

π

]< 5.05.

Page 371: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

362 11. Central Limit Theorems

The following result plays an important role in the proof of Theorem11.4.1.

Lemma 11.4.2: (A smoothing inequality). Let F be a cdf on R with∫xdF (x) = 0 and characteristic function ζ(t) =

∫exp(ιtx)dF (x), t ∈ R.

Let G : R → R be a differentiable function with derivative g such thatlim|x|→∞

(F (x) − G(x)

)= 0. Suppose that

∫(1 + |x|)|g(x)|dx < ∞,∫∞

−∞ xrg(x)dx = 0 for r = 0, 1 and |g(x)| ≤ C0 for all x ∈ R, for someC0 ∈ (0,∞). Then, for any T ∈ (0,∞),

supx∈R

∣∣∣F (x)−G(x)∣∣∣ ≤ 1

π

∫ T

−T

|ζ(t)− ξ(t)||t| dt +

24C0

πT(4.3)

where ξ(t) =∫∞

−∞ exp(ιtx)g(x)dx, t ∈ R.

For a proof of Lemma 11.4.2, see Feller (1966).

The next lemma deals with an expansion of the logarithm of the charac-teristic function of X, in a neighborhood of zero. Let z = reiθ, r ∈ (0,∞),θ ∈ [0, 2π) be the polar representation of a nonzero complex number z.Then, the (principal branch of the) complex logarithm of z is defined as

log z = log r + iθ. (4.4)

The function log z is infinitely differentiable on the set z ∈ C : z = reiθ, r ∈(0,∞), 0 ≤ θ < 2π and has a convergent Taylor’s series expansion around1 on the unit disc:

log(1 + z) =∞∑

k=1

zk/k for |z| < 1. (4.5)

Lemma 11.4.3: Let Y be a random variable with EY = 0, σ2 = EY 2 ∈(0,∞), ρ = E|Y |3 < ∞ and characteristic function φY (t) = E exp(ιtY ),t ∈ R. Then, for all t ∈

[− 1

σ , 1σ

],

∣∣∣∣ log φY (t) +t2σ2

2

∣∣∣∣ ≤ 512|t|3ρ (4.6)

and ∣∣∣∣ log φY (t)−[ (ιt)2

2!σ2 +

(ιt)3

3!EY 3

]∣∣∣∣≤ E

(min

|tY |3

3,(tY )4

24

)+

t4σ4

4. (4.7)

Page 372: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.4 Refinements and extensions of the CLT 363

Proof: Note that by Lemma 10.1.5,

∣∣φY (t)− 1∣∣ = ∣∣E( exp(ιtY )− 1− ιtY

)∣∣ ≤ t2EY 2

2≤ 1

2(4.8)

whenever |t| ≤ σ−1. In particular, log φY (t) is well defined for all t ∈[− σ−1, σ−1

].

By (4.5), (4.8), and Lemma 10.1.5, for |t| ≤ σ−1,∣∣∣∣ log φY (t) +t2σ2

2

∣∣∣∣=

∣∣∣∣ log[1 +

(φY (t)− 1

)]+

t2σ2

2

∣∣∣∣≤

∣∣∣∣φY (t)−[1− t2σ2

2

]∣∣∣∣+∞∑

k=2

∣∣φY (t)− 1∣∣k/k

≤ E

∣∣∣∣(tY )2 ∧ |tY |33!

∣∣∣∣+ 12

( t2σ2

2

)2 ∞∑k=2

(12

)k−2

≤ |t|3ρ6

+t4σ4

4.

Now using the bounds |tσ| ≤ 1 and σ3 = (EY 2)3/2 ≤ E|Y |3 = ρ, onegets (4.6). The proof of (4.7) is similar and hence, it is left as an exercise(Problem 11.27).

Proof of Theorem 11.4.1: W.l.o.g., set µ = 0 and σ = 1. Then,X1, X2, . . . are iid zero mean, unit variance random variables. Let X =d X1,ρ = E|X|3 and φX(·) denote the characteristic function of X. It is easy tocheck that the conditions of Lemma 11.4.2 hold with F (x) = P

(Sn√

n≤ x

),

G(x) = Φ(x), x ∈ R, and C0 = 1√2π

. Hence, by Lemma 11.4.2, withT =

√n/ρ,

∆n ≤1π

∫ T

−T

∣∣φnX( t√

n)− e−t2/2

∣∣|t| dt +

24ρ

π√

2πn. (4.9)

By Lemma 11.4.3 (with Y = X1−µσ and t replaced by t√

n),

rn(t) ≡∣∣∣n log φX

( t√n

)+

t2

2

∣∣∣= n

∣∣∣ log φX

( t√n

)+( t√

n

)2 σ2

2

∣∣∣≤ 5

12· ρ|t|3√

n(4.10)

for all |t| ≤√

n, n ≥ 1.

Page 373: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

364 11. Central Limit Theorems

Since ρ = E|X1|3 ≥ (EX21 )3/2 = σ3 = 1, |T | ≤

√n. Hence, using the

inequality |ez − 1| ≤ |z|e|z| for all z ∈ C and (4.10), one gets∣∣∣φnX

( t√n

)− e−t2/2

∣∣∣=

∣∣∣ exp(n · log φX

( t√n

)+

t2

2

)− 1∣∣∣ · exp

(− t2

2

)

≤ |rn(t)| exp(|rn(t)|

)· exp

(− t2

2

)≤ 5ρ

12√

n|t|3 exp

(− t2

2

[1− 5ρ|t|

6√

n

])

≤ 5ρ

12√

n|t|3 exp

(− t2

12

)(4.11)

for all ρ|t|√n≤ 1, i.e., for all |t| ≤ T , n ≥ 1. Since

∫∞−∞ t2 exp

(− t2

12

)dt = 6

√2π,

the theorem follows from (4.9) and (4.11) with C =√

[ 52 + 12

π

].

A striking feature of Theorem 11.4.1 is that the upper bound on ∆n in(4.2) is valid for all n ≥ 1. Also, under the conditions of Theorem 11.4.1,the rate O( 1√

n) in (4.2) is the best possible in the sense that there exist

random variables for which ∆n is bounded below by a constant multiple of1√n

(cf. Problem 11.29). Edgeworth expansions of the cdf of Sn−nµσ

√n

, to bedeveloped in the next section, can be used to show that for certain randomvariables X1 satisfying additional moment and symmetry conditions, ∆n

may go to zero at a faster rate. (For example, consider X1 ∼ N(µ, σ2).)For iid sequences Xnn≥1 with E|X1|2+δ < ∞ for some δ ∈ (0, 1],

Theorem 11.4.1 can be strengthened to show that ∆n decreases at the rateO(n−δ/2) as n →∞ (cf. Chow and Teicher (1997), Chapter 9).

11.4.2 Edgeworth expansionsRecall from Chapter 10 that a random variable X1 is called lattice if thereexist a ∈ R and h ∈ (0,∞) such that

P(X1 ∈ a + ih : i ∈ Z

)= 1. (4.12)

The largest h satisfying (4.12) is called the span of (the distribution of)X1. A random variable X1 is called nonlattice if it is not a lattice randomvariable. From Proposition 10.1.1, it follows that X1 is nonlattice iff∣∣E exp(ιtX1)

∣∣ < 1 for all t = 0. (4.13)

The next result gives an Edgeworth expansion for the cdf of Sn−nµσ

√n

with

an error of order o(n−1/2) for nonlattice random variables.

Page 374: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.4 Refinements and extensions of the CLT 365

Theorem 11.4.4: Let Xnn≥1 be a sequence of iid random variables withEX1 = µ, Var(X1) = σ2 ∈ (0,∞) and E|X1|3 < ∞. Suppose, in addition,that X1 is nonlattice, i.e., it satisfies (4.13). Then,

supx∈R

∣∣∣∣P(Sn − nµ

σ√

n≤ x

)−[Φ(x)− 1√

n· µ3

6σ3 (x2 − 1)φ(x)]∣∣∣∣

= o(n−1/2) as n →∞, (4.14)

where φ(x) = 1√2π

e−x2/2, x ∈ R and µ3 = E(X1 − µ)3.

The function

en,2(x) ≡ Φ(x)− 1√n· µ3

6σ3 (x2 − 1), x ∈ R (4.15)

is called a second order Edgeworth expansion for Tn ≡ Sn−nµσ

√n

. The abovetheorem shows that the cdf of the normalized sum Tn can be approximatedby the second order Edgeworth expansion with accuracy o(n−1/2). It canbe shown that if E|X1|4 < ∞ and X1 satisfies Cramer’s condition:

lim sup|t|→∞

∣∣E exp(ιtX1)∣∣ < 1, (4.16)

then the bound on the right side of (4.14) can be improved to O(n−1). Notethat for a symmetric random variable X1, having a finite fourth momentand satisfying (4.16), the second term in en,2(x) is zero and the rate ofnormal approximation becomes O(n−1). Higher order Edgeworth expan-sions for Tn can be derived using (4.16) and arguments similar to those inthe proof of Theorem 11.4.4, but the form of the expansion becomes morecomplicated. See Petrov (1975), Bhattacharya and Rao (1986), and Hall(1992) for detailed accounts of the Edgeworth expansion theory.

Proof of Theorem 11.4.4: W.l.o.g., let µ = 0 and σ = 1. In Lemma11.4.2, take F (x) = P (Tn ≤ x), and G(x) = en,2(x), x ∈ R. Then, it is easyto verify that the conditions of Lemma 11.4.2 hold with g(x) = gn(x) ≡φ(x) + µ3

6√

n(x3 − 3x)φ(x), x ∈ R. Using repeated differentiation on both

sides of the identity (inversion formula):

e−x2/2√

2π=

12π

∫ ∞

−∞e−ιtx · e−t2/2dt, x ∈ R,

one can show that

−(x3 − 3x)e−x2/2√

2π=

d3

dx3

(e−x2/2√

)=

12π

∫ ∞

−∞e−ιtx(−ιt)3e−t2/2dt,

x ∈ R. Hence,

ξn(t) ≡ ξ(t) =∫

eιtxgn(x)dx = e−t2/2[1 +

µ3

6√

n(ιt)3

], t ∈ R. (4.17)

Page 375: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

366 11. Central Limit Theorems

Next, let ε ∈ (0, 1) be given. Then, set T = c√

n where

cε ≡ c = 24 · sup(

1 +|µ3|6|x3 − 3x|

)φ(x) : x ∈ R

/ε.

Then, by Lemma 11.4.2 and (4.17),

∆n,2 ≡ supx∈R

∣∣∣∣P(Sn − nµ

σ√

n≤ x

)− en,2(x)

∣∣∣∣≤ 1

π

∫ c√

n

−c√

n

∣∣∣φnX

(t√n

)− ξ(t)

∣∣∣|t| dt +

ε√n

, (4.18)

where φX(t) = E exp(ιtX), t ∈ R and X =d X1. Let ρ = E|X1|3. Let M ∈(1,∞) be such that E|X1|3I(|X1| > M) ≤ ε/2. Then, setting δ = ε

2Mρ , it

follows that E|X1|4 |t|√nI(|X1| ≤ M) ≤ MδE|X1|3 ≤ ε/2 for all |t| ≤ δ

√n.

Hence, for all |t| ≤ δ√

n, by (4.7) of Lemma 11.4.3,

rn,2(t) ≡ n

∣∣∣∣ log φX

( t√n

)−[ (ιt)2

2n+

µ3

6

( ιt√n

)3]∣∣∣∣≤ n ·

[∣∣∣ t√n

∣∣∣3 E(

min |X1|3

3,|X1|424

|t|√n

)+

t4

4n2

]

≤ |t|33√

n

[E(|X1|3I(|X1| > M)

)+ E

(|X1|4

|t|√n

I(|X1| ≤ M))

+|t|√n

34

]

≤ |t|3ε√n

. (4.19)

Also, note that for any complex numbers z, w,

|ez − 1− w| ≤ |ez − ew|+ |ew − 1− w|

≤[|z − w|+ 1

2|w|2

]exp

(|z| ∨ |w|

). (4.20)

Hence, by (4.10), (4.19), and (4.20), it follows that for all |t| ≤ δ√

n,

∣∣∣φnX

( t√n

)− ξn(t)

∣∣∣=

∣∣∣∣ exp(n log φX

( t√n

)+

t2

2

)− 1− µ3

6√

n(ιt)3

∣∣∣∣e−t2/2

≤[rn,2(t) +

12

∣∣∣ µ3

6√

n(ιt)3

∣∣∣2] exp(rn(t) ∨

∣∣∣ µ3

6√

nt3∣∣∣)e−t2/2.

Page 376: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.4 Refinements and extensions of the CLT 367

Since rn(t) ∨ |µ3t|36√

n≤ 5

12 |t|2 for all |t| ≤ δ√

n,

∫ δ√

n

−δ√

n

∣∣φnX( t√

n)− ξn(t)

∣∣|t| dt

≤∫ δ

√n

−δ√

n

[ ε√n· t2 +

µ23

72n· |t|5

]exp

(− t2

12

)dt

≤ C1 ·ε√n

(4.21)

for some C1 ∈ (0,∞). By (4.13),

sup∣∣∣φX

( t√n

)∣∣∣n : δ√

n < |t| < c√

n≤ θn

for some θ ∈ (0, 1). Hence,∫δ√

n<|t|<c√

n

∣∣φnX

(t√n

)− ξn(t)

∣∣|t| dt

≤ 2θn log(c/δ) +1

δ√

n

∫|t|>δ

√n

e−t2/2(1 +

ρ√n|t|3)dt,

= O(θn1 ) as n →∞ (4.22)

for some θ1 = θ1(δ) ∈ (0, 1). Since ε > 0 is arbitrary, the result follows from(4.18), (4.21), and (4.22).

This section concludes with an analog of Theorem 11.4.4 for lattice ran-dom variables. Note that for X1 satisfying P (X1 ∈ a+jh : j ∈ Z) = 1, thenormalized sample sum Tn takes values in the lattice

na−nµσ

√n

+ jh/σ√

n :j ∈ Z

. Hence, the cdf of Tn is a step function. The second order Edgeworth

expansion for Tn is no longer a smooth function — the effect of the jumpsof the cdf of Tn is now accounted for by adding a discontinuous functionto the expansion en,2. Let

Q(x) = x− x − 12, x ∈ R (4.23)

where x denotes the largest integer not exceeding x, x ∈ R. It is easy tocheck that Q(x) is a periodic function of period 1 with values in

[− 1

2 , 12

],

Q(x) is right continuous, and it has jumps of size 1 at the integer values.The second order Edgeworth expansions for Tn in the lattice case involvesthe function Q, as shown below.

Theorem 11.4.5: Let Xnn≥1 be a sequence of lattice random variableswith span h > 0. Suppose that E|X1|3 < ∞ and σ2 = Var(X1) ∈ (0,∞).Then,

supx∈R

∣∣∣∣P(Sn − nµ

σ√

n≤ x

)− en,2(x)

∣∣∣∣ = o(n−1/2) as n →∞ (4.24)

Page 377: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

368 11. Central Limit Theorems

where µ = EX1, Sn = X1 + · · ·+ Xn, n ≥ 1,

en,2(x) = Φ(x)− 1√n

[µ3

6σ3 (x2 − 1) +h

σQ(√n[nxσ − nx0]

h

)]φ(x), (4.25)

x ∈ R, Q(x) is as in (4.23), φ(x) = 1√2π

exp(−x2/2), x ∈ R and x0 is areal number satisfying P (X1 = µ + x0) > 0.

For a proof of Theorem 11.4.5 and for related results, see Esseen (1945)and Bhattacharya and Rao (1986).

11.4.3 Large deviationsLet X1, X2, . . . be iid random variables with EX1 = µ. The SLLN impliesthat for any x > µ,

P (Xn > x) → 0 as n →∞, (4.26)

where Xn = n−1∑ni=1 Xi, n ≥ 1. If Xi’s were distributed as N(µ, σ2) for

some µ ∈ R and σ2 ∈ (0,∞), then the left side of (4.26) equals Φ(√

n(x−µ)σ

).

Now note that

− log Φ(√n(x− µ)

σ

)∼ − log

[exp(−nc2

1)/(√

n c1√

2n)]∼ nc2

1,

where c1 = (x− µ)/σ ∈ (0,∞). Large deviation bounds on the probabilityP (Xn > x) assert that a similar behavior holds for many distributionsother than the normal distribution on the (negative) logarithmic scale.

The main result of this section is the following.

Theorem 11.4.6: Let Xnn≥1 be a sequence of iid nondegenerate randomvariables with

φ(t) ≡ EetX1 < ∞ for all t > 0. (4.27)

Let µ = EX1. Then, for all x ∈ (µ, θ)

limn→∞ n−1 log P (Xn ≥ x) = −γ(x), (4.28)

where

γ(x) = supt>0

tx− log φ(t) and θ = supx ∈ R : P (X1 ≤ x) < 1. (4.29)

Note that under (4.27), EX+1 < ∞ and hence, µ ≡ EX1 is well defined,

and µ ∈ [−∞,∞).For proving the theorem, the following results are needed.

Lemma 11.4.7: Let X1 be a nondegenerate random variable satisfying(4.27). Let µ = EX1 and let γ(x), θ be as in (4.29). Then,

Page 378: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.4 Refinements and extensions of the CLT 369

(i) the function φ(t) is infinitely differentiable on (0,∞) with φ(r)(t), therth derivative of φ(t) being given by

φ(r)(t) = E(Xr

1etX1

), t ∈ (0,∞), r ∈ N, (4.30)

(ii) limt↓0

φ(t) = 1, limt↓0

φ(1)(t) = µ, and limt→∞

φ(1)(t)φ(t) = θ,

(iii) for every x ∈ (µ, θ), there exists a unique solution ax ∈ R to theequation

x = φ′(ax)/φ(ax) (4.31)

such that γ(x) = xax − log φ(ax).

Proof: Let F denote the cdf of X1. (i) Note that for any h = 0,

h−1[φ(t + h)− φ(t)]

=∫ ∞

−∞

ehx − 1h

· etxdF (x).

As h → 0, the integrand converges xetx for all x, t. Also,∣∣ ehs−1

h

∣∣ ≤∑∞k=1

∣∣hk−1xk∣∣/k! ≤ |x|e|hx| for all h, x. Hence, for any x ∈ R, t ∈ (0,∞),

and 0 < |h| < t/2, the integrand is bounded above by

|x|e|hx|etx = |x|e(t−|h|)xI(−∞,0)(x) + |x|e(t+|h|)xI(x > 0)

≤ |x|e−t|x|/2I(−∞,0)(x) + |x|e3txI(0,∞)(x)≡ g(x), say. (4.32)

Since∫

g(x)dF (x) < ∞, by the DCT, it follows that

limh→0

φ(t + h)− φ(t)h

exists and equals∫

xetxdF (x)

for all t ∈ (0,∞). Thus, φ(t) is differentiable on (0,∞) with φ(1)(t) =EX1e

tX1 , t ∈ (0,∞). Now, using induction and similar arguments, one cancomplete the proof of part (i) (Problem 11.34).

Next consider (ii). Since etx ≤ I(−∞,0](x) + exI(0,∞)(x) for all x ∈ R,t ∈ (0, 1), by the DCT, the first relation follows. For the second, note that

|x|etxI(−∞,0)(x) ↑ |x|I(−∞,0](x) as t ↓ 0 (4.33)

and |x|etx ≤ |x|ex for all 0 < t ≤ 1, x > 0. Hence, applying the MCT forx ∈ (−∞, 0] and the DCT for x ∈ (0,∞), one obtains the second limit.Derivation of the third limit is left as an exercise (Problem 11.35).

Page 379: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

370 11. Central Limit Theorems

To prove part (iii), fix x ∈ (µ, θ) and let γ(t) = tx− log φ(t), t ≥ 0. Then,for t ∈ (0,∞),

γ(1)(t) = x− φ(1)(t)φ(t)

γ(2)(t) =φ(2)(t)φ(t)

−(φ(1)(t)

φ(t)

)2= Var(Yt), (4.34)

where Yt is a random variable with cdf P (Yt ≤ y) =∫ y

−∞ etudF (x)/φ(t),y ∈ R. Since X1 is nondegenerate, so is Yt (for any t ≥ 0) and hence,Var(Yt) > 0. As a consequence, the second derivative of the function γ(t)is positive. And the minimum of γ(t) over (0,∞) is attained by a solutionto the equation γ(1)(t) = 0, i.e., by t = ax satisfying (4.30). That such asolution exists and is unique follows from part (ii) and the facts that x > µ,φ(1)(0+)

φ(0) = µ (by (ii)), and that φ(1)(t)φ(t) is continuous and strictly increasing

on (0,∞) (as for any t ∈ (0,∞), the derivative of φ(1)(t)φ(t) coincides with

γ(2)(t), which is positive by (4.34)). This proves part (iii).

Lemma 11.4.8: Let Xnn≥1 be as in Theorem 11.4.6. For t ∈ (0,∞),let Yt,nn≥1 be a sequence of iid random variables with cdf

P (Yt,1 ≤ y) =∫ y

−∞etudF (u)/φ(t), y ∈ R,

where F is the cdf of X1. Let νn and λn denote the probability distributionsof Sn ≡ X1 + · · ·+ Xn and Tn,t = Yt,1 + · · ·+ Yt,n, n ≥ 1. Then, for eachn ≥ 1,

νn λn anddνn

dλn(x) = e−txφ(t)n, x ∈ R. (4.35)

Proof: The proof is by induction. Clearly, the assertion holds for n = 1.Next, suppose that (4.35) is true for some r ∈ N and let n = r + 1. Then,for any A ∈ B(R),

νn(A) = P(X1 + · · ·+ Xn ∈ A

)=

∫ ∞

−∞P(X1 + · · ·+ Xn−1 ∈ A− x

)dF (x)

=∫ ∞

−∞

∫A−x

[ dνn−1

dλn−1(u)]dλn−1(u)dF (x)

=∫ ∞

−∞

∫A−x

e−tuφ(t)n−1dλn−1(u)dF (x)

= [φ(t)]n∫ ∞

−∞

∫A−x

e−t(u+x)dλn−1(u)dλ1(x)

= [φ(t)]n∫

A

e−tu(λn−1 ∗ λ1

)(dν),

Page 380: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.4 Refinements and extensions of the CLT 371

where ∗ denotes convolution. Since λn−1 ∗ λ1 = λn, the result follows.

Proof of Theorem 11.4.6: Fix x ∈ (µ, θ). Note that by Markov’s in-equality, for any t > 0, n ≥ 1,

P (Xn ≥ x) = P(etXn ≥ etx

)≤ e−txE

(etXn

)= exp

(− tx + n log φ(t/n)

).

Hence,

n−1 log P(Xn ≥ x

)≤ −x · t

n+ log φ

( t

n

)for all t > 0, n ≥ 1

⇒ lim supn→∞

n−1 log P(Xn ≥ x

)≤ inf

t>0−xt + log φ(t) = −γ(x). (4.36)

This yields the upper bound. Next it will be shown that

lim infn→∞ n−1 log P

(Xn ≥ x

)≥ −γ(x). (4.37)

To that end, let Yt,nn≥1, νn, and λn be as in Lemma 11.4.8. Also, letax be as in (4.30). Then, for any y > x, t ∈ (ax,∞), and n ≥ 1, by Lemma11.4.8,

P(Xn ≥ x

)= νn

([nx,∞)

)=

∫[nx,∞)

e−tuφ(t)n du

≥∫

[nx,ny]e−tuφ(t)n du

≥ φ(t)ne−tnyλn

([nx, ny]

). (4.38)

Note that EYt,1 =∫

udλ1(u) = φ(1)(t)/φ(t). Since φ(1)(·)/φ(·) is strictlyincreasing and continuous on (0,∞), given y > x, there exists a t = ty ∈(ax,∞) such that

y >φ(1)(t)φ(t)

>φ(1)(ax)φ(ax)

= x. (4.39)

By the WLLN, for any y > x and t satisfying (4.39),

λn

([nx, ny]

)= P

(x ≤ Yt,1 + · · ·+ Yt,n

n≤ y

)→ 1 as n →∞.

Hence, from (4.38), it follows that

lim infn→∞ n−1 log P

(Xn ≥ x

)≥ −ty + log φ(t)

Page 381: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

372 11. Central Limit Theorems

for all y > x and all t ∈ (ax,∞) satisfying (4.39). Now, letting t ↓ ax

first and then y ↓ x, one gets (4.37). This completes the proof of Theorem11.4.6.

Remark 11.4.1: If (4.27) holds and θ < ∞, then

P(Xn ≥ θ

)=[P (X1 = θ)

]nso that

limn→∞ n−1 log P

(Xn ≥ θ

)= log P (X1 = θ).

In this case, (4.28) holds for x = θ with γ(θ) = − log P (X1 = θ). For x > θ,(4.28) holds with γ(x) = +∞.

Remark 11.4.2: Suppose that there exists a t0 ∈ (0,∞) such that, insteadof (4.27), the following condition holds:

φ(t) = +∞ for all t > t0

< ∞ for all t ∈ (0, t0),

and φ′(t)/φ(t) increases to a finite limit θ0 as t ↑ t0. Then, θ must be +∞.In this case, it can be shown that (4.28) holds for all x ∈ (µ, θ0) (withthe given definition of γ(x)) and that (4.28) holds for all x ∈ [θ0,∞), withγ(x) ≡ t0x− log φ(t0). See Theorem 9.6, Chapter 1, Durrett (2004).

11.4.4 The functional central limit theoremLet Xii≥1 be iid random variables with EX1 = 0, EX2

1 = 1. Let S0 = 0,Sn =

∑ni=1 Xi, n ≥ 1. The central limit theorem says that as n →∞,

Wn ≡Sn√

n−→d N(0, 1).

Now letWn

( j

n

)=

1√n

Sj , j = 0, 1, 2, . . . , n (4.40)

and for any jn ≤ t < j+1

n , j = 0, 1, 2, . . . , n, let

Wn(t) = Wn

( j

n

)+(t− j

n

)(Wn( j+1n )−Wn( j

n ))

1n

(4.41)

be the function obtained by linear interpolation of Wn

(jn

): 0 ≤ j ≤ n

on [0, 1]. Then, Wn(·) is a random element of the metric space S ≡ C[0, 1]of all real valued continuous functions on [0, 1] with the supremum metricρ(f, g) ≡ sup|f(t) − g(t)| : 0 ≤ t ≤ 1. Let µn(·) be the probabilitydistribution induced on C[0, 1] by Wn(·).

Page 382: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.4 Refinements and extensions of the CLT 373

By an application of the multivariate CLT it can be shown that for anyk ∈ N and any 0 ≤ t1 ≤ t2 ≤ · · · ≤ tk ≤ 1, the joint distribution of(

Wn(t1), Wn(t2), . . . , Wn(tk))

will converge to a k-variate normal distribution with mean vector(0, 0, . . . , 0) and covariance matrix Σ ≡

((σij)

), where σij = ti ∧ tj . It

turns out that a C[0, 1] valued random variable W (·), called the standardBrownian motion on [0, 1] (SBM [0, 1]) can be defined such that for anyk ≥ 1 and any 0 ≤ t1 ≤ t2 ≤ · · · ≤ tk ≤ 1, (W (t1), . . . , W (tk)) has ak-variate normal distribution with mean vector (0, 0, . . . , 0) and covariancematrix Σ as above (see Chapter 15). Thus,(

Wn(t1), . . . , Wn(tk))−→d

(W (t1), . . . , W (tk)

).

If µ(·) is the probability distribution of W (·) on C[0, 1], then the abovesuggests that µn −→d µ. This is indeed true and is known as a functionalcentral limit theorem. See Billingsley (1968, 1995) for more details.

Theorem 11.4.9: (Functional central limit theorem). Let Xii≥1 be iidrandom variables with EX1 = 0, EX2

1 = 1. Let S0 = 0, Sn =∑n

i=1 Xi,n ≥ 1 and for all n ≥ 1, let Wn(·) be the C[0, 1] random elements obtainedby the linear interpolation of Wn( j

n ) ≡ Sj√n, j = 0, 1, 2, . . . , n as defined in

(4.41). Then

(i) there exists a C[0, 1]-valued random variable W (·) such that

Wn(·) −→d W (·) in C[0, 1],

(ii) W (·) is a Gaussian process with zero as its mean function and thecovariance function C(s, t) = s ∧ t, 0 ≤ s, t ≤ 1.

Proof: (An outline). There are three steps.

Step I: The sequence of probability distributions µn(·)n≥1 on C[0, 1]induced by Wn(·)n≥1 is tight (as defined in Chapter 9).

Step II: For any random element W (·) of C[0, 1], the family of finitedimensional joint distributions of W (t1), . . . , W (tk), with k ranging overN and 0 ≤ t1 ≤ t2 ≤ . . . ≤ tk ≤ 1 determines the probability distributionof W (·) in C[0, 1].

Step III: By Step I, every subsequence µnj(·) of µn(·) has a furthersubsequence µnjk(·) that converges to some probability distribution µon C[0, 1]. (This is a generalization of the Helly’s selection theorem. It isknown as the Prohorov-Varadarajan theorem.) But then µ has the same

Page 383: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

374 11. Central Limit Theorems

finite dimensional distribution as SBM [0,1] and hence, by Step II, themeasure µ is independent of the subsequence. Thus µn −→d µ. For a fullproof, see Billingsley (1968, 1995).

Corollary 11.4.10: For k ∈ N, let T : C[0, 1] → Rk be a continuousfunction from the metric space (C[0, 1], ρ) to Rk. Then

T(Wn(·)

)−→d T

(W (·)

).

Some examples of such continuous functions on the metric space C[0, 1]are

T1(f) ≡ supf(x) : 0 ≤ x ≤ 1,T3(f) ≡ sup|f(x)| : 0 ≤ x ≤ 1, and (4.42)T2(f) ≡ inff(x) : 0 ≤ x ≤ 1.

As an application of Corollary 11.4.10 to the above choices of T yields

Corollary 11.4.11: Let Xii≥1 be iid random variables with EX1 = 0,EX2

1 = 1, let

Mn1 = max0≤j≤n

Sj ,

Mn2 = min0≤j≤n

Sj , and

Mn3 = max0≤j≤n

|Sj |.

Then,1√n

(Mn1, Mn2, Mn3) −→d (M1, M2, M3)

where

M1 = supW (x) : 0 ≤ x ≤ 1M2 = infW (x) : 0 ≤ x ≤ 1M3 = sup|W (x)| : 0 ≤ x ≤ 1

and where W (·) is the Standard Brownian motion on [0, 1], as defined inTheorem 11.4.9.

Corollary 11.4.11 is useful in statistical inference in obtaining approxi-mations to the sampling distributions of the statistics (Mn1, Mn2, Mn3) forlarge n. The exact distribution of (M1, M2, M3) can be obtained by usingthe reflection principle as discussed in Chapter 15.

11.4.5 Empirical process and Brownian bridgeLet Uii≥1 be iid Uniform [0, 1] random variable. Let Fn(t) =1n

∑ni=1 I(Ui ≤ t), 0 ≤ t ≤ 1 be the empirical cdf of U1, U2, . . . , Un.

Page 384: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.4 Refinements and extensions of the CLT 375

Clearly, Fn(·) is a step function on [0, 1] with jumps of size 1n at

Un1 < Un2 < · · · < Unn, where (Un1, Un2, . . . , Unn) is the increasingrearrangement of (U1, U2, . . . , Un), i.e., the n order statistics based on(U1, U2, . . . , Un).

Let Yn(t) be the function obtained by linearly interpolating√

n(Fn(t)−t),0 ≤ t ≤ 1 from the values at the jump points (Un1, Un2, . . . , Unn). ThenYn(·) is a random element of the metric space (C[0, 1], ρ), the space of allreal valued continuous functions on [0, 1] with the supremum metric ρ,where ρ(f, g) = sup|f(t) − g(t)| : 0 ≤ t ≤ 1. Let B(t) : 0 ≤ t ≤ 1be the Standard Brownian motion, i.e., a C[0, 1]-valued random variablethat is also a Gaussian process with mean zero and covariance functionc(s, t) = s∧ t, 0 ≤ s, t ≤ 1. Let B0(t) ≡ B(t)− tB(1), 0 ≤ t ≤ 1. Then B0(·)is a random element of C[0, 1]. It is also a Gaussian process with mean zeroand covariance function c0(s, t) = s ∧ t− st− st + st = s ∧ t− st.

Theorem 11.4.12: Yn(·) −→d B0(·) in C[0, 1].

The proof is similar to that of Theorem 11.4.9. See Billingsley (1968,1995) for details.

Definition 11.4.1: The process B0(t) : 0 ≤ t ≤ 1 is called the Brownianbridge and the process Yn(t) : 0 ≤ t ≤ 1 is called the empirical processbased on U1, U2, . . . , Un.

Now recall that by the Glivenko-Cantelli theorem (cf. Chapter 8),

sup|Fn(t)− t| : 0 ≤ t ≤ 1 → 0.

Since T (f) ≡ sup|f(x)| : 0 ≤ x ≤ 1 is a continuous map form C[0, 1] toR+, by Theorem 11.4.12 this leads to

Corollary 11.4.13:√

n sup|Fn(t) − t| : 0 ≤ t ≤ 1 −→d sup|B0(t)| :0 ≤ t ≤ 1.

This, in turn, can be used to find the asymptotic distribution of theKolmogorov-Smirnov statistic (see (4.43) below).

Let Xii≥1 be iid random variables with a continuous distributionfunction F (·). Let Fn(x) ≡ 1

n

∑ni=1 I(Xi ≤ x) be the empirical distri-

bution based on (X1, X2, . . . , Xn). Let Ui = F−1(Xi), i = 1, 2, . . . , n whereF−1(t) = infx : F (x) ≥ t, 0 ≤ t ≤ 1. Then Uii≥1 are iid Uniform (0,1)random variables. It can be shown that the Kolmogorov-Smirnov statistic

KS(Fn) ≡√

n sup|Fn(x)− F (x)| : 0 ≤ x ≤ 1 (4.43)

has the same distribution as sup|Yn(t)| : 0 ≤ t ≤ 1 and hence

KS(Fn) −→d M0 ≡ sup|B0(t)| : 0 ≤ t ≤ 1.

Page 385: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

376 11. Central Limit Theorems

This can be used to test the hypothesis that F (·) is the cdf of Xii≥1 byrejecting it if KS(Fn) is large. To decide on what is large, the distributionof KS(Fn) can be approximated by that of M0. Thus, if the significancelevel is α, 0 < α < 1, then one determines a value C0 such that P (M0 >C0) = α and rejects the hypothesis that F is the distribution of Xii≥1 ifKS(Fn) > C0 and accepts it, otherwise.

11.5 Problems

11.1 Show that the triangular array Xnj : 1 ≤ j ≤ nn≥1, with Xnj asin (1.24), is a null array, i.e., satisfies (1.15) iff (1.22) holds.

11.2 Construct an example of a triangular array Xnj : 1 ≤ j ≤ rnn≥1 ofindependent random variables such that for any 1 ≤ j ≤ rn, n ≥ 1,E|Xnj |α = ∞ for all α ∈ (0,∞), but there exist sequences ann≥1 ⊂(0,∞) and bnn≥1 ⊂ R such that

Sn − bn

an−→d N(0, 1).

11.3 Let Xnn≥1 be a sequence of independent random variables with

P (Xn = ±1) =12− 1

2√

n, P (Xn = ±n2) =

12√

n, n ≥ 1.

Find constants ann≥1 ⊂ (0,∞) and bnn≥1 ⊂ R such that∑nj=1 Xj − bn

an−→d N(0, 1).

11.4 Let Xnn≥1 be a sequence of independent random variables suchthat for some α ≥ 1

2 ,

P (Xn = ±nα) =n1−2α

2and P (Xn = 0) = 1− n1−2α, n ≥ 1.

Let Sn =∑n

j=1 Xj and s2n = Var(Sn).

(a) Show that for all α ∈ [ 12 , 1),

Sn

sn−→d N(0, 1). (5.1)

(b) Show that (5.1) fails for α ∈ [1,∞).

(c) Show that for α > 1, Snn≥1 converges to a random variableS w.p. 1 and that sn →∞.

Page 386: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.5 Problems 377

11.5 Let Xnj : 1 ≤ j ≤ rnn≥1 be a triangular array of independent zeromean random variables satisfying the Lindeberg condition. Show that

E

[max

1≤j≤rn

X2nj

s2n

]→ 0 as n →∞,

where s2n =

∑rn

j=1 Var(Xnj), n ≥ 1.

11.6 Let Xnn≥1 be a sequence of random variables. Let Sn =∑n

j=1 Xj

and s2n =

∑nj=1 EX2

j < ∞, n ≥ 1. If s2n →∞ then show that

limn→∞ s−2

n

n∑j=1

EX2j I(|Xj | > εsn) = 0 for all ε > 0

⇐⇒ limn→∞ s−2

n

n∑j=1

EX2j I(|Xj | > εsj) = 0 for all ε > 0.

(Hint: Verify that for any δ > 0,∑

j:sj<δsnEX2

j ≤ δ2s2n.)

11.7 For a sequence of random variables Xnn≥1 and for a real numberr > 2, show that

limn→∞ s−r

n

n∑j=1

E|Xj |rI(|Xj | > εsn) = 0 for all ε ∈ (0,∞),

⇐⇒ limn→∞ s−r

n

n∑j=1

E|Xj |r = 0, (5.2)

where s2n =

∑nj=1 EX2

j .

11.8 Let Xnn≥1 be a sequence of zero mean independent random vari-ables satisfying (5.2) for r = 4.

(a) Show thatlim

n→∞ E(s−1n Sn)k = EZk

for all k = 2, 3, 4, where Z ∼ N(0, 1).

(b) Show that(

Sn

sn

)4n≥1 is uniformly integrable.

(c) Show that limn→∞ Eh

(Sn

sn

)= Eh(Z) where h(·) : R → R is contin-

uous and h(x) = O(|x|4) as |x| → ∞.

11.9 Let Xnn≥1 be a sequence of independent random variables suchthat

P (Xn = ±1) =14, P (Xn = ±n) =

14n2

andP (Xn = 0) =

12(1− 1

n2 ), n ≥ 1.

Page 387: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

378 11. Central Limit Theorems

(a) Show that the triangular array Xnj : 1 ≤ j ≤ nn≥1 withXnj = Xj/

√n, 1 ≤ j ≤ n, n ≥ 1 does not satisfy the Lindeberg

condition.(b) Show that there exists σ ∈ (0,∞) such that

Sn√n−→d N(0, σ2).

Find σ2.

11.10 Let Xjj≥1 be independent random variables such that Xj has Uni-form [−j, j] distribution. Show that the Lindeberg-Feller conditionholds for the triangular array Xnj = Xj/n3/2, 1 ≤ j ≤ n, n ≥ 1.

11.11 (CLT for random sums). Let Xii≥1 be iid random variables withEX1 = 0, EX2

1 = 1. Let Nnn≥1 be a sequence of positive integervalued random variables such that Nn

n −→p c, 0 < c < ∞. Show thatSNn√

Nn−→d N(0, 1).

(Hint: Use Kolmogorov’s first inequality (cf. 8.3.1) to show thatP( |SNn−Snc|√

n> λ, |Nn − nc| < nε

)≤ ε

λ2 for any ε > 0, λ > 0.)

11.12 Let N(t) : t ≤ 0 be the renewal process as defined in (5.1) of Section8.5. Assume EX1 = µ ∈ (0,∞), EX2

1 < ∞. Show that

N(t)− t/µ√t

−→d N(0, σ2) (5.3)

for some 0 < σ2 < ∞. Find σ2.

(Hint: Use

SN(t) −N(t)µ√N(t)

≤ t−N(t)µ√N(t)

≤SN(t)+1 − (N(t) + 1)µ√

N(t)+

µ√N(t)

and the fact that N(t)t → 1

µ w.p. 1.)

11.13 Let N(t) : t ≥ 0 be as in the above problem. Give another proof of(5.3) by using the CLT for Snn≥1 and the relation P (N(t) > n) =P (Sn < t) for all t, n.

11.14 Let Xjj≥1 be iid random variables with distribution P (X1 = 1) =1/2 = P (X1 = −1). Show that there exist positive integer valuedrandom variables rkk≥1 such that rk → ∞ w.p. 1, but Srk√

rkdoes

not converge in distribution.

(Hint: Let r1 = minn : Sn√n

> 1 and for k ≥ 1, define recursively

rk+1 = minn : n > rk, Sn√n

> k + 1.)

Page 388: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.5 Problems 379

11.15 (CLT for sample quantiles). Let Xii≥1 be iid random variables.Let 0 < p < 1 and let Yn ≡ F−1

n (p) = infx : Fn(x) ≥ p,where Fn(x) ≡ 1

n

∑ni=1 I(Xi ≤ x) is the empirical cdf based on

X1, X2, . . . , Xn. Assume that the cdf F (x) ≡ P (X1 ≤ x) is differen-tiable at F−1(p) ≡ infx : F (x) ≥ p and that λp ≡ F ′(F−1(p)

)> 0.

Then show that√

n(Yn − F−1(p)

)−→d N(0, σ2), where σ2 =

p(1− p)/λ2p.

(Hint: Use the identity P (Yn ≤ x) = P (Fn(x) ≥ p) for all x and p.)

11.16 (A coupon collector’s problem). For each n ∈ N, let Xnii≥1 beiid random variables such that P (Xn1 = j) = 1

n , 1 ≤ j ≤ n. LetTn0 = 1, Tn1 = minj : j > 1, Xj = X1, and Tn(i+1) = min

j :

j > Tni, Xj /∈ XTnk: 0 ≤ k ≤ i

. That is, Tni is the first time

the sample has (i + 1) distinct elements. Suppose kn ↑ ∞ such thatkn

n → θ, 0 < θ < 1. Show that for some an, bn

Tn,kn− an

bn−→d N(0, 1).

(Hint: Let Ynj = Tnj − Tn(j−1), j = 1, 2, . . . , (n− 1). Show that foreach n, Ynjj=1,2,... are independent with Ynj having a geometricdistribution with parameter

(1− j

n

). Now apply Lyapounov’s criterion

to the triangular array Ynj : 1 ≤ j ≤ kn.)

11.17 Prove Theorem 11.1.6.

11.18 Let Xnn≥1 be a sequence of iid random variables with EXn = 0and EX2

n = σ2 ∈ (0,∞). Let Sn =∑n

j=1 Xj , n ≥ 1. For each k ∈ N,find the limit distribution of the k-dimensional vector(s).

(a)(

Sn√n, S2n−Sn√

n, . . . ,

Snk−Sn(k−1)√n

),

(b)(

Sna1√n

,Sna2√

n, . . . ,

Snak√n

), where 0 < a1 < a2 < · · · < ak < ∞ are

given real numbers,

(c)(

S2n√n

, S3n−Sn√n

, . . . ,S(k+1)n−S(k−1)n√

n

).

11.19 For any random variable X, show that EX2 < ∞ implies

y2P (|x| > y)E(X2I(|x| ≤ y))

→ 0

as y →∞. Give an example to show that the converse is false.

(Hint: Consider a random variable X with pdf f(x) = c1|x|−3 for|x| > 2.)

Page 389: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

380 11. Central Limit Theorems

11.20 Let Xnn≥1 be a sequence of iid random variables with commondistribution

P (X1 ∈ A) =∫

A

|x|−3I(|x| > 1)dx, A ∈ B(R).

Find sequences ann≥1 ⊂ (0,∞) and bnn≥1 ⊂ R such that

Sn − bn

an−→d N(0, 1)

where Sn =∑n

j=1 Xj , n ≥ 1.

11.21 Show using characteristic functions that if X1, X2, . . . , Xk are iidCauchy (µ, σ2) random variables, then Sk ≡

∑ki=1 Xi has a Cauchy

(kµ, kσ) distribution.

11.22 Show that if a random variable Y1 has pdf f as in (2.4), then (2.9)holds with α = 1/2.

11.23 If Ynn≥1 are iid Gamma (1,2), then n−1∑ni=1 Y −1

i −→d W , whereW has pdf fW (w) ≡

( 2π

11+w2

)· I(0,∞)(w).

11.24 Let X be a nonnegative random variable such that P (X ≤ x) ∼xαL(x) as x ↓ 0 for some α > 0 and L(·) slowly varying at 0. LetY = X−β , β > 0. Show that

P (Y > y) ∼ y−γL(y) as y ↑ ∞

for some γ > 0 and L(·) slowly varying at ∞.

11.25 Let Xii≥1 be iid Beta (m, n) random variables. Let Yi = X−βi ,

β > 0, i ≥ 1. Show that there exist sequences ann≥1 and bnn≥1

such that∑n

i=1 Yi−an

bn−→d a stable law of order γ for some γ in (0, 2].

11.26 Let Xii≥1 be iid Uniform [0, 1] random variables.

(a) Show that for each 0 < β < 12 , there exist constants µ and σ2

such that

1σ√

n

( n∑i=1

X−βi − nµ

)−→d N(0, 1).

(b) Show that for each 12 < β < 1, there exist a constant 0 < γ < 2

and sequences ann≥1 and bnn≥1 such that

1bn

( n∑i=1

X−βi − an

)−→d a stable law of order γ.

Page 390: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

11.5 Problems 381

11.27 Prove (4.7).

(Hint: Use (4.5) and Lemma 10.1.5.)

11.28 (a) Show that the Gamma (α, β) distribution is infinitely divisible,0 < α, β < ∞.

(b) Let µ be a finite measure on(R, β(R)

). Show that φ(t) ≡

exp ∫

(eιtu − 1)µ(du)

is the characteristic function of an in-finitely divisible distribution.

11.29 Let Xnn≥1 be iid random variables with P (X1 = 0) = P (X1 =1) = 1

2 . Show that there exists a constant C1 ∈ (0,∞) such that

∆n ≥ C1n−1/2 for all n ≥ 1,

where ∆n is as in (4.1).

11.30 Let X1 be a random variable such that the absolutely continuouscomponent βFac(·) in the decomposition (4.5.3) of the cdf F of X1 isnonzero. Show that X1 satisfies Cramer’s condition (4.13).

(Hint: Use the Riemann-Lebesgue lemma.)

11.31 (Berry-Esseen theorem for sample quantile). Let Xnn≥1 be a col-lection of iid random variables with cdf F (·). Let 0 < p < 1 andYn = F−1

n (p), where Fn(x) = n−1∑ni=1 I(X1 ≤ x), x ∈ R. Suppose

that F (·) is twice differentiable in a neighborhood of ξp ≡ F−1(p)and F ′(ξp) ∈ (0,∞). Show that

supx∈R

∣∣∣P(√n(Yn − ξp)/σp ≤ x)− Φ(x)

∣∣∣ = O(n−1/2) as n →∞

where σp = p(1− p)/(F ′(ξp)

)2.(Hint: Use the identity P (Yn ≤ x) = P (Fn(x) ≥ p) for all x andp, apply Theorem 11.4.1 to Fn(x) for

√n|x − ξp| ≤ log n, and use

monotonicity of cdfs for√

n|x − ξp| > log n. See Lahiri (1992) formore details. Also, see Reiss (1974) for a different proof.)

11.32 (A moderate deviation bound). Let Xnn≥1 be a sequence of iidrandom variables with EX1 = µ, Var(X) = σ2 ∈ (0,∞) andE|X1|3 < ∞. Show that

P(√

n∣∣Xn − µ

∣∣ > σ√

log n)

= O(n−1/2) as n →∞.

(Hint: Apply Theorem 11.4.1.)

It can be shown that the bound on the right side is indeed o(n−1/2)as n → ∞. For a more general version of this result, see Gotze andHipp (1978).

Page 391: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

382 11. Central Limit Theorems

11.33 Show that the values of the functions en,2(x) of (4.14) and en,2(x) of(4.24) are not necessarily nonnegative for all x ∈ R.

11.34 Complete the proof of Lemma 11.4.7 (i).

(Hint: Suppose that for some r ∈ N, φ is r-times differentiable withits rth derivative given by (4.30). Then, for t ∈ (0,∞),

h−1[φ(r)(t + h)− φ(r)(t)]

=∫ ∞

−∞

ehx − 1h

· xretxF (dx)

and the integrand is bounded by the integrable function |x|rg(x) forall x ∈ R, 0 < |h| < t/2, where g(·) is as in (4.32). Now apply theDCT.)

11.35 Under the conditions of Lemma 11.4.7, show that

limt→∞

φ(1)(t)φ(t)

= θ.

(Hint: Consider the cases ‘θ ∈ R’ and ‘θ = ∞’ separately.)

11.36 Find the function γ(x) of (4.28) in each of the following cases:

(a) X1 ∼ N(µ, σ2),

(b) X1 ∼ Gamma (α, β),

(c) X1 ∼ Uniform (0, 1).

11.37 Verify that the functions Ti, i = 1, 2, 3 defined by (4.42) are continu-ous on C[0, 1].

Page 392: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

12Conditional Expectation andConditional Probability

12.1 Conditional expectation: Definitions andexamples

This section motivates the definition of conditional expectation for randomvariables with a finite variance through a mean square error predictionproblem. The definition is then extended to integrable random variablesby an approximation argument (cf. Definition 12.1.3). The more standardapproach of proving the existence of conditional expectation by the use ofRadon-Nikodym theorem is also outlined.

Let (X, Y ) be a bivariate random vector. A standard problem in regres-sion analysis is to predict Y having observed X. That is, to find a functionh(X) that predicts Y . A common criterion for measuring the accuracy ofsuch a predictor is the mean squared error E(Y − h(X))2. Under the as-sumption that E|Y |2 < ∞, it can be shown that there exists a uniqueh0(X) that minimizes the mean squared error.

Theorem 12.1.1: Let (X, Y ) be a bivariate random vector. Let EY 2 <∞. Then there exists a Borel measurable function h0 : R → R withE(h0(X)

)2< ∞, such that

E(Y − h0(X)

)2 = infE(Y − h(X)

)2 : h(X) ∈ H0, (1.1)

where H0 =h(X) | h : R → R is Borel measurable and E(h(X))2 < ∞

.

Page 393: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

384 12. Conditional Expectation and Conditional Probability

Proof: Let H be the space of all Borel measurable functions of (X, Y )that have a finite second moment. Let H0 be the subspace of all Borelmeasurable functions of X that have a finite second moment. It is knownthat H0 is a closed subspace of H (Problem 12.1) and that for any Z in H,there exists a unique Z0 in H0 such that

E(Z − Z0)2 = minE(Z − Z1)2 : Z1 ∈ H0

.

Further, Z0 is the unique random variable (up to equivalence w.p. 1) suchthat

E(Z − Z0)Z1 = 0 for all Z1 ∈ H0. (1.2)

A proof of this fact is given at the end of this section in Theorem 12.1.6.If one takes Z to be Y , then this Z0 is the desired h0(X).

Remark 12.1.1: The random variable Z0 in (1.2) is known as the projec-tion of Y onto H0.

It is known that for any random variable Y with EY 2 < ∞, the constantc that minimizes E(Y − c)2 over all c ∈ R is c = EY , the expected valueof Y . By analogy with this, one is led to the following definition.

Definition 12.1.1: For any bivariate random vector (X, Y ) with EY 2 <∞, the conditional expectation of Y given X, denoted as E(Y |X), is thefunction h0(X) of Theorem 12.1.1. Note that h0(X) is determined up toequivalence w.p. 1. Any such h0(X) is called a version of E(Y |X).

From (1.2) in the proof of Theorem 12.1.1, by taking Z = Y , Z1 = IB(X),one finds that Z0 = h0(X) satisfies

EY IA = Eh0(X)IA (1.3)

for every event A of the form X−1(B) where B ∈ B(R). Conversely, it canbe shown that (1.3) implies (1.2), by the usual approximation procedure(Problem 12.1). From (1.3), the function h0(X) is determined w.p. 1. Soone can take (1.3) to be the definition of h0(X). In statistics, the functionE(Y |X) is called the regression of Y on X.

The function h0(x) can be determined explicitly in the following twospecial cases.

If X is a discrete random variable with values x1, x2, . . ., then (1.3) im-plies, by taking A = X = xi, that

h0(xi) =E(Y I(X = xi)

)P (X = xi)

, i = 1, 2, . . . . (1.4)

Similarly, if (X, Y ) has an absolutely continuous distribution with jointprobability density f(x, y), it can be shown that w.p. 1, E(Y |X) = h0(X),where

h0(x) =(∫

yf(x, y)dy

fX(x)

)(1.5)

Page 394: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

12.1 Conditional expectation: Definitions and examples 385

if fX(x) > 0 and 0 otherwise, where fX(x) =∫

f(x, y)dy is the probabilitydensity function of X (Problem 12.2).

The definition of E(Y |X) can be generalized to the case when X is avector and more generally, as follows.

Theorem 12.1.2: Let (Ω,F , P ) be a probability space and G ⊂ F be a σ-algebra. Let H ≡ L2(Ω,F , P ) and H0 = L2(Ω,G, P ). Then for any Y ∈ H,there exist a Z0 ∈ H0 such that

E(Y − Z0)2 = infE(Y − Z)2 : Z ∈ H0 (1.6)

and this Z0 is determined w.p. 1 by the condition

E(Y IA) = E(Z0IA) for all A ∈ G. (1.7)

The proof is similar to that of Theorem 12.1.1.

Definition 12.1.2: The random variable Z0 in (1.7) is called the condi-tional expectation of Y given G and is written as E(Y |G).

When G = σ〈X〉, the σ-algebra generated by a random variable X,E(Y |G) reduces to E(Y |X) in Definition 12.1.1. The following propertiesof E(Y |G) are easy to verify by using the defining equation (1.7) (Problem12.3).

Proposition 12.1.3: Let Y and G be as in Theorem 12.1.2.

(i) Y ≥ 0 w.p. 1 ⇒ E(Y |G) ≥ 0 w.p. 1

(ii) Y1, Y2 ∈ H ⇒ E((αY1 + βY2)|G

)= αE(Y1|G) + βE(Y2|G) for any

α, β ∈ R.

(iii) Y1 ≥ Y2 w.p. 1 ⇒ E(Y1|G) ≥ E(Y2|G) w.p. 1.

Using a natural approximation procedure it is possible to extend thedefinition of E(Y |G) to all random variables with just the first moment,i.e., E|Y | < ∞. This is done in the following result.

Theorem 12.1.4: Let (Ω,F , P ) be a probability space and G ⊂ F be asub-σ-algebra. Let Y : Ω → R be a F-measurable random variable withE|Y | < ∞. Then there exists a random variable Z0 : Ω → R that is G-measurable, E|Z0| < ∞, and is uniquely determined (up to equivalencew.p. 1) by

E(Y IA) = E(Z0IA) for all A ∈ G. (1.8)

Proof: Since Y can be written as Y = Y + − Y −, it is enough to considerthe case Y ≥ 0 w.p. 1. Let Yn = minY, n for n = 1, 2, . . .. Then EY 2

n < ∞

Page 395: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

386 12. Conditional Expectation and Conditional Probability

and by Theorem 12.1.2, Zn ≡ E(Yn|G) is well defined, it is G-measurable,and satisfies

E(YnIA) = E(ZnIA) for all A ∈ G. (1.9)

Since 0 ≤ Yn ≤ Yn+1, by Proposition 12.1.3, 0 ≤ Zn ≤ Zn+1 w.p. 1 andso there exists a set B ∈ G such that P (B) = 0 and on Bc, Znn≥1 isnondecreasing and nonnegative. Let Z0 = limn→∞ Zn on Bc and 0 on B.Then Z0 is G-measurable. Applying the MCT to both sides of (1.9), onegets

E(Y IA) = E(Z0IA) for all A ∈ G.

This proves the existence of a G-measurable Z0 satisfying (1.8). Theuniqueness follows from the fact that if Z0 and Z ′

0 are G-measurable withE|Z0| < ∞, E|Z ′

0| < ∞ and

EZ0IA = EZ ′0IA for all A ∈ G,

then Z0 = Z ′0 w.p. 1 (Problem 12.3).

Remark 12.1.2: An alternative to the proof of Theorem 12.1.4 aboveleading to the definition of E(Y |G) is via the Radon-Nikodym theorem.Here is an outline of this proof. Let Y be a nonnegative random variablewith E|Y | < ∞. Set µ(A) ≡ E(Y IA) for all A ∈ G. Then µ is a measureon (Ω,G) and it is dominated by PG , the restriction of P to G. By theRadon-Nikodym theorem, there is a G-measurable function Z such that

E(Y IA) = µY (A) =∫

A

ZdPG = EZIA.

Extension to the case when Y is real-valued with E|Y | < ∞, is via thedecomposition Y = Y + − Y −.

Remark 12.1.3: The arguments in the proof of Theorem 12.1.4 (andProblem 12.3) show that the conclusion of the theorem holds for any non-negative random variable Y for which EY may or may not be finite.

Definition 12.1.3: Let Y be a F-measurable random variable on a prob-ability space (Ω,F , P ) such that either Y is nonnegative or E|Y | < ∞. Arandom variable Z0 that is G-measurable and satisfies (1.8) is called theconditional expectation of Y given G and is written as E(Y |G).

The following are some important consequences of (1.8):

(i) If Y is G-measurable then E(Y |G) = Y .

(ii) If G = F , then E(Y |G) = Y .

(iii) If G = ∅,Ω, then E(Y |G) = EY .

Page 396: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

12.1 Conditional expectation: Definitions and examples 387

(iv) By taking A to be Ω in (1.8),

EY = E(E(Y |G)

).

Furthermore, Proposition 12.1.3 extends to this case. When G = σ〈X〉 withX discrete, (1.4) holds provided E|Y | < ∞.

Part (iv) is useful in computing EY without explicitly determining thedistribution of Y . Suppose E(Y |X) = m(X) and Em(X) is easy to computebut finding the distribution of Y is not so easy. Then EY can still becomputed as Em(X). For example, let (X, Y ) have a bivariate distributionwith pdf

fX,Y (x, y) =

⎧⎨⎩

1√2π

1√2π|x| e− (y−x)2

2x2 e− (x−1)2

2 if x = 0,

0 if x = 0,

x, y ∈ R2. In this case, evaluating fY (y) is not easy. On the other hand, itcan be verified that for each x,

m(x) ≡∫

yfX,Y (x, y)dy

fX(x)= x,

and that fX(x) = 1√2π

e− (x−1)2

2 . Thus, EY = EX = 1. For more examplesof this type, see Problem 12.29.

The next proposition lists some useful properties of the conditional ex-pectation.

Proposition 12.1.5: Let (Ω,F , P ) be a probability space and let Y be aF-measurable random variable with E|Y | < ∞. Let G1 ⊂ G2 ⊂ F be twosub-σ-algebras contained in F .

(i) ThenE(Y |G1) = E

(E(Y |G2)|G1

). (1.10)

(ii) For any bounded G1-measurable random variable U ,

E(Y U |G1) = UE(Y |G1). (1.11)

Proof: (i) Let A ∈ G1, Z1 = E(Y |G1), and Z2 = E(Y |G2). Then E(Y IA) =E(Z1IA) by the definition of Z1. Since G1 ⊂ G2, A ∈ G2 and again by thedefinition of Z2,

E(Y IA) = E(Z2IA).

Thus,E(Z2IA) = E(Z1IA) for all A ∈ G1

and by the definition of E(Z2|G1), it follows that Z1 = E(Z2|G1), proving(i).

Page 397: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

388 12. Conditional Expectation and Conditional Probability

(ii) Let Z1 = E(Y |G1). If U = IB some B ∈ G1, then for any A ∈ G1,A ∩B ∈ G1 and by (1.8),

EY IBIA = EY IA∩B = E(Z1IA∩B) = E(Z1IB · IA).

So in this case E(Y U |G1) = Z1U . By linearity (Proposition 12.1.3 (ii)),it extends to all U that are simple and G1-measurable. For any boundedG1-measurable U , there exists a sequence of bounded, G1-measurable, andsimple random variables Unn≥1 that converge to U uniformly. Hence, forany A ∈ G1 and for n ≥ 1,

EY UnIA = EZ1UnIA.

The bounded convergence theorem applied to both sides yields

EY UIA = EZ1UIA.

Since Z1 and U are both G1-measurable, (ii) follows.

Remark 12.1.4: If the random variable U in Proposition 12.1.5 is G1-measurable and E|Y U | < ∞, then part (ii) of the proposition holds. Theproof needs a more careful approximation (see Billingsley (1995), pp. 447).

An Approximation Theorem

Theorem 12.1.6: Let H be a real Hilbert space and M be a nonemptyclosed convex subset of H. Then for every v ∈ H, there is a unique u0 ∈ Msuch that

‖v − u0‖ = inf‖v − u‖ : u ∈ M (1.12)

where ‖x‖2 = 〈x, x〉, with 〈x, y〉 denoting the inner-product in H.

Proof: Let δ = inf‖v − u‖ : u ∈ M. Then, δ ∈ [0,∞). By definition,there exists a sequence unn≥1 ⊂ M such that

‖v − un‖ → δ.

Also note that in any inner-product space, the parallelogram law holds,i.e., for any x, y ∈ H,

‖x + y‖2 + ‖x− y‖2 = 2(‖x‖2 + ‖y‖2).

Thus

‖2v − (un + um)‖2 + ‖un − um‖2

= 2(‖v − un‖2 + ‖v − um‖2). (1.13)

Since M is convex, un+um

2 ∈ M implying that∥∥v − un+um

2

∥∥2 ≥ δ2. This,with (1.13), implies that

lim supm,n→∞

‖un − um‖2 = 0,

Page 398: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

12.2 Convergence theorems 389

making unn≥1 a Cauchy sequence. Since H is a Hilbert space, thereexists a u0 ∈ H such that unn≥1 converges to u0. Also, since M is closed,u0 ∈ M. Since ‖v − un‖ → δ, it follows that ‖v − u0‖ = δ.

To show the uniqueness, let u′0 ∈ M also satisfies ‖v − u′

0‖ = δ. Then asin (1.13), ∥∥∥v − u0 + u′

0

2

∥∥∥2+∥∥∥u0 − u′

0

2

∥∥∥2= δ2,

implying ‖u0 − u′0‖ = 0.

Remark 12.1.5: The above theorem holds if M is a closed subspace of H.

12.2 Convergence theorems

From Proposition 12.1.3, it is seen that E(Y |G) is monotone and linearin Y , suggesting that it behaves like an ordinary expectation. A naturalquestion is whether under appropriate conditions, the basic convergenceresults extend to conditional expectations (CE). The answer is ‘yes,’ asshown by the following results.

Theorem 12.2.1: (Monotone convergence theorem for CE). Let (Ω,F , P )be a probability space and G ⊂ F be a sub-σ-algebra of F . Let Ynn≥1 bea sequence of nonnegative F-measurable random variables such that 0 ≤Yn ≤ Yn+1 w.p. 1. Let Y ≡ lim

n→∞ Yn w.p. 1. Then

limn→∞ E(Yn|G) = E(Y |G) w.p. 1. (2.1)

Proof: By Proposition 12.1.3 (i), Zn ≡ E(Yn|G) is monotone nondecreas-ing in n, w.p. 1, and so Z ≡ limn→∞ Zn exists w.p. 1. By the MCT, for allA ∈ G,

E(Y IA) = limn→∞ EYnIA = lim

n→∞ E(ZnIA) = E(ZIA).

Thus, Z = E(Y |G) w.p. 1, proving (2.1).

Theorem 12.2.2: (Fatou’s lemma for CE). Let Ynn≥1 be a sequenceof nonnegative random variables on a probability space (Ω,F , P ) and let Gbe a sub-σ-algebra of F . Then

lim infn→∞ E(Yn|G) ≥ E(lim inf

n→∞ Yn|G). (2.2)

Proof: Let Yn = infj≥n Yj . Then Ynn≥1 is a sequence of nonnegativenondecreasing random variables and limn→∞ Yn = lim infn→∞ Yn. By theprevious theorem,

limn→∞ E(Yn|G) = E(lim inf

n→∞ Yn|G). (2.3)

Page 399: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

390 12. Conditional Expectation and Conditional Probability

Also, since Yn ≤ Yj for each j ≥ n,

E(Yn|G) ≤ E(Yj |G) for each j ≥ n w.p. 1

implying that E(Yn|G) ≤ infj≥n E(Yj |G) w.p. 1. The right side convergesto lim infn→∞ E(Yn|G) w.p. 1. Now (2.2) follows from (2.3).

It is easy to deduce from Fatou’s lemma the following result (Problem12.4).

Theorem 12.2.3: (Dominated convergence theorem for CE). Let Ynn≥1and Y be random variables on a probability space (Ω,F , P ) and let G bea sub-σ-algebra of F . Suppose that limn→∞ Yn = Y w.p. 1 and that thereexists a random variable Z such that |Yn| ≤ Z w.p. 1 and EZ < ∞. Then

limn→∞ E(Yn|G) = E(Y |G) w.p. 1. (2.4)

Theorem 12.2.4: (Jensen’s inequality for CE). Let φ : (a, b) → R beconvex for some −∞ ≤ a < b ≤ ∞. Let Y be a random variable on aprobability space (Ω,F , P ) such that P (Y ∈ (a, b)) = 1 and E|φ(Y )| < ∞.Let G be a sub-σ-algebra of F . Then

φ(E(Y |G)

)≤ E(φ(Y )|G). (2.5)

Proof: By the convexity of φ on (a, b), for any c, x ∈ (a, b),

φ(x)− φ(c)− (x− c)φ′−(c) ≥ 0, (2.6)

where φ′−(c) is the left derivative of φ at c. Taking c = E(Y |G) and x = Y

in (2.6), one gets

Z ≡ φ(Y )− φ(E(Y |G))− (Y − E(Y |G))φ′−(E(Y |G)) ≥ 0. (2.7)

Since E(φ(E(Y |G))|G

)= φ

(E(Y |G)

), by (1.11),

E[(

Y − E(Y |G))φ′

−(E(Y |G)

)∣∣G]= φ′

−(E(Y |G))E[(

Y − E(Y |G))|G]

= 0.

Also, from (2.7), E(Z|G) ≥ 0 and hence,

E(φ(Y )|G

)≥ φ(E(Y |G)).

The following inequalities are a direct consequence of Theorem 12.2.4.

Corollary 12.2.5: Let Y be a random variable on a probability space(Ω,F , P ) and let G be a sub-σ-algebra of F .

Page 400: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

12.2 Convergence theorems 391

(i) If EY 2 < ∞, then E(Y 2|G) ≥(E(Y |G)

)2.(ii) If E|Y |p < ∞ for some p ∈ [1,∞), then E(|Y |p|G) ≥ |(EY |G)|p.

Definition 12.2.1: Let EY 2 < ∞. The conditional variance of Y givenG, denoted by Var(Y |G), is defined as

Var(Y |G) = E(Y 2|G)− (E(Y |G))2. (2.8)

This leads to the following formula for a decomposition of the varianceof Y , known as the Analysis of Variance formula.

Theorem 12.2.6: Let EY 2 < ∞. Then

Var(Y ) = Var(E(Y |G)) + E(Var(Y |G)

). (2.9)

Proof: Var(Y ) = E(Y −EY )2. But Y −EY = Y −E(Y |G)+E(Y |G)−EY .Also by (1.11),

E([

Y − E(Y |G)][

E(Y |G)− EY]∣∣G)

=[E(Y |G)− EY

]E([

Y − E(Y |G)]∣∣G)

=[E(Y |G)− EY

]0 = 0.

Thus, E([

Y − E(Y |G)][

E(Y |G)− EY])

= 0 and so

E(Y − EY )2 = E(Y − E(Y |G)

)2 + E(E(Y |G)− EY

)2. (2.10)

Now, noting that E[Y E(Y |G)

]= E

[EY (EY |G)|G

]= E

[E(Y |G)

]2, onegets

E(Y − E(Y |G))2 = EY 2 − 2E[Y E(Y |G)

]+ E

(E(Y |G)

)2= EY 2 − E

[(E(Y |G)

)2]= E

[E(Y 2|G)

]− E

[(E(Y |G)

)2]= E

(Var(Y |G)

).

Also, since E[E(Y |G)

]= EY ,

E(E(Y |G)− EY

)2 = Var(E(Y |G)

).

Hence, (2.9) follows from (2.10).

Remark 12.2.1: E(Var(Y |G)

)is called the variance within and

Var(E(Y |G)

)is the variance between. The above proof also shows that

E(Y − Z)2 = E(Y − E(Y |G)

)2 + E(E(Y |G)− Z

)2 (2.11)

for any random variable Z that is G-measurable. This is used to prove theRao-Blackwell theorem in mathematical statistics (Lehmann and Casella(1998)) (Problem 12.27).

Page 401: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

392 12. Conditional Expectation and Conditional Probability

12.3 Conditional probability

Let (Ω,F , P ) be a probability space and let G ⊂ F be a sub-σ-algebra.

Definition 12.3.1: For B ∈ F , the conditional probability of B given G,denoted by P (B|G), is defined as

P (B|G) = E(IB |G). (3.1)

Thus Z ≡ P (B|G) is a G-measurable function such that

P (A ∩B) = E(ZIA) for all A ∈ G. (3.2)

Since 0 ≤ P (A∩B) ≤ P (A) for all A ∈ G, it follows that 0 ≤ P (B|G) ≤ 1w.p. 1. It is easy to check that w.p. 1

P (Ω|G) = 1 and P (∅|G) = 0.

Also, by linearity, if B1, B2 ∈ F , B1 ∩B2 = ∅, then

P (B1 ∪B2|G) = P (B1|G) + P (B2|G) w.p. 1.

This suggests that w.p. 1, P (B|G) is countably additive as a set functionin B. That is, there exists a set A0 ∈ G such that P (A0) = 0 and for allω ∈ A0, the map B → P (B|G)(ω) is countably additive. However, this isnot true. Although for a given collection Bnn≥1 of disjoint sets in F ,there is an exceptional set A0 such that P (A0) = 0 and for ω ∈ A0,

P (⋃n≥1

Bn|G)(ω) =∞∑

n=1

P (Bn|G)(ω).

However, this A0 depends on Bnn≥1 and as the collection varies, theseexceptional sets can be an uncountable collection whose union may not becontained in a set of probability zero.

Definition 12.3.2: Let (Ω,F , P ) be a probability space and G be a sub-σ-algebra of F . A function µ : F ×Ω → [0, 1] is called a regular conditionalprobability on F given G if

(i) for all B ∈ F , µ(B, ω) = P (B|G) w.p. 1, and

(ii) for all ω ∈ Ω, µ(B, ω) is a probability measure on (Ω,F).

If a regular conditional probability (r.c.p.) µ(·, ·) exists on F given G,then conditional expectation of Y given G can be computed as

E(Y |G)(ω) =∫

Y (ω′)µ(dω′, ω) w.p. 1

Page 402: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

12.4 Problems 393

for all Y such that E|Y | < ∞. The proof of this is via standard approxi-mation using simple random variables (Problem 12.15).

A sufficient condition for the existence of r.c.p. is provided by the fol-lowing result.

Theorem 12.3.1: Let (Ω,F , P ) be a probability space. Let S be a Polishspace and S be its Borel σ-algebra. Let X be an S-valued random variableon (Ω,F). Then for any σ-algebra G ⊂ F , there is a regular conditionalprobability on σ〈X〉 given G, where σ〈X〉 = X−1(D) : D ∈ S.

Proof: (for S = R). Let Q = rj be the set of rationals. Let F (rj , ω) =P (X ≤ rj |G)(ω) w.p. 1. Then, there is a set A0 ∈ G such that P (A0) = 0and for ω ∈ A0, F (r, ω) is monotone nondecreasing on Q. For x ∈ R, set

F (x, ω) ≡

supF (r, ω) : r ≤ x if ω ∈ A0F0(x) if ω ∈ A0,

where F0(x) is a fixed cdf (say, F0 = Φ, the standard normal cdf). Then,F (x, ω) is a cdf in x for each ω and for each x, F (x, ·) is G-measurable.

Let µ(B, ω) be the Lebesgue-Stieltjes measure induced by F (·, ω). Thenit can be checked using the π− λ theorem (Theorem 1.1.2) that µ(·, ·) is aregular conditional probability on σ〈X〉 given G (Problem 12.16).

Remark 12.3.1: When F = σ〈X〉, the regular conditional probability onF given G is also called the regular conditional probability distribution ofX given G.

Remark 12.3.2: For a proof for the general Polish case, see Durrett (2004)and Parthasarathy (1967).

12.4 Problems

12.1 Let (X, Y ) be a bivariate random vector with EY 2 < ∞. Let H =L2(R2,B(R), PX,Y ) and H0 = h(X) | h : R → R is Borel measurableand Eh(X)2 < ∞. Suppose that for some h(X) ∈ H0,

EY IA = Eh(X)IA for all A ∈ σ〈X〉.

Show that E(Y − h(X))Z1 = 0 for all Z1 ∈ H0. Show also that H0 isa closed subspace of H.

(Hint: For any Z1 ∈ H0, there exists a sequence of simple randomvariables Wnn≥1 ⊂ H0 such that |Wn| ≤ |Z1| and Wn → Z1 a.s.Now, apply the DCT. For the second part, use the fact that f : Ω → Ris σ〈X〉-measurable iff there is a Borel measurable function h : R → Rsuch that f = h(X).)

Page 403: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

394 12. Conditional Expectation and Conditional Probability

12.2 Let (X, Y ) be a bivariate random vector that has an absolutely con-tinuous distribution on (R2,B(R2)) w.r.t. the Lebesgue measure withdensity f(x, y). Suppose that E|Y | < ∞. Show that a version ofE(Y |X) is given by h0(X), where, with fX(x) =

∫f(x, y)dy,

h0(x) =

∫yf(x,y)dyfX(x) if fX(x) > 0

0 otherwise.

(Hint: Verify (1.8) for all A ∈ σ〈X〉.)

12.3 Let Z1 and Z2 be two random variables on a probability space(Ω,G, P ).

(a) Suppose that E|Z1| < ∞, E|Z2| < ∞ and

EZ1IA = EZ2IA for all A ∈ G. (4.1)

Show that P (Z1 = Z2) = 1.(b) Suppose that Z1 and Z2 are nonnegative and (4.1) holds. Show

that P (Z1 = Z2) = 1.(c) Prove Proposition 12.1.3.

(Hint: (a) Consider (4.1) with A1 = Z1 − Z2 > 0 and A2 =Z1 − Z2 < 0 and conclude that P (A1) = 0 = P (A2).(b) Let A1n = Z1 ≤ n, Z2 ≤ n, Z1 − Z2 > 0 and A2n = Z1 ≤n, Z2 ≤ n, Z1−Z2 < 0, n ≥ 1. Then, by (4.1), P (A1n) = 0 = P (A2n)for all n ≥ 1. But A1 =

⋃n≥1 Ain, i = 1, 2, . . . , where Ai’s are as

above.)

12.4 Prove Theorem 12.2.3.

12.5 Let Xi be a ki-dimensional random vector, ki ∈ N, i = 1, 2 such thatX1 and X2 are independent. Let h : Rk1+k2 → [0,∞) be a Borelmeasurable function. Show that

E(h(X1, X2) | X1

)= g(X1) (4.2)

where g(x) = Eh(x, X2), x ∈ Rk1 . Show that (4.2) is also valid for areal valued h with E

∣∣h(X1, X2)∣∣ < ∞.

(Hint: Let k = k1 + k2, Ω = Rk, F = B(Rk), P = PX1 ×PX2 . Verify(1.8) for all A ∈ A1 × Rk2 : A1 ∈ B(Rk1) ≡ σ〈X1〉.)

12.6 Let X be a random variable on a probability space (Ω,F , P ) withEX2 < ∞ and let G ⊂ F be a sub-σ-field.

(a) Show that for any A ∈ G,∣∣∣ ∫A

E(X|G)dP∣∣∣ ≤ (∫

A

E(X2|G)dP)1/2

. (4.3)

Page 404: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

12.4 Problems 395

(b) Show that (4.3) is valid for all A ∈ F .

12.7 Let f : (Rk,B(Rk), P ) → (R,B(R)) be an integrable function where

P (A) = 2−k

∫A

exp(−

k∑i=1

|xi|)

dx1, . . . , dxk, A ∈ B(Rk).

For each of the following cases, find a version of E(f |G) and justifyyour answer:

(a) G = σ〈A ∈ B(Rk) : A = −A〉,(b) G = σ〈(j1, j1 + 1]× · · · × (jk, jk + 1] : j1, . . . , jk ∈ Z〉,(c) G = σ〈B × 0 : B ∈ B(Rk−1)〉.

12.8 Let (Ω,F , P ) be a probability space and G = ∅, B, Bc,Ω for someB ∈ F with P (B) ∈ (0, 1). Determine P (A|G) for A ∈ F .

12.9 Let Xn : n ∈ Z be a collection of independent random variableswith E|X0| < ∞. Show that

(a) E(X0 | X1, . . . , Xn) = EX0 for any n ∈ N,(b) E(X0 | X−n, . . . , X−1) = EX0 for any n ∈ N,(c) E(X0 | X1, X2, . . .) = EX0 = E(X0 | . . . , X−2, X−1).

12.10 Let X be a random variable on a probability space (Ω,F , P ) withE|X| < ∞ and let C be a π-system such that σ〈C〉 = G ⊂ F . Supposethat there exists a G-measurable function f : Ω → R such that∫

A

fdP =∫

A

XdP for all A ∈ C.

Show that f = E(X|G).

12.11 Let X and Y be integrable random variables on (Ω,F , P ) and let Cbe a semi-algebra, C ⊂ F . Suppose that

∫A

XdP ≤∫

AY dP for all

A ∈ C. Show thatE(X|G) ≤ E(Y |G)

where G = σ〈C〉.

12.12 Let X, Y ∈ L2(Ω,F , P ). If E(X|Y ) = Y and E(Y |X) = X, thenP (X = Y ) = 1.

(Hint: Show that E(X − Y )2 = EX2 − EY 2.)

12.13 Let Xnn≥1, X be a collection of random variables on (Ω,F , P ) andlet G be a sub-σ-algebra of F . If limn→∞ E|Xn −X|r = 0 for somer ≥ 1, then

limn→∞ E

∣∣E(Xn|G)− E(X|G)∣∣r = 0.

Page 405: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

396 12. Conditional Expectation and Conditional Probability

12.14 Let X, Y ∈ L2(Ω,F , P ) and let G be a sub-σ-algebra of F . Show that

EY E(X|G) = EXE(Y |G).

12.15 Let Y be an integrable random variable on (Ω,F , P ) and let µ be ar.c.p. on F given G. Show that h(ω) ≡

∫Y (ω1)µ(dω1, ω), ω ∈ Ω is a

version of E(Y |G).

(Hint: Prove this first for Y = IA, A ∈ F . Extend to simple functionsby linearity. Use the DCT for CE for the general case.)

12.16 Complete the proof of Theorem 12.3.1 for S = R.

12.17 Let (Ω,F , P ) be a probability space, G be a sub-σ-algebra of F , andlet Ann≥1 ⊂ F be a collection of disjoint sets. Show that

P

( ⋃n≥1

An|G)

=∞∑

n=1

P (An|G).

Definition 12.4.1: Let G be a σ-algebra and let Gλ : λ ∈ Λ be acollection of subsets of F in a probability space (Ω,F , P ). Then, Gλ : λ ∈Λ is called conditionally independent given G if for any λ1, . . . , λk ∈ Λ,k ∈ N,

P(A1 ∩ · · · ∩Ak|G

)=

k∏i=1

P (Ai|G)

for all A1 ∈ G1, . . . , Ak ∈ Gk. A collection of random variables Xλ : λ ∈ Λon (Ω,F , P ) is called conditionally independent given G if σ〈Xλ〉 : λ ∈ Λis conditionally independent given G.

12.18 Let G1,G2,G3 be three sub-σ-algebras of F . Recall that Gi ∨ Gj =σ〈Gi ∪ Gj〉, 1 ≤ i = j ≤ 3. Show that G1 and G2 are conditionallyindependent given G3 iff

P (A1|G2 ∨ G3) = P (A1|G3) for all A1 ∈ G,

iff E(X|G2 ∨ G3) = E(X|G3)

for every X ∈ L1(Ω,G1 ∨ G3, P ).

12.19 Let G1,G2,G3 be sub-σ-algebra of F . Show that if G1 ∨ G3 is inde-pendent of G2, then G1 and G2 are conditionally independent givenG3.

12.20 Give an example where

E(E(Y |X1) | X2

)= E

(E(Y |X2) | X1

).

Page 406: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

12.4 Problems 397

12.21 Let X be an Exponential (1) random variable. For t > 0, let Y1 =minX, t and Y2 = maxX, t. Find E(X|Yi) i = 1, 2.

(Hint: Verify that σ〈Y1〉 is the σ-algebra generated by the collectionX−1(A) : A ∈ B(R), A ⊂ [0, t) ∪ X−1[t,∞).)

12.22 Let (X, Y ) be a bivariate random vector with a joint pdf w.r.t. theLebesgue measure f(x, y). Show that E(X|X +Y ) = h(X +Y ) where

h(z) =(∫

xf(x, z − x)dx∫f(x, z − x)dx

)I(0,∞)

(∫f(x, z − x)dx

).

12.23 Let Xii≥1 be iid random variables with E|X1| < ∞. Show that forany n ≥ 1,

E(X1 | (X1 + X2 + · · ·+ Xn)

)=

X1 + X2 + · · ·+ Xn

n.

(Hint: Show that E(Xi | (X1 + · · ·+Xn)

)is the same for all 1 ≤ i ≤

n.)

Definition 12.4.2: A finite collection of random variables Xi : 1 ≤i ≤ n on a probability space (Ω,F , P ) is said to be exchangeable if forany permutation (i1, i2, . . . , in) of (1, 2, . . . , n), the joint distribution of(Xi1 , Xi2 , . . . , Xin

) is the same as that of (X1, X2, . . . , Xn). A sequenceof radom variables Xii≥1 on a probability space (Ω,F , P ) is said to beexchangeable if for any finite n, the collection Xi : 1 ≤ i ≤ n is exchange-able.

12.24 Let Xi : 1 ≤ i ≤ n+1 be a finite collection of random variables suchthat conditional Xn+1, X1, X2, . . . , Xn are iid. Show that Xi : 1 ≤i ≤ n is exchangeable.

12.25 Let Xi : 1 ≤ i ≤ n be exchangeable. Suppose E|X1| < ∞. Showthat

E(X1 | (X1 + · · ·+ Xn)

)=

X1 + X2 + · · ·+ Xn

n.

12.26 Let (X1, X2, X3) be random variables such that

P (X2 ∈ · | X1) = p1(X1, ·) andP (X3 ∈ · | X1, X2) = p2(X2, ·)

where for each i = 1, 2, pi(x, ·) is a probability transition function onR as defined in Example 6.3.8. Suppose pi(x, ·) admits a pdf fi(x, ·)i = 1, 2, . . .. Show that

P (X1 ∈ · | X2, X3) = P (X1 ∈ · | X2).

(This says that if X1, X2, X3 has the Markov property, then so doesX3, X2, X1.)

Page 407: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

398 12. Conditional Expectation and Conditional Probability

12.27 (Rao-Blackwell theorem). Let Y ∈ L2(Ω,F , P ) and G ⊂ F be a sub-σ-algebra. Show that there exists Z ∈ L2(Ω,G, P ) such that EZ =EY and Var(Z) ≤ Var(Y ).

(Hint: Consider Z = E(Y |G).)

12.28 Let (X, Y ) have an absolutely continuous bivariate distribution withdensity fX,Y (x, y). Show that there is a regular conditional probabil-ity on σ〈Y 〉 given σ〈X〉 and that this probability measure induces anabsolutely continuous distribution on R. Find its density.

12.29 Suppose, in the above problem,

fX,Y (x, y) =1

σ(x)φ(y −m(x)

σ(x)

)g(x)

where m(·), σ(·), φ(·), and g(·) are all Borel measurable functionson R to R with σ, φ, and g being nonnegative and φ and g beingprobability densities.

(a) Find the marginal probability densities fX(·) and fY (·) of Xand Y , respectively. Set up the integrals for EX and EY .

(b) Using the conditioning argument in Proposition 12.1.5, showthat

EY =∫

m(x)g(x)dx +(∫

uφ(u)du)(∫

σ(x)g(x)dx)

(assuming that all the integrals are well defined).(c) Find similar expressions for EY 2 and E(etY ).

12.30 Let X, Y , Z ∈ L1(Ω,F , P ). Suppose that

E(X|Y ) = Z, E(Y |Z) = X, E(Z|X) = Y.

Show that X = Y = Z w.p. 1.

12.31 Let X, Y ∈ L2(Ω,F , P ). Suppose E|Y |4 < ∞. Show that

minE|X − (a + bY + cY 2)|2 : a, b, c ∈ R

= max

E(XZ) : Z ∈ L2(Ω,F , P ), EZ = 0, EZY = 0,

EZY 2 = 0, EZ2 = 1.

12.32 Let X ∈ L2(Ω,F , P ) and G be a sub-σ-algebra of F

(a) Show that

minE(X − Y )2 : Y ∈ L2(Ω,G, P )

= max

(EXZ)2 : EZ2 = 1, E(Z|G) = 0

.

(b) Find a random variable Z such that E(Z|G) = 0 w.p. 1 andρ ≡ corr(X, Z) is maximized.

Page 408: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13Discrete Parameter Martingales

13.1 Definitions and examples

This section deals with a class of stochastic processes called martingales.Martingales arise in a natural way in many problems in probability andstatistics. It provides a more general framework than the case of indepen-dent random variables where results, like the SLLN, the CLT, and otherconvergence theorems, can be established. Much of the discrete parametermartingale theory was developed by the great American mathematicianJ. L. Doob, whose book (Doob (1953)) has been very influential.

Definition 13.1.1: Let (Ω,F , P ) be a probability space and let N =1, . . . , n0 be a nonempty subset of N = 1, 2, . . ., n0 ≤ ∞.

(a) A collection Fn : n ∈ N of sub-σ-algebras of F is called a filtrationif Fn ⊂ Fn+1 for all 1 ≤ n < n0.

(b) A collection of random variables Xn : n ∈ N is said to be adaptedto the filtration Fn : n ∈ N if Xn is Fn-measurable for all n ∈ N .

(c) Given a filtration Fn : n ∈ N and random variables Xn : n ∈ N,the collection (Xn,Fn) : n ∈ N is called a martingale if

(i) Xn : n ∈ N is adapted to Fn : n ∈ N,(ii) E|Xn| < ∞ for all n ∈ N , and(iii) for all 1 ≤ n < n0,

E(Xn+1|Fn) = Xn. (1.1)

Page 409: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

400 13. Discrete Parameter Martingales

When N = N, there is no maximum element in N . In this case, Def-inition 13.1.1 is to be interpreted by setting n0 = +∞ in parts (a) and(c) (iii). A similar convention applies to Definition 13.1.2 below. Also, re-call that equalities and inequalities involving conditional expectations areinterpreted as being valid events w.p. 1.

If (Xn,Fn) : n ∈ N is a martingale, then Xn : n ∈ N is also said tobe a martingale w.r.t. (the filtration) Fn : n ∈ N. Also Xn : n ∈ N iscalled a martingale if it is a martingale w.r.t. some filtration. Observe thatif Xn : n ∈ N is a martingale w.r.t. any given filtration Fn : n ∈ N,it is also a martingale w.r.t. the natural filtration Xn : n ∈ N, whereXn = σ〈X1, . . . , Xn〉, n ∈ N . Clearly, Xn : n ∈ N is adapted to Xn :n ∈ N. To see that E(Xn+1|Xn) = Xn for all 1 ≤ n < n0, note thatXn ⊂ Fn for all n ∈ N and hence,

E(Xn+1|Xn) = E(E(Xn+1|Fn) | Xn)= E(Xn|Xn) = Xn. (1.2)

Thus, (Xn,Xn) : n ∈ N is a martingale.A classic interpretation of martingales in the context of gambling is given

as follows. Let Xn represent the fortune of a gambler at the end of thenth play and let Fn be the information available to the gambler up toand including the nth play. Then, Fn contains the knowledge of all eventslike Xj ≤ r for r ∈ R, j ≤ n, making Xn measurable w.r.t. Fn. AndCondition (iii) in Definition 13.1.1 (c) says that given all the informationup until the end of the nth play, the expected fortune of the gambler at theend of the (n+1)th play remains unchanged. Thus a martingale representsa fair game. In situations where the game puts the gambler in a favorable orunfavorable position, one may express that by suitably modifying condition(iii), yielding what are known as sub- and super-martingales, respectively.

Definition 13.1.2: Let Fn : n ∈ N be a filtration and Xn : n ∈ N bea collection of random variables in L1(Ω,F , P ) adapted to Fn : n ∈ N.Then (Xn,Fn) : n ∈ N is called a sub-martingale if

E(Xn+1|Fn) ≥ Xn for all 1 ≤ n < n0, (1.3)

and a super-martingale

E(Xn+1|Fn) ≤ Xn for all 1 ≤ n < n0. (1.4)

Suppose that (Xn,Fn) : n ∈ N is a sub-martingale. Then A ∈ Fn

implies that A ∈ Fn+1 ⊂ . . . ⊂ Fn+k for every k ≥ 1, n+k ∈ N and hence,by (1.3), ∫

A

XndP ≤∫

A

E(Xn+1|Fn)dP =∫

A

Xn+1dP

Page 410: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.1 Definitions and examples 401

...

≤∫

A

Xn+kdP. (1.5)

Therefore, E(Xn+k|Fn) ≥ Xn and, by taking A = Ω in (1.5), EXn+k ≥EXn. Thus, the expected values of a sub-martingale is nondecreasing. Fora martingale, by (1.2), equality holds at every step of (1.5) and hence,

E(Xn+k|Fn) = Xn, EXn+k = EXn (1.6)

for all k ≥ 1, n, n + k ∈ N . Thus, in a fair game, the expected fortune ofthe gambler remains constant over time.

Here are some examples.

Example 13.1.1: (Random walk). Let Z1, Z2, . . . be a sequence of iidrandom variables on a probability space (Ω,F , P ) with finite mean µ = EZ1and let Fn = σ〈Z1, . . . , Zn〉, n ≥ 1. Let Xn = Z1 + . . . Zn, n ≥ 1. Then, forall n ≥ 1, σ〈Xn〉 ⊂ Fn and E|Xn| < ∞ for all n ≥ 1. Also,

E(Xn+1|Fn) = E((Z1 + . . . + Zn+1) | Z1, . . . , Zn

)= Z1 + . . . + Zn + EZn+1 (by independence)= Xn + µ,

so that

E(Xn+1|Fn) = Xn if µ = 0> Xn if µ > 0< Xn if µ < 0.

Thus, (Xn,Fn) : n ≥ 1 is a martingale if µ = 0, a sub-martingale if µ ≥ 0and a super-martingale if µ ≤ 0.

Example 13.1.2: (Random walk continued). Let Znn≥1 and Fnn≥1be as in Example 13.1.1 and let EZ2

1 < ∞. Let Yn =∑n

i=1(Zi − µ)2 andYn = Yn − nσ2, where σ2 = V ar(Z1). Then, check that (Yn,Fn) : n ≥ 1is a sub-martingale and (Yn,Fn) : n ≥ 1 is a martingale (Problem 13.3).

Example 13.1.3: (Doob’s martingale). Let Z be an integrable randomvariable and let Fn : n ≥ 1 be a filtration both defined on a probabilityspace (Ω,F , P ). Define

Xn = E(Z|Fn), n ≥ 1. (1.7)

Then, clearly, Xn is integrable and Fn-measurable for all n ≥ 1. Also,

E(Xn+1|Fn) = E(E(Z|Fn+1) | Fn)= E(Z|Fn) = Xn.

Page 411: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

402 13. Discrete Parameter Martingales

Thus, (Xn,Fn) : n ≥ 1 is a martingale.

Example 13.1.4: (Generation of a martingale from a given sequence ofrandom variables). Let Ynn≥1 ⊂ L1(Ω,F , P ) be an arbitrary collection ofintegrable random variables and let Fn = σ〈Y1, . . . , Yn〉, n ≥ 1. For n ≥ 1,define

Xn =n∑

j=1

Yj − E(Yj |Fj−1) (1.8)

where F0 ≡ ∅,Ω.

Then, for each n ≥ 1, Xn is integrable and Fn-measurable. Also, forn ≥ 1,

E(Xn+1|Fn) =n+1∑j=1

E([Yj − E(Yj |Fj−1)] | Fn)

=n∑

j=1

[Yj − E(Yj |Fj−1)] + [E(Yn+1|Fn)− E(Yn+1|Fn)]

= Xn.

Hence (Xn,Fn) : n ≥ 1 is a martingale. Thus, one can construct a mar-tingale sequence starting from any arbitrary sequence of integrable randomvariables. When Ynn≥1 are iid with EY1 = 0, (1.8) yields Xn =

∑nj=1 Yj

and one gets the martingale sequence of Example 13.1.1.

Example 13.1.5: (Branching process). Let ξnk : n ≥ 1, k ≥ 1 bea double array of iid nonnegative integer valued random variables withEξnk = µ ∈ (0,∞). One may think of ξnk as the number of offspring of thekth individual at time n in an evolving population. Let Zn denote the sizeof the population at time n, n ≥ 0. If one considers the evolution of thepopulation starting with a single individual at time n = 0, then

Z0 = 1 and Zn =Zn−1∑k=1

ξnk, n ≥ 1.

The sequence Z0, Z1, . . . is called a branching process (cf. Chapter 18).

Let Fn = σ〈Z1, . . . , Zn〉, n ≥ 1. Then, for n ≥ 1,

E(Zn+1|Fn) = E

( Zn∑k=1

ξn+1,k | Zn

)= µZn (1.9)

and therefore, (Zn,Fn) : n ≥ 1 is a martingale, sub-martingale or super-martingale according as µ = 1, µ ≥ 1 or µ ≤ 1. One can define a new

Page 412: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.1 Definitions and examples 403

sequence Xnn≥1 from Znn≥1 such that Xnn≥1 is a martingale w.r.t.Fn for all values of µ ∈ (0,∞). Let

Xn = µ−nZn, n ≥ 1. (1.10)

Then, using (1.9), it is easy to check that (Xn,Fn) : n ≥ 1 is a martingale.

Example 13.1.6: (Likelihood ratio). Let Y1, Y2, . . . be a collection of ran-dom variables on a probability space (Ω,F , P ) and let Fn = σ〈Y1, . . . , Yn〉,n ≥ 1. Let Q be another probability measure on F . Suppose that underboth P and Q, the joint distributions of (Y1, . . . , Yn) are absolutely continu-ous w.r.t. the Lebesgue measure λn on Rn, n ≥ 1. Denote the correspondingdensities by pn(y1, . . . , yn) and qn(y1, . . . , yn), n ≥ 1 and for simplicity, sup-pose that pn(y1, . . . , yn) is everywhere positive. Then, a likelihood ratio fordiscriminating between the probability measures P and Q on the basis ofthe observations Y1, . . . , Yn, is given by

Xn = qn(Y1, . . . , Yn)/pn(Y1, . . . , Yn), n ≥ 1.

A higher value of Xn is supposed to provide “evidence” in favor of Q(over P ) as the “true” probability measure determining the distribution of(Y1, . . . , Yn). It will now be shown that Xnn≥1 is a martingale w.r.t. Fn,n ≥ 1 under P . Clearly, Xn is Fn-measurable for all n ≥ 1 and∫

|Xn|dP =∫

Ω

qn(Y1, . . . , Yn)pn(Y1, . . . , Yn)

dP

=∫

Rn

qn(y1, . . . , yn)pn(y1, . . . , yn)

pn(y1, . . . , yn)dλn

=∫

Rn

qn(y1, . . . , yn)dλn = Q(Ω) = 1 < ∞,

so that Xn is integrable (w.r.t. P ) for all n ≥ 1. Noting that the sets in theσ-algebra Fn are given by (Y1, . . . , Yn) ∈ B for B ∈ B(Rn), one has, forany n ≥ 1,∫

(Y1,...,Yn)∈BXn+1dP =

∫(Y1,...,Yn+1)∈B×R

qn+1(Y1, . . . , Yn+1)pn+1(Y1, . . . , Yn+1)

dP

=∫

B×R

qn+1(y1, . . . , yn+1)pn+1(y1, . . . , yn+1)

pn+1(y1, . . . , yn+1)dλn+1

=∫

B

qn(y1, . . . , yn)dλn

=∫

B

qn(y1, . . . , yn)pn(y1, . . . , yn)

pn(y1, . . . , yn)dλn

Page 413: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

404 13. Discrete Parameter Martingales

=∫

(Y1,...,Yn)∈BXndP,

implying that E(Xn+1|Fn) = Xn, n ≥ 1. This shows that (Xn,Fn) : n ≥1 is a martingale for any arbitrary sequence of random variables Ynn≥1under P . However, (Xn,Fn) : n ≥ 1 may not be a martingale under Q.

Example 13.1.7: (Radon-Nikodym derivatives). Let Ω = (0, 1], F =B(0, 1], the Borel σ-algebra on (0, 1] and let P denote the Lebesgue measureon (0, 1]. For n ≥ 1, let Fn be the σ-algebra generated by the partition((k − 1)2−n, k2n], k = 1, . . . , 2n of (0, 1] by dyadic intervals. Let ν be afinite measure on (Ω,F). Let νn be the restriction of ν to Fn and Pn bethe restriction of P to Fn, for each n ≥ 1. As Fn consists of all disjointunions of the intervals ((k − 1)2−n, k2−n], 1 ≤ k ≤ 2n, Pn(A) = 0 for someA ∈ Fn iff A = ∅. Consequently, νn(A) = 0 whenever Pn(A) = 0, A ∈ Fn,i.e., νn Pn. Let Xn denote the Radon-Nikodym derivative of νn w.r.t.Pn, given by

Xn =2n∑

k=1

[ν(((k − 1)2−n, k2−n]

)2n]I((k−1)2−n,k2−n]. (1.11)

Clearly, Xn is Fn-measurable and P -integrable. It is easy to check thatE(Xn+1|Fn) = Xn for all n ≥ 1. Hence (Xn,Fn) : n ≥ 1 is a martingaleon (Ω,F , P ). Note that the absolute continuity of νn w.r.t. Pn (on Fn)holds for each 1 ≤ n < ∞ even though the measure ν may not be absolutelycontinuous w.r.t. P on F = B((0, 1]).

Proposition 13.1.1: (Convex functions of martingales and sub-martingales). Let φ : R → R be a convex function and let N =1, 2, . . . n0 ⊂ N be a nonempty subset.

(i) If (Xn,Fn) : n ∈ N is a martingale with E|φ(Xn)| < ∞ for alln ∈ N , then (φ(Xn),Fn) : n ∈ N is a sub-martingale.

(ii) If (Xn,Fn) : n ∈ N is a sub-martingale, E|φ(Xn)| < ∞ for all n ∈N , and in addition, φ is nondecreasing, then (φ(Xn),Fn) : n ∈ Nis a sub-martingale.

Proof: By the conditional Jensen’s inequality (Theorem 12.2.4) for all1 ≤ n < n0,

E(φ(Xn+1)|Fn) ≥ φ(E(Xn+1|Fn)). (1.12)

Parts (i) and (ii) follow from (1.12) on using the martingale and sub-martingale properties of Xnn∈N , respectively.

Proposition 13.1.2: (Doob’s decomposition of a sub-martingale). Let(Xn,Fn) : n ∈ N be a sub-martingale for some N = 1, . . . , n0 ⊂ N.Then, there exist two sets of random variables Yn : n ∈ N and Zn : n ∈N satisfying Xn = Yn + Zn, n ∈ N such that

Page 414: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.2 Stopping times and optional stopping theorems 405

(i) (Yn,Fn) : n ∈ N is a martingale

(ii) For all n ∈ N , Zn+1 ≥ Zn ≥ 0 w.p. 1 and Zn is Fn−1-measurable,where F0 = ∅,Ω.

(iii) If Xn : n ∈ N are L1-bounded, i.e., M ≡ maxE|Xn| : n ∈ N <∞, then so are Yn : n ∈ N and Zn : n ∈ N.

Proof: Define the difference variables ∆n’s by

∆1 = X1, and ∆n = Xn −Xn−1, n ≥ 2, n ∈ N.

Note that Xn =∑n

j=1 ∆j , n ∈ N , and E(∆n|Fn−1) ≥ 0 for all n ≥ 2,n ∈ N . Now, set

Y1 = ∆1, Yn = Xn −n∑

j=2

E(∆j |Fj−1), n ≥ 2, n ∈ N

and

Z1 = 0, Zn =n∑

j=2

E(∆j |Fj−1), n ≥ 2, n ∈ N.

Check that the requirements (i) and (ii) above hold. To prove the L1-boundedness, notice that by (ii), for any n ≥ 2, n ∈ N ,

E|Zn| = EZn = E

[ n∑j=2

E(∆j |Fj−1)]

=n∑

j=2

E(∆j)

= EXn − EX1 ≤ 2M. (1.13)

Also, Xn = Yn + Zn for all n ∈ N implies that

|Yn| ≤ |Xn|+ |Zn|, n ∈ N. (1.14)

Hence, (iii) follows.

13.2 Stopping times and optional stoppingtheorems

In the following (and elsewhere), ‘n ≥ 1’ is used as an alternative for thestatement ‘n ∈ N’ or equivalently for 1 ≤ n < ∞.

Definition 13.2.1: Let (Ω,F , P ) be a probability space, Fnn≥1 be afiltration and T be a F-measurable random variable taking values in theset N ≡ N ∪ ∞ = 1, 2, . . . , ∪ ∞.

Page 415: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

406 13. Discrete Parameter Martingales

(a) T is called a stopping time w.r.t. Fnn≥1 if

T = n ∈ Fn for all n ≥ 1. (2.1)

(b) T is called a finite or proper stopping time w.r.t. Fnn≥1 (under P )if

P (T < ∞) = 1. (2.2)

Given a filtration Fnn≥1, define the σ-algebra

F∞ = σ〈⋃n≥1

Fn〉. (2.3)

Since T = +∞c =⋃

n≥1T = n ∈ F∞, (2.1) is equivalent to ‘T =n ∈ Fn for all 1 ≤ n ≤ ∞’. It is also easy to check (Problem 13.7) that Tis a stopping time w.r.t. Fnn≥1 iff

T ≤ n ∈ Fn for all n ≥ 1 (2.4)

iff T > n ∈ Fn for all n ≥ 1. However, the condition

‘T ≥ n ∈ Fn for all n ≥ 1’ (2.5)

does not always imply that T is a stopping time w.r.t. Fnn≥1 (cf. Problem13.7). Note that for a stopping time T w.r.t. Fnn≥1,

T ≥ n = T ≤ n− 1c ∈ Fn−1 for n ≥ 2,

and T ≥ 1 = Ω. Since Fn−1 ⊂ Fn for all n ≥ 2, (2.5) is a weakerrequirement than T being a stopping time w.r.t Fnn≥1.

Proposition 13.2.1: Let T be a stopping time w.r.t. Fnn≥1 and let F∞be as in (2.3). Define

FT = A ∈ F∞ : A ∩ T = n ∈ Fn for all n ≥ 1. (2.6)

Then, FT is a σ-algebra.

Proof: Left as an exercise (Problem 13.8).

If T is a stopping time w.r.t. Fnn≥1, then for any m ∈ N, T = m ∩T = n = ∅ ∈ Fn for all n = m and T = m∩T = n = T = m ∈ Fm

for n = m. Thus, T = m ∈ FT for all m ≥ 1 and hence, σ〈T 〉 ⊂ FT . Butthe reverse inclusion may not hold as shown below.

Example 13.2.1: Let T ≡ m for some given integer m ≥ 1. Then, T is astopping time w.r.t. any filtration Fnn≥1. For this T ,

A ∈ FT ⇒ A ∩ T = m ∈ Fm ⇒ A ∈ Fm,

Page 416: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.2 Stopping times and optional stopping theorems 407

so that FT ⊂ Fm. Conversely, suppose A ∈ Fm. Then, A ∩ T = n = ∅ ∈Fn for all n = m, and A∩T = m = A ∈ Fm for n = m. Thus, Fm = FT .But σ〈T 〉 = Ω, ∅.

Example 13.2.2: Let Xnn≥1 be a sequence of random variables adaptedto a filtration Fnn≥1 and let Bnn≥1 be a sequence of Borel sets in R.Define the random variable

T = infn ≥ 1 : Xn ∈ Bn

. (2.7)

Then, T (ω) < ∞ if Xn(ω) ∈ Bn for some n ∈ N and T (ω) = +∞ ifXn(ω) ∈ Bn for all n ∈ N. Since, for any n ≥ 1,

T = n =X1 ∈ B1, . . . , Xn−1 ∈ Bn−1, Xn ∈ Bn

∈ Fn,

T is a stopping time w.r.t. Fnn≥1. Now, define a new random variableXT by

XT =

Xm if T = m

lim supn→∞

Xn if T = ∞. (2.8)

The, XT ∈ R and for any n ≥ 1 and r ∈ R,

XT ≤ r ∩ T = n = Xn ≤ r ∩ T = n ∈ Fn.

Also, XT = ±∞ ∩ T = n = Xn = ±∞ ∩ T = n = ∅ for all n ≥ 1.Hence, it follows that XT is 〈FT ,B(R)〉-measurable.

Example 13.2.3: Let Ynn≥1 be a sequence of iid random variableswith EY1 = µ. Let Xn = (Y1 + . . . + Yn), n ≥ 1 denote the random walkcorresponding to Ynn≥1. For x > 0, let

Tx = infn ≥ 1 : Xn > nµ + x

√n. (2.9)

Then, Tx is the first time the sequence of partial sums Xnn≥1 exceedsthe level nµ+x

√n and is a special case of (2.7) with Bn = (nµ+x

√n,∞),

n ≥ 1. Consequently, Tx is a stopping time w.r.t. Fn = σ〈Y1, Y2, . . . , Yn〉,n ≥ 1. Note that if EY 2

1 < ∞, by the law of iterated logarithm (cf. 8.7),

lim supn→∞

Xn − nµ√2σ2n log log n

= 1 w.p. 1,

i.e., Xn > nµ + C√

n log log n infinitely often w.p. 1

for some constant C > 0. Thus, P (Tx < ∞) = 1 and hence, Tx is a finitestopping time. This random variable Tx arises in sequential probability ratiotests (SPRT) for testing hypotheses on the mean of a (normal) population.See Woodroofe (1982), Chapter 3.

Page 417: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

408 13. Discrete Parameter Martingales

Definition 13.2.2: Let Fnn≥0 be a filtration in a probability space(Ω,F , P ). A betting sequence w.r.t. Fnn≥0 is a sequence Hnn≥1 ofnonnegative random variables such that for each n ≥ 1, Hn is Fn−1 mea-surable. The following result says that there is no betting scheme that canbeat a gambling system, i.e., convert a fair one into a favorable one or theother way around.

Theorem 13.2.2: (Betting theorem). Let Fnn≥0 be a filtration in aprobability space. Let Hnn≥0 be a betting sequence w.r.t. Fnn≥0. Foran adapted sequence Xn,Fnn≥0 let Ynn≥0 be defined by Y0 = X0,Yn = Y0 +

∑nj=1(Xj − Xj−1)Hj, n ≥ 1. Let E|(Xj − Xj−1)Hj | < ∞

for j ≥ 1. Then,

(i) Xn,Fnn≥0 a martingale ⇒ Yn,Fnn≥0 is also a martingale,

(ii) Xn,Fnn≥0 a sub-martingale ⇒ Yn,Fnn≥0 is also a sub-martingale,

Proof: Clearly, for all n ≥ 1, E|Yn| < ∞ and Yn is Fn-measurable.Further,

E(Yn+1

∣∣Fn

)= Yn + E

((Xn+1 −Xn)Hn+1 | Fn

)= Yn + Hn+1 E

((Xn+1 −Xn) | Fn

)since Hn+1 is Fn+1-measurable. Now the theorem follows from the definingproperties of Xn,Fnn≥0.

The above theorem leads to the following results known as Doob’s op-tional stopping theorems.

Theorem 13.2.3: (Doob’s optional stopping theorem I ). Let Xn,Fnn≥0

be a sub-martingale. Let T be a stopping time w.r.t. Fnn≥0. Let Xn ≡XT∧n, n ≥ 0. Then Xn,Fnn≥0 is also a sub-martingale and henceEXn ≥ EX0 for all n ≥ 1.

Proof: For any A ∈ B(R) and n ≥ 0,

X−1n (A) = ω : Xn ∈ A

=( n⋃

j=1

ω : Xj ∈ A, T = j)∪ ω : Xn ∈ A, T > n.

Since T is a stopping time w.r.t. Fnn≥0 the right side above belongs toFn for each n ≥ 0. Next, |Xn| ≤

∑nj=1 |Xj | and hence E|Xn| < ∞.

Finally, let Hj = 1 if j ≤ T and 0 if j > T . Since for all j ≥ 1,ω : Hj = 1 = ω : T ≤ j − 1c ∈ Fj−1, Hjj≥1 is a betting sequencew.r.t. Fnn≥0. Also, Xn = X0 +

∑nj=1(Xj − Xj−1)Hj . Now the betting

theorem (Theorem 13.2.2) implies the present theorem.

Page 418: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.2 Stopping times and optional stopping theorems 409

Remark 13.2.1: If Xn,Fnn≥0 is a martingale, then both Xn,Fnn≥0and −Xn,Fnn≥0 are sub-martingales, and hence the above theorem im-plies that if Xn,Fnn≥0 is a martingale, then so is Xn,Fnn≥0, andhence EXn = EXT∧n = EX0 = EXn for all n ≥ 1.

This suggests the question that if P (T < ∞) = 1, then on letting n →∞,does EXn → EXT ? Consider the following example. Let Xnn≥0 denotethe symmetric simple random walk on the integers with X0 = 0. Let T =infn : n ≥ 1, Xn = 1. Then P (T < ∞) = 1 and EXn = EXT∧n =EX0 = 0 but XT = 1 w.p. 1 and hence EXn → EXT = 1. So, clearly someadditional hypothesis is needed.

Theorem 13.2.4: (Doob’s optional stopping theorem II ). LetXn,Fnn≥0 be a martingale. Let T be a stopping time w.r.t. Fnn≥0.Suppose P (T < ∞) = 1 and there is a 0 < K < ∞ such that for all n ≥ 0

|XT∧n| ≤ K w.p. 1.

Then EXT = EX0.

Proof: Since P (T < ∞) = 1, XT∧n → XT w.p. 1 and |XT | ≤ K < ∞and hence E|XT | < ∞. Thus, E|XT −XT∧n| ≤ 2KP (T > n) → 0.

Remark 13.2.2: Since

XT = XT I(T≤n) + XnI(T>n)

and

EXT I(T≤n) =n∑

j=0

E(Xj : T = j)

=n∑

j=0

E(X0 : T = j) = E(X0 : T ≤ n)

it follows that if E(|Xn|I(T>n)

)→ 0 as n → ∞ and P (T < ∞) = 1 then

EXT = EX0.

A stronger version of Doob’s optional stopping theorem is given belowin Theorem 13.2.6.

Proposition 13.2.5: Let S and T be two stopping times w.r.t. Fnn≥1with S ≤ T . Then, FS ⊂ FT .

Proof: For any A ∈ FS and n ≥ 1,

A ∩ T = n = A ∩ T = n ∩ S ≤ n

=[ n⋃

k=1

A ∩ S = k]∩ T = n

∈ Fn,

Page 419: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

410 13. Discrete Parameter Martingales

since A ∩ S = k ∈ Fk for all 1 ≤ k ≤ n. Thus, A ∈ FT , proving theresult.

Theorem 13.2.6: (Doob’s optional stopping theorem III ). LetXn,Fnn≥1 be a sub-martingale and S and T be two finite stopping timesw.r.t. Fnn≥1 such that S ≤ T . If XS and XT are integrable and if

lim infn→∞ E|Xn|I(|Xn| > T ) = 0, (2.10)

thenE(XT |FS) ≥ XS a.s. (2.11)

If, in addition, Xnn≥1 is a martingale, then equality holds in (2.11).

Thus, Theorem 13.2.6 shows that if a martingale (or a sub-martingale)is stopped at random time points S and T with S ≤ T , then under verymild conditions,

(XS ,FS), (XT ,FT )

continues to have the martingale

(sub-martingale, respectively) property.

Proof: To show (2.11), it is enough to show that∫A

(XT −XS)dP ≥ 0 for all A ∈ FS . (2.12)

Fix A ∈ FS . Let nkk≥1 be a subsequence along which the “lim inf” isattained in (2.10). Let Tk = minT, nk and Sk = minS, nk, k ≥ 1. Theproof of (2.12) involves showing that∫

A

(XTk−XSk

)dP ≥ 0 for all k ≥ 1 (2.13)

and

limk→∞

∫A

[(XT −XS)− (XTk

−XSk)]dP = 0. (2.14)

Consider (2.13). Since Sk ≤ Tk ≤ nk,

XTk−XSk

=Tk∑

n=Sk+1

(Xn −Xn−1)

=nk∑

n=2

(Xn −Xn−1)I(Sk + 1 ≤ n ≤ Tk). (2.15)

Note that for all 2 ≤ n ≤ nk, Tk ≥ n = Tk ≤ n− 1c = T ≤ n− 1c ∈Fn−1 and Sk + 1 ≤ n = Sk ≤ n − 1 = S ≤ n − 1 ∈ Fn−1. Also,since A ∈ FS , Bn ≡ A ∩ Sk + 1 ≤ n ≤ Tk = (A ∩ Sk + 1 ≤ n) ∩ Tk ≥n ∈ Fn−1 for all 2 ≤ n ≤ nk. Hence, by the sub-martingale property of

Page 420: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.2 Stopping times and optional stopping theorems 411

Xnn≥1, from (2.15),

∫A

(XTk−XSk

)dP =nk∑

n=2

∫A∩Sk+1≤n≤Tk

(Xn −Xn−1)dP

=nk∑

n=1

∫Bn

[E(Xn|Fn−1)−Xn−1

]dP

≥ 0. (2.16)

This proves (2.13). To prove (2.14), note that by (2.10) and the integrabilityof XS and XT and the DCT,

limk→∞

∣∣∣ ∫A

[(XT −XS)− (XTk

−XSk)]dP∣∣∣

≤ limk→∞

∫ [|XT −XTk

|+ |XS −XSk|]dP

≤ limk→∞

[ ∫T>nk

(|XT |+ |Xnk|)dP +

∫S>nk

(|XS |+ |Xnk|)dP

]

≤ limk→∞

[ ∫T>nk

|XT |dP + 2∫

T>nk|Xnk

|dP +∫

S>nk|XS |dP

]= 0,

since S > nk ⊂ T > nk and T > nk ↓ ∅ as k →∞. This proves thetheorem for the case when Xnn≥1 is a sub-martingale. When Xnn≥1 isa martingale, equality holds in the last line of (2.16), which implies equalityin (2.13) and hence, in (2.12). This completes the proof.

Remark 13.2.3: If there exists a t0 < ∞ such that P (T ≤ t0) = 1, then(2.10) holds.

Corollary 13.2.7: Let Xn,Fnn≥1 be a sub-martingale and let T be afinite stopping time w.r.t. Fnn≥1 such that E|XT | < ∞ and (2.10) holds.Then,

EXT ≥ EX1. (2.17)

If, in addition, Xnn≥1 is a martingale, then equality holds in (2.17).

Proof: Follows from Theorem 13.2.6 by setting S ≡ 1.

Corollary 13.2.8: Let Xn,Fnn≥1 be a sub-martingale. Let Tnn≥1 bea sequence of stopping times such that

(i) for all n ≥ 1, Tn ≤ Tn+1 w.p. 1,

(ii) for all n ≥ 1, there exist a nonrandom tn ∈ (0,∞) such that P (Tn ≤tn) = 1.

Page 421: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

412 13. Discrete Parameter Martingales

Let Gn ≡ FTn, Yn ≡ XTn

, n ≥ 1. Then Yn,Gnn≥1 is a sub-martingale. IfXn,Fnn≥1 is a martingale, then Yn,Gnn≥1 is a martingale.

Proof: Use Theorem 13.2.6 and Remark 13.2.3.

Corollary 13.2.9: Let Xn,Fnn≥1 be a sub-martingale. Let T be a stop-ping time. Let

Tn = minT, n, n ≥ 1.

Then XTn,FTn

n≥1 is a sub-martingale.

Note that this is a stronger version of Theorem 13.2.3 as FTn ⊂ Fn forall n ≥ 1.

Theorem 13.2.10: (Doob’s maximal inequality). Let Xn,Fnn≥1 be asub-martingale and let Mm = maxX1, . . . , Xm, m ∈ N. Then, for anym ∈ N and x ∈ (0,∞),

P (Mm > x) ≤ EX+mI(Mm > x)

x≤ EX+

m

x. (2.18)

Proof: Fix m ≥ 1, x > 0. Define a random variable S by

S =

infk : 1 ≤ k ≤ m, Xk > x on Am on Ac

where A = Xk > x for some 1 ≤ k ≤ m = Mm > x. Then it is easyto check that S is a stopping time w.r.t. Fnn≥1 and S ≤ m. Set T ≡ m.Then (2.10) holds and hence, by Theorem 13.2.6,

(XS ,FS), (Xm,Fm)

is a sub-martingale.

Note that A = Mm > x =⋃m

k=1Mm > x, S = k =⋃m

k=1XS > x, S =k = XS > x ∈ FS . Hence, by Markov’s inequality,

P (A) = P (XS > x) ≤ 1x

∫XS>x

XSdP ≤ 1x

∫A

XmdP

≤ 1x

∫A

X+mdP ≤ EX+

m

x.

Remark 13.2.4: An alternative proof of (2.18) is as follows. Let A1 =X1 > x, Ak = X1 ≤ x, X2 ≤ x, . . . , Xk−1 ≤ x, Xk > x for k =2, . . . , m. Then

⋃ki=1 Ai = A ≡ Mm > x and Ak ∈ Fk for all k. Now for

x > 0,

P (Mm > x) =m∑

k=1

P (Ak) ≤ 1x

m∑k=1

E(XkIAk).

Page 422: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.2 Stopping times and optional stopping theorems 413

By the sub-martingale property of Xn,Fnn≥1,

E(XkIAk) ≤ E(XmIAk

) for k ≤ m.

Thus

P (Mm > x) ≤ 1x

E

(Xm

m∑k=1

IAk

)

≤ 1x

E(XmIA)

≤ 1x

E(X+

mIA

)≤ 1

xE(X+

m

).

Theorem 13.2.11: (Doob’s Lp-maximal inequality for sub-martingales).Let Xn,Fnn≥1 be a sub-martingale and let Mn = maxXj : 1 ≤ j ≤ n.Then, for any p ∈ (1,∞),

E(M+n )p ≤

(p

p− 1

)p

E(X+n )p ≤ ∞. (2.19)

Proof: If E(X+n )p = ∞, then (2.19) holds trivially. Let E(X+

n )p < ∞.Since for p > 1, φ(x) = (x+)p is a convex nondecreasing function on R.Hence, (X+

n )p,Fnn≥1 is a sub-martingale and E(X+j )p ≤ E(X+

n )p < ∞for all j ≤ n. Since (M+

n )p ≤∑n

j=1(X+j )p, this implies that E(M+

n )p < ∞.For any nonnegative random variable Y and p > 0, by Tonelli’s theorem,

EY p = pE

(∫ Y

0xp−1dx

)

= pE

(∫ ∞

0xp−1I(Y > x)dx

)

=∫ ∞

0pxp−1P (Y > x)dx.

Thus,

E(M+n )p =

∫ ∞

0pxp−1P (M+

n > x)dx

=∫ ∞

0pxp−1P (Mn > x)dx.

By Theorem 13.2.10, for x > 0

P (Mn > x) ≤ 1x

E(X+n I(Mn > x)),

Page 423: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

414 13. Discrete Parameter Martingales

and hence

E(M+n )p ≤

∫ ∞

0pxp−2E(X+

n I(Mn > x))dx

=p

(p− 1)E(X+

n Mp−1n )

≤(

p

p− 1

)(E(X+

n )p)1/p(E(M (p−1)qn )1/q

(by Holder’s inequality) where q is the conjugate of p, i.e. q = p(p−1) .

Thus, (E(M+

n )p)1/p

≤(

p

p− 1

)(E(X+

n )p)1/p

,

proving (2.19).

Corollary 13.2.12: Let Xn,Fnn≥1 be a martingale and let Mn =sup|Xj | : 1 ≤ j ≤ n. Then, for p ∈ (1,∞),

EMpn ≤

(p

p− 1

)p

E(|Xn|p

).

Proof: Since |Xn|,Fnn≥1 is a sub-martingale, this follows from Theo-rem 13.2.11.

Theorem 13.2.13: (Doob’s L log L maximal inequality for sub-martingales). Let Xn,Fnn≥1 be a sub-martingale and Mn = maxXj :1 ≤ j ≤ n. Then

EM+n ≤

(e

e− 1

)(1 + E(X+

n log X+n )), (2.20)

where 0 log 0 is interpreted as 0.

Proof: As in the proof of Theorem 13.2.11,

EM+n =

∫ ∞

0P (M+

n > x)dx

≤ 1 +∫ ∞

1

1x

E(X+n I(M+

n > x))dx

= 1 + E(X+n log M+

n ) . (2.21)

For x > 0, y > 0,x log y = x log x + x log

y

x.

Now x log yx = yφ(x

y ) where φ(t) ≡ −t log t, t > 0. It can be verified φ(t)attains its maximum 1

e at t = 1e . Thus

x log y ≤ x log x +y

e.

Page 424: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.2 Stopping times and optional stopping theorems 415

So

EX+n log M+

n ≤ EX+n log X+

n +EM+

n

e. (2.22)

If EX+n log X+

n = ∞, (2.20) is trivially true. If EX+n log X+

n < ∞, then asin the proof of Theorem 13.2.11, it can be shown that EM+

n < ∞. Hence,the theorem follows from (2.21) and (2.22).

A special case of Theorem 13.2.10 is the maximal inequality of Kol-mogorov (cf. Section 8.3) as shown by the following example.

Example 13.2.4: Let Ynn≥1 be a sequence of independent randomvariables with EYn = 0 and E|Yn|α < ∞ for all n ≥ 1 for some α ∈ (1,∞).Let Sn = Y1+. . .+Yn, n ≥ 1. Then, φ(x) ≡ |x|α, x > 0 is a convex function,and hence, by Proposition 13.1.1, Xn ≡ φ(|Sn|), n ≥ 1 is a sub-martingalew.r.t. Fn = σ〈Y1, . . . , Yn〉, n ≥ 1. Now, by Theorem 13.2.10, for anyx > 0 and m ≥ 1,

P(

max1≤n≤m

|Sn| > x)

= P(

max1≤n≤m

Xn > xα)

≤ x−αEX+m

≤ x−αE|Sm|α. (2.23)

Kolmogorov’s inequality corresponds to the case where α = 2.

Another application of the optimal stopping theorem yields the followinguseful result.

Theorem 13.2.14: (Wald’s lemmas). Let Ynn≥1 be a sequence of iidrandom variables and let Fnn≥1 be a filtration such that

(i) Yn is Fn-measurable and (ii) Fn and σ〈Yk : k ≥ n + 1〉 areindependent for all n ≥ 1. (2.24)

Also, let T be a finite stopping time w.r.t. Fnn≥1 and E|T | < ∞. LetSn = Y1 + . . . + Yn, n ≥ 1. Then,

(a) E|Y1| < ∞ impliesEST = (EY1)(ET ). (2.25)

(b) EY 21 < ∞ implies

E(ST − TEY1)2 = Var(Y1)E(T ). (2.26)

Proof: W.l.o.g., suppose that EY1 = 0. Then, Sn,Fnn≥1 is a martingale.By Corollary 13.2.7, (2.25) would follow if one showed that (2.10) holds withXn = Sn and that E|ST | < ∞. Since |Sn| ≤

∑ni=1 |Yi| ≤

∑Ti=1 |Yi| on the

set T ≥ n, both these conditions would hold if E(∑T

i=1 |Yi|) < ∞. Now,

Page 425: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

416 13. Discrete Parameter Martingales

by the MCT and the independence of Yi and T ≥ i = T ≤ i−1c ∈ Fi−1for i ≥ 2 and the fact that T ≥ 1 = Ω, it follows that

ET∑

i=1

|Yi| = E

( ∞∑i=1

|Yi|I(i ≤ T ))

=∞∑

i=1

E|Yi|I(i ≤ T )

= E|Y1|∞∑

i=1

P (T ≥ i)

= E|Y1|E|T | < ∞. (2.27)

This proves (a).To prove (b), set σ2 = Var(Y1) and note that EY1 = 0 ⇒ S2

n −nσ2,Fnn≥1 is a martingale. Let Tn = T ∧n, n ≥ 1. Then, Tn is a boundedstopping time w.r.t. Fnn≥1 and hence, by Theorem 13.2.6,

E[S2Tn− (ETn)σ2] = E(S2

1 − σ2) = 0 for all n ≥ 1. (2.28)

Thus, (2.26) holds with T replaced by Tn. Since T is a finite stopping time,Tn ↑ T < ∞ w.p. 1 and therefore, STn

→ ST as n → ∞, w.p. 1. Nowapplying Fatou’s lemma and the MCT, from (2.28), one gets

ES2T ≤ lim inf

n→∞ ES2Tn

= lim infn→∞ (ETn)σ2 = (ET )σ2. (2.29)

Also, note that for any n ≥ 1

E(S2T − S2

Tn) = E(S2

T − S2n)I(T > n)

= E[(ST − Sn)2 + 2Sn(ST − Sn)]I(T > n)≥ 2ESn(ST − Sn)I(T > n)= 2ESn(ST1n

− Sn)= 2E[Sn · E(ST1n

− Sn)|Fn], (2.30)

where T1n = maxT, n. Since ET1n ≤ ET + n < ∞, and T1n > k =T > k for all k > n, the conditions of Theorem 13.2.6 hold with Xn =Sn, S = n and T = T1n. Hence, E(ST1n − Sn|Fn) = 0 a.s. and by (2.30),ES2

T ≥ ES2Tn

for all n ≥ 1. Now letting n →∞ and using (2.28), one getsES2

T ≥ (ET )σ2, as in (2.29). This completes the proof of (b).

This section is concluded with the statement of an inequality relating thepth moment of a martingale to the (p/2)th moment of its squared variation.

Theorem 13.2.15: (Burkholder’s inequality). Let Xn,Fnn≥1 be a mar-tingale sequence. Let ξj = Xj −Xj−1, α ≥ 1, with X0 = 0. Then for any

Page 426: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.3 Martingale convergence theorems 417

p ∈ [2,∞), there exist positive constants Ap and Bp such that

E|Xn|p ≤ Ap E

( n∑i=1

ξ2i

)p/2

and

E|Xn|p ≤ Bp

E

( n∑i=1

E(ξ2i |Fi−1

))p/2

+n∑

i=1

E|ξi|p

.

For a proof, see Chow and Teicher (1997).

13.3 Martingale convergence theorems

The martingale (or sub- or super-martingale) property of a sequence ofrandom variables Xnn≥1 implies, under some mild additional conditions,a remarkable regularity, namely, that Xnn≥1 converges w.p. 1 as n →∞. For example, any nonnegative super-martingale converges w.p. 1. Alsoany sub-martingale Xnn≥1 for which E|Xn|n≥1 is bounded convergesw.p. 1. Further, if E|Xn|pn≥1 is bounded for some p ∈ (1,∞), then Xn

converges w.p. 1 and in Lp as well.The proof of these assertions depend crucially on an ingenious inequality

due to Doob. Recall that one way to prove that a sequence of real numbersxnn≥1 converges as n →∞ is to show that it does not oscillate too muchas n → ∞. That is, for all a < b, the number of times the sequence goesfrom below a to above b is finite. This number is referred to as the numberof upcrossings from a to b. Doob’s upcrossing lemma (see Theorem 13.3.1below) shows that for a sub-martingale, the mean of the upcrossings can bebounded above. First, a formal definition of upcrossings of a given sequencexj : 1 ≤ j ≤ n of real numbers from level a to level b with a < b is given.

Let

N1 = minj : 1 ≤ j ≤ n, xj ≤ aN2 = minj : N1 < j ≤ n, xj ≥ b

and, define recursively,

N2k−1 = minj : N2k−2 < j ≤ n, xj ≤ aN2k = minj : N2k−1 < j ≤ n, xj ≥ b.

If any of these sets on the right side is empty, all subsequent ones will beempty as well and the corresponding Nk’s will not be well defined. If N1 orN2 is not well defined, then set U

xjn

j=1; a, b, the number of upcrossings

of the interval (a, b) by xjj=1 equal to zero. Otherwise let N be the last

Page 427: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

418 13. Discrete Parameter Martingales

one that is well defined. Set Uxjn

j=1; a, b

= 2 if is even and −1

2 if is odd.

Theorem 13.3.1: (Doob’s upcrossing lemma). Let Xj ,Fjnj=1 be a sub-

martingale and let a < b be real numbers. Let Un ≡ UXjn

j=1; a, b.

Then

EUn ≤E(Xn − a)+ − E(X1 − a)+

(b− a)≤ EX+

n + |a|(b− a)

. (3.1)

Proof: Consider first the special case when Xj ≥ 0 w.p. 1 for all j ≥ 1and a = 0. Let N0 = 1. Let

Nj =

⎧⎪⎨⎪⎩

Nj if j = 2k, k ≤ Un or ifj = 2k − 1, k ≤ Un,

n otherwise.

If j is odd and j + 1 ≤ 2Un, then

XNj+1≥ b > 0.

If j is odd and j + 1 ≥ 2Un + 2, then

XNj+1= Xn = XNj

.

Thus∑

j odd(XNj+1− XNj

) ≥ bUn. It is easy to check that Njnj=1 are

stopping times. By Theorem 13.2.6,

E(XNj+1−XNj

) ≥ 0 for j = 1, 2, . . . , n.

Thus,

E(Xn −X1) = E

( n−1∑j=0

(XNj+1−XNj

))

≥ bEUn + E( ∑

j even(XNj+1

−XNj))

≥ bEUn. (3.2)

Hence, both inequalities of (3.1) hold for the special case.Now for the general case, let Yj ≡ (Xj − a)+, 1 ≤ j ≤ n. Then

Yj ,Fjnj=1 is a nonnegative sub-martingale and Un

Yjn

j=1, 0, b − a≡

Un

Xjn

j=1, a, b. Thus, from (3.2)

EUn ≤ E(Yn − Y1)(b− a)

=E((Xn − a)+)− E((X1 − a)+)

(b− a),

Page 428: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.3 Martingale convergence theorems 419

proving the first inequality of (3.1). The second inequality follows by notingthat (x− a)+ ≤ x+ + |a| for any x, a ∈ R.

The first convergence theorem is an easy consequence of the above the-orem.

Theorem 13.3.2: Let Xn,Fnn≥1 be a sub-martingale such that

supn≥1

EX+n < ∞.

Then Xnn≥1 converges to a finite limit X∞ w.p. 1 and E|X∞| < ∞.

Proof: LetA = ω : lim inf

n→∞ Xn < lim supn→∞

Xn,

and for a < b, let

A(a, b) = ω : lim infn→∞ Xn < a < b < lim sup

n→∞Xn.

Then, A = ∪A(a, b) where the union is taken over all rationals a, b suchthat a < b. To establish convergence of Xnn≥1 it suffices to show thatP (A(a, b)) = 0 for each a < b, as this implies P (A) = 0. Fix a < b andlet Un = U

Xjn

j=1; a, b. For ω ∈ A(a, b), Un → ∞ as n → ∞. On the

other hand, by the upcrossing lemma

EUn ≤EX+

n + |a|(b− a)

and by hypothesis, supn≥1 EX+n < ∞, implying that

supn≥1

EUn < ∞.

By the MCT, E[limn→∞ Un

]= limn→∞ EUn, and hence

limn→∞ Un < ∞ w.p. 1.

Thus, P (A(a, b)) = 0 for all a < b, and hence limn→∞ Xn = X∞ existsw.p. 1. By Fatou’s lemma

E|X∞| ≤ limn→∞ E|Xn| ≤ sup

n≥1E|Xn| .

But E|Xn| = 2E(X+n ) − EXn ≤ 2EX+

n − EX1, as Xn,Fnn≥1 asub-martingale implies EXn ≥ EX1. Thus, supn≥1 EX+

n < ∞ impliessupn≥1 E|Xn| < ∞. So, E|X∞| < ∞ and hence |X∞| < ∞ w.p. 1.

Corollary 13.3.3: Let Xn,Fn)n≥1 be a nonnegative super-martingale.Then Xnn≥1 converges to a finite limit w.p. 1.

Page 429: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

420 13. Discrete Parameter Martingales

Proof: Since −Xn,Fnn≥1 is a nonpositive sub-martingale, supn≥1E(−Xn)+ = 0 < ∞. By Theorem 13.3.2, −Xnn≥1 converges to a fi-nite limit w.p. 1.

Corollary 13.3.4: Every nonnegative martingale converges w.p. 1.

A natural question is that if a sub-martingale converges w.p. 1 to a finitelimit, does it do so in L1 or in Lp for p > 1. It turns out that if a sub-martingale is Lp bounded for some p > 1, then it converges in Lp. But thisis false for p = 1 as the following examples show.

Example 13.3.1: (Gambler’s ruin problem). Let Snn≥1 be the simplesymmetric random walk, i.e., Sn =

∑ni=1 ξi, n ≥ 1, where ξnn≥1 is a

sequence of iid random variables with P (ξ1 = 1) = 12 = P (ξ1 = −1). Let

N = infn : n ≥ 1, Sn = 1.

As noted earlier, N is a finite stopping time and that Snn≥1 is a mar-tingale. Let Xn = SN∧n, n ≥ 1. Then by the optional sampling theorem,Xnn≥1 is a martingale. Clearly, limn→∞ Xn ≡ X∞ = SN = 1 exists w.p.1. But EXn ≡ 0 while EX∞ = 1 and so Xn does not converge to X∞ inL1.

Example 13.3.2: Suppose that ξnn≥1 is a sequence of iid nonnegativerandom variables with Eξ1 = 1. Let Xn = Πn

i=1ξi, n ≥ 1. Then Xnn≥1is a nonnegative martingale and hence converges w.p. 1 to X∞, say. IfP (ξ1 = 1) < 1, it can be shown that X∞ = 0 w.p. 1. Thus, Xn X∞ inL1. In particular, Xnn≥1 is not UI (Problem 13.19).

Example 13.3.3: If Znn≥0 is a branching process with offspring dis-tribution pjj≥0 and mean m =

∑∞j=1 jpj then Xn ≡ Zn/mn (cf. 1.9)

is a nonnegative martingale and hence limn Xn = X∞ exists w.p. 1. It isknown that Xn converges to X∞ in L1 iff m > 1 and

∑∞j=1 j log pj < ∞

(cf. Chapter 18). See also Athreya and Ney (2004).

Theorem 13.3.5: Let Xn,Fnn≥1 be a sub-martingale. Then the follow-ing are equivalent:

(i) There exists a random variable X∞ in L1 such that Xn → X∞ in L1.

(ii) Xnn≥1 is uniformly integrable.

Proof: Clearly, (i) ⇒ (ii) for any sequence of integrable random variablesXnn≥1. Conversely, if (ii) holds, then E|Xn|n≥1 is bounded and henceby Theorem 13.3.2, Xn → X∞ w.p. 1 and by uniform integrability, Xn →X∞ in L1, i.e., (i) holds.

Page 430: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.3 Martingale convergence theorems 421

Remark 13.3.1: Let (ii) of Theorem 13.3.5 hold. For any A ∈ Fn andm > n, by the sub-martingale property

E(XnIA) ≤ E(XmIA).

By uniform integrability, for any A ∈ F ,

EXnIA → EX∞IA as n →∞.

This implies that Xn,Fnn≥1 ∪ X∞,F∞ is a sub-martingale, whereF∞ = σ〈

⋃n≥1 Fn〉. That is, the sub-martingale is closable at right. Further,

EXn → EX∞. Conversely, it can be shown that if there exists a randomvariable X∞, measurable w.r.t. F∞, such that

(a) E|X∞| < ∞,

(b) Xn,Fnn≥1 ∪ X∞,F∞ is a sub-martingale, and

(c) EXn → EX∞,

then by (a) and (b), Xnn≥1 is uniformly integrable and (i) of Theorem13.3.5 holds.

Corollary 13.3.6: If Xn,Fnn≥1 is a martingale, then it is closable atright iff Xnn≥1 is uniformly integrable iff Xn converges in L1.

This follows from the previous remark since for a martingale, EXn isconstant for 1 ≤ n ≤ ∞.

Remark 13.3.2: A sufficient condition for a sequence Xn∞n≥1 of

random variables to be uniformly integrable is that there exists a ran-dom variable M such that EM < ∞ and |Xn| ≤ M w.p. 1 for alln ≥ 1. Suppose that Xnn≥1 is a nonnegative sub-martingale andM = supn≥1 Xn = limn→∞ Mn where Mn = sup1≤j≤n Xj . By the MCT,EM = limn→∞ EMn. But by Doob’s L log L maximal inequality (Theorem13.2.13),

EMn ≤e

e− 1

[1 + E

(Xn(log Xn)+

)],

for all n ≥ 1. Thus, if Xnn≥1 is a nonnegative sub-martingale andsupn≥1 E(Xn(log Xn)+) < ∞, then EM < ∞ and hence Xnn≥1 is uni-formly integrable and converges w.p. 1 and in L1. Similarly, if Xnn≥1 isa martingale such that supn≥1 E(|Xn|(log |Xn|)+) < ∞, then Xnn≥1 isuniformly integrable.

L1 Convergence of the Doob Martingale

Definition 13.3.1: Let X be a random variable on a probability space(Ω,F , P ) and Fnn≥1 a filtration in F . Let E|X| < ∞ and Xn ≡

Page 431: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

422 13. Discrete Parameter Martingales

E(X|Fn), n ≥ 1. Then Xn,Fnn≥1 is called a Doob martingale (cf. Ex-ample 13.1.3).

For a Doob martingale, E|Xn| ≤ E|X| and it can be shown that Xnn≥1is uniformly integrable (Problem 13.20). Hence, limn→∞ Xn exists w.p. 1and in L1, and equals E(X|F∞), where F∞ = σ〈

⋃n≥1 Fn〉. This may be

summarized as:

Theorem 13.3.7: Let Fnn≥1 be a filtration and let X be an F∞-measurable with E|X| < ∞. Then

E(X|Fn) → X w.p. 1 and in L1.

Corollary 13.3.8: Let Fnn≥1 be a filtration and F∞ = σ〈⋃

n≥1 Fn〉.

(i) For any A ∈ F∞, one has

P (A|Fn) → IA w.p. 1.

(ii) For any random variable X with E|X| < ∞,

E(X|Fn) → E(X|F∞) w.p. 1.

Proof: Take X = IA for (i) and in Theorem 13.3.7, replace X by E(X|F∞)for (ii).

Kolmogorov’s zero-one law (Theorem 7.2.4) is an easy consequence ofthis. If ξnn≥1 are independent random variables and A is a tail eventand Fn ≡ σ〈ξj : 1 ≤ j ≤ n〉, then P (A|Fn) = P (A) for each n and henceP (A) = IA w.p. 1, i.e., P (A) = 0 or 1.

Theorem 13.3.9: (Lp convergence of sub-martingales, p > 1). LetXn,Fnn≥1 be a nonnegative sub-martingale. Let 1 < p < ∞ andsupn≥1 E|Xn|p < ∞. Then limn→∞ Xn = X∞ exists w.p. 1 and in Lp,and (Xn,Fn)n≥1 ∪ X∞,F∞ is a Lp-bounded sub-martingale.

Proof: By Doob’s maximal Lp inequality (Theorem 13.2.11), for any n ≥1,

EMpn ≤

(p

p− 1

)p

EXpn ≤

(p

p− 1

)p

supm≥1

EXpm, (3.3)

where Mn = maxXj : 1 ≤ j ≤ n. Let M = limn→∞ Mn. Then (3.3) yields

EMp < ∞ .

This makes |Xn|pn≥1 uniformly integrable. Also supn≥1 E|Xn|p < ∞ andp > 1 ⇒ supn E|Xn| < ∞ and hence, limn→∞ Xn = X∞ exists w.p. 1 as a

Page 432: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.3 Martingale convergence theorems 423

finite limit. The uniform integrability of |Xn|pn≥p implies Lp convergence(cf. Problem 2.36). The closability also follows as in Remark 13.3.1.

Corollary 13.3.10: Let Xn,Fnn≥1 be a martingale. Let 1 < p < ∞and supn≥1 E|Xn|p < ∞. Then the conclusions of Theorem 13.3.9 hold.

Proof: Since Yn ≡ |Xn|,Fnn≥1 is a nonnegative sub-martingale, Theo-rem 13.3.9 applies.

Reversed Martingales

Definition 13.3.2: Let Xn,Fnn≤−1 be an adapted family with(Ω,F , P ) as the underlying probability space, i.e., for n < m, Fn ⊂ Fm ⊂ Fand Xn is Fn-measurable for each n ≤ −1. Such a sequence is called a re-versed martingale if

(i) E|Xn| < ∞ for all n ≤ −1,

(ii) E(Xn+1|Fn) = Xn for all n ≤ −1.

The definitions of reversed sub- and super-martingales are similar.

Reversed martingales are well behaved since they are closed at right.

Theorem 13.3.11: Let Xn,Fnn≤−1 be a reversed martingale. Then

(a) limn→−∞ Xn = X−∞ exists w.p. 1 and in L1,

(b) X−∞ = E(X−1|F−∞), where F−∞ ≡⋂

n≤−1

Fn.

Proof: Fix a < b. For n ≤ −1, let Un be the number of (a, b) upcrossings ofXj : n ≤ j ≤ −1. Then by Doob’s upcrossing lemma (Theorem 13.3.1),

EUn ≤E(X1 − a)+

(b− a).

Let U = limn→−∞ Un. Letting n → −∞, by the MCT, it follows that

EU < ∞.

Thus, U < ∞ w.p. 1. This being true for every a < b, one may con-clude as in Theorem 13.3.2 that P (limn→−∞Xn > limn→−∞Xn) = 0. Solimn→−∞ Xn = X−∞ exists w.p. 1. Also, by Jensen’s inequality, Xnn≤−1is uniformly integrable. So Xn → X−∞ in L1, proving (a).

To prove (b), note that for any A ∈ F−∞, by uniform integrability,∫A

X−∞dP = limn→−∞

∫A

X−ndP

=∫

A

X−1dP, by the martingale property.

Page 433: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

424 13. Discrete Parameter Martingales

Corollary 13.3.12: (The Strong Law of Large Numbers for iid randomvariables). Let ξnn≥1 be a sequence of iid random variables with E|ξ1| <∞. Then, n−1∑n

i=1 ξi → Eξ1 as n →∞, w.p. 1.

Proof: For k ≥ 1, let Sk = ξ1 + · · ·+ ξk and

F−k = σ〈Sk, ξk+1, ξk+2, . . .〉.

Let Xn ≡ E(ξ1|Fn)n≤−1. By the independence of ξii≥1, for any n ≤ −1,with k = −n,

Xn = E(ξ1|σ〈Sk〉).Also, by symmetry, for any k ≥ 1,

E(ξ1|σ〈Sk〉) = E(ξj |σ〈Sk〉) for 1 ≤ j ≤ k.

Thus, Xn = 1k

∑kj=1 E(ξj |σ〈Sk〉) = Sk

k , for all k = −n ≥ 1. It is easyto check that Xn,Fnn≤−1 is a reversed martingale and so by Theorem13.3.11,

limn→−∞ Xn = lim

k→∞Sk

kexists w.p. 1 and in L1.

By Kolmogorov’s zero-one law, limk→∞ Sk

k is a tail random variable, andso a constant, which by L1 convergence must equal Eξ1.

13.4 Applications of martingale methods

13.4.1 Supercritical branching processesRecall Example 13.1.5 on branching processes. Assume that it is supercrit-ical, i.e., µ = Eξ11 > 1 and that σ2 = Var(ξ11) < ∞.

Proposition 13.4.1: Let Xn = µ−nZn be the martingale defined in (1.9).Then, Xnn≥1 is an L2-bounded martingale.

Proof: Let vn = Var(Xn), n ≥ 1. Then

vn+1 = Var(E(Xn+1|Fn)) + E(Var(Xn+1|Fn))

= Var(Xn) +E(Znσ2)µ2(n+1)

= vn + σ2µ−2µ−n, n ≥ 1.

Thus, vn+1 = σ2µ−2∑nj=1 µ−j . Since µ > 1, vnn≥1 is bounded. Now

since EXn ≡ 1, Xnn≥1 is L2-bounded.

A direct consequence of Proposition 13.4.1 and Theorem 13.3.8 is thatlimn→∞ Xn = X∞ exists w.p. 1 and in mean-square.

Page 434: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.4 Applications of martingale methods 425

13.4.2 Investment sequencesLet Xn be the value of a portfolio at (the end of) the nth period. Supposethe returns on the investment are random and satisfy

E(Xn+1|X0, X1, . . . , Xn) ≤ ρn+1Xn, n ≥ 1

where ρn+1 is a strictly positive random variable that is Fn-measurable,where Fn ≡ σ〈X1, X2, . . . , Xn〉. Let ρ1 ≡ 1 and

Zn =Xn∏nj=1 ρj

, n ≥ 1 .

Then, Zn,Fnn≥1 is a nonnegative super-martingale and hence, it con-verges w.p. 1 to a limit Z, with EZ ≤ EZ1 = EX1. This implies thatXnn≥1 converges w.p. 1 on the event A ≡

∏nj=1 ρj converges.

13.4.3 A conditional Borel-Cantelli lemmaLet Ann≥1 be a sequence of events in a probability space (Ω,F , P ) andlet Fnn≥1 be a filtration in F . Let An ∈ Fn for all n ≥ 1 and pn =P (An|Fn−1), n ≥ 1, where F0 is the trivial σ-algebra ≡ Ω, ∅. Let δn =IAn , and Xn =

∑nj=1(δj − pj), n ≥ 1. Then

Xn,Fnn≥1

is a martingale. Let Vj = Var(δj |Fj−1) = pj(1− pj), j ≥ 1, sn =∑n

j=1 Vj

and sn = maxsn, 1, n ≥ 1. Since Vn is Fn−1-measurable, so are sn andsn. Let Yn =

∑nj=1(δj − pj)/sj , n ≥ 1. Then, Yn,Fnn≥1 is a martingale.

Clearly, EYn = 0 and by the martingale property

EY 2n = Var(Yn) =

n∑j=1

Var(

δj − pj

sj

)

=n∑

j=1

E

(Vj

s2j

)

= E

( n∑j=1

Vj

s2j

).

But V1 = s1 and Vj = sj−sj−1 for j ≥ 2 and so Vj

s2j≤∫ sj

sj−1

1t2 dt and hence,

∞∑j=1

Vj

s2j

≤∫ ∞

1

1t2

dt = 1.

So supn≥1 EY 2n ≤ 1. Thus, Ynn≥1 converges to some Y w.p. 1 and in L2.

Page 435: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

426 13. Discrete Parameter Martingales

If sn →∞, then by Kronecker’s lemma (cf. Lemma 8.4.2).

1sn

n∑j=1

(δj − pj) → 0

⇒(∑n

j=1 δj∑nj=1 pj

− 1)∑n

j=1 pj

sn→ 0 .

But∑n

j=1 pj ≥ sn and hence

∑nj=1 δj∑nj=1 pj

→ 1 w.p. 1 on the event B ≡ sn →∞ . (4.1)

Next it is claimed that on Bc ≡ limn→∞ sn < ∞, limn→∞ Xn = Xexists and is finite w.p. 1. To prove the claim, fix 0 < t < ∞. Let Nt =infn : sn+1 > t. Since sn+1 is Fn-measurable, Nt is a stopping time andby the optional stopping theorem I (Theorem 13.2.3), Zn ≡ XNt∧nn≥1is a martingale. By Doob’s L2-maximal inequality,

E(

sup1≤j≤n

Z2j

)≤ 4E(Z2

n).

Also it is easy to verify that X2n − s2

nn≥1 is a martingale and by theoptional sampling theorem (Theorem 13.2.4),

E(X2n∧Nt

− s2n∧Nt

) = 0.

Thus, EZ2n = Es2

n∧Nt≤ t and hence for each t, limn→∞ Zn exists w.p.

1 and in L2. Thus, limn→∞ XNt∧n exists w.p. 1 for each t. But, on Bc,Nt = ∞ for all large t. So limn→∞ Xn = X exists w.p. 1 on Bc. Thisproves the claim.

It follows that on Bc ∩ ∑∞

j=1 pj = ∞,∑n

j=1 δj∑nj=1 pj

− 1 =Xn∑nj=1 pj

→ 0.

Also, since B ≡ sn → ∞ is a subset of ∑∞

j=1 pj = ∞ and it has beenshown in (4.1) that ∑n

j=1 δj∑nj=1 pj

→ 1 w.p. 1 on B,

it follows that ∑nj=1 δj∑nj=1 pj

→ 1 w.p. 1 on ∞∑

j=1

pj = ∞

.

Page 436: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.4 Applications of martingale methods 427

Summarizing the above, one gets the following result.

Theorem 13.4.2: (A conditional Borel-Cantelli lemma). Let Ann≥1be a sequence of events in a probability space (Ω,F , P ) and Fnn≥1 bea filtration such that An ∈ Fn, for all n ≥ 1. Let pn = P (An|Fn−1) forn ≥ 2, p1 = P (A1). Then on the event B0 ≡

∑∞j=1 pj = ∞,

∑nj=1 IAj∑nj=1 pj

→ 1 w.p. 1,

and in particular, infinitely many An’s happen w.p. 1 on B0.

13.4.4 Decomposition of probability measuresThe almost sure convergence of a nonnegative martingale yields the follow-ing theorem on the Lebesgue decomposition of two probability measureson a given measurable space.

Theorem 13.4.3: Let (Ω,F) be a measurable space and Fnn≥1 be afiltration with Fn ⊂ F for all n ≥ 1. Let P and Q be two probabilitymeasures on (Ω,F) such that for each n ≥ 1, Pn ≡ the restriction of Pto Fn is absolutely continuous w.r.t. Qn ≡ on the restriction of Q to Fn,with the Radon-Nikodym derivative Xn = dPn

dQn. Let F∞ ≡ σ〈

⋃n≥1 Fn〉 and

X ≡ limn→∞Xn. Then for any A ∈ F∞,

P (A) =∫

A

XdQ + P (A ∩ (X = ∞))

≡ Pa(A) + Ps(A), say, (4.2)

and Pa Q and Ps ⊥ Q.

Proof: For 1 ≤ k ≤ n, let Mk,n = maxk≤j≤n Xj . Let Mk =supn≥k Mk,n = limn→∞ Mk,n. Then X ≡ limn→∞Xn = limk→∞ Mk. Fix1 ≤ k0, N < ∞ and A ∈ Fk0 . Then for n ≥ k ≥ k0, Bk,n ≡ A ∩ Mk,n ≤N ∈ Fn and hence

P (Bk,n) =∫

XnIBk,ndQ. (4.3)

As n → ∞, Mk,n ↑ Mk and so IBk,n↓ IBk

, where Bk ≡ A ∩ Mk ≤ N =⋂n≥k Bk,n. Also, since Xn,Fnn≥1 is a nonnegative martingale under the

probability measure Q, limn→∞ Xn exists w.p. 1 and hence coincides withX. Thus, by the bounded convergence theorem applied to (4.3),

P (Bk) =∫

XIBkdQ.

Page 437: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

428 13. Discrete Parameter Martingales

Now let N →∞ and use the MCT to conclude that

P (A ∩ (Mk < ∞)) =∫

XIMk<∞dQ.

Since Mk < ∞ ↑ X < ∞, another application of the MCT yields

P (A ∩ X < ∞) =∫

XIX<∞dQ. (4.4)

Since Q(X < ∞) = 1 and

P (A) = P (A ∩ (X < ∞)) + P (A ∩ (X = ∞)),

(4.2) is proved for all A ∈ Fk0 and hence, also for all A ∈⋃

k≥1 Fk. Since⋃k≥1 Fk is an algebra, (4.2) holds for all A ∈ F∞ ≡ σ〈

⋃k≥1 Fk〉.

To prove the second part, simply note that Q(A) = 0 implies Pa(A) = 0and Q(X = ∞) = 0.

Remark 13.4.1: The right side of (4.2) provides the Lebesgue decompo-sition of P w.r.t. Q.

Corollary 13.4.4:

(i) P Q iff EQX = 1 iff P (X = ∞) = 0.

(ii) P ⊥ Q iff Q(X = 0) = 1 iff P (X = ∞) = 1.

Proof: Follows easily from (4.2).

For some applications of Corollary 13.4.4 to branching processes seeAthreya (2000).

Corollary 13.4.5: Let Xn,Fnn≥1 be a nonnegative martingale on aprobability space (Ω,F , Q), with EQX1 = 1. Let P be defined on

⋃n≥1 Fn

by P (A) = EQXnIA if A ∈ Fn. Then P is well defined. Suppose furtherthat P admits an extension to a probability measure on (Ω,F∞) whereF∞ ≡ σ(

⋃n≥1 Fn). Then (4.2) holds.

Proof: If A ∈ Fn, then A ∈ Fm for any m > n. But EQXmIA = EQXnIA

by the martingale property, and so P is well defined on⋃

n≥1 Fn. The restof the corollary follows from the theorem.

Remark 13.4.2: It can be shown that if Fn ≡ σ〈Xj : 1 ≤ j ≤ n〉, then Pdoes admit an extension to F∞ (cf. see Athreya (2000)).

Corollary 13.4.6: Let (Ω,F) be a measurable space. Let for each n ≥ 1,An ≡ An1, An2, . . . , Ankn

⊂ F be a partition of Ω. Let An ⊂ σ〈An+1〉for all n ≥ 1. Let P and Q be two probability measures on (Ω,F). LetQ be such that Q(Ani) > 0 for all n and i. Let F = σ〈

⋃n≥1An〉. Let

Page 438: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.4 Applications of martingale methods 429

Xn ≡∑kn

i=1P (Ani)Q(Ani)

IAni . Then Xn,Fnn≥1 is a martingale on (Ω,F , Q)and P satisfies the decomposition (4.2).

The proof of Corollary 13.4.6 is left as an exercise (Problem 13.22).

Remark 13.4.3: This yields the Lebesgue decomposition of P w.r.t. Qwhen F is countably generated, i.e., when there exists a countable collectionA of subsets of Ω such that F = σ〈A〉. In particular, this holds if Ω = Rk

and F ≡ B((Rk)

)for k ∈ N.

13.4.5 Kakutani’s theorem

Theorem 13.4.7: (Kakutani’s theorem). Let P and Q be the probabil-ity distributions on

(R∞,B(R∞)

)of the sequences of independent random

variables Xjj≥1 and Yjj≥1, respectively. Let for each j ≥ 1, the distri-bution of Xj be dominated by that of Yj. Then

either P Q or P ⊥ Q. (4.5)

Proof: Let fj be the density of λj w.r.t. µj where λj(·) = P (Xj ∈ ·) andµj(·) = Q(Yj ∈ ·). Let Ω = R∞, F = (B(R))∞. Then P = Πj≥1λj , Q =Πj≥1µj . Let ξn(ω) ≡ ω(n), the nth co-ordinate of ω = (ω1, ω2, . . .) ∈ Ω,and Fn ≡ σ〈ξj : 1 ≤ j ≤ n〉. Also let Pn be the restriction of P to Fn andQn be that of Q to Fn. Then Pn Qn with probability density

Ln =dPn

dQn=

n∏j=1

fj(ξj).

Since limn→∞Ln < ∞ is a tail event, by the independence of ξjj≥1under P and the Kolmogorov’s zero-one law, P (limn→∞Ln < ∞) = 0 or1. Now, by Corollary 13.4.4, (4.5) follows.

Remark 13.4.4: It can be shown that P Q or P ⊥ Q according as∏∞j=1 E

√fj(Yj) > 0 or = 0. For a proof, see Durrett (2004).

Remark 13.4.5: If Xjj≥1 are iid and Yjj≥1 are also iid, then P = Qor P ⊥ Q. This is because fj = f1 for all j and EQ

√f1 ≤ (EQf1)1/2 ≤ 1

and so either EQ

√f1 < 1 or EQ

√f1 = 1. In the latter case f1 ≡ 1, since

EQ(√

f1)2 = 1 = EQ(√

f1) ⇒ VarQ(√

f1) = 0.

Remark 13.4.6: The above result can be extended to Markov chains.Let P and Q be two irreducible stochastic matrices on a countable setand let Q be positive recurrent. Also, let Px0 denote the distribution of aMarkov chain Xnn≥1 starting at x0 and with transition probability P ,and similarly, let Qy0 denote the distribution of a Markov chain Ynn≥1

Page 439: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

430 13. Discrete Parameter Martingales

starting at y0 and with transition probability Q. Then

either Px0 ⊥ Qy0 or P = Q. (4.6)

The proof of this is left as an exercise (Problem 13.23).

13.4.6 de Finetti’s theoremLet Xnn≥1 be a sequence of exchangeable random variables on aprobability space (Ω,F , P ), i.e., for each n ≥ 1, the distribution of(Xσ(1), Xσ(2), . . . , Xσ(n)) is the same as that of (X1, X2, . . . , Xn) where(σ(1), σ(2), . . . , σ(n)) is a permutation of (1, 2, . . . , n). Then there is a σ-algebra G ⊂ F such that for each n ≥ 1,

P (Xi ∈ Bi, i = 1, 2, . . . , n | G) =n∏

i=1

P (Xi ∈ Bi | G) (4.7)

for all B1, . . . , Bn ∈ B(R).This is known as de Finetti’s theorem. For a proof, see Durrett (2004)

and Chow and Teicher (1997). This theorem says that conditioned on Gthe Xii≥1 are iid random variables with distribution P (X1 ∈ · | G). Theconverse to this result, i.e., if for some σ-algebra G ⊂ F (4.7) holds, thenthe sequence Xii≥1 is exchangeable is not difficult to verify (Problem13.26).

13.5 Problems

13.1 Let Ω be a nonempty set and let Ajj≥1 be a countable partition ofΩ. For n ≥ 1, let Fn = σ-algebra generated by Ajn

j=1.

(a) Show that Fnn≥1 is a filtration.

(b) Find F∞ = σ〈⋃

n≥1 Fn〉.

13.2 Let Ω be a nonempty set. For each n ≥ 1, let πn ≡ Anj : j =1, 2, . . . , kn be a partition of Ω. Suppose that each n and j, Anj is aunion of sets of πn+1. Let Fn ≡ σ〈πn〉 for n ≥ 1.

(a) Show that Fnn≥1 is a filtration.

(b) Suppose ∆ = [0, 1) and πn ≡ [

j−12n , j

2n

): j = 1, 2, . . . , 2n.

Show that F∞ = σ〈⋃

n≥1 Fn〉 is the Borel σ-algebra B([0, 1)).

13.3 Let (Yn,Fn) : n ≥ 1 and (Yn,Fn) : n ≥ 1 be as in Exam-ple 13.1.2. Verify that (Yn,Fn) : n ≥ 1) is a sub-martingale and(Yn, Fn) : n ≥ 1 is a martingale.

Page 440: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.5 Problems 431

13.4 Give an example of a random variable T and two filtrations Fnn≥1and Gnn≥1 such that T is a stopping time w.r.t. the filtrationFnn≥1 but not w.r.t. Gnn≥1.

13.5 Let T1 and T2 be stopping times w.r.t. a filtration Fnn≥1. Verifythat min(T1, T2), max(T1, T2), T1 + T2, and T 2

1 are stopping timesw.r.t. Fnn≥1. Give an example to show that

√T1 and T1 − 1 need

not be stopping times w.r.t. Fnn≥1.

13.6 Let T be a random variable taking values in 1, 2, 3, . . .. Show thatthere is a filtration Fnn≥1 w.r.t. which T is a stopping time.

13.7 Let Fnn≥1 be a filtration.

(a) Show that T is a stopping time w.r.t. Fnn≥1 iff

T ≤ n ∈ Fn for all n ≥ 1 .

(b) Show by an example that if a random variable T satisfies T ≥n ∈ Fn for all n ≥ 1, it need not be a stopping time w.r.t.Fnn≥1.

(Hint: Consider a T of the form

T = infk : k ≥ 1, Xk+1 ∈ A and Fn = σ〈Xj : j ≤ n〉.)

13.8 Show that FT defined in (2.6) is a σ-algebra.

13.9 Let Xnn≥1 be a sequence of random variables. Let Gn = σ〈Xj :1 ≤ j ≤ n〉. Let Fnn≥1 be a filtration such that Gn ⊂ Fn for eachn ≥ 1.

(a) Show that if Xn,Fnn≥1 is a martingale, then so isXn,Gnn≥1.

(b) Show by an example that the converse need not be true.

(c) Let Xn,Fnn≥1 be a martingale. Let 1 ≤ k1 < k2 < k3 · · · bea sequence of integers. Let Yn ≡ Xkn , Hn ≡ Fkn , n ≥ 1. Showthat Yn,Hnn≥1 is also a martingale.

13.10 A branching random walk is a branching process and a random walkassociated with it. Individuals reproduce according to a branchingprocess and the offspring move away from the parent a random dis-tance. If Xn ≡ xn1, xn2, . . . xnZn

denotes the position vector of theZn individuals in the nth generation and the individual at locationxni produces ρni offspring, then each of them chooses a new positionby moving a random distance from xni and these are assumed to beiid. Let ηnij be the random distance moved by the jth offspring of the

Page 441: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

432 13. Discrete Parameter Martingales

individual at xni. Then the position vector of the (n+1)st generationis given by

Xn+1 =xni + ηnijρni

j=1, i = 1, 2, . . . , n

≡xn+1,k : k = 1, 2, . . . Zn+1

, say,

where

Zn+1 = population size of the (n+1)st generation

=Zn∑i=1

ρni.

Let the offspring distribution be pkk≥0 and the jump size distri-bution be denoted by F (·). Assume that the η’s are real valued andalso that the collection ρnii≥1,n≥0, ηniji≥1,j≥1,n≥0 are all inde-pendent with the ρ’s being iid with distribution pkk≥0 and the η’siid with distribution F . Fix θ ∈ R. For n ≥ 0, let

Zn(θ) ≡( Zn∑

i=1

eθxni

)and Yn(θ) =

(Zn(θ)

)(ρφ(θ)

)−n

where ρ =∑∞

k=0 kpk, φ(θ) = E(eθη111) =∫

eθxdF (x). Assume 0 <φ(θ) < ∞, 0 < ρ < ∞.

(a) Verify that Yn(θ)n≥0 is a martingale w.r.t. an appropriatefiltration Fnn≥0.

(b) Show that

Var(Zn+1(θ)

)= Var

(Zn(θ)ρφ(θ)

)+(EZn(2θ)

)(ρψ(θ) + (φ(θ))2σ2),

where ψ(θ) = φ(2θ) −(φ(θ)

)2 and σ2 is the variance of thedistribution pkk≥0.

(c) State a sufficient condition on ρ, σ2, ψ(·) and φ(·) and θ forYn(θ) to be L2-bounded.

13.11 Let ηjj≥1 be adapted to a filtration Fjj≥1. Let E(ηj |Fj−1) = 0and Vj = E(η2

j |Fj−1) for j ≥ 2. Let s2n =

∑nj=1 Vj , n ≥ 2.

(a) Verify that Yn ≡∑n

j=2ηj

sj,Fnn≥1 is a martingale, where sj =

max(sj , 1).

(b) Show that Var(Yn) = E(∑n

j=2Vj

s2j

).

(c) Show that∑∞

j=2Vj

s2j≤∫∞1

1t2 dt + 1.

Page 442: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.5 Problems 433

(d) Conclude that Yn converges w.p. 1 and in L2.

(e) Now suppose that sn → ∞ w.p. 1. Show that 1sn

∑nj=1 ηj → 0

w.p. 1.

(Hint: Use Kronecker’s lemma (cf. Chapter 8).)

13.12 Let ξii≥1 be iid random variable with distribution P (ξ1 = 1) =12 = P (ξ1 = −1). Let S0 = 0, Sn =

∑ni=1 ξi, n ≥ 1. Let −a < 0 < b

be integers and T = T−a,b = infn : n ≥ 1, Sn = −a or b. Show,using Wald’s lemmas (Theorem 13.2.14), that

(a) P (ST ≡ −a) = bb+a .

(b) ET = 4ab.

(c) Extend the above arguments to find Var(T ).

(Hint: Consider T ∧ n first and then let n ↑ ∞.)

13.13 Use Problem 13.12 to conclude that for any positive integer b

P (Tb < ∞) = 1, but ETb = ∞,

where for any integer i,

Ti = infn : n ≥ 1, Sn = i.

(Hint: Use the relation Tb = limi→∞ T−i,b.)

13.14 Let ξii≥1 be iid random variables with distribution

P (ξi = 1) = p = 1− P (ξi = −1), 0 < p = 12

< 1 .

Let S0 = 0, Sn =∑n

i=1 ξi, n ≥ 1. Let ψ(x) =(

qp

)x

, x ∈ R whereq = 1− p.

(a) Show that Xn = ψ(Sn), n ≥ 0 is a martingale w.r.t. the filtrationFn = σ〈ξ1, . . . , ξn〉, n ≥ 1, and F0 = Ω, ∅.

(b) Let Ta,b = infn : n ≥ 1, Sn = −a or b, for positive integers aand b. Show that P (T−a,b < ∞) = 1.

(Hint: Use the strong law of large numbers.)

(c) Use (a) to show that for positive integers a, b,

θ ≡ P (T−a < Tb) =ψ(b)− 1

ψ(b)− ψ(−a),

where for any integer i, Ti = infn : n ≥ 1, Sn = i.

Page 443: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

434 13. Discrete Parameter Martingales

(d) Show that ET−a,b = b−θ(b−a)(p−q)

(e) Show that if p > q then ET−a = ∞ and ETb = b(p−q) .

13.15 Let Xnn≥0 be a Markov chain with state space S = 1, 2, 3, . . . , and transition probability matrix P = ((pij)). That is, for each n ≥ 1,

P (X0 = i0, X1 = i, . . . , Xn = in)= P (X0 = i0)pi0i1 . . . pin−1in

for all i0, i1, i2, . . . , in ∈ S. Let h : S → R and ρ ∈ R be such that

∞∑j=1

|h(j)|pij < ∞ for all i

and ∞∑j=1

h(j)pij = ρh(i) for all i.

(a) Verify that Xnn≥1 has the Markov property, namely, for alln ≥ 0,

P (Xn+1 = in+1 | Xn = in, Xn−1 = in−1, X0 = i0)= P (Xn+1 = in+1 | Xn = in) = pinin+1 .

(b) Verify that Yn ≡ h(Xn)ρ−nn≥0 is a martingale w.r.t. the fil-tration Fn ≡ σ〈X0, X1, . . . , Xn〉.

(c) Suppose ρ = 1 and h is bounded below and attains its lowerbound. Suppose also that Xnn≥0 is irreducible and recurrent.That is, P (Xn = j for some n ≥ 1 | X0 = i) = 1 for all i, j.Show that h is a constant function.

(Hint: Use the optional stopping theorem II.)

13.16 Let Yjj≥1 be a sequence of random variables such that P (|Yj | ≤1) = 1 for all j ≥ 1 and E(Yj |Fj−1) = 0 for j ≥ 2 where Fj =σ〈Y1, Y2, . . . , Yj〉. Let Xn =

∑nj=1 Yj , n ≥ 1 and let τ be a stopping

time w.r.t. Fnn≥1 and Eτ < ∞. Show that E|Xτ | < ∞ and EXτ =EX1.

13.17 Let θ, ξjj≥1 be a sequence of random variables such that E|θ| < ∞and ξjj≥1 are iid with mean zero. For j ≥ 1 let Xj = θ + ξj andFj = σ〈X1, X2, . . . , Xj〉. Show that Yj ≡ E(θ|Fj) → θ w.p. 1.

(Hint: Use the convergence result for a Doob martingale and theSLLN to Xn = θ + n−1∑n

j=1 ξj to show that θ is F∞-measurable.)

Page 444: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.5 Problems 435

13.18 Let Xn,Fnn≥1 be a martingale sequence such that EX2n < ∞ for

all n ≥ 1. Let Y1 = X1, Yj = Xj−Xj−1, j ≥ 2. Let Vj = E(Y 2j |Fj−1)

for j ≥ 2 and An =∑n

j=2 Vj , n ≥ 2. Verify that

(a) An is Fn−1 measurable and nondecreasing in n.(b) X2

n −An,Fnn≥2 is a martingale.(c) Verify that X2

n = X2n − An + An is the Doob decomposition of

the sub-martingale X2n,Fn (Proposition 13.1.2).

13.19 Show that the random variable X∞ defined in Example 13.3.2 is zerow.p. 1.

(Hint: Show that E log ξ1 < 0 and use the SLLN.)

13.20 Show that the Doob martingale in Definition 13.3.1 is uniformly in-tegrable.

(Hint: Show that for any λ > 0, λ0 > 0

E(|Xn|I(|Xn| > λ) ≤ E

(|X|I|Xn| > λ

)≤ E

(|X|I(|X| > λ0)

)+ λ0P (|Xn| > λ)

). )

13.21 Consider the following urn scheme due to Polya. Let an urn containw0 white and b0 black balls at time n = 0. A ball is drawn from theurn at random. It is returned to the urn with one more ball of thecolor drawn. Repeat this procedure for all n ≥ 1. Let Wn and Bn

denote the number of white and black balls in the urn after n draws.Let Zn = Wn

Wn+Bn, n ≥ 0. Let Fn = σ〈Z0, Z1, . . . , Zn〉.

(a) Show that (Zn,Fn)n≥0 is a martingale.(b) Conclude that Zn converges w.p. 1 and in L1 to a random vari-

able Z.(c) Show that for any k ∈ N, limn→∞ EZk

n converges and evaluatethe limit. Deduce that Z has Beta (w0, b0) distribution, i.e., itspdf fZ(z) ≡ (w0+b0−1)!

(w0−1)!(b0−1)! zw0−1(1− z)b0−1I[0,1](z).

(d) Generalize (a) to the case when at the nth stage a random num-ber αn of balls of the color drawn are added where αnn≥1 isany sequence of nonnegative integer valued random variables.

13.22 Prove Corollary 13.4.6.

(Hint: Argue as in Example 13.1.7.)

13.23 Prove the last equation (4.6) of Section 13.4.

(Hint: Show using the strong law for the Q chain that under Q, themartingale Xn converges to 0 w.p. 1, where Xn is the Radon-Nikodymderivative of Px0

((X0, . . . , Xn) ∈ ·

)w.r.t. Qx0

((X0, . . . , Xn) ∈ ·

).)

Page 445: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

436 13. Discrete Parameter Martingales

13.24 Let Fnn≥0 be a filtration ⊂ F where (Ω,F , P ) is a probabilityspace. Let Ynn≥0 ⊂ L1(Ω,F , P ). Suppose

Z ≡ supn≥1

|Yn| ∈ L1(Ω,F , P ) and limn→∞ Yn ≡ Y exists w.p. 1.

Show that E(Yn|Fn) → E(Y |F∞) w.p. 1.

(Hint: Fix m ≥ 1 and let Vm = supn≥m |Yn − Y |. Show that

limnE(|Yn − Y ||Fn) ≤ limn→∞ E(Vm|Fn) = E(Vm|F∞).

Now show that E(Vm|F∞) → 0 as m →∞.)

13.25 Let Xt,Ft : t ∈ I ≡ Q ∩ (0, 1) be a martingale, i.e., for all t1 < t2in I,

E(Xt2 |Ft1) = Xt1 .

Show that for each t in I

lims↑t,s∈I

Xs and lims↓t,s∈I

Xs

both exist w.p. 1 and in L1 and equal Xt w.p. 1.

13.26 Let Xii≥1 be random variables on a probability space (Ω,F , P ).Suppose for some σ-algebra G ⊂ F (4.7) holds. Show that Xii≥1are exchangeable.

13.27 Let Xnn≥0, Ynn≥0 be martingales in L2(Ω,F , P ) w.r.t. the samefiltration Fnn≥1. Let X0 = Y0 = 0. Show that

E(XnYn) =n∑

k=1

E(Xk −Xk−1)(Yk − Yk−1), n ≥ 1

and, in particular,

E(X2n) =

n∑k=1

E(Xk −Xk−1)2.

13.28 Let Xn,Fnn≥1 be a martingale in L2(Ω,F , P ). Suppose 0 ≤ bn ↑ ∞such that

∑nj=2

E(Xj−Xj−1)2

b2j< ∞. Show that Xn

bn→ 0 w.p. 1.

(Hint: Consider the sequence Yn ≡∑n

j=2(Xj−Xj−1)

bj, n ≥ 2. Verify

that Yn,Fnn≥2 is a L2 bounded martingale and use Kronecker’slemma (cf. Chapter 8).)

Page 446: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

13.5 Problems 437

13.29 Let f ∈ L1([0, 1],B([0, 1]), m

)where m(·) is Lebesgue measure on

[0, 1]. Let Hk(·)k≥1 be the Haar functions defined by

H1(t) ≡ 1,

H2(t) ≡

1 0 ≤ t < 12

−1 12 ≤ t < 1,

H2n+1(t) =

⎧⎪⎨⎪⎩

2n/2 0 ≤ t < 2−(n+1)

−2n/2 2−(n+1) ≤ t < 2−n, n = 1, 2, . . .

0 otherwise,

H2n+j(t) = H2n+1

(t− j − 1

2n

), j = 1, 2, . . . , 2n.

Let ak ≡∫ 10 f(t)Hk(t)dt, k = 1, 2, . . ..

(a) Verify that Xn(t) ≡∑n

k=1 akHk(t)n≥1 is a martingale w.r.t.the natural filtration.

(b) Show that Xn converges w.p. 1 and in L1 to f .

13.30 Let Xnn≥1 be a sequence of nonnegative random variables on someprobability space (Ω,F , P ) such that E(Xn+1|Fn) ≤ Xn + Yn whereFn ≡ σ〈X1, . . . , Xn〉 where Ynn≥1 is a sequence of nonnegativeconstants such that

∑∞n=1 Yn < ∞. Show that Xnn≥1 converges

w.p. 1.

13.31 Let τjj≥1 be independent exponential random variables with λj =Eτj , j ≥ 1 such that

∑∞j=1

1λ2

j< ∞. Let T0 = 0, Tn =

∑nj=1 τj ,

n ≥ 1, sn =∑n

j=1 λj . Show that Xn ≡ Tn − snn≥1 converges w.p.1 and in mean square.

(Hint: Show that Xnn≥1 is an L2-bounded martingale.)

Page 447: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14Markov Chains and MCMC

14.1 Markov chains: Countable state space

14.1.1 DefinitionLet S = aj : j = 1, 2, . . . , K, K ≤ ∞ be a finite or countable set.Let P = ((pij))K×K be a stochastic matrix, i.e., pij ≥ 0, for every i,∑K

j=1 pij = 1 and µ = µj : 1 ≤ j ≤ K be a probability distribution, i.e.,µj ≥ 0 for all j and

∑Kj=1 µj = 1.

Definition 14.1.1: A sequence Xn∞n=0 of S-valued random variables on

some probability space (Ω,F , P ) is called a Markov chain with stationarytransition probabilities P = ((pij)), initial distribution µ, and state spaceS if

(i) X0 ∼ µ, i.e., P (X0 = aj) = µj for all j, and

(ii) P(Xn+1 = aj | Xn = ai, Xn−1 = ain−1 , . . . , X0 = ai0

)= pij for all

ai, aj , ain−1 , . . . , ai0 ∈ S and n = 0, 1, 2, . . . ,

i.e., the sequence is memoryless. Given Xn, Xn+1 is independent of Xj :j ≤ n − 1. More generally, given the present (Xn), the past (Xj : j ≤n − 1) and the future (Xj : j > n) are stochastically independent(Problem 14.1).

A few questions arise:

Question 1: Does such a sequence Xn∞n=0 exist for every µ and P , and

Page 448: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

440 14. Markov Chains and MCMC

if so, how does one generate them?

The answer is yes. There are two approaches, namely, (i) using Kol-mogorov’s consistency theorem and (ii) an iid random iteration scheme.

Question 2: How does one describe the finite time behavior, i.e., thejoint distribution of (X0, X1, . . . , Xn) for any n ∈ N?

One may use the Markov property repeatedly to obtain the joint distribu-tion.

Question 3: What can one say about the long-term behavior? One canask questions like:

(a) Does the trajectory n → Xn converge as n →∞?

(b) Does the distribution of Xn converge as n →∞?

(c) Do the laws of large numbers hold for a suitable class of functionsf ’s, i.e., do the limits limn→∞ 1

n

∑nj=1 f(Xj) exist w.p. 1?

(d) Do stationary distributions exist? (A probability distribution π =πii∈S is called a stationary distribution for a Markov chain Xnn≥0if X0 has distribution π, then Xn also has distribution π for all n ≥ 1.)

The key to answering these questions are the concepts of communication, ir-reducibility, aperiodicity, and most importantly recurrence. The main toolsare the laws of large numbers, renewal theory, and coupling.

14.1.2 ExamplesExample 14.1.1: (IID sequence). Let Xn∞

n=0 be a sequence of iid S-valued random variables with distribution µ = µj. Then Xn∞

n=0 is aMarkov chain with initial distribution µ and transition probabilities givenby pij = µj for all i, i.e., all rows of P are identical. It is also easy to provethe converse, i.e., if all rows of P are identical, then Xn∞

n=1 are iid andindependent of X0.

To answer Question 3 in this case, note that P [Xn = j] = µj for all n andthus Xn converges in distribution. But the trajectories do not converge.However, the law of large numbers holds and µ is the unique stationarydistribution.

Example 14.1.2: (Random walks). Let S = Z, the set of integers. Letεnn≥1 be iid with distribution pjj∈Z, i.e., P [ε1 = j] = pj for j ∈Z and εnn≥1 are independent. Let X0 be a Z-valued random variableindependent of εnn≥1. Then, define for n ≥ 0,

Xn+1 = Xn + εn+1 = Xn−1 + εn + εn−1 = · · · = X0 +n+1∑j=1

εj .

Page 449: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.1 Markov chains: Countable state space 441

In this case, with probability one, the trajectories of Xn go to +∞ (re-spectively, −∞), if E(ε1) > 0 (respectively < 0). If E(ε1) = 0, then thetrajectories fluctuate infinitely often provided p0 = 1.

Example 14.1.3: (Branching processes). Let S = Z+ = 0, 1, 2, . . ..Let pj∞

j=0 be a probability distribution. Let ξnii∈N,n∈Z+ be iid randomvariables with distribution pj∞

j=0. Let Z0 be a Z+-valued random variableindependent of ξni. Let

Zn+1 =Zn∑i=1

ξni for n ≥ 0.

If p0 = 0 and p1 < 1, then Zn → ∞ w.p. 1. If p0 > 0, then P [Zn →∞] +P [Zn → 0] = 1. Also, P [Zn → 0 | Z0 = 1] = q is the smallest solutionin [0,1] to the equation

q = f(q) =∞∑

j=0

pjqj .

So q = 1 iff m ≡∑∞

j=1 jpj(1) ≤ 1 (see Chapter 18 also).

Example 14.1.4: (Birth and death chains). Again take S = Z+. DefineP by

pi,i+1 = αi, pi,i−1 = βi = 1− αi, for i ≥ 1,

p0,1 = α0, p0,0 = β0 = 1− α0.

The population increases at rate αi and decreases at rate 1− αi.

Example 14.1.5: (Iterated function systems). Let G := hi : hi : S →S, i = 1, 2, . . . , L, L ≤ ∞. Let µ = piL

i=1 be a probability distribution.Let fn∞

n=1 be iid, such that P (fn = hi) = pi, 1 ≤ i ≤ L. Let X0 be aS-valued random variable independent of fn∞

n=1. Then, the iid randomiteration scheme

X1 = f1(X0)X2 = f2(X1)

...Xn+1 = fn+1(Xn) = fn+1

(fn(· · · (f1(X0)) · · ·)

)is a Markov chain with transition probability matrix

pij = P (f1(i) = j) =L∑

r=1

prI(hr(i) = j

).

Remark 14.1.1: Any discrete state space Markov chain can be generatedin this way (see II in 14.1.3 below).

Page 450: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

442 14. Markov Chains and MCMC

14.1.3 Existence of a Markov chainI. Kolmogorov’s approach. Let Ω = SZ+ = ω : ω ≡ xn∞

n=0, xn ∈S for all n be the set of all sequences xn∞

n=0 with values in S. LetF0 consist of all finite dimensional subsets of Ω of the form

A =ω : ω = xn∞

n=0, xj = aj , 0 ≤ j ≤ m,

where m < ∞ and aj ∈ S for all j = 0, 1, . . . , m. Let F be theσ-algebra generated by F0. Fix µ and P . For A as above let

λµ,P (A) := µa0pa0a1pa1a2 · · · pam−1am.

Then it can be shown, using the extension theorem from Chapter 2or Kolmogorov’s consistency theorem of Chapter 6, that λµ,P canbe extended to be a probability measure on F . Let Xn(ω) = xn, ifω = xn∞

n=0, be the coordinate projection. Then Xn∞n=0 will be a

sequence of S-valued random variables on (Ω,F , λµ,P ), such that it isa Markov chain with initial distribution µ and transition probabilityP . A typical element ω = xn∞

n=0 of Ω is called a sample path or asample trajectory.

The following are examples of events (sets) in F , which are not finite-dimensional:

A1 =

ω : limn→∞

1n

n∑j=1

h(xj) exists

for a given h : S → R,

A2 =ω : the set of limit points of xn∞

n=0 = a, b

.

Thus, it is essential to go to (Ω,F , λµ,P ) to discuss the events in-volving asymptotic (long term) behavior, i.e., as n →∞.

II. IIIDRM approach (iteration of iid random maps). Let P =((pij))K×K be a stochastic matrix. Let f : S× [0, 1] → S be

f(ai, u) =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

a1 if 0 ≤ u < pi1a2 if pi1 ≤ u < pi1 + pi2...

aj if pi1 + pi2 + · · ·+ pi(j−1) ≤ u < pi1 + pi2+ · · ·+ pij

...aK if pi1 + pi2 + · · ·+ pi(K−1) ≤ u < 1 .

(1.1)Let U1, U2, . . . be iid Uniform [0, 1] random variables. Let fn(·) :=f(·, Un). Then for each n, fn maps S to S. Also fn∞

n=1 are iid.

Page 451: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.1 Markov chains: Countable state space 443

Let X0 be independent of Ui∞i=1 and X0 ∼ µ. Then the sequence

Xn∞n=0 defined by

Xn+1 = fn+1(Xn) = f(Xn, Un+1)

is a Markov chain with initial distribution µ and transition probabilityP . The underlying probability space on which X0 and Ui∞

i=1 aredefined can be taken to be the Lebesgue space ([0, 1],B([0, 1]), m),where m is the Lebesgue measure.

Finite Time Behavior of XnFor each n ∈ N,

P(X0 = a0, X1 = a1, . . . , Xn = an

)=

( n∏j=1

P(Xj = aj | Xj−1 = aj−1, . . . , X0 = a0

))P(X0 = a0

)

=( n∏

j=1

paj−1aj

)µa0 . (1.2)

Thus, the joint distribution for any finite n is determined by µ and P . Inparticular,

P(Xn = an | X0 = a0

)=∑ n∏

j=1

paj−1aj= (P n)a0an

, (1.3)

where the sum in the middle term runs over all a1, a2, . . . , aj−1 and P n

is the nth power of P . So the behavior of the distribution of Xn can bestudied via that of P n for large n. But this analytic approach is not ascomprehensive as the probabilistic one, via the concept of recurrence, whichwill be described next.

14.1.4 Limit theoryLet Xn∞

n=0 be a Markov chain with state space S = 1, 2, . . . , K, K ≤∞, and transition probability matrix P =

((pij)

)K×K

.

Definition 14.1.2: (Hitting times). For any set A ⊂ S, the hitting timeTA is defined as

TA = infn : Xn ∈ A, n ≥ 1i.e., it is the first time after 0 that the chain enters A. The random variableTA is also called the first passage time for A or the first entrance time forA. Note that TA is a stopping time (cf. Chapter 13) w.r.t. the filtrationFn ≡ σ〈Xj : 0 ≤ j ≤ n〉n≥0.

Page 452: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

444 14. Markov Chains and MCMC

By convention, inf ∅ = ∞. If A = i, then write Ti = Ti for notationalsimplicity.

A concept of fundamental importance is recurrence of a state.

Definition 14.1.3: (Recurrence). A state i is recurrent or transient ac-cording as

Pi[Ti < ∞] = 1 or < 1,

where Pi denotes the probability distribution of Xn∞n=0 with X0 = i with

probability 1. Thus i is recurrent iff

fii ≡ P(Xn = i for some 1 ≤ n < ∞ | X0 = i

)= 1. (1.4)

Definition 14.1.4: (Null and positive recurrence). A recurrent state i iscalled null recurrent if Ei(Ti) = ∞ and positive recurrent if Ei(Ti) < ∞,where Ei refers to expectation w.r.t. the probability distribution Pi.

Example 14.1.6: (Frog in the well). Let S = 1, 2, . . . and P = ((pij))be given by

pi,i+1 = αi and pi,1 = 1− αi for all i ≥ 1, 0 < αi < 1.

ThenP1[T1 > r] = α1α2 · · ·αr.

So P1[T1 < ∞] = 1 iff∏r

i=1 αi → 0 as r →∞ iff∑∞

i=1(1−αi) = ∞. Further1 is positive recurrent iff

∑∞r=1

(∏r1 αi

)< ∞. Thus, if αi = (1− 1

2i2 ), then1 is transient; but if αi ≡ α, 0 < α < 1 for all i, then 1 is positive recurrent.If αi = 1− 1

ci , c > 1, then 1 is null recurrent (Problem 14.2).

Example 14.1.7: (Simple random walk). Let S = Z, the set of all integers,pi,i+1 = p, pi,i−1 = q, 0 < p = 1 − q < 1. Then it can be shown by usingthe SLLN that for p = 1

2 , 0 is transient. But it is harder to show that forp = 1

2 , 0 is null recurrent (see Corollary 14.1.5 below).

The next result says that after each return to i, the Markov chain startsafresh.

Proposition 14.1.1: (The strong Markov property). For any i ∈ Sand any initial distribution µ of X0 and any k < ∞, a1, . . . , ak in S,Pµ(XTi+j = aj , j = 1, 2, . . . , k, Ti < ∞) = Pµ(Ti < ∞)Pi(Xj = aj , j =1, 2, . . . , k).

Proof: For any n ∈ N,

Pµ(XTi+j = aj , 1 ≤ j ≤ k, Ti = n)= Pµ(Xn+j = aj , 1 ≤ j ≤ k, Xn = i, Xr = i, 1 ≤ r ≤ n− 1)= Pµ(Xn+j = aj , 1 ≤ j ≤ k

∣∣ Xn = i, Xr = i, 1 ≤ r ≤ n− 1)

Page 453: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.1 Markov chains: Countable state space 445

· Pµ(Xn = i, Xr = i, 1 ≤ r ≤ n− 1)= Pi(Xj = aj , 1 ≤ j ≤ k)Pµ(Xn = i, Xr = i, 1 ≤ r ≤ n− 1)= Pi(Xj = aj , 1 ≤ j ≤ k)Pµ(Ti = n).

Adding both sides over n yields the result.

The strong Markov property leads to the important useful techniqueof breaking up the time evolution of a Markov chain into iid cycles. Thiscombined with the law of large numbers yield the basic convergence results.

Definition 14.1.5: (IID cycles). Let i be a state. Let T(0)i = 0 and

T(k+1)i =

infn : n > T

(k)i , Xn = i, if T

(k)i < ∞,

∞, if T(k)i = ∞,

(1.5)

i.e., for k = 0, 1, 2, . . ., T(k)i is the successive return times to state i.

Proposition 14.1.2: Let i be a recurrent state. Then Pi

(T

(k)i < ∞

)= 1

for all k ≥ 1.

Proof: By definition of recurrence, the claim is true for k = 1. If it is truefor k > 1, then

Pi

(T

(k+1)i < ∞

)

=∞∑

j=k

P(T

(k+1)i < ∞, T

(k)i = j

)

=∞∑

j=k

Pi

(T

(1)i < ∞

)Pi

(T

(k)i = j

)(by Markov property)

= Pi

(T

(1)i < ∞

)Pi

(T

(k)i < ∞

)= 1.

Let ηr = Xj , T(r)i ≤ j ≤ T

(r+1)i − 1; T

(r+1)i − T

(r)i for r = 0, 1, 2, . . ..

The ηr’s are called cycles or excursions.

Theorem 14.1.3: Let i be a recurrent state. Under Pi, the sequenceηr∞

r=0 are iid as random vectors with a random number of components.More precisely, for any k ∈ N,

Pi

(ηr = (xr0, xr1, . . . , xrjr

), T (r+1)i − T

(r)i = jr, r = 0, 1, . . . , k

)

=k∏

r=0

Pi

(η1 = (xr0, xr1, . . . , xrjr

), T (1)i = jr

)(1.6)

for any xr0, xr1, . . . , xrjr , jr, r = 0, 1, . . . , k.

Page 454: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

446 14. Markov Chains and MCMC

Proof: Use the strong Markov property repeatedly (Problem 14.7).

Proposition 14.1.4: For any state i, let Ni ≡∑∞

n=1 Ii(Xn) be the totalnumber of visits to i. Then,

(i) i recurrent ⇒ P (Ni = ∞ | X0 = i) = 1.

(ii) i transient ⇒ P (Ni = j | X0 = i) = f jii(1 − fii) for j = 0, 1, 2, . . .

where fii = P (Ti < ∞ | X0 = i) is the probability of returning to istarting from i, i.e., under Pi, Ni has a geometric distribution withparameter (1− fii).

Proof: Follows from the strong Markov property (Proposition 14.1.1).

Corollary 14.1.5: A state i is recurrent iff

EiNi ≡ E(Ni | X0 = i) =∞∑

n=1

p(n)ii = ∞. (1.7)

Proof: If i is recurrent then Pi(Ni = ∞) = 1 and so EiNi = ∞.If i is transient then EiNi =

∑∞j=0 jf j

ii(1 − fii) = fii

1−fii< ∞. Also by

the monotone convergence theorem,

E(Ni | X0 = i) =∞∑

n=1

E(δXni | X0 = i) =∞∑

n=1

p(n)ii .

An Application

For the simple symmetric random walk (SSRW) in Z, 0 is recurrent.Indeed,

p(n)00 =

0 if n is odd(2k

k

)( 12

)2k if n = 2k.(1.8)

By Stirling’s formula,(2k

k

)=

(2k)!k!k!

∼ e−2k(2k)2k+ 12√

(e−kkk+ 12√

2π)2∼ 4k

√k

1√π

and hencep(2k)00 ∼ 1√

π

1√k

.

Thus∑∞

k=1 p(2k)00 = ∞, implying that 0 is recurrent.

By a similar argument, it can be shown that the simple symmetric ran-dom walk in Z(2), the integer lattice in R2, the origin is recurrent, but inZ(d) for d ≥ 3, the origin is transient (Problem 14.3).

Page 455: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.1 Markov chains: Countable state space 447

Corollary 14.1.6: If the state space S is finite, then at least one statemust be recurrent.

Proof: Let S ≡ 1, 2, . . . , K, K < ∞. Since n =∑K

i=1∑n

j=1 δXji, thereexists an i0 such that as n →∞,

∑nj=1 δXji0 →∞ with positive probability.

This implies that i0 must be recurrent.

Definition 14.1.6: (Communication). A state i leads to j (write i → j)if for some n ≥ 1, p

(n)ij > 0. A pair of states i, j are said to communicate

if i → j and j → i, i.e., if there exist n ≥ 1, m ≥ 1 such that p(n)ij > 0,

p(m)ji > 0.

Definition 14.1.7: (Irreducibility). A Markov chain with state spaceS ≡ 1, 2, . . . , K, K ≤ ∞ and transition probability matrix P ≡ ((pij)) isirreducible if for each i, j in S, i and j communicate.

Definition 14.1.8: A state i is absorbing if pii = 1.

It is easy to show that if i is absorbing and j → i, then j is transient(Problem 14.4).

Proposition 14.1.7: (Solidarity property). Let i be recurrent and i → j.Then fji = 1 and j is recurrent, where fji ≡ P (Ti < ∞ | X0 = j).

Proof: By the (strong) Markov property, 1− fii = P (Ti = ∞ | X0 = i) ≥P (Tj < ∞, Ti = ∞ | X0 = i) = P (Ti = ∞ | X0 = j)P (Tj < Ti | X0 = i)(intuitively speaking, one possibility of not returning to i (starting from i) isto visit j and then not returning to i) = (1−fji)f∗

ij , where f∗ij = P (visiting

j before visiting i | X0 = i). Now i recurrent and i → j yield 1−fii = 0 andf∗

ij > 0 (Problem 14.4) and so 1− fji = 0, i.e., fji = 1. Thus, starting fromj, the chain visits i w.p. 1. From i, it keeps returning to i infinitely often.In each of these excursions, the probability f∗

ij of visiting j is positiveand since there are infinite number of such excursions and they are iid,the chain does visit j in one of these excursions w.p. 1. That is j is recurrent.

Also an alternate proof using the Corollary 14.1.5 is possible (Problem14.5).

Proposition 14.1.8: In a finite state space irreducible Markov chain allstates are recurrent.

Proof: By Corollary 14.1.6, there is at least one state i0 that is recurrent.By irreducibility and solidarity, this implies all states are recurrent.

Remark 14.1.2: A stronger result holds, namely, that for a finite statespace irreducible Markov chain, all states are positive recurrent (Problem14.6).

Page 456: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

448 14. Markov Chains and MCMC

Theorem 14.1.9: (A law of large numbers). Let i be positive recurrent.Let

Nn(j) = #k : 0 ≤ k ≤ n, Xk = j, j ∈ S (1.9)

be the number of visits to j during 0, 1, . . . , n. LetLn(j) ≡ Nn(j)

n+1

be

the empirical distribution at time n. Let X0 = i, with probability 1. Then

Ln(j) → Vij

Ei(Ti), w.p. 1 (1.10)

where Vij = Ei

(∑Ti−1k=0 δXk,j

)is the mean number of visits to j during

0, 1, . . . , Ti − 1 starting from i. In particular, Ln(i) → 1EiTi

, w.p. 1.

Proof: For each n, let k ≡ k(n) be such that T(k)i ≤ n < T

(k+1)i . Then,

NT

(k)i

(j) ≤ Nn(j) ≤ NT

(k+1)i

(j),

i.e.,k∑

r=0

ξr ≤ Nn(j) ≤k+1∑r=0

ξr,

where ξr = #l : T(r)i ≤ l < T

(r+1)i , Xl = j is the number of visits to

j during the rth excursion. Since Vij ≡ Eiξ1 ≤ EiTi < ∞, by the SLLN,with probability 1,

1k(n)

k(n)∑r=0

ξr → Ei(ξ1) = Vij

and 1kT

(k)i → Ei(Ti). Note that since n is between T

(k(n))i and T

(k(n)+1)i ,

so nk(n) → EiTi. Since

k

n

1k

k∑r=0

ξr ≤Nn(j)

n≤ k + 1

n

1k + 1

k+1∑r=0

ξr,

it follows that Ln(j) = nn+1

Nn(j)n → Ei(ξ1)

Ei(Ti).

Note that the above proof works for any initial distribution µ such thatPµ(Ti < ∞) = 1 and further, the limit of Ln(j) is independent of any suchµ. Thus, if (S,P ) is irreducible and one state i is positive recurrent, thenfor any initial distribution µ, Pµ(Ti < ∞) = 1.

Note finally that the proof in Theorem 14.1.9 can be adapted to yield acriterion for transience, null recurrence, and positive recurrence of a state.Thus, the following holds.

Corollary 14.1.10: Fix a state i. Then

Page 457: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.1 Markov chains: Countable state space 449

(i) i is transient iff limn→∞ Nn(i) exists and is finite w.p. 1 for anyinitial distribution iff

limn→∞ EiNn(i) =

∞∑k=0

p(k)ii < ∞.

(ii) i is null recurrent iff∑∞

k=0 p(k)ii = ∞ and

limn→∞ EiLn(i) = lim

n→∞1n

n∑k=0

p(k)ii = 0.

(iii) i is positive recurrent iff

limn→∞ EiLn(i) = lim

n→∞1n

n∑k=0

p(k)ii > 0.

(iv) Let (S,P ) be irreducible and let one state i be positive recurrent.Then, for any j and any initial distribution µ,

Ln(j) → 1EjTj

∈ (0,∞) w.p. 1.

Thus, for the symmetric simple random walk on the integers

p(2k)00 ∼ c√

kas k →∞, 0 < c < ∞

and hence ∞∑n=0

p(n)00 = ∞ and

1n

n∑k=0

p(k)00 → 0.

Thus 0 is null recurrent.

It is not difficult to show that if j leads to i, i.e., if fji ≡ Pj(Ti < ∞)is positive then the number ξ1 of visits to j between consecutive visits to ihas all moments (Problem 14.9).

Using the SLLN, Theorem 14.1.9 can be extended as follows to coverboth null and positive recurrent cases.

Theorem 14.1.11: Let i be a recurrent state. Then, for any j and anyinitial distribution µ such that Pµ(Ti < ∞) = 1,

Ln(j) ≡ 1n + 1

n∑k=0

δXkj → πij ≡Vij

EiTi, w.p. 1 (1.11)

as n →∞ where Vij < ∞. (If EiTi = ∞, then πij = 0 for all j.)

Page 458: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

450 14. Markov Chains and MCMC

Corollary 14.1.12: Let (S,P ) be irreducible and let one state be recurrent.Then, for any j and any initial distribution,

Ln(j) → cj as n →∞, w.p. 1, (1.12)

where cj = 1/EjTj if EjTj < ∞ and cj = 0 otherwise.

The Basic Ergodic Theorem

Taking expectation in Theorem 14.1.11 and using the bounded conver-gence theorem leads to

Corollary 14.1.13: Let i be recurrent. Then, for any initial distributionµ with Pµ(Ti < ∞) = 1,

Eµ(Ln(j)) =1

n + 1

n∑k=0

Pµ(Xk = j) → Vij

Ei(Ti):= πij as n →∞.

(1.13)

Theorem 14.1.14: Let i be positive recurrent. Let πj := πij for j ∈S. Then πjj∈S is a stationary distribution for P , i.e.,

∑j πj = 1 and∑

l∈S πlplj = πj, for all j.

Proof:

1n + 1

n∑k=0

p(k)ij =

1n + 1

δij +1

n + 1

n∑k=1

p(k)ij

=1

n + 1δij +

1n + 1

n∑k=1

∑l

p(k−1)il plj

=1

(n + 1)δij +

∑l

(1

n + 1

n∑k=1

p(k−1)il

)plj .

Taking limit as n →∞ and using Fatou’s lemma yields

πj ≥∑

l

πlplj . (1.14)

If strict inequality were to hold for some j0, then adding over j would yield∑

j

πj >∑

j

∑l

πlplj =∑

l

πl

(∑j

plj

)=∑

l

πl.

Since∑

j∈S Vij = EiTi,∑

j πj = 1, so there cannot be a strict inequalityin (1.14) for any j.

Therefore, the following has been established.

Page 459: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.1 Markov chains: Countable state space 451

Theorem 14.1.15: Let i be a positive recurrent state. Let

πj :=Ei(∑Ti−1

k=0 δXk,j)Ei(Ti)

.

Then

(i) πjj∈S is a stationary distribution.

(ii) For any j and any initial distribution µ with Pµ(Ti < ∞) = 1,

(a) Ln(j) ≡ 1n+1#k : Xk = j, 0 ≤ j ≤ n → πj w.p. 1 (Pµ)

(b) 1n+1

n∑k=0

Pµ(Xk = j) → πj.

In particular, if j = i, we have

1n + 1

n∑k=0

p(k)ii → πi =

1Ei(Ti)

. (1.15)

Now let i be a positive recurrent state and j be such that i → j. ThenVij > 0 and by the solidarity property, fji = 1 and j is recurrent. Now tak-ing µ = δj in (ii) above and using Corollary 14.1.13, leads to the conclusionsthat

1n + 1

n∑k=0

p(k)jj → πj > 0 (1.16)

andπj =

1EjTj

. (1.17)

Thus, j is positive recurrent.Now Theorem 14.1.15 leads to the basic ergodic theorem for Markov

chains.

Theorem 14.1.16: Let (S,P ) be irreducible and let one state be positiverecurrent. Then

(i) all states are positive recurrent,

(ii) π ≡ πj ≡ (EjTj)−1 : j ∈ S is a stationary distribution for (S,P ),

(iii) for any initial distribution µ and any j ∈ S

(a) 1n+1

n∑k=0

Pµ(Xk = j) → πj, i.e., π is the unique limiting distri-

bution (in the average sense) and hence the unique stationarydistribution,

Page 460: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

452 14. Markov Chains and MCMC

(b) Ln(j) ≡ 1n+1

n∑k=0

δXkj → πj w.p. 1 (Pµ).

There is a converse to the above result. To develop this, note first that ifj is a transient state, then the total number Nj of visits to j is finite w.p.1 (for any initial distribution µ) and hence Ln(j) → 0 w.p. 1 and takingexpectations, for any i

1n + 1

n∑k=0

p(k)ij → 0 as n →∞.

Now, suppose that π ≡ πj : j ∈ S is a stationary distribution for (S,P ).Then, for all j, πj =

∑i∈S πipij and hence

πj =∑∑i∈S,r∈S

πrpripij

=∑r∈S

πrp(2)rj

and by induction,

πj =∑i∈S

πip(k)ij for all k ≥ 0

implying

πj =∑i∈S

πi

(1

n + 1

n∑k=0

p(k)ij

).

Now if j is transient then for any i

1n + 1

n∑k=0

p(k)ij → 0 as n →∞

and so by the bounded convergence theorem

πj = limn→∞

∑i∈S

πi

(1

n + 1

n∑k=0

p(k)ij

)=∑i∈S

πi · 0 = 0.

Thus, πj > 0 implies j is recurrent. For j recurrent, it follows from argu-ments similar to those used to establish Theorem 14.1.15 that

1n

n∑k=0

p(k)ij → fij

1EjTj

.

Thus, πj =(∑

i∈S πifij

)1

EjTj. But

∑i∈S πifij ≥ πjfjj = πj > 0. So

EjTj < ∞, i.e., j is positive recurrent. Summarizing the above discussionleads to

Page 461: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.1 Markov chains: Countable state space 453

Proposition 14.1.17: Let π ≡ πj : j ∈ S be a stationary distributionfor (S,P ). Then, πj > 0 implies that j is positive recurrent.

It is now possible to state a converse to Theorem 14.1.16.

Theorem 14.1.18: Let (S,P ) be irreducible and admit a stationary dis-tribution π ≡ πj : j ∈ S. Then,

(i) all states are positive recurrent,

(ii) π ≡ πj : πj = 1EjTj

, j ∈ S is the unique stationary distribution,

(iii) for any initial distribution µ and for all j ∈ S,

(a) 1n+1

n∑k=0

Pµ(Xk = j) → πj,

(b) 1n+1

n∑k=0

δXkj → πj w.p. 1 (Pµ).

In summary, for an irreducible Markov chain (S,P ) with a countablestate space, a stationary distribution π exists iff all states are positiverecurrent. In which case, π is unique and for any initial distribution µ, thedistribution at time n converges to π in the Cesaro sense (i.e., average)and the (LLN) law of large numbers holds. For the finite state space case,irreducibility suffices (Problem 14.6).

If h : S → R is a function such that∑

j∈S |h(j)|πj < ∞, then the LLNcan be strengthened to

1n + 1

n∑k=0

h(Xk) →∞∑

j=0

h(j)πj w.p. 1 (Pµ)

for any µ (Problem 14.10). In particular, if A is any subset of S, then

Ln(A) ≡ 1n + 1

n∑k=0

IA(Xk) → π(A) ≡∑j∈A

πj

w.p. 1 (Pµ) for any µ.An important question that remains is whether the convergence of

Pµ(Xn = j) to πj can be strengthened from the average sense to full,i.e., from the convergence to πj of 1

(n+1)

∑nk=0 Pµ(Xk = j) to the conver-

gence to πj of Pµ(Xn = j) as n → ∞. For this, the additional hypothesisneeded is aperiodicity.

Definition 14.1.9: For any state i, the period di of the state i is the

g.c.d.n : n ≥ 1, p

(n)ii > 0

.

Page 462: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

454 14. Markov Chains and MCMC

Further, i is called aperiodic if di = 1.

Example 14.1.8: Let S = 0, 1, 2 and

P =

⎛⎝ 0 1 0

12 0 1

20 1 0

⎞⎠ .

Then di = 2 for all i.

Note that in this example, since (S,P ) is finite and irreducible, it has aunique stationary distribution, given by π = (1

4 , 12 , 1

4 ) and

1n + 1

n∑k=0

p(k)00 → 1

4as n →∞

but p(2n+1)00 = 0 for each n and p

(2n)00 → 1

4 as n → ∞. This suggests thataperiodicity will be needed. It turns out that if (S,P ) is irreducible, theperiod di is the same for all i (Problem 14.5).

Theorem 14.1.19: Let (S,P ) be irreducible, positive recurrent and aperi-odic (i.e., di = 1 for all i). Let Xnn≥0 be a (S,P ) Markov chain. Then,for any initial distribution µ, limn→∞ Pµ(Xn = i) ≡ πi exists for all i,where π ≡ πj : j ∈ S is the unique stationary distribution.

There are many proofs known for this, and two of them are outlinedbelow. The first uses the discrete renewal theorem and the second uses acoupling argument.

Proof 1: (via the discrete renewal theorem). Fix a state i. Recall that forn ≥ 1, p

(n)ii = P (Xn = i | X0 = i), n ≥ 0 and f

(n)ii = P (Ti = n | X0 = i),

n ≥ 1. Using the Markov property, for n ≥ 1,

p(n)ii = P (Xn = i | X0 = i) =

n∑k=1

P (Xn = i, Ti = k | X0 = i)

=n∑

k=1

P (Ti = k | X0 = i)P (Xn = i | Xk = i)

=n∑

k=1

f(k)ii p

(n−k)ii .

Let an ≡ p(n)ii , n ≥ 0, pn ≡ f

(n)ii , n ≥ 1. Then pnn≥1 is a probability

distribution and ann≥0 satisfies the discrete renewal equation

an = bn +n∑

k=0

an−kpk, n ≥ 0

Page 463: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.1 Markov chains: Countable state space 455

where bn = δn0 and p0 = 0. It can be shown that di = 1 iff

g.c.d.k : k ≥ 1, pk > 0

= 1.

Further, EiTi =∑∞

k=1 kpk < ∞, by the assumption of positive recurrence.Now it follows from the discrete renewal theorem (see Section 8.5) that

limn→∞ an exists and equals

∑∞j=0 bj∑∞

k=1 kpk=

1EiTi

= πi.

Proof 2: (Using coupling arguments). Let Xnn≥0 and Ynn≥0 beindependent (S,P ) Markov chains such that Y0 has distribution π and X0has distribution µ. Then Zn = (Xn, Yn)n≥0 is a Markov chain with statespace S × S and transition probability P × P ≡

((p(i,j),(k,) = pikpj)

).

Further, it can be shown that (see Hoel, Port and Stone (1972))

(a) πi,j ≡ πiπj : (i, j) ∈ S× S is a stationary distribution for Zn,

(b) since (S, P ) is irreducible and aperiodic, the pair (S × S,P × P ) isirreducible.

Since (S× S,P ×P ) is irreducible and admits a stationary distribution, itis necessarily recurrent and so from any initial distribution the first passagetime TD for the diagonal D ≡ (i, i) : i ∈ S is finite with probability one.Thus, Tc ≡ minn : n ≥ 1, Xn = Yn is finite w.p. 1. The random variableTc is called the coupling time. Let

Xn =

Xn , n ≤ Tc

Yn , n > Tc.

Then, it can be verified that Xnn≥0 and Xnn≥0 are identically dis-tributed Markov chains. Thus

P (Xn = j) = P (Xn = j)= P (Xn = j, Tc < n) + P (Xn = j, Tc ≥ n)

andP (Yn = j) = P (Yn = j, Tc < n) + P (Yn = j, Tc ≥ n)

implying that ∣∣P (Xn = j)− P (Yn = j)∣∣ ≤ 2P (Tc ≥ n).

Since P (Tc ≥ n) → 0 as n →∞ and by the stationarity of π, P (Yn = j) =πj for all n and j it follows that for any j

limn→∞ P (Xn = j) exists and = πj .

Page 464: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

456 14. Markov Chains and MCMC

In order to obtain results on rates of convergence for |P (Xn = j)− πj |,one needs more assumptions on the distribution of return time Ti or thecoupling time Tc. For results on this, the books of Hoel et al. (1972), Meynand Tweedie (1993), and Lindvall (1992) are good sources. It can be shownthat in the irreducible case if for some i, Pi(Ti > n) = O(λn

1 ) for some0 < λ1 < 1, then

∑j∈S |Pi(Xn = j) − πj | = O(λn

2 ) for some λ1 < λ2 < 1.In particular, this geometric convergence holds for the finite state spaceirreducible case.

The main results of this section are summarized below.

Theorem 14.1.20: Let Xnn≥0 be a Markov chain with a countablestate space S = 0, 1, 2, . . . , K, K ≤ ∞ and transition probability matrixP ≡

((pij)

). Let (S,P ) be irreducible. Then,

(a) All states are recurrent iff for some i in S,∞∑

n=1

p(n)ii = ∞.

(b) All states are positive recurrent iff for some i in S,

limn→∞

1n

n∑k=0

p(k)ii exists and is strictly positive.

(c) There exists a stationary probability distribution π iff there exists apositive recurrent state.

(d) If there exists a stationary distribution π ≡ πj : j ∈ S, then

(i) it is unique, all states are positive recurrent and for all j ∈ S,πj = (EjTj)−1,

(ii) for all i, j ∈ S,

1n + 1

n∑k=0

p(k)ij → πj as n →∞,

(iii) for any initial distribution and any j ∈ S,

1n + 1

n∑k=0

δXkj → πj w.p. 1,

(iv) if∑

j∈S |h(j)|πj < ∞, then

1n + 1

n∑k=0

h(Xk) →∑j∈S

h(j)πj w.p. 1,

Page 465: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.2 Markov chains on a general state space 457

(v) if, in addition, di = 1 for some i ∈ S, then dj = 1 for all j ∈ Sand for all i, j,

p(n)ij → πj as n →∞.

14.2 Markov chains on a general state space

14.2.1 Basic definitionsLet Xnn≥0 be a sequence of random variables with values in some spaceS that is not necessarily finite or countable. The Markov property says thatconditioned on Xn, Xn−1, . . . , X0, the distribution of Xn+1 depends onlyon Xn and not on the past, i.e., Xj : j ≤ n− 1. When S is not countable,to make this notion of Markov property precise, one needs the following setup.

Let (S,S) be a measurable space. Let (Ω,F , P ) be a probability spaceand Xn(ω)n≥0 be a sequence of maps from Ω to S such that for eachn, Xn is (F , S) measurable. Let Fn ≡ σ〈Xj : 0 ≤ j ≤ n〉 be the sub-σ-algebra of F generated by Xj : 0 ≤ j ≤ n. In what follows, for anysub-σ-algebra Y of F , let P (· | Y) denote the conditional probability given,Y as defined in Chapter 12.

Definition 14.2.1: The sequence Xnn≥0 is a Markov chain if for allA ∈ S,

P((Xn+1 ∈ A) | Fn

)= P

((Xn+1 ∈ A) | σ〈Xn〉

)w.p. 1, for all n ≥ 0,

(2.1)for any initial distribution of X0, where σ〈Xn〉 is the sub-σ-algebra of Fgenerated by Xn.

It is easy to verify that (2.1) holds for all A ∈ S iff for any boundedmeasurable h from (S,S) to

(R,B(R)

),

E(h(Xn+1) | Fn

)= E

(h(Xn+1) | σ〈Xn〉

)w.p. 1 for all n ≥ 0 (2.2)

for any initial distribution of X0.Another equivalent formulation that makes the Markov property sym-

metric with respect to time is the following that says that given the present,the past and future are independent.

Proposition 14.2.1: A sequence of random variables Xnn≥0 satisfies(2.1) iff for any Ajn+k

0 ⊂ S,

P(Xj ∈ Aj , j = 0, 1, 2, . . . , n− 1, n + 1, . . . , n + k | σ〈Xn〉

)= P

(Xj ∈ Aj , j = 0, 1, 2, . . . , n− 1 | σ〈Xn〉

P(Xj ∈ Aj , j = n + 1, . . . , n + k | σ〈Xn〉

)

Page 466: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

458 14. Markov Chains and MCMC

w.p. 1.

The proof is somewhat involved but not difficult. The countable statespace case is easier (Problem 14.1).

An important tool for studying Markov chains is the notion of a transitionprobability function.

Definition 14.2.2: A function P : S × S → [0, 1] is called a transitionprobability function on S if

(i) for all x in S, P (x, ·) is a probability measure on (S,S),

(ii) for all A ∈ S, P (·, A) is an S-measurable function from (S,S) → [0, 1].

Under some general conditions guaranteeing the existence of regular condi-tional probabilities, the right side of (2.1) can be expressed as Pn(Xn, A),where Pn(·, ·) is a transition probability function on S. In such a case, yetanother formulation of Markov property is in terms of the joint distribu-tions of X0, X1, . . . , Xn for any finite n.

Proposition 14.2.2: A sequence Xnn≥0 satisfies (2.1) iff for any n ∈ Nand A0, A1, . . . , An ∈ S,

P (Xj ∈ Aj , j = 0, 1, 2, . . . , n)

=∫

A0

∫An−2

∫An−1

Pn−1(xn−1, An)Pn−2(xn−2, dxn−1)

. . . P1(x0, dx1)µ0(dx0),

where µ0(A) = P (x0 ∈ A), A ∈ S.

The proof is by induction and left as an exercise (Problem 14.16). Inwhat follows, it will be assumed that such transition functions exist.

Definition 14.2.3: A sequence of S-valued random variables Xnn≥0 iscalled a Markov chain with transition function P (·, ·) if (2.1) holds and theright side equals P (Xn, A) for all n ∈ N.

14.2.2 ExamplesExample 14.2.1: (IID sequence). Let Xnn≥0 be iid S-valued randomvariables with distribution µ. Then Xnn≥0 is a Markov chain with tran-sition function P (x, A) ≡ µ(A) and initial distribution µ.

Example 14.2.2: ((Additive) random walk in Rk). Let ηjj≥1 be iidRk-valued random variables with distribution ν. Let X0 be an Rk-valuedrandom variable independent of ηjj≥1 and with distribution µ. Let

Xn+1 = Xn + ηn+1

Page 467: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.2 Markov chains on a general state space 459

= X0 +n+1∑j=1

ηj , n ≥ 0.

Then Xnn≥0 is a Rk-valued Markov chain with transition functionP (x, A) ≡ ν(A− x) and initial distribution µ.

Example 14.2.3: (Multiplicative random walk on R+). Let ηnn≥1 beiid nonnegative random variables with distribution ν and X0 be a nonneg-ative random variable with distribution µ and independent of ηnn≥1. LetXn+1 = Xnηn+1, n ≥ 0. Then Xnn≥0 is a Markov chain with state spaceR+ and transition function

P (x, A) = ν(x−1A) if x > 0 and IA(0) if x = 0

and initial distribution µ.

This is a model for the value of a stock portfolio subject to random growthrates. Clearly, the above iteration scheme leads to Xn = X0 ·

∏ni=1 ηi which

when normalized appropriately leads to what is known as the geometricBrownian notion model in financial mathematics literature.

Example 14.2.4: (AR(1) time series). Let ρ ∈ R and ν be a probabilitymeasure on (R,B(R)). Let ηjj≥1 be iid with distribution ν and X0 be arandom variable independent of ηjj≥1 and with distribution µ. Let

Xn+1 = ρXn + ηn+1, n ≥ 0.

Then Xnn≥0 is a R-valued Markov chain with transition functionP (x, A) ≡ ν(A− ρx) and initial distribution µ.

Example 14.2.5: (Random AR(1) vector time series). Let (Ai, bi)i≥1be iid such that Ai is a k × k matrix and bi is a k × 1 vector. Let µbe a probability measure on (Rk,B(Rk)). Let X0 be a Rk-valued randomvariable independent of (Ai, bi)i≥1 and with distribution µ. Let

Xn+1 = An+1Xn + bn+1, n ≥ 0.

Then Xnn≥0 is a Rk-valued Markov chain with transition functionP (x, B) ≡ P (A1x + b1 ∈ B) and initial distribution µ.

Example 14.2.6: (Waiting time chain). Let ηii≥1 be iid real valuedrandom variable with distribution ν and X0 be independent of ηii≥1 withdistribution µ. Let

Xn+1 = maxXn + ηn+1, 0

.

Then Xnn≥0 is a nonnegative valued Markov chain with transition func-tion P (x, A) ≡ P

(maxx + η1, 0 ∈ A

)and initial distribution µ. In the

Page 468: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

460 14. Markov Chains and MCMC

queuing theory context, if ηn represents the difference between the nth in-terarrival time and service time, then Xn represents the waiting time atthe nth arrival.

All the above are special cases of the following:

Example 14.2.7: (Iterated function system (IFS)). Let (S,S) be a mea-surable space. Let (Ω,F , P ) be a probability space. Let fi(x, ω)i≥1 besuch that for each i, fi : S × Ω → S is (S × F ,S)-measurable and thestochastic processes fi(·, ω)i≥1 are iid. Let X0 be a S-valued randomvariable on (Ω,F , P ) with distribution µ and independent of fi(·, ·)i≥1.Let

Xn+1(ω) = fn+1(Xn(ω), ω), n ≥ 0.

Then Xnn≥0 is an S-valued Markov chain with transition functionP (x, A) ≡ P (f1(x, ω) ∈ A) and initial distribution µ.

It turns out that when S is a Polish space with S as the Borel σ-algebra and P (·, ·) is a transition function on S, any S-valued Markov chainXnn≥0 with transition function P (·, ·) can be generated by an IFS as inExample 14.2.7. For a proof, see Kifer (1988) and Athreya and Stenflo(2003). When fii≥1 are iid such that f1 has only finite many choiceshjk

j=1, where each hj is an affine contraction on Rp, then the Markovchain Xn converges in distribution to some π(·). Further, the limit pointset of Xn coincides w.p. 1 with the support M of the limit distributionπ(·). This has been exploited by Barnsley and others to solve the inverseproblem: given a compact set M in Rp, find an IFS hjk

j=1, of affine con-tractions so that by generating the Markov chain Xn, one can get anapproximate picture of M . This is called data compression or image gen-eration by Markov chain Monte Carlo. See Barnsley (1992) for details onthis. More generally, when fi are Lipschitz maps, the following holds.

Theorem 14.2.3: Let fi(·, ·)i≥1 be iid Lipschitz maps on S. Assume

(i) E| log s(f1)| < ∞ and E log s(f1) < 0, where s(f1) =

supx =y

d(f1(x, ω), f1(y, ω)

)d(x, y)

and d(·, ·) is the metric on S, and

(ii) for some x0, E(log d(f1(x0, ω), x0)

)+< ∞.

Then,

(i) Xn(x, ω) ≡ f1(f2(. . . fn−1(fn(x, ω), ω)) . . .

)converges w.p. 1 to a

random variable X(ω) that is independent of x,

(ii) for all x, Xn(x, ω) ≡ fn

(fn−1 . . . (f1(x, ω), ω) . . .

)converges in dis-

tribution to X(ω).

Page 469: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.2 Markov chains on a general state space 461

That (ii) is a consequence of (i) is clear since for each n, x, Xn(x, ω) andXn(x, ω) are identically distributed.

The proof of (i) involves showing that Xn is a Cauchy sequence in Sw.p. 1 (Problem 14.17).

14.2.3 Chapman-Kolmogorov equationsLet P (·, ·) be a transition function on (S,S). For each n ≥ 0, define asequence of functions P (n)(·, ·)n≥0 by the iteration scheme

P (n+1)(x, A) =∫

S

P (n)(y,A)P (x, dy), n ≥ 0, (2.3)

where P (0)(x, A) ≡ IA(x). It can be verified by induction that for each n,P (n)(·, ·) is a transition probability function.

Definition 14.2.4: P (n)(·, ·) defined by (2.3) is called the n-step transitionfunction generated by P (·, ·).

It is easy to show by induction that if X0 = x w.p. 1, then

P (Xn ∈ A) = P (n)(x, A) for all n ≥ 0 (2.4)

(Problem 14.18). This leads to the Chapman-Kolmogorov equations.

Proposition 14.2.4: Let P (·, ·) be a transition probability function on(S,S) and let P (n)(·, ·) be defined by (2.3). Then for any n, m ≥ 0,

P (n+m)(x, A) =∫

P (n)(y,A)P (m)(x, dy). (2.5)

Proof: The analytic verification is straightforward by induction. One canverify this probabilistically using the Markov property. Indeed, from (2.4)the left side of (2.5) is

Px(Xn+m ∈ A) = Ex

(P (Xn+m ∈ A | Fm)

)= Ex

(P (n)(Xm, A)

)(by Markov property)

= right-hand side of (2.5),

where Ex, Px denote expectation and probability distribution of Xnn≥0when X0 = x w.p. 1.

From the above, one sees that the study of the limit behavior of thedistribution of Xn as n → ∞ can be reduced analytically to the study ofthe n-step transition probabilities. This in turn can be done in terms of

Page 470: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

462 14. Markov Chains and MCMC

the operator P on the Banach space B(S, R) of bounded measurable realvalued functions from S to R (with sup norm), defined by

(Ph)(x) ≡ Exh(X1) ≡∫

h(y)P (x, dy). (2.6)

It is easy to verify that P is a positive bounded linear operator on B(S, R)of norm one. The Chapman-Kolmogorov equation (2.4) is equivalent tosaying that Exh(Xn) = (P nh)(x). Thus, analytically the study of the limitdistribution of Xnn≥0 can be reduced to that of the sequence P nn≥0 ofthe operator P . However, probabilistic approaches via the notion of Harrisirreducibility and recurrence when applicable and via the notion of Fellercontinuity when S is a Polish space are more fruitful and will be developedbelow.

14.2.4 Harris irreducibility, recurrence, and minorization14.2.4.1 Definition of irreducibility

Recall that a Markov chain Xnn≥0 with a discrete state space S andtransition probability matrix P ≡ ((pij)) is irreducible if for every i, j in S,i leads to j, i.e.,

P (Xn = j for some n ≥ 1 | X0 = i)≡ Pi(Tj < ∞) ≡ fij > 0,

where Tj = minn : n ≥ 1, Xn = j is the time of first visit to j.To generalize this to the case of general state spaces, one starts with the

notion of first entrance time or hitting time (also called the first passagetime).

Definition 14.2.5: For any A ∈ S, the first entrance time to A is definedas

TA ≡

minn : n ≥ 1, Xn ∈ A∞ if Xn ∈ A for any n ≥ 1.

Since the event TA = 1 = X1 ∈ A and for k ≥ 2, TA = k = X1 ∈A, X2 ∈ A, . . . , Xk−1 ∈ A, Xk ∈ A is an element of Fk = σ〈Xj : j ≤ k〉,TA is a stopping time w.r.t. the filtration Fnn≥1 (cf. Chapter 13).

Definition 14.2.6: Let φ be a nonzero σ-finite measure on (S,S). TheMarkov chain Xnn≥0 (or equivalently, its transition function P (·, ·)) issaid to be φ-irreducible (or irreducible in the sense of Harris with referencemeasure φ) if for any A in S,

φ(A) > 0 ⇒ L(x, A) ≡ Px(TA < ∞) > 0 (2.7)

for all x in S. This says that if a set A in S is considered important by themeasure φ (i.e., φ(A) > 0), then so does the chain Xnn≥0 starting from

Page 471: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.2 Markov chains on a general state space 463

any x in S. If G(x, A) ≡∑∞

n=1 Pn(x, A) is the Greens function associatedwith P , then (2.7) is equivalent to

φ(A) > 0 ⇒ G(x, A) > 0 for all x ∈ S, (2.8)

i.e., φ(·) is dominated by G(x, ·) for all x in S.

14.2.4.2 Examples

Example 14.2.7: If S is countable and φ is the counting measure on S,then the irreducibility of a Markov chain Xnn≥0 with state space S isthe same as φ-irreducibility.

Example 14.2.8: If Xnn≥0 are iid with distribution ν, then it is ν-irreducible.

Example 14.2.9: It can be verified that the random walk with jumpdistribution ν (Example 14.2.2) is φ-irreducible with the Lebesgue measureas φ if ν has a nonzero absolutely continuous component with a positivedensity on some open interval (Problem 14.19 (a)).

Example 14.2.10: The AR(1) with η1 having a nontrivial absolutelycontinuous component can be shown to be φ-irreducible for some φ. Onthe other hand the AR(1) chain

Xn+1 =Xn

2+

12ηn, n ≥ 0,

where ηnn≥1 are iid Bernoulli (12 ) random variables, is not φ-irreducible

for any φ. In general, if Xnn≥0 is a Markov chain that has a discretedistribution for each n and has a limit distribution that is nonatomic, thenXnn≥0 cannot be φ-irreducible for any φ (Problem 14.19 (b)).

The waiting time chain (Example 14.2.6) is irreducible w.r.t. φ ≡ δ0, thedelta measure at 0 if P (η1 < 0) > 0 (Problem 14.20).

It can be shown that if Xnn≥0 is φ-irreducible for some σ-finite φ, thenthere exists a probability measure ψ such that Xnn≥0 is ψ-irreducible andit is maximal in the sense that if Xnn≥0 is φ-irreducible for some φ, thenφ is dominated by ψ. See Nummelin (1984).

14.2.4.3 Harris recurrence

A Markov chain Xnn≥0 that is Harris irreducible with reference measureφ is said to be Harris recurrent if

A ∈ S, φ(A) > 0 ⇒ Px(TA < ∞) = 1 for all x in S. (2.9)

Recall that irreducibility requires only that Px(TA < ∞) be > 0.When S is countable and φ is the counting measure, this reduces to the

usual notion of irreducibility and recurrence. If S is not countable but has

Page 472: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

464 14. Markov Chains and MCMC

a singleton ∆ such that Px(T∆ < ∞) = 1 for all x in S, then the chainXnn≥0 is Harris recurrent with respect to the measure φ(·) ≡ δ∆(·), thedelta measure at ∆. The waiting time chain (Example 14.2.6) has such a ∆in 0 if Eη1 < 0 (Problem 14.20). If such a recurrent singleton ∆ exists, thenthe sample paths of Xnn≥0 can be broken into iid excursions by lookingat the chain between consecutive returns to ∆. This in turn will allow acomplete extension of the basic limit theory from the countable case to thisspecial case. In general, such a singleton may not exist. For example, for theAR(1) sequence with ηii≥1 having a continuous distribution, Px(Xn = xfor some n ≥ 1) = 0 for all x. However, it turns out that for Harrisrecurrent chains, such a singleton can be constructed via the regenerationtheorem below established independently by Athreya and Ney (1978) andNummelin (1978).

14.2.5 The minorization theoremA remarkable result of the subject is that when S is countably generated, aHarris recurrent chain can be embedded in a chain that has a recurrent sin-gleton. This is achieved via the minorization theorem and the fundamentalregeneration theorem below.

Theorem 14.2.5: (The minorization theorem). Let (S,S) be such thatS is countably generated. Let Xnn≥0 be a Markov chain with state space(S,S) and transition function P (·, ·) such that it is Harris irreducible withreference measure φ(·). Then the following minorization hypothesis holds.

(i) (Hypothesis M). For every B0 ∈ S such that φ(B0) > 0, there existsa set A0 ⊂ B0, an integer n0 ≥ 1, a constant 0 < α < 1, and aprobability measure ν on (S,S) such that φ(A0) > 0 and for all x inA0,

Pn0(x, A) ≥ αν(A) for all A ∈ S.

(ii) (The C-set lemma). For any set B0 ∈ S such that φ(B0) > 0, thereexists a set A0 ⊂ B0, an n0 ≥ 1, a constant 0 < α′ < 1 such that forx, y in A0,

pn0(x, y) ≥ α′,

where pn0(x, ·) is the Radon-Nikodym derivative of the absolutely con-tinuous component of

Pn0(x, ·) w.r.t. φ(·).

The proof of the C-set lemma is a nice application of the martingaleconvergence theorem (see Orey (1971)). The proof of Theorem 14.2.5 (i)using the C-set lemma is easy and is left as an exercise (Problem 14.21).

Page 473: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.2 Markov chains on a general state space 465

14.2.6 The fundamental regeneration theoremTheorem 14.2.6: Let Xnn≥0 be a Markov chain with state space (S,S)and transition function P (·, ·). Suppose there exists a set A0 ∈ S, a constant0 < α < 1, a probability measure ν(·) on (S,S) such that for all x in A0,

P (x, A) ≥ αν(A) for all A ∈ S. (2.10)

Suppose, in addition, that for all x in S,

Px(TA0 < ∞) = 1. (2.11)

Then, for any initial distribution µ, there exists a sequence of random timesTii≥1 such that under Pµ, the sequence of excursions ηj ≡ XTj+r, 0 ≤r < Tj+1 − Tj , Tj+1 − Tj, j = 1, 2, . . . are iid with XTj

∼ ν(·).

Proof: For x in A0, let

Q(x, ·) =P (x, ·)− αν(·)

(1− α). (2.12)

Then, (2.10) implies that for x in A0, Q(x, ·) is a probability measure on(S,S). For each x in A0 and n ≥ 0, let ηn+1, δn+1 and Yn+1,x be independentrandom variables such that P (ηn+1 ∈ ·) = ν(·), δn+1 is Bernoulli (α), andP (Yn+1,x ∈ ·) = Q(x, ·). Then given Xn = x in A0, Xn+1 can be chosen tobe

Xn+1 =

ηn+1 if δn+1 = 1

Yn+1,x if δn+1 = 0(2.13)

to ensure that Xn+1 has distribution P (x, ·). Indeed, for x in A0,

P (ηn+1 ∈ ·, δn+1 = 1) + P (Yn+1,x ∈ ·, δn+1 = 0)= ν(·)α + (1− α)Q(x, ·) = P (x, ·).

Thus, each time the chain enters A0, there is a probability α that theposition next time will have distribution ν(·), independent of x ∈ A0 aswell as of all past history, i.e., that of starting afresh with distribution ν(·).Now if (2.11) also holds, then for any x in S, by Markov property, the chainenters A0 infinitely often w.p. 1 (Px).

Let τ1 < τ2 < τ3 < · · · denote the times of successive visits to A0. SinceXτi ∈ A0, by the above construction (cf. (2.13)), there is a probabilityα > 0 that Xτi+1 will be distributed as ν(·), completely independent ofXj for j ≤ τi and τi. By comparison with coin tossing, this implies thatfor any x, w.p. 1 (Px), there is a finite index i0 such that δτi0+1 = 1 andhence Xτi0+1 will be distributed as ν(·), independent of all history of thechain Xnn≥0 including that of the δτi+1 and Yτi+1, i ≤ i0 up to the timeτi0 . That is, at τi0+1 the chain starts afresh with distribution ν(·). Thus,

Page 474: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

466 14. Markov Chains and MCMC

it follows that for any initial distribution µ, there is a random time T suchthat XT is distributed as ν(·) and is independent of all history up to T −1.More precisely, for any µ, Pµ(T < ∞) = 1 and

Pµ (Xj ∈ Aj , j = 0, 1, 2, . . . , n + k, T = n + 1)= Pµ (Xj ∈ Aj , j = 0, 1, 2, . . . , n, T = n + 1)×

Pν (X0 ∈ An+1, X1 ∈ An+2, . . . , Xk−1 ∈ An+k) .

Since this is true for any µ, it is true for µ = ν and hence the theoremfollows.

A consequence of the above theorem is following.

Theorem 14.2.7: Suppose in Theorem 14.2.6, instead of (2.10) and(2.11), the following holds.

There exists an n0 ≥ 1 such that for all x in A0, A in S,

Pn0(x, A) ≥ αν(A) (2.14)

and for all x in A0,

Px(Xnn0 ∈ A0 for some n ≥ 1) = 1 (2.15)

andPx(TA0 < ∞) = 1 for all x in S. (2.16)

Let Yn ≡ Xnn0 , n ≥ 0 (where nn0 stands for the product of n and n0).Then, for any initial distribution µ, there exist random times Tii≥1 suchthat under Pµ, the sequence ηj ≡ Yj : Tj+r, 0 ≤ r < Tj+1 − Tj , Tj+1 − Tjfor j = 1, 2, . . . are iid with YTj

∼ ν(·).

Proof: For any initial distribution µ such that µ(A0) = 1, the theoremfollows from Theorem 14.2.6 since (2.14) and (2.15) are the same as (2.10)and (2.11) for the transition function Pn0(·, ·) and the chain Ynn≥0. By(2.16) for any other µ, Pµ(TA0 < ∞) = 1.

Given a realization of the Markov chain Yn ≡ Xnn0n≥0, it is possibleto construct a realization of the full Markov chain Xnn≥0 by “filling thegaps” Xj : kn0 + 1 ≤ j ≤ (k + 1)n0 − 1 as follows: Given Xkn0 = x,X(k+1)n0 = y, generate an observation from the conditional distribution of(X1, X2, . . . , Xn0−1), given X0 = x, Xn0 = y. This leads to the following.

Theorem 14.2.8: Under the set up of Theorem 14.2.7, the “excursions”

ηj ≡Xn0Tj+k : 0 ≤ k < n0(Tj+1 − Tj), Tj+1 − Tj

∞j=1,

are identically distributed and are one dependent, i.e., for each r ≥ 1, thecollections η1, η2, . . . , ηr and ηr+2, ηr+3, . . . are independent.

Page 475: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.2 Markov chains on a general state space 467

Proof: Note that in applying the regeneration method to the sequenceYnn≥0 and then doing the “filling the gaps” lead to the common portion

X(Tj−1)n0+r 0 ≤ r ≤ n0

with given the values X(Tj−1)n0 and XTjn0 . This makes two successive ηj−1and ηj dependent. But Markov property renders ηj and ηj+2 independent.This yields the one-dependence of ηjj≥1.

By the C-set lemma and the minorization Theorem 14.2.5, φ-recurrenceyields the hypothesis of Theorem 14.2.7.

Theorem 14.2.9: Let Xnn≥0 be a φ-recurrent Markov chain with statespace (S,S), where S is countably generated. Then there exist an A0 in S,n0 ≥ 1, 0 < α < 1 and a probability measure ν such that (2.14), (2.15),and (2.16) hold and hence, the conclusions of Theorem 14.2.8 hold.

Thus, φ-recurrence implies that the Markov chain Xnn≥0 is regener-ative (defined fully below). This makes the law of large numbers for iidrandom variables available to such chains. The limit theory of regenerativesequences developed in Section 8.5 is reviewed below and by the aboveresults, such a theory will hold for φ-recurrent chains.

14.2.7 Limit theory for regenerative sequencesDefinition 14.2.7: Let (Ω,F , P ) be a probability space and (S,S) bea measurable space. A sequence of random variables Xnn≥0 defined on(Ω,F , P ) with values in (S,S) is called regenerative if there exists a se-quence of random times 0 < T1 < T2 < T3 < · · · such that the excursionsηj ≡ Xn : Tj ≤ n < Tj+1, Tj+1 − Tjj≥1 are iid, i.e.,

P(Tj+1 − Tj = kj , XTj+ ∈ A,j , 0 ≤ < kj , j = 1, 2, . . . , r

)=

r∏j=1

P(T2 − T1 = kj , XT1+ ∈ A,j , 0 ≤ < kj

)(2.17)

for all k1, k2, . . . , kr ∈ N and A,j ∈ S, 1 ≤ ≤ kj , j = 1, . . . , r.

Example 14.2.11: Any Markov chain Xnn≥0 with a countable statespace S that is irreducible and recurrent is regenerative with Tii≥1 beingthe times of successive returns to a given state ∆.

Example 14.2.12: Any Harris recurrent chain satisfying the minorizationcondition (2.10) is regenerative by Theorem 14.2.6.

Example 14.2.13: The waiting time chain (Example 14.2.6) with Eη1 < 0is regenerative with Tii≥1 being the times of successive returns ofXnn≥0 to zero.

Page 476: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

468 14. Markov Chains and MCMC

Example 14.2.14: An example of a non-Markov sequence Xnn≥0 thatis regenerative is a semi-Markov chain. Let ynn≥0 be a Harris recur-rent Markov chain satisfying (2.10). Given yn = ann≥0, let Lnn≥0 beindependent positive integer valued random variables. Set

Xj =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

y0 0 ≤ j < L0y1 L0 ≤ j < L0 + L1y2 L0 + L1 ≤ j < L0 + L1 + L2...

Then Xnn≥0 is called a semi-Markov chain with embedded Markov chainynn≥0 and sojourn times Lnn≥0. It is regenerative if Tii≥1 are definedby Ti =

∑Ni−1j=0 Lj , where Nii≥1 are the successive regeneration times for

yn as in Theorem 14.2.7.

Theorem 14.2.10: Let Xnn≥0 be a regenerative sequence with regen-eration times Tii≥1. Let π(A) ≡ E

(∑T2−1j=T1

IA(Xj))

for A ∈ S. Supposeπ(S) ≡ E(T2 − T1) < ∞. Let π(·) = π(·)/π(S). Then

(i)1n

n∑j=0

f(Xj) →∫

fdπ w.p. 1 for any f ∈ L1(S,S, π).

(ii) µn(·) ≡ 1n

n∑j=0

P (Xj ∈ ·) → π(·) in total variation.

(iii) If the distribution of T2 − T1 is aperiodic, then P (Xn ∈ ·) → π(·) intotal variation.

Proof: To prove (i) it suffices to consider nonnegative f . For each n, letNn = k if Tk ≤ n < Tk+1. Let

Yi =Ti+1−1∑j=Ti

f(Xj), i ≥ 1 and Y0 =T1−1∑j=0

f(Xj).

Then

Y0 +Nn−1∑i=1

Yi ≤n∑

i=0

f(Xi) ≤ Y0 +Nn∑i=1

Yi. (2.18)

By the SLLN, 1Nn

∑Nn−1i=1 Yi and 1

Nn

∑Nn

i=1 Yi converge to EY1 w.p. 1 andNn

n →(E(T2 − T1)

)−1. It follows from (2.18) that

limn→∞

1n

n∑i=0

f(Xi) =EY1

E(T2 − T1)=∫

fdπ

π(S)=∫

fdπ,

Page 477: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.2 Markov chains on a general state space 469

establishing (i).By taking f = IA and using the BCT, one concludes from (i) that

µn(A) → π(A) for every A in S. Since µn and π are probability measures,this implies that µn → π in total variation, proving (ii).

To prove (iii), note that for any bounded measurable f , an ≡ Ef(XT1+n)satisfies

an = E(f(XT1+n)I(T2 − T1 > n)

)+

n∑r=1

E(f(XT1+n)I(T2 − T1 = r)

)

= bn +n∑

r=1

E(f(XT2+n−r)

)P (T2 − T1 = r)

= bn +n∑

r=1

an−rpr,

where pr = P (T2 − T1 = r). Now by the discrete renewal theorems fromSection 8.5 (which applies since ET2−T1 < ∞ and T2−T1 has an aperiodicdistribution), (iii) follows.

Remark 14.2.1: Since the strong law is valid for any m-dependent (m <∞) and stationary sequence of random variables, Theorem 14.2.10 is valideven if the excursions ηjj≥1 are m-dependent.

14.2.8 Limit theory of Harris recurrent Markov chainsThe minorization theorem, the fundamental regeneration theorem, andthe limit theorem for regenerative sequences, i.e., Theorems 14.2.5, 14.2.6,14.2.7, and 14.2.10, are the essential components of a limit theory for Harrisrecurrent Markov chains that parallels the limit theory for discrete statespace irreducible recurrent Markov chains.

Definition 14.2.8: A probability measure π on (S,S) is called stationaryfor a transition function P (·, ·) if

π(A) =∫

S

P (x, A)π(dx) for all A ∈ S.

Note that if X0 ∼ π, then Xn ∼ π for all n ≥ 1, justifying the term“stationary.”

Theorem 14.2.11: Let Xnn≥0 be a Harris recurrent Markov chainwith state space (S,S) and transition function P (·, ·). Let S be countablygenerated. Suppose there exists a stationary probability measure π. Then,

(i) π is unique.

Page 478: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

470 14. Markov Chains and MCMC

(ii) (The law of large numbers). For all f ∈ L1(S,S, π), for all x ∈ S,

1n

n−1∑j=0

f(Xj) →∫

fdπ w.p. 1 (Px).

(iii) (Convergence of n-step probabilities). For all x ∈ S

µn,x(·) ≡ 1n

n−1∑j=0

Px(Xj ∈ ·) → π(·) in total variation.

Proof: By Harris recurrence and the minorization Theorem 14.2.5, thereexist a set A0 ∈ S, a constant 0 < α < 1, an integer n0 ≥ 1, a probabilitymeasure ν such that

for all x ∈ A0, A ∈ S, Pn0(x, A) ≥ αν(A) (2.19)

andfor all x in S, Px(TA0 < ∞) = 1. (2.20)

For simplicity of exposition, assume that n0 = 1. (The general case n0 > 1can be reduced to this by considering the transition function Pn0 .)

Let the sequence ηn, δn, Yn,xn≥1 and the regeneration times Tii≥1be as in Theorem 14.2.6. Recall that the first regeneration time T1 can bedefined as

T1 = minn : n > 0, Xn−1 ∈ A0, δn = 1 (2.21)

and the succeeding ones by

Ti+1 = minn : n > Ti, Xn−1 ∈ A0, δn = 1, (2.22)

and that XTi are distributed as ν independent of the past. Let for n ≥ 1,Nn = k if Tk ≤ n < Tk+1. By Harris recurrence, for all x in S, Nn → ∞w.p. 1 (Px) and by the SLLN, for all x in S,

Nn

n→ 1

EνT1w.p. 1 (Px).

and hence, by the BCT,

ExNn

n→ 1

EνT1.

On the other hand, for any k ≥ 1, x ∈ S,

Px(a regeneration occurs at k) = Px(Xk−1 ∈ A0, δk = 1)= Px(Xk−1 ∈ A0)α.

Thus

ExNn

n=

1n

n∑k=1

Px(Xk−1 ∈ A0)α

Page 479: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.2 Markov chains on a general state space 471

and hence, for all x in S,

1n

n−1∑j=0

Px(Xj ∈ A0) →1

αEνT1.

Now let π be a stationary measure for P (·, ·). Then

π(A0) =∫

Px(Xj ∈ A0)π(dx) for all j = 0, 1, 2, . . .

and hence

nπ(A0) =∫ n−1∑

j=0

Px(Xj ∈ A0)π(dx). (2.23)

Since G(x, A0) ≡∑∞

j=0 Px(Xj ∈ A0) > 0 for all x in S, by Harris recurrence(Harris irreducibility will do for this), it follows that π(A0) > 0. Dividingboth sides of (2.23) by n and letting n →∞ yield

π(A0) =1

αEνT1

and hence that EνT1 < ∞. Since EνT1 ≡ E(T2 − T1) < ∞, by Theorem14.2.10, for all x in S, A ∈ S,

1n

n−1∑j=0

Px(Xj ∈ A) →Eν

(∑T−1j=0 IA(Xj)

)Eν(T1)

.

Integrating the left side with respect to π yields that for any A ∈ S,

π(A) =Eν

(∑T−1j=0 IA(Xj)

)Eν(T1)

,

thus establishing the uniqueness of π, i.e., establishing (i) of Theorem14.2.11. The other two parts follow from the regeneration Theorem 14.2.6and the limit Theorem 14.2.10.

Remark 14.2.2: Under the assumption n0 = 1 that was made at thebeginning of the proof, it also follows that

Px(Xj ∈ ·) → π(·) (2.24)

in total variation.

This also holds if the g.c.d. of the n0’s for which there exist A0, α, νsatisfying (2.19) is one.

Remark 14.2.3: A necessary and sufficient condition for the existence ofa stationary distribution for a Harris recurrent chain is that there exists aset A0, α, ν, n0 satisfying (2.19) and (2.20) and

EνTA0 < ∞.

Page 480: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

472 14. Markov Chains and MCMC

A more general result than Theorem 14.2.11 is the following that wasmotivated by applications to Markov chain Monte Carlo methods.

Theorem 14.2.12: Let Xnn≥0 be a Markov chain with state space (S,S)and transition function P (·, ·). Suppose (2.19) holds for some (A0, α, ν, n0).Suppose π is a stationary probability measure for P (·, ·) such that

π(x : Px(TA0 < ∞) > 0

)= 1. (2.25)

Then, for π-almost all x,

(i) Px(TA0 < ∞) = 1.

(ii) µn,x(·) = 1n

∑n−1j=0 Px(Xj ∈ ·) → π(·) in total variation.

(iii) For any f ∈ L1(S,S, π),

1n

n∑j=0

f(Xj) →∫

fdπ w.p. 1 (Px).

(iv) 1n

∑nj=0 Exf(Xj) →

∫fdπ.

If, further the g.c.d. m: there exists αm > 0 such that for all x in A0,Pm(x, ·) ≥ αν(·) = 1, then (ii) can be strengthened to

Px(Xn ∈ ·) → π(·) in total variation.

The key difference between Theorems 14.2.11 and 14.2.12 is that the latterdoes not require Harris recurrence which is often difficult to verify. Onthe other hand, the conclusions of Theorem 14.2.12 are valid only for π-almost all x unlike for all x in Theorem 14.2.11. In MCMC applications, theexistence of a stationary measure is given (as it is the ‘target distribution’)and the minorization condition is more easy to verify as is the milder form ofirreducibility condition (2.25). (Harris irreducibility will require Px(TA0 <∞) > 0 for all x in S.) For a proof of Theorem 14.2.12 and applications toMCMC, see Athreya, Doss and Sethuraman (1996).

Example 14.2.15: (AR(1) time series) (Example 14.2.4). Suppose η1 hasan absolutely continuous component and that |ρ| < 1. Then the chain isHarris recurrent and admits a stationary probability distribution π(·) of∑∞

j=0 ρjηj and the Px(Xn ∈ ·) → π(·) in total variation for any x.

Example 14.2.16: (Waiting time chain) (Example 14.2.6). If Eη1 < 0,the state 0 is recurrent and hence the Markov chain is Harris recurrent.Also a stationary distribution π does exist. It is known that π is the sameas the distribution of M∞ ≡ supj≥0 Sj , where S0 = 0, Sj =

∑ji=1 ηi, j ≥ 1,

ηii≥1 being iid.

Page 481: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.2 Markov chains on a general state space 473

14.2.9 Markov chains on metric spaces14.2.9.1 Feller continuity

Let (S, d) be a metric space and S be the Borel σ-algebra in S. Let P (·, ·)be a transition function. Let Xnn≥0 be Markov chain with state space(S,S) and transition function P (·, ·).

Definition 14.2.9: The transition function P (·, ·) is called Feller con-tinuous (or simply Feller) if xn → x in S ⇒ P (xn, ·) −→d P (x, ·) i.e.(Pf)(xn) ≡

∫f(y)P (xn, dy) → (Pf)(x) ≡

∫f(y)P (x, dy) for all bounded

continuous f : S → R. In terms of the Markov chain, this says

E(f(X1) | X0 = xn

)→ E

(f(X1) | X0 = x

)if xn → x.

Example 14.2.17: Let (Ω,F , P ) be a probability space and h : S×Ω → Sbe jointly measurable and h(·, ω) be continuous w.p. 1. Let P (x, A) ≡P (h(x, ω) ∈ A) for x ∈ S, A ∈ S. Then P (·, ·) is a Feller continuoustransition function. Indeed, for any bounded continuous f : S → R

(Pf)(x) ≡∫

f(y)P (x, dy) = Ef(h(x, ω)

).

Now, xn → x

⇒ h(xn, ω) → h(x, ω) w.p. 1⇒ f

(h(xn, ω)

)→ f

(h(x, ω)

)w.p. 1 (by continuity of f)

⇒ Ef(h(xn, ω)

)→ Efh(x, ω) (by bounded convergence theorem).

The first five examples of Section 14.2.4 fall in this category. If h is discon-tinuous, then P (·, ·) need not be Feller (Problem 14.22). That P (·, ·) is atransition function requires only that h(·, ·) be jointly measurable (Problem14.23).

14.2.9.2 Stationary measures

A general method of finding a stationary measure for a Feller transitionfunction P (·, ·) is to consider weak or vague limits of the occupation mea-sures µn,λ(A) ≡ 1

n

∑n−1j=0 Pλ(Xj ∈ A), where λ is the initial distribution.

Theorem 14.2.13: Fix an initial distribution λ. Suppose a probabilitymeasure µ is a weak limit point of µn,λn≥1. That is, for some n1 < n2 <n3 < · · ·, µnk,λ −→d µ. Assume P (·, ·) is Feller. Then µ is a stationaryprobability measure for P (·, ·).

Proof: Let f : S → R be continuous and bounded. Then∫f(y)µnk,λ(dy) →

∫f(y)µ(dy).

Page 482: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

474 14. Markov Chains and MCMC

But the left side equals

1nk

nk−1∑j=0

Eλf(Xj)

=1nk

Eλf(X0) +1nk

nk−1∑j=1

Eλf(Xj)

=1nk

Eλf(X0) +1nk

nk−1∑j=1

Eλ(Pf)(Xj−1)

(since by Markov property, for j ≥ 1, Eλf(Xj) = Eλ(Pf)(Xj−1)

)=

1nk

Eλf(X0) +1nk

nk−1∑j=0

Eλ(Pf)(Xj)−1nk

Eλ(Pf)(Xnk−1).

The first and third term on the right side go to zero since f is boundedand nk → ∞. The second term goes to

∫(Pf)(y)µ(dy) since by the Feller

property Pf is a bounded continuous function. Thus,∫S

f(y)µ(dy) =∫

S

(Pf)(y)µ(dy)

=∫

S

(∫S

f(z)P (y, dz))µ(dy)

=∫

S

f(z)(µP )(dz)

whereµP (A) ≡

∫S

P (y,A)µ(dy).

This being true for all bounded continuous f , it follows that µ = µP , i.e.,µ is stationary for P (·, ·).

A more general result is the following.

Theorem 14.2.14: Let λ be an initial distribution. Let µ be a subprobabil-ity measure (i.e., µ(S) ≤ 1) such that for some n1 < n2 < n3 < · · ·, µnk,λconverges vaguely to µ, i.e.,

∫fdµnk,λ →

∫fdµ for all f : S → R contin-

uous with compact support. Suppose there exists an approximate identitygnn≥1 for S, i.e., for all n, gn is a continuous function from S to [0,1]with compact support and for every x in S, gn(x) ↑ 1 as n → ∞. Then µis stationary for P (·, ·), i.e.,

µ(A) =∫

S

P (x, A)µ(dx) for all A ∈ S.

Page 483: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.2 Markov chains on a general state space 475

For a proof, see Athreya (2004).

If S = Rk for some k < ∞, S admits an approximate identity. Conditionsto ensure that there is a vague limit point µ such that µ(S) > 0 is providedby the following.

Theorem 14.2.15: Suppose there exists a set A0 ∈ S, a function V : S →[0,∞) and numbers 0 < α, M < ∞ such that

(PV )(x) ≡ ExV (X1) ≤ V (x)− α for x /∈ A0 (2.26)

andsup

x∈A0

(ExV (X1)− V (x)

)≡ M < ∞. (2.27)

Then, for any initial distribution λ,

limn→∞

µn,λ(A0) ≥α

α + M. (2.28)

Proof: For j ≥ 1,

EλV (Xj)− EλV (Xj−1) = Eλ

(PV (Xj−1)− V (Xj−1)

)= Eλ

((PV (Xj−1)− V (Xj−1)

)IA0(Xj−1)

)+ Eλ

((PV (Xj−1)− V (Xj−1)

)IAc

0(Xj−1)

)≤ MPλ(Xj−1 ∈ A0)− αPλ(Xj−1 /∈ A0).

Adding over j = 1, 2, . . . , n yields

1n

(EλV (Xn)− V (x)

)≤ −α + (α + M)µn,λ(A0).

Since V (·) ≥ 0, letting n →∞ yields (2.28).

Definition 14.2.10: A metric space (S, d) has the vague compactnessproperty if given any collection µα : α ∈ I of subprobability measures,there is a sequence αj ⊂ I such that µαj

converges vaguely to a sub-probability measure µ. It is known by the Helly’s selection theorem that allEuclidean spaces have this property. It is also known that any Polish space,i.e., a complete, separable, metric space has this property (see Billingsley(1968)).

Combining the above two results yields the following:

Theorem 14.2.16: Let P (·, ·) be a Feller transition function on a met-ric space (S, d) that admits an approximate identity and has the vaguecompactness property. Suppose there exists a closed set A0, a function

Page 484: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

476 14. Markov Chains and MCMC

V : S → [0,∞), numbers 0 < α, M < ∞ such that (2.26) and (2.27)hold. Then there exists a stationary probability measure µ for P (·, ·).

Proof: Fix an initial distribution λ. Then the family µn,λn≥1 has asubsequence µnk,λk≥1 and a subprobability measure µ such that µnk,λ →µ vaguely. Since A0 is closed, this implies

limk→∞

µnk,λ(A0) ≤ µ(A0).

Thus µ(A0) ≥ lim µnk,λ(A0) ≥ lim µnk,λ(A0) ≥ αα+M , by Theorem 14.2.15.

This yields that µ(S) > 0. By Theorem 14.2.14, µ is stationary for P . Soµ(·) ≡ µ(·)

µ(S) is a probability measure that is stationary for P .

Example 14.2.18: Consider a Markov chain generated by the iterationof iid random logistic maps

Xn+1 = Cn+1Xn(1−Xn), n ≥ 0

with § = [0, 1], Cnn≥1 iid with values in [0,4] and X0 is independent ofCnn≥1. Assume E log C1 > 0 and E| log(4−C1)| < ∞. Then there existsa stationary probability measure π such that π

((0, 1)

)= 1. This follows

from Theorem 14.2.16 by showing that if V (x) = | log x|, then there existsA0 = [a, b] ⊂ (0, 1) and constants 0 < α, M < ∞ such that (2.26) and(2.27) hold. For details, see Athreya (2004).

14.2.9.3 Convergence questions

If Xnn≥0 is a Feller Markov chain (i.e., its transition function P (·, ·) isFeller continuous), what can one say about the convergence of the distri-bution of Xn as n →∞?

If P (·, ·) admits a unique stationary probability measure π and the familyµn,λ : n ≥ 1 is tight for a given λ, then one can conclude from Theorem14.2.13 that every weak limit point of this family has to be π and hence πis the only weak limit point and that for this λ, µn,λ −→d π. To go fromthis to the convergence of Pλ(Xn ∈ ·) to π(·), one needs extra conditionsto rule out periodic behavior.

Since the occupation measure µn,λ(A) is the mean of the empirical mea-sure

Ln(A) ≡ 1n

n−1∑j=0

IA(Xj),

a natural question is what can one say about the convergence of the em-pirical measure? This is important for the statistical estimation of π.

When the chain is Harris recurrent, it was shown in the previous sectionthat for each x and for each A ∈ S

Ln(A) → π(A) w.p. 1 (Px).

Page 485: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.3 Markov chain Monte Carlo (MCMC) 477

For a Feller chain admitting a unique stationary measure π, one can appealto the ergodic theorem to conclude that for each A in S, Ln(A) → π(A)w.p. 1 (Px) for π-almost all x in S. Further, if S is Polish, then one canshow that for π-almost all x, Ln(·) −→d π(·) w.p. 1 (Px).

14.3 Markov chain Monte Carlo (MCMC)

14.3.1 IntroductionLet π be a probability measure on a measurable space (S,S). Let h(·) :S → R be S-measurable and

∫|h|dπ < ∞ and λ =

∫hdπ. The effort in the

computation of λ depends on the complexity of the function h(·) as well asthat of the measure π. Clearly, a first approach is to go back to the defini-tion of

∫hdπ and use numerical approximation such as approximating h(·)

by a sequence of simple functions and evaluating π(·) on the sets involvedin these simple functions. However, in many situations this may not be fea-sible especially if the measure π(·) is specified only up to a constant thatis not easy to evaluate. Such is often the case in Bayesian statistics whereπ is the posterior distribution πθ|X of the parameter θ given the data Xwhose density is proportional to f(X|θ)ν(dθ), f(X|θ) being the density ofX given θ and ν(dθ) the prior distribution of θ. In such situations, objectsof interest are the posterior mean, variance, and other moments as wellas posterior probability of θ being in some set of interest. In these prob-lems, the main difficulty lies in the evaluation of the normalizing constantC(X) =

∫f(X|θ)ν(dθ). However, it may be possible to generate a sequence

of random variables Xnn≥1 such that the distribution of Xn gets closeto π in a suitable sense and a law of large numbers asserting that

1n

n∑i=1

h(Xi) → λ =∫

hdπ

holds for a large class of h such that∫|h|dπ < ∞.

A method that has become very useful in Bayesian statistics in the pasttwenty years or so (with the advent of high speed computing) is that ofgenerating a Markov chain Xnn≥1 with stationary distribution π. Thismethod has its origins in the important paper of Metropolis, Rosenbluth,Rosenbluth, Teller and Teller (1953). For the adaptation of this methodto image processing problems, see Geman and Geman (1984).

This method is now known as the Markov chain Monte Carlo, or MCMCfor short. For the basic limit theory of Markov chains, see Section 14.2.For proofs of the claims in the rest of this section and further details onMCMC, see the recent book of Robert and Casella (1999). In the rest ofthis section two of the widely used MCMC algorithms are discussed. Theseare the Metropolis-Hastings algorithm and the Gibbs sampler.

Page 486: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

478 14. Markov Chains and MCMC

14.3.2 Metropolis-Hastings algorithmLet π be a probability measure on a measurable space (S,S). Let π bedominated by a σ-finite measure µ with density f(·). Let for each x, q(y|x)be a probability density in y w.r.t. µ. That is, q(y|x) is jointly measurableas a function from (S×S,S ×S) → R+ and for each x,

∫Sq(y|x)µ(dy) = 1.

Such a distribution q(·|·) is called the instrumental or proposal distribution.The Metropolis-Hastings algorithm generates a Markov chain Xn using

the densities f(·) and q(·) in two steps as follows:

Step 1: Given Xn = x, first generate a random variable Yn with densityq(·|x).

Step 2: Then, set Xn+1 = Yn with probability p(x, Yn) and = Xn withprobability 1− p(x, Yn), where

p(x, y) ≡ minf(y)

f(x)q(x|y)q(y|x)

, 1

. (3.1)

Thus, the value Yn is “accepted” as Xn+1 with probability p(x, Yn) and ifrejected the chain stays where it was, i.e., at Xn.

Implicit in the above definition is that the state space of the Markovchain Xn is simply the set Af ≡ x : f(x) > 0. It is also assumedthat for all x, y in Af , q(y|x) > 0. The transition function P (x, A) for thisMarkov chain is given by

P (x, A) = IA(x)(1− r(x)) +∫

A

p(x, y)q(y|x)µ(dy) (3.2)

wherer(x) =

∫S

p(x, y)q(y|x)µ(dy).

It turns out that the measure π(·) is a stationary measure for this Markovchain Xn. Indeed, for any A ∈ S,∫

S

P (x, A)π(dx) =∫

S

P (x, A)f(x)µ(dx)

=∫

S

IA(x)(1− r(x))f(x)µ(dx)

+∫

S

∫A

p(x, y)q(y|x)f(x)µ(dy)µ(dx). (3.3)

By definition of p(x, y), the identity

q(y|x)f(x)p(x, y) = p(y, x)q(x|y)f(y) (3.4)

holds for all x, y. Thus the second integral in (3.3) (using Tonelli’s theorem)is

=∫

A

(∫S

p(y, x)q(x|y)µ(dx))f(y)µ(dy)

Page 487: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.3 Markov chain Monte Carlo (MCMC) 479

=∫

A

r(y)f(y)µ(dy).

Thus, the right side of (3.3) is∫S

IA(x)f(x)µ(dx) ≡ π(A),

verifying stationarity.From the results of Section 14.2, it follows that if the transition function

P (·, ·) is Harris irreducible w.r.t. some reference measure ϕ, then (since itadmits π as a stationary measure) the law of large numbers, holds i.e., forany h ∈ L1(π),

1n

n−1∑j=0

h(Xj) →∫

hdπ as n →∞

w.p. 1 for any initial distribution. Thus, a good MCMC approximation toλ =

∫hdπ is λn ≡ 1

n

∑n−1j=0 h(Xj). A sufficient condition for irreducibility

is that q(y|x) > 0 for all (x, y) in Af ×Af .Summarizing the above discussion leads to

Theorem 14.3.1: Let π be a probability measure on a measurable space(S,S) with probability density f(·) w.r.t. a σ-finite measure µ. Let Af =x : f(x) > 0. Let q(y|x) be a measurable function from Af ×Af → (0,∞)such that

∫S

q(y|x)µ(dy) = 1 for all x in Af . Let Xnn≥0 be a Markovchain generated by the Metropolis-Hastings algorithm (3.1). Then, for anyh ∈ L1(π),

1n

n−1∑j=0

h(Xj) →∫

hdπ as n →∞ w.p. 1 (3.5)

for any (initial) distribution of X0.

The Metropolis-Hastings algorithm does not need the full knowledge ofthe target density f(·) of π(·). The function f(·) enters the algorithm onlythrough the function p(x, y), which involves only the knowledge of f(y)

f(x) andq(·|·) and hence this algorithm can be implemented even if f is known onlyup to a multiplicative constant. This is often the case in Bayesian statistics.Also, the choice of q(x|y) depends on f(·) only through the condition thatq(x|y) > 0 on Af ×Af . Thus, the Metropolis-Hastings algorithm has wideapplicability. Two special cases of this algorithm are given below.

14.3.2.1 Independent Metropolis-Hastings

Let q(y|x) ≡ g(y) where g(·) is a probability density such that g(y) > 0 iff(y) > 0.

Page 488: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

480 14. Markov Chains and MCMC

Suppose sup f(y)

g(y) : f(y) > 0≡ M < ∞. Then, in addition to the law

of large numbers (3.5) of Theorem 14.3.1, it holds that for any initial valueof X0,

‖P (Xn ∈ ·)− π(·)‖ ≤ 2(1− 1

M

)n

where ‖ · ‖ is the total variation norm. Thus, the distribution of Xnconverges in total variation at a geometric rate. For a proof, see Robertand Casella (1999).

14.3.2.2 Random-walk Metropolis-Hastings

Here the state space is the real line or a subset of some Euclidean space.Let q(y|x) = g(y− x) where g(·) is a probability density such that g(y−

x) > 0 for all x, y such that f(x) > 0, f(y) > 0. This ensures irreducibilityand hence the law of large numbers (3.5) holds. A sufficient condition forgeometric convergence of the distribution of Xn in the real line case isthe following:

(a) The density f(·) is symmetric about 0 and is asymptotically log con-cave, i.e., it holds that for some α > 0 and x0 > 0,

log f(x)− log f(y) ≥ α|y − x|

for all y < x < −x0 or x0 < x < y.

(b) The density function g(·) is positive and symmetric.

For further special cases and more results, see Robert and Casella (1999).

14.3.3 The Gibbs samplerSuppose π is the probability distribution of a bivariate random vector(X, Y ). A Markov chain Znn≥0 can be generated with π as its stationarydistribution using only the families of conditional distributions Q(·, y) ofX|Y = y and P (·, x) of Y |X = x for all x, y generated by the joint distri-bution π of (X, Y ). This Markov chain is known as the Gibbs sampler. Thealgorithm is as follows:

Step 1: Start with some initial value X0 = x0. Generate Y0 according tothe conditional distribution P (Y ∈ · | X0 = x0) = P (·, x0).

Step 2: Next, generate X1 according to the conditional distributionP (X1 ∈ · | Y0 = y0) = Q(·, y0).

Step 3: Now generate Y1 as in Step 1 but with conditioning value X1.

Step 4: Now generate X2 as in Step 2 but with conditioning value Y1 andso on.

Page 489: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.4 Problems 481

Thus, starting from X0, one generates successively Y0, X1, Y1, X2, Y2, . . ..Clearly, the sequences Xnn≥0, Ynn≥0 and Zn ≡ (Xn, Yn)n≥0 are allMarkov chains. It is also easy to verify that the marginal distribution πX

of X, the marginal distribution πY of Y , and the distribution π are, respec-tively, stationary for the Xn, Yn, and Zn chains. Indeed, if X0 ∼ πX ,then Y0 ∼ πY and hence X1 ∼ πX . Similarly one can verify the other twoclaims. Recall that a sufficient condition for the law of large numbers (3.5)to hold is irreducibility. A sufficient condition for irreducibility in turn isthat the chain Znn≥0 has a transition function R(z, ·) that, for eachz = (x, y), is absolutely continuous with respect to some fixed dominatingmeasure on R2.

The above algorithm is easily generalized to cover the k-variate case(k > 2). Let (X1, X2, . . . , Xk) be a random vector with distribution π. Forany vector x = (x1, x2, . . . , xk) let x(i) = (x1, x2, . . . , xi−1, xi+1, . . . , xk)and Pi(· | x(i)) be the conditional distribution of Xi given X(i) = x(i). Nowgenerate a Markov chain Zn ≡ (Zn1, Zn2, . . . , Znk), n ≥ 0 as follows:

Step 1: Start with some initial value Z0j = z0j , j = 1, 2, . . . , k − 1.Generate Z0k from the conditional distribution Pk(· | Xj = z0j , j =1, 2, . . . , k − 1).

Step 2: Next, generate Z11 from the conditional distribution P1(· | Xj =z0j , j = 2, . . . , k − 1, Xk = Z0k).

Step 3: Next, generate Z12 from the conditional distribution P2(· |X1 = Z11, Xj = z0j , j = 3, . . . , k − 1, Xk = Z0k) and so on until(Z11, Z12, . . . , Z1,k−1) is generated.

Now go back to Step 1 to generate Z1k and repeat Steps 2 and 3 and soon. This sequence Znn≥0 is called the Gibbs sampler Markov chain forthe distribution π.

A sufficient condition for irreducibility given earlier for the 2-variate casecarries over to the k-variate case. For more on the Gibbs sampler, see Robertand Casella (1999).

14.4 Problems

14.1 (a) Show using Definition 14.1.1 that when the state space S iscountable, for any n, conditioned on Xn = an, the eventsXn+j = an+j , 1 ≤ j ≤ k and Xj = aj : 0 ≤ j ≤ n − 1 areindependent for all choices of k and ajn+k

j=0 . Thus, conditionedon the “present” Xn = an, the “past” Xj : j ≤ n − 1 and“future” Xj : j ≥ n + 1 are two families of independent ran-dom variables with respect to the conditional probability mea-sure P (· | Xn = an), provided, P (Xn = an) > 0.

Page 490: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

482 14. Markov Chains and MCMC

(b) Prove Proposition 14.2.2 using induction on n (cf. Chapter 6).

14.2 In Example 14.1.1 (Frog in the well), verify that

(a) if αi ≡ 1− 1ci , c > 1, i ≥ 1, then 1 is null recurrent,

(b) if αi ≡ α, 0 < α < 1, then 1 is positive recurrent, and

(c) if αi ≡ 1− 12i2 , then 1 is transient.

14.3 Consider SSRW in Z2 where the transition probabilities arep(i,j)(i′,j′) = 1

4 each if (i′, j′) ∈ (i+1, j), (i−1, j), (i, j +1), (i, j−1)and zero otherwise. Verify that for n = 2k

p(2k)(0,0),(0,0) =

142k

(2k

k

)2

∼ 1π

1k

and conclude that (0,0) is null recurrent. Extend this calculation toSSRW in Z3 and conclude that (0,0,0) is transient.

14.4 Show that if i is absorbing and j → i, then j is transient by showingthat if j → i, then f∗

ji = P (Ti < Tj | X0 = j) > 0 and 1− fjj ≥ f∗ji.

14.5 (a) Let i be recurrent and i → j. Show that j is recurrent usingCorollary 14.1.5.

(Hint: Show that there exist n0 and m0 such that for all n,p(n0+n+m0)jj ≥ p

(n0)ji p

(n)ii , p

(m0)ij with p

(n0)ji > 0 and p

(m0)ij > 0.)

(b) Let i and j communicate. Show that di = dj .

14.6 Show that in a finite state space irreducible Markov chain (S,P ), allstates are positive recurrent by showing

(a) that for any i, j in S, there exist r, r ≤ K such that p(r)ij > 0,

where K is the number of states in S,

(b) for any i in S, there exists a 0 < α < 1, and c < ∞ such thatPi(Ti > n) ≤ cαn.

Give an alternate proof by showing that if S is finite, then for anyinitial distribution µ, the occupation measures

µ(·)n ≡ 1

(n + 1)

n∑j=0

Pµ(Xj ∈ ·)

has a subsequence that converges to a probability distribution π thatis stationary for (S,P ).

14.7 Prove Theorem 14.1.3 using the Markov property and induction.

Page 491: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.4 Problems 483

14.8 Adapt the proof of Theorem 14.1.9 to show that for any i, j

1n

n∑j=1

p(k)ij → fij

EjTj

if j is positive recurrent and 0 otherwise. Conclude that in a finitestate space case, there must be at least one state that is positiverecurrent.

14.9 If j → i then ζ1 ≡∑Ti−1

j=0 δXrj , the number of visits to j beforevisiting i satisfies Pi(ζ1 > n) ≤ cαn for some 0 < c < ∞, 0 < α < 1and all n ≥ 1.

14.10 Adapt the proof of Theorem 14.1.9 to establish the following laws oflarge numbers. Let (S,P ) be irreducible and positive recurrent withstationary distribution π.

(a) Let h : S → R be such that∑

j∈S |h(j)|πj < ∞. Then, for anyinitial distribution µ,

1n + 1

n∑j=0

h(Xj) →∑j∈S

h(j)πj w.p. 1

by first verifying that

Ei

(∣∣∣∣Ti−1∑j=0

h(Xj)∣∣∣∣)

< ∞.

(b) Let g : S× S → R be such that∑

i,j∈S |g(i, j)|πipij < ∞. Then,for any initial distribution µ,

1n + 1

n∑j=0

g(Xj , Xj+1) →∑i,j∈S

g(i, j)πipij w.p. 1.

(c) Fix two disjoint subsets A and B in S. Evaluate the long runproportion of transitions from A to B.

(d) Extend (b) to conclude that the tail sequence Zn ≡ Xn+j :j ≥ 0 of the Markov chain Xnn≥0 converges as n → ∞in the sense of finite dimensional distributions to the strictlystationary sequence Xnn≥0 which is the Markov chain (S,P )with initial distribution π.

14.11 Let Xnn≥0 be a Markov chain that is irreducible and has at leasttwo states. Show that w.p. 1 the trajectories Xn do not converge,i.e., w.p. 1, limn→∞ Xn does not exist.

(Hint: Show that w.p. 1, the set of limit points of the set Xn : n ≥0 coincides with S.)

Page 492: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

484 14. Markov Chains and MCMC

14.12 Let Xnn≥0 be a Markov chain with state space S and tr. pr. P ≡((pij)

). A probability distribution π ≡ πj : j ∈ S is said to satisfy

the condition of detailed balance or time reversibility with respect to(S,P ) if for all i, j, πipij = πjpji.

(a) Show that such a π is necessarily a stationary distribution.

(b) For the birth and death chain (Example 14.1.4), find a conditionin terms of the birth and death rates αi, βii≥0 for the existenceof a probability distribution π that satisfies the condition ofdetailed balance.

14.13 (Absorption probabilities and times). Let 0 be an absorbing state.For any i = 0, let θi = fi0 ≡ Pi(T0 < ∞) and ηi = EiT0. Show usingthe Markov property that for every i = 0,

θi = pi0 +∑j =0

θjpij

ηi = 1 +∑j =0

ηjpij .

Apply this to the Gambler’s ruin problem with S = 0, 1, 2, . . . , K,K < ∞ and p00 = 1, pNN = 1, pi,i+1 = p, pi,i−1 = 1− p, 0 < p < 1,1 ≤ i ≤ N −1 and find the probability and expected waiting time forruin (absorption at 0) starting from an initial fortune of i, 1 ≤ i ≤N − 1.

14.14 (Renewal theory via Markov chains). Let Xjj≥1 be iid positiveinteger valued random variables. Let S0 = 0, Sn =

∑nj=1 Xj , n ≥ 1,

N(n) = k if Sk ≤ n < Sk+1, k = 0, 1, 2, . . . be the number of renewalsup to time n, An = n− SN(n) be the age of the current unit at timen.

(a) Show that Ann≥0 is a Markov chain and find its state spaceS and transition probabilities.

(b) Assuming that EX1 < ∞, verify that

πj =P (X1 > j)

EX1j = 0, 1, 2, . . .

is the unique stationary distribution.

(c) Assuming that X1 has an aperiodic distribution and Theorem14.1.18 holds, show that the discrete renewal theorem holds.

14.15 Prove Proposition 14.2.1 for the countable state space case.

14.16 Prove Proposition 14.2.2.

Page 493: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

14.4 Problems 485

14.17 Establish assertion (i) of Theorem 14.2.3.

(Hint: Show that d(Xn(x, ω), Xn+1(x, ω)

)≤

(∏ni=1 s(fi)

)d(x, fn+1(x, ω)

)and use Borel-Cantelli to show that the right

side is O(λn) w.p. 1 for some 0 < λ < 1 and show similarlyd(Xn(x, ω), Xn(y, ω)

)= O(λn) w.p. 1 for any x, y.)

14.18 Show that if P (·, ·) is the transition function of a Markov chainXnn≥0, then for any n ≥ 0, Px(Xn ∈ A) = P (n)(x, A), whereP (n)(·, ·) is defined by the iteration

P (n+1)(x, A) =∫

S

P (n)(y,A)P (x, dy),

with P (0)(x, A) = IA(x).

(Hint: Use induction and Markov property.)

14.19 (a) Let Xnn≥0 be a random walk defined by the iteration schemeXn+1 = Xn+ηn+1 where ηnn≥1 are iid random variables inde-pendent of X0. Assume that ν(·) = P (η1 ∈ ·) has an absolutelycontinuous component with a density that is strictly positivea.e. on an open interval around 0. Show that Xnn≥0 is Harrisirreducible w.r.t. the Lebesgue measure on R. Show that if inaddition Eη1 = 0, then Xn is Harris recurrent as well.

(b) Use Theorem 14.2.11 to establish the second claim in Example14.2.10.

14.20 Show that the waiting time chain (Example 14.2.6) defined byXn+1 = maxXn + ηn+1, 0, where ηnn≥1 are iid is irreduciblewith reference measure φ(·) ≡ δ0(·), the delta measure at 0, providedP (η1 < 0) > 0. Show further that it is φ recurrent if Eη1 < 0.

14.21 Prove Theorem 14.2.5 (i) using the C-set lemma.

14.22 Find a h : [0, 1] × [0, 1] → [0, 1] such that h(x, y) is discontinuous inx for almost all y in [0, 1] and conclude that the function P (x, A) =P(h(x, Y ) ∈ A

)where Y is a uniform [0,1] random variable need not

be Feller.

14.23 Let (Ω,F , P ) be a probability space and (S,S) a measurable space.Let h : S × Ω → S be jointly measurable. Show that P (x, A) ≡P (h(x, ω) ∈ A) is a transition function.

14.24 (a) Let Xnn≥0 be an irreducible Markov chain with state spaceS ≡ 0, 1, 2, . . .. Suppose V : S → [0,∞) is such that forsome K < ∞, ExV (X1) ≤ V (x) for all x > K and thatlimx→∞ V (x) = ∞. Show that Xnn≥0 is recurrent.

Page 494: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

486 14. Markov Chains and MCMC

(Hint: Let Xnn≥0 be a Markov chain with state spaceS ≡ 0, 1, 2, . . . and transition probabilities same as that ofXnn≥0 except that the states 0, 1, 2, . . . , K are absorbing.Verify that V (Xn)n≥0 is a nonnegative super-martingale andhence that Xnn≥0 is bounded w.p. 1. Now conclude thatthere must exist a state x that gets visited infinitely often byXnn≥0.)

(b) Consider the reflecting nonhomogeneous random walk on S ≡0, 1, 2, . . . such that

pij =

pi if j = i + 11− pi if j = i− 1

with p0 = 1, 0 < pi ≤ qi for all i ≥ k0 and some 1 ≤ k0 < ∞and 0 < pi < 1 for all i ≥ 1. Show that Xnn≥0 is irreducibleand recurrent.

14.25 Let Xnn≥0 be an irreducible and recurrent Markov chain with acountable state space S. Let V : S → R+ be such that ExV (X1) ≤V (x) for all x in S. Show that V (·) is constant on S.

14.26 Let Cnn≥1 be iid random variables with values in [0, 4]. LetXnn≥0 be a Markov chain with values in [0, 1] defined by the ran-dom iteration scheme

Xn+1 = Cn+1Xn(1−Xn), n ≥ 0.

(a) Show that if E log C1 < 0 then Xn = O(λn) w.p. 1 for some0 < λ < 1.

(b) Show also that if E log C1 < 0 and 0 < V (log C1) < ∞ thenthere exist sequences ann≥1 and bnn≥1 such that

log Xn − an

bn−→d N(0, 1).

Page 495: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

15Stochastic Processes

This chapter gives a brief discussion of two special classes of real valuedstochastic processes X(t) : t > 0 in continuous time [0,∞). These arecontinuous time Markov chains with a discrete state space (including Pois-son processes) and Brownian motion These are very useful in many areasof applications such as queuing theory and mathematical finance.

15.1 Continuous time Markov chains

15.1.1 DefinitionConsider a physical system that can be in one of a finite or countablenumber of states 0, 1, 2, . . . , K, K ≤ ∞. Assume that the system evolvesin continuous time in the following manner. In each state the system staysa random length of time that is exponentially distributed and then jumpsto a new state with a probability distribution that depends only on thecurrent state and not on the past history. Thus, if the state of the systemat the time of the nth transition is denoted by yn, n = 0, 1, 2, . . ., thenynn≥0 is a Markov chain with state space S ≡ 0, 1, 2, . . . , K, K ≤ ∞and some transition probability matrix P ≡

((pij)

). If yn = in, then the

system stays in in a random length of time Ln, called the sojourn time, suchthat conditional on yn = inn≥0, Lnn≥0 are independent exponentialrandom variables with Ln having a mean λ−1

in. Now set the state of the

Page 496: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

488 15. Stochastic Processes

system X(t) at time t ≥ 0 by

X(t) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

y0 0 ≤ t < L0y1 L0 ≤ t < L0 + L1...

yn Lν + L1 + · · ·+ Ln−1 ≤ t < L0 + L1 + · · ·+ Ln.(1.1)

Then X(t) : t ≥ 0 is called a continuous time Markov chain with statespace S, jump probabilities P ≡

((pij)

), waiting time parameters λi : i ∈

S, and embedded Markov chain ynn≥0. To make sure that there are onlyfinite number of transitions in finite time, i.e.,

∞∑i=0

Li = ∞ w.p. 1

one needs to impose the nonexplosion condition

∞∑i=0

1λyn

= ∞ w.p. 1. (1.2)

(Problem 15.1)Clearly, a sufficient condition for this is that λi < ∞ for all i ∈ S and

ynn≥0 is an irreducible recurrent Markov chain.It can be verified using the “memorylessness” property of the exponential

distribution (Problem 15.2) that X(t) : t ≥ 0 has the Markov property,i.e., for any 0 ≤ t1 ≤ t2 ≤ t3 ≤ · · · ≤ tr < ∞ and

P(X(tr) = ir | X(tj) = ij , 0 ≤ j ≤ r − 1

)= P

(X(tr) = ir | X(tr−1) = ir−1

). (1.3)

15.1.2 Kolmogorov’s differential equationsThe functions

pij(t) ≡ P(X(t) = j | X(0) = i

)(1.4)

are called transition functions. To determine these functions from the jumpprobabilities pij and the waiting time parameters λi, one uses theChapman-Kolmogorov equations

pij(t + s) =∑k∈S

pik(t)pkj(s), t, s ≥ 0 (1.5)

which is an immediate consequence of the Markov property (1.3) and thedefinition (1.4). In addition to (1.5), one has the continuity condition

limt↓0

pij(t) = δij . (1.6)

Page 497: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

15.1 Continuous time Markov chains 489

Under the nonexplosion hypothesis (1.2), it can be shown (Chung (1967),Feller (1966), Karlin and Taylor (1975)) that pij(t) are differentiable asfunctions of t and satisfy the Kolmogorov’s forward and backward differen-tial equations

p′ij(t) =

∑k

pik(t)p′kj(0) (forward) (1.7a)

p′ij(t) =

∑k

p′ik(0)pkj(t) (backward) (1.7b)

Further, akj ≡ p′kj(0) can be shown to be λkpkj for k = j and−λk for k = j.

The matrix A ≡((aij)

)is called the infinitesimal matrix or generator of

the process X(t) : t ≥ 0. If the state space S is finite, i.e., K < ∞, thenP (t) ≡

((pij(t))

)can be shown to be

P (t) = exp(At) ≡∞∑

n=0

An

n!tn. (1.8)

15.1.3 ExamplesExample 15.1.1: (Birth and death process). Here

pi,i+1 =αi

αi + βii ≥ 0

pi,i−1 =βi

αi + βii ≥ 1

λi = (αi + βi) i ≥ 0

where αi, βii≥0 are nonnegative numbers with αi being the birth rate, βi

being the death rate. This has the meaning that given X(t) = i, for smallh > 0, X(t + h) goes up to (i + 1) with probability αih + o(h) or goesdown to (i − 1) with probability βih + o(h) or stays at i with probability1 − (αi + βi)h + o(h). In this case the forward and backward equationsbecome

p′ij(t) = αj−1pi,j−1(t) + βj+1pi,j+1(t)− (αj + βj)pij(t),

p′ij(t) = αipi+1,j(t) + βipi−1,j(t)− (αi + βi)pij(t)

with initial condition pij(0) = δij .

(a) Pure birth process. A special case of the above is when βi ≡ 0 for alli. Here pij(t) = 0 if j < i and X(t) is a nondecreasing function of tand the jumps are of size one.

A further special case of this when αi ≡ α for all i. In this case, theprocess waits in each state a random length of time with mean α−1 and

Page 498: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

490 15. Stochastic Processes

jumps one step higher. It can be verified that in this case, the solution ofthe Kolmogorov’s differential equations (1.7a) and (1.7b) are given by

pij(t) = e−αt (αt)j−i

(j − i)!. (1.9)

From this it is easy to conclude that X(t) : t ≥ 0 is a Levy process, i.e.,it has stationary and independent increments, i.e., for 0 = t0 ≤ t1 ≤ t2 ≤t3 ≤ · · · ≤ tr < ∞, Yj = X(tj) −X(tj−1), j = 1, 2, . . . , r are independentand the distribution of Yj depends only on (tj−tj−1). Further, in this case,X(t)−X(0) has a Poisson distribution with mean αt. This X(t) : t ≥ 0is called a Poisson process with intensity parameter α.

Another special case is the linear birth and death process. Here αi = iα,βi = iβ for i = 0, 1, 2, . . .. The pure death process has parameters αi ≡ 0for i ≥ 0. A number of queuing processes can be modeled as a birth anddeath process and more generally as a continuous time Markov chain. Forexample, an M/M/s queuing system is one in which customers arrive at aservice facility at the jump times of a Poisson process (with parameter α)and there are s servers with service time at each server being exponentialwith the same mean (= β−1). The number X(t) of customers in the systemat time t evolves a birth and dealt process with parameters αi ≡ α for i ≥ 0and βi = iβ, 0 ≤ i ≤ s, = sβ for i > s.

Example 15.1.2: (Markov branching processes). Here X(t) is the popu-lation size in a process where each particle lives a random length of timewith exponential distribution with mean α−1 and on death create a ran-dom number of new particles with offspring distribution pjj≥0 and allparticles evolve independently of each other. This implies that λi = iα,i ≥ 0, pij = pj−i+1, j ≥ i− 1 and = 0 for j < i− 1, i ≥ 1, p00 = 1. Thus 0is an absorbing barrier. The random variable T ≡ inft : t > 0, X(t) = 0is called the extinction time. It can be shown that

∞∑j=0

pij(t)sj =( ∞∑

j=0

p1j(t)sj

)i

for i ≥ 0, 0 ≤ s ≤ 1 (1.10)

and also that

F (s, t) ≡( ∞∑

j=0

p1j(t)sj

)

satisfies the differential equation

∂F

∂t(s, t) = u(s)

∂sF (s, t) (forward equation) (1.11)

∂F

∂t(s, t) = u

(F (s, t)

)(backward equation) (1.12)

Page 499: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

15.1 Continuous time Markov chains 491

with F (s, 0) ≡ s

where u(s) ≡ α

( ∞∑j=0

pjsj − s

). (1.13)

Further, if q ≡ P (T < ∞ | X(0) = 1) is the extinction probability, the q isthe smallest solution in [0,1] of the equation q =

∑∞j=0 pjq

j (cf. Chapter18). (See Athreya and Ney (2004), Chapter III, p. 106.)

Example 15.1.3: (Compound Poisson processes). Let Lii≥0 and ξii≥1be two independent sequences of random variables such that Lii≥0 areiid exponential with mean α−1 and ξii≥1 are iid integer valued randomvariables with distribution pj. Let X(t) = k if L0 + · · · + Lk ≤ t <L0 + · · ·+ Lk+1. Let

X(t) =

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

0 0 ≤ t < L01 L0 ≤ t < L0 + L1...k L0 + · · ·+ Lk−1 ≤ t < L0 + · · ·+ Lk,...

Let

Y (t) =X(t)∑i=1

ξi, t ≥ 0. (1.14)

Then Y (t) : t ≥ 0 is a continuous time Markov chain with state spaceS ≡ 0,±1,±2, . . ., jump probabilities pij = P (ξ1 = j − i) = pj−i. It isalso a Levy process. It is called a compound Poisson process with jump rateα and jump distribution pj. If p1 ≡ 1 this reduces to the Poisson processcase.

15.1.4 Limit theoremsTo investigate what happens to pij(t) ≡ P (X(t) = j | X(0) = i) as t →∞,one needs to assume that the embedded chain ynn≥0 is irreducible andrecurrent. This implies that for any i0 the random variable

T = mint : t > L0, X(t) = i0

is finite w.p. 1. Further, the process, starting from i0, returns to i0 infinitelyoften and hence by the Markov property is regenerative in the sense thatthe excursions between consecutive returns to i0 are iid. One can use this,laws of large numbers and renewal theory (cf. Section 8.5) to arrive at thefollowing:

Page 500: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

492 15. Stochastic Processes

Theorem 15.1.1: Let P = pij be irreducible and recurrent and 0 <λi < ∞ for all i in S. Let there exist a probability distribution πi suchthat ∑

j∈S

aijπj = 0 for all i (1.15)

where

aij = λipij i = j

= −λi i = j.

Then

(i) πj is stationary for pij(t), i.e.,∑

i∈S πipij(t) = πj for all j,t ≥ 0,

(ii) for all i, jlim

t→∞ pij(t) = πj , (1.16)

and hence πj is the unique stationary distribution,

(iii) for any function h : S → R, such that∑

j∈S |h(j)|πj < ∞,

limt→∞

1t

∫ t

0h(X(u)

)du =

∑j∈S

h(j)πj w.p. 1 (1.17)

for any initial distribution of X(0).

Note that (1.16) holds without any assumption of aperiodicity on P ≡((pij)

).

A sufficient condition for a probability distribution πj to be a station-ary distribution is the so-called detailed balance condition

πkakj = πjajk. (1.18)

One can use this for birth and death chains on a finite state space S ≡0, 1, 2, . . . , N, N < ∞ to conclude that the stationary distribution isgiven by

πn =αn−1αn−2 . . . α0

βnβn−1 . . . β1π0 (1.19)

provided αi > 0 for all 0 ≤ i ≤ N − 1, βi > 0 for all 1 ≤ i ≤ N andαN = 0, β0 = 0. A necessary and sufficient condition for equilibrium, i.e.,the existence of a stationary distribution when N = ∞ is

∞∑n=1

αn−1 . . . α0

βn . . . β1< ∞. (1.20)

Page 501: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

15.2 Brownian motion 493

This yields in the M/M/s case with arrival rate α and service rate β (i.e.,αi ≡ α, for i ≥ 0, βi ≡ iβ for 1 ≤ i ≤ s, = sβ for i > s) the necessary andsufficient condition for the equilibrium, that the traffic intensity

ρ ≡ α

sβ< 1, (1.21)

i.e., the mean number of arrivals per unit time, be less than the meannumber of the persons served per unit time. For further discussion andresults, see the books of Karlin and Taylor (1975) and Durrett (2001).

15.2 Brownian motion

Definition 15.2.1: A real valued stochastic process B(t) : t ≥ 0 iscalled standard Brownian motion (SBM) if it satisfies

(i) B(0) = 0,

(ii) B(t) has N(0, t) distribution, for each t ≥ 0,

(iii) it is a Levy process, i.e., it has stationary independent increments.

It follows that B(t) : t ≥ 0 is a Gaussian process (i.e., the finitedimensional distributions are Gaussian) with mean function m(t) ≡ 0 andcovariance function c(s, t) = min(s, t). It can be shown that the trajectoriesare continuous w.p. 1. Thus, Brownian motion is a Gaussian process, hascontinuous trajectories and has stationary independent increments (andhence is Markovian). These features make it a very useful process as abuilding block for many real world phenomena such as the movement ofpollen (which was studied by the English Botanist, Robert Brown, andhence the name Brownian motion) movement of a tagged particle in aliquid subject to the bombardment of the molecules of the liquid (studiedby Einstein and Slomuchowski) and the fluctuations in stock market prices(studied by the French Economist Bachelier).

15.2.1 Construction of SBMLet ηii≥1 be iid N(0, 1) random variables on some probability space(Ω,F , P ). Let φi(·)i≥1 be the sequence of Haar functions on [0, 1] definedby the doubly indexed collection

H00(t) ≡ 1

H11(t) =

1 on [0, 12 )

−1 on [12 , 1]

Page 502: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

494 15. Stochastic Processes

and for n ≥ 1

Hn,j(t) = 2n−1

2 for t in[ (j − 1)

2n,

j

2n

)= −2

n−12 for t in

[ j

2n,j + 12n

]= 0 otherwise

j = 1, 3, . . . , 2n−1.

It is known that this family is a complete orthonormal basis for L2([0, 1]).Let

BN (t, ω) ≡N∑

i=1

ηi(ω)∫ t

0φi(u)du. (2.1)

Then, for each N , BN (t, ω) : 0 ≤ t ≤ 1 is a Gaussian processon (Ω,F , P ) with mean function mN (t) ≡ 0 and covariance functioncN (s, t) =

∑Ni=1

( ∫ t

0 φi(u)du)( ∫ s

0 φi(u)du)

and the property that the tra-jectories t → BN (t, ω) are continuous in t for each ω in Ω.

It can be shown (Problem 15.11) that w.p. 1 the sequence BN (·, ω)N≥1is a Cauchy sequence in the Banach space C[0, 1] of continuous real val-ued functions on [0, 1] with supremum metric. Hence, BN (·, ω)N≥1 con-verges w.p. 1 to a limit element B(·, ω) which will be a Gaussian processwith continuous trajectories and mean and covariance functions m(t) ≡ 0and c(s, t) =

∑∞i=1

( ∫ t

0 φi(u)du)( ∫ s

0 φi(u)du)

=∫ t

0 I[0,t](u)I[0,s](u)du =min(s, t) respectively. (See Section 2.3 of Karatzas and Shreve (1991).)Thus,

B(t, ω) ≡∞∑

i=1

ηi(ω)∫ t

0φi(u)du (2.2)

is a well-defined stochastic process for 0 ≤ t ≤ 1 that has all the propertiesclaimed above and is called SBM on [0,1]. Let B(j)(t, ω) : 0 ≤ t ≤ 1j≥1be iid copies of B(t, ω) : 0 ≤ t ≤ 1 as defined as above. Now set

B(t, ω) ≡

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

B(1)(t, ω), 0 ≤ t ≤ 1B(1)(1, ω) + B(2)(t− 1, ω), 1 ≤ t ≤ 2...B(n, ω) + B(n+1)(t− n, ω), n ≤ t ≤ n + 1, n = 1, 2, . . .

(2.3)Then B(t, ω) : t ≥ 0 satisfies

(i) B(0, ω) = 0,

(ii) t → B(t, ω) is continuous in t for all ω,

(iii) it is Gaussian with mean function m(t) ≡ 0 and covariance functionc(s, t) ≡ min(s, t),

i.e., it is SBM on [0,∞). From now on the symbol ω may be suppressed.

Page 503: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

15.2 Brownian motion 495

15.2.2 Basic properties of SBM(i) Scaling properties

Fix c > 0 and set

Bc(t) ≡1√cB(ct), t ≥ 0. (2.4)

Then, Bc(t)t≥0 is also an SBM. This is easily verified by notingthat Bc(0) = 0, Bc(t) ∼ N(0, t), Cov(Bc(t), Bc(s)) = 1

c minct, cs =min(t, s) and that Bc(·) is a Levy process and the trajectories arecontinuous w.p. 1.

(ii) ReflectionIf B(·) is SBM, then so is −B(·). This follows from the symmetryof the mean zero Gaussian distribution.

(iii) Time inversionLet

B(t) =

tB(1t ) for t > 0

0 for t = 0.(2.5)

Then B(t) : t ≥ 0 is also an SBM. The facts that B(t) : t > 0is a Gaussian process with mean and covariance function sameas SBM and the trajectories are continuous in the open interval(0,∞) are straightforward to verify. It only remains to verify thatlimt→0 B(t) = 0 w.p. 1. Fix 0 < t1 < t2. Then B(t) : t1 ≤ t ≤ t2is a Gaussian process with mean function 0 and covariance functionmin(s, t) and has continuous trajectories, i.e., it has the same distri-bution as B(t) : t1 ≤ t ≤ t2. Thus X1 ≡ sup|B(t)| : t1 ≤ t ≤ t2has the same distribution as X1 ≡ sup|B(t)| : t1 ≤ t ≤ t2. Sinceboth converge as t1 ↓ 0 to X2(t2) ≡ supB(t) : 0 < t ≤ t2 andX2(t2) ≡ supB(t) : 0 < t ≤ t2, respectively, these two have thesame distribution. Again, since X2(t2) and X2(t2) both converge ast2 ↓ 0 to X2 ≡ limt↓0|B(t)| and X2 ≡ limt↓0|B(t)|, respectively, X2and X2 have the same distribution. But X2 = 0 w.p. 1 since B(t) iscontinuous in [0,∞). Thus X2 = 0 w.p. 1, i.e., limt→0 B(t) = 0 w.p.1.

(iv) Translation invariance (after a fixed time t0)Fix t0 > 0 and set

Bt0(t) = B(t + t0)−B(t0), t ≥ 0. (2.6)

Then Bt0(t)t≥0 is also an SBM. This follows from the stationaryindependent increments property.

(v) Translation invariance (after a stopping time T0)A random variable T (ω) with values in [0,∞) is called a stopping

Page 504: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

496 15. Stochastic Processes

time w.r.t. the SBM B(t) : t ≥ 0 if for each t in [0,∞) the eventT ≤ t is in the σ-algebra Ft ≡ σ(B(s) : s ≤ t) generated by thetrajectory B(s) for 0 ≤ s ≤ t. Examples of stopping times are

Ta = mint : t ≥ 0, B(t) ≥ a (2.7)

for 0 < a < ∞Ta,b = mint : t > 0, B(t) ∈ (a, b) (2.8)

where a < 0 < b.Let T0 be a stopping time w.r.t. SBM B(t) : t ≥ 0. Let

BT0(t) ≡ B(T0 + t)−B(T0) : t ≥ 0. (2.9)

Then BT0(t)t≥0 is again an SBM.Here is an outline of the proof.

(a) T0 deterministic is covered by (4) above.(b) If T0 takes only countably many values, say ajj≥1, then it

is not difficult to show that conditioned on the event T0 = aj ,the process BT0(t) ≡ B(T0 + t) − B(T0) is SBM. Thus theunconditional distribution of BT0(t) : t ≥ 0 is again an SBM.

(c) Next given a general stopping time T0, one can approximateit by a sequence Tn of stopping times where for each n, Tn isdiscrete. By continuity of trajectories, BT0(t) : t ≥ 0 has thesame distribution as the limit of BTn

(t) : t ≥ 0 as n →∞.

A consequence of the above two properties is that SBM has theMarkov and the strong Markov properties. That is, for each fixedt0, the distribution of B(t), t ≥ t0 given B(s) : s ≤ t0 depends onlyon B(t0) (Markov property) and for each stopping time T0, the dis-tribution of B(t) : t ≥ T0 given B(s) : s ≤ T0 depends only on B(T0)(strong Markov property).

(vi) The reflection principleFix a > 0 and let Ta = inft : B(t) ≥ a where B(t) : t ≥ 0 is

SBM. For any t > 0, a > 0,

P (Ta ≤ t) = P (Ta ≤ t, B(t) > a)+ P (Ta ≤ t, B(t) < a).

Now, by continuity of the trajectory, B(Ta) = a on Ta ≤ t. Thus

P(Ta ≤ t, B(t) < a

)= P

(Ta ≤ t, B(t) < B(Ta)

)= P

(Ta ≤ t, B(t)−B(Ta) < 0

)= P

(Ta ≤ t, B(t)−B(Ta) > 0

)= P

(Ta ≤ t, B(t) > a

).

Page 505: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

15.2 Brownian motion 497

To see this note that by (4), B(Ta + h) − B(Ta) : h ≥ 0 is inde-pendent of Ta and has the same distribution as an SBM and hence−(B(Ta + h)−B(Ta)

): h ≥ 0

is also independent of Ta and has

the same distribution as an SBM. Thus,

P (Ta ≤ t) = 2P(Ta ≤ t, B(t) > a

)= 2P

(B(t) > a

)= 2

(1− Φ

( a√t

))(2.10)

where Φ(·) is the standard N(0, 1) cdf. The above argument is knownas the reflection principle as it asserts that the path

B(t) ≡

B(t) , t ≤ Ta

B(Ta)−(B(t)−B(Ta)

), t > Ta

(2.11)

obtained by reflecting the original path on the line y = a from thepoint (Ta, a) for t > Ta yields a path that has the same distributionas the original path. Thus the probability density function of Ta is

fTa(t) = 2φ( a√

t

)12

a

t3/2

=1√2π

e− a22t

a

t3/2 (2.12)

implying that ET pa < ∞ for p < 1/2 and ∞ for p ≥ 1/2. Also, by the

strong Markov property the process Ta : a ≥ 0 is a process withstationary independent increments, i.e., a Levy process. It is also astable process of order 1/2.

One can use this calculation of P (Ta ≤ t) to show that the probabil-

ity that the SBM crosses zero in the interval (t1, t2) is 2π arcsin

√t1t2

(Problem 15.12).

If M(t) ≡ supB(s) : 0 ≤ s ≤ t then for a > 0

P(M(t) > a

)= P (Ta ≤ t)

= 2P(B(t) > a

)= P

(|B(t)| > a

)(2.13)

it follows that M(t) has the same distribution as |B(t)| and hencehas finite moments of all order. In fact,

E(eθM(t)) < ∞ for all θ > 0.

Page 506: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

498 15. Stochastic Processes

15.2.3 Some related processes(i) Let B(t) : t ≥ 0 be a SBM. For µ in (−∞,∞) and σ > 0, the

process Bµ,σ(t) ≡ µt + σB(t), t ≥ 0 is called Brownian motion withconstant drift µ and constant diffusion σ.

(ii) Let B0(t) = B(t) − tB(1), 0 ≤ t ≤ 1. The process B0(t) : 0 ≤t ≤ 1 is called the Brownian bridge. It is a Gaussian process withmean function 0 and covariance min(s, t) − st and has continuoustrajectories that vanish both at 0 and 1.

(iii) Let Y (t) = e−tB(e2t), −∞ < t < ∞. Then Y (t) : t ≥ 0is a Gaussian process with mean function 0 and covariance func-tion c(s, t) = e−(s+t)e+2s = es−t if s < t. This process is calledthe Ornstein-Uhlenbeck process. It is to be noted that for each t,Y (t) ∼ N(0, 1) and in fact Y (t) : −∞ < t < ∞ is a strictly sta-tionary process and is a Markov process as well.

15.2.4 Some limit theoremsLet ξii≥1 be iid random variables with Eξ1 = 0, Eξ2

1 = 1. Let S0 = 0,Sn =

∑ni=1 ξi, n ≥ 1. Let Bn(j/n) = 1√

nSj , j = 0, 1, 2, . . . , n and Bn(t) :

0 ≤ t ≤ 1 be obtained by linear interpolation from the values at j/nfor j = 0, 1, 2, . . . , n. Then for each n, Bn(t) : 0 ≤ t ≤ 1 is a randomcontinuous trajectory and hence is a random element of the metric spaceof continuous real valued functions on [0,1] that are zero at zero with themetric

ρ(f, g) ≡ sup |f(t)− g(t)| : 0 ≤ t ≤ 1. (2.14)

Let µn ≡ PB−1n be the induced probability measure on C[0, 1]. The fol-

lowing is a generalization of the central limit theorem as noted in Chapter11.

Theorem 15.2.1: (Donsker). In the space (C[0, 1], ρ) the sequence ofprobability measures µn ≡ PB−1

n n≥1 converges weakly to µ, the probabil-ity distribution of the SBM. That is, for any bounded continuous functionh from C[0, 1] to R,

∫hdµn → fhdµ.

For a proof, see Billingsley (1968).

Corollary 15.2.2: For any continuous functional T on (C[0, 1], ρ) to Rk,k < ∞, the distribution of T (Bn) converges to that of T (B). In particular,the joint distribution of

(max

0≤j≤n

Sj√n, max0≤j≤n

|Sj |√n

)converges weakly to that of(

max0≤u≤1

B(u), max0≤u≤1

|B(u)|).

Page 507: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

15.2 Brownian motion 499

There are similar limit theorems asserting the convergence of the em-pirical processes to the Brownian bridge with applications to the limitdistribution of the Kolmogorov-Smirnov statistics (see Billingsley (1968)).

Theorem 15.2.3: (Laws of large numbers).

limt→∞

B(t)t

= 0 w.p. 1. (2.15)

Proof: By the time inversion property (2.5)

limt→0

B(t) = 0 w.p. 1.

But limt→0

B(t) = limt→0

tB(1/t) = limτ→∞

B(τ)τ

.

Theorem 15.2.4: (Kallianpur-Robbins). Let f : R → R be integrablewith respect to Lebesgue measure. Then

1√t

∫ t

0f(B(u))du −→d

(∫ ∞

0f(u)du

)Z (2.16)

where Z is a random variable with density π√z(1−z)

in [0,1].

This is a special case of the Darling-Kac formula for Markov processesthat can be established here using the regenerative property of SBM dueto the fact that starting from 0, SBM will hit level 1 at same time T1and from there hit level 0 at a later time τ1. And this can be repeated toproduce a sequence of times 0, τ1, τ2, . . . such that the excursions B(t) :τi ≤ t < τi+1i≥1 are iid. The sequence τii≥1 is a renewal sequence withlife time distribution τ1 having a regularly varying tail of order 1/2 andhence infinite mean. One can appeal now to results from renewal theory tocomplete the proof (see Feller (1966) and Athreya (1986)).

15.2.5 Some sample path properties of SBMThe sample paths t → B(t, ω) of the SBM are continuous w.p. 1. It turnsout that they are not any more smooth than this. For example, they arenot differentiable nor are they of bounded variation on finite intervals. Itwill be shown now that w.p. 1 Brownian sample paths are not differentiableany where and the quadratic variation over any finite interval is finite andnonrandom. (See also Karatzas and Shreve (1991).)

(i) Nondifferentiability of B(·, ω) in [0,1]Let An,k =

ω : sup

|t−s|≤3/n

|B(t,ω)−B(s,ω)||t−s| ≤ k for some 0 ≤ s ≤ 1

.

Let Zr,n =∣∣B( (r+1)

n

)− B

(rn

)∣∣, r = 0, 1, 2, n − 1. Let Bn,k =ω :

Page 508: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

500 15. Stochastic Processes

max(Zr,n, Zr+1,n, Zr+2,n

)≤ 6k

n for some r. It can be verified that

An,k ⊂ Bn,k. Now

P (Bn,k) ≤n−1∑r=0

P(

max(Zr,n, Zr+1,n, Zr+2,n

)≤ 6k

n

)

≤ n(P(|Z0,n| ≤

6k

n

))3

≤ nP( |Z0n|

1√n

≤( 6k√

n

))3

≤ n(Const√

n

)3as n →∞,

since Z0n1√n

∼ N(0, 1). Thus for each k < ∞, P (An,k) ≤ Const√n

. This

implies∞∑

n=1

P (An3,k) < ∞.

So by the Borel-Cantelli lemma, only finitely many An3,k can happenw.p. 1. The event A ≡ ω : B(t, ω) is differentiable for at least one tin [0, 1] is contained in C ≡

⋃k≥1ω : ω ∈ An3,k for infinitely many

n and so P (A) ≤ P (C) = 0.

(ii) Finite quadratic variation of SBMLet

ηn,j = B(j2−n)−B((j − 1)2−n

), j = 1, 2, . . . , 2n

∆n ≡2n∑

j=1

η2nj . (2.17)

Then

E∆n =2n∑

j=1

12n

= 1.

Also by independence and stationarity of increments

Var(∆n) =2n∑

j=1

322n

=32n

.

Thus P (|∆n−1| > ε) ≤ Var(∆n)ε2 for any ε > 0. This implies, by Borel

Cantelli, ∆n → 1 w.p. 1. By definition the quadratic variation is

∆ ≡ sup n∑

j=0

∣∣B(tj , ω)−B(tj−1, ω)∣∣2 : all finite partitions

(t0, t1, . . . , tn) of [0, 1]

. (2.18)

Page 509: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

15.2 Brownian motion 501

It is easy to verify that ∆ = limn ∆n. Thus ∆ = 1 w.p. 1. It followsthat w.p. 1 the Brownian motion paths are not of bounded variation.By the scaling property of SBM, it follows that the quadratic variationof SBM over [0, t] is t w.p. 1 for any t > 0.

15.2.6 Brownian motion and martingalesThere are three natural martingales associated with Brownian motion.

Theorem 15.2.5: Let B(t) : t ≥ 0 be SBM. Then

(i) (Linear martingale) B(t) : t ≥ 0 is a martingale.

(ii) (Quadratic martingale) B2(t)− t : t ≥ 0 is a martingale.

(iii) (Exponential martingale) For any θ real, eθB(t)− θ22 t : t ≥ 0 is a

martingale.

Proof: (i) and (ii). Since B(t) ∼ N(0, t),

E|B(t)| < ∞ and E|B(t)|2 < ∞.

By the stationary independent increments property for any t ≥ 0, s ≥ 0,

E(B(t + s) | B(u) : u ≤ t

)= E

((B(t + s)−B(t)

)| B(u) : u ≤ t

)+ B(t)

= 0 + B(t) = B(t) establishing (i).

Next,

E(B2(t + s) | B(u) : u ≤ t

)= E

((B(t + s)−B(t)

)2 | B(u) : u ≤ t)

+ B2(t) + 2E((

B(t + s)−B(t))B(t) | B(u) : u ≤ t

)= s + B2(t) + 0

and hence

E(B2(t + s)− (t + s) | B(u) : u ≤ t

)= B2(t)− t, establishing (ii).

(iii)

E(eθB(t+s)− θ2

2 (t+s)∣∣∣ B(u) : u ≤ t

)= E

(eθ(B(t+s)−B(t))− θ2

2 s∣∣∣ B(u) : u ≤ t

)eθB(t)− θ2

2 t.

Again by using the fact that B(t+s)−B(t) given(B(u) : u ≤ t

)is N(0, s),

the first term on the right side becomes 1 proving (iii).

Page 510: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

502 15. Stochastic Processes

15.2.7 Some applicationsThe martingales in Theorem 15.2.5 combined with the optional stoppingtheorems of Chapter 13 yield the following applications.

(i) Exit probabilitiesLet B(·) be SBM. Fix a < 0 < b. Let Ta,b = mint : t > 0, B(t) =

a or b. From (i) and the optional sampling theorem, for any t > 0;

EB(Ta,b ∧ t) = EB(0) = 0. (2.19)

Also, by continuity, B(Ta,b ∧ t) → B(Ta,b). By bounded convergencetheorem, this implies

EB(Ta,b) = 0 (2.20)

i.e., a p + b(1 − p) = 0 where p = P (Ta < Tb) = P (B(·) reaches abefore b). Thus, p = b

(b−a) .

(ii) Mean exit timeFrom (ii) and the optional sampling theorem

E(B2(Ta,b ∧ t)− (Ta,b ∧ t)

)= 0

i.e.,EB2(Ta,b ∧ t) = E(Ta,b ∧ t). (2.21)

By using the bounded convergence theorem on the left and the mono-tone convergence theorem on the right, one may conclude

EB2(Ta,b) = ETa,b

i.e.,

ETa,b = pa2 + (1− p)b2

= a2 b

(b− a)+ b2 (−a)

(b− a)= (−ab). (2.22)

(iii) The distribution of Ta,b

From (iii) and the optimal sampling theorem

E(eθB(Ta,b∧t)− θ2

2 Ta,b

)= 1.

By the bounded convergence theorem, this implies

E(eθB(Ta,b)− θ2

2 Ta,b

)= 1. (2.23)

Page 511: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

15.2 Brownian motion 503

In particular, if b = −a this reduces to

1 = E(eθa− θ2

2 Ta,−a : Ta < T−a

)+ E

(e−θae− θ2

2 Ta,−a : Ta > T−a

)=

(eθa + e−θa

)12E(e− θ2

2 Ta,−a

),

since by symmetry

E(e− θ2

2 Ta,−a : Ta < T−a

)= E

(e− θ2

2 Ta,−a : Ta > T−a

).

Thus, for λ ≥ 0

E(e−λTa,−a

)= 2(e√

2λa + e−√2λa)−1

. (2.24)

Similarly, it can be shown that for λ ≥ 0, a > 0

E(e−λTa

)= e−√

2λa. (2.25)

15.2.8 The Black-Scholes formula for stock price optionLet X(t) denote the price of one unit of a stock S at time t. Due to fluc-tuations in the market place, it is natural to postulate that X(t) : t ≥ 0is a stochastic process. To build an appropriate model consider the dis-crete time case first. If Xn denotes the unit price at time n, it is naturalto postulate that Xn+1 = Xnyn+1 where yn+1 represents the effects ofthe market fluctuation in the time interval [n, n + 1). This leads to theformula Xn = X0y1y2 · · · yn. If one assumes that ynn≥1 are sufficientlyindependent, then

Xn = X0 enµ+

n∑

i=1(log yi−µ)

is, by the central limit theorem, approximately Gaussian, leading one toconsider a model of the form

X(t) = X(0)eµt+σB(t) (2.26)

where B(t) : t ≥ 0 is SBM. Thus, log X(t) − log X(0) : t ≥ 0 ispostulated to be a Brownian motion with drift µ and diffusion σ. In thelanguage of finance, µ is called the growth rate and σ the volatility rate.

The so-called European option allows one to buy the stock at a futuretime t0 for a unit price of K dollars at time 0. If X(t0) < K then one hasthe option of not buying, whereas if X(t0) ≥ K, then one can buy it atK dollars and sell it immediately at the market price X(t0) and realize aprofit of X(t0)−K. Thus the net revenue from this option is

X(t0) =

0 if X(t0) ≤ K

X(t0)−K if X(t0) > K.(2.27)

Page 512: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

504 15. Stochastic Processes

Since the value of money depreciates over time, say at rate r, the netrevenue’s value at time 0 is X(t0)e−t0r. So a fair price for this Europeanoption is

p0 = EX(t0)e−t0r

= E(X(t0)−K)+e−t0r. (2.28)

Here the constants µ, σ, K, t0, r are assumed known. The goal is to com-pute p0. This becomes feasible if one makes the natural assumption of noarbitrage. That is, the discounted value of the stock, i.e., X(t)e−rt, evolvesas a martingale. This is a reasonable assumption as otherwise (if it is advan-tageous) then everybody will want to take advantage of it and start buyingthe stock, thereby driving the price down and making it unprofitable.

Thus, in effect, this assumption says that

X(t)e−rt ≡ X(0)eµt+σB(t)−rt

evolves as a martingale. But recall that if B(·) is an SBM then for any θ

real, eθB(t)− θ22 t evolves as a martingale. Thus, µ, σ, r should satisfy the

condition −σ2

2 = (µ − r). With this assumption, the fair price for thisEuropean option with µ, σ, r, K, t0 given is

p0 = E(e−t0r

(X0 eσB(t0)+µt0 −K

)+)=

1√2πt0

e−t0r

∫X0 eσy+µt0>K

(X0 eσy+µt0 −K

)e− y2

2t0 dy. (2.29)

This is known as the Black-Scholes formula.For more detailed discussions on Brownian motion including the develop-

ment of Ito stochastic integration and diffusion processes via a martingaleformulation, the books of Stroock and Varadhan (1979) and Karlin andTaylor (1975) should be consulted. See also Karatzas and Shreve (1991).

15.3 Problems

15.1 Let Ljj≥0 be as in Section 15.1.1. Show that for any θ ≥ 0

(a) E(e−θ

∑nj=0 Lj

)= E

(n∏

j=0

λyj

θ+λyj

)

(b) E(e−θ

∑∞j=0 Lj

)= 0 for all θ > 0 iff

∞∑j=0

1λyj

= ∞ w.p. 1 assum-

ing 0 < λi < ∞ for all i.

Page 513: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

15.3 Problems 505

15.2 Let L be an exponential random variable. Verify that for any x > 0,u > 0

P (L > x + u | L > x) = P (L > u).

(This is referred to as the “lack of memory” property.)

15.3 Solve the Kolmogorov’s forward and backward equations for the fol-lowing special cases of birth and death processes:

(a) Poisson process: αi ≡ α, βi ≡ 0,(b) Yule process: αi ≡ iα, βi ≡ 0,(c) On-off process: α0 = α, αi = 0, i ≥ 1,

β1 = β, βi = 0, i = 0, 2, . . . ,

(d) M/M/1 queue: αi = α, i ≥ 0, βi = β, i ≥ 1, β0 = 0,(e) M/M/s queue: αi = α, i ≥ 0, βi = iβ, 1 ≤ i ≤ s and = sβ,

i > s,β0 = 0,

(f) Pure death process: βi ≡ β, i ≥ 1, β0 = 0, αi = 0, i ≥ 0.

15.4 Find the stationary distributions when they exist for the processes inProblem 15.3.

15.5 Consider 2 independent M/M/1 queues with arrival rate λ, servicerate µ (Case I), and one M/M/1 queue with arrival rate 2λ and servicerate 2µ (Case II). Assume λ < µ. Show that in the stationary statethe mean number in the system Case I is larger than in Case 2 andtheir ratio approaches 2 as ρ = λ

µ ↑ 1.

15.6 Show that for any finite state space irreducible CTMC X(t) : t ≥ 0with all λi ∈ (0,∞), there is a unique stationary distribution.

15.7 (M/M/∞ queue). This is a birth and death process such that αn ≡α, βn = nβ, n ≥ 0, 0 < α, β < ∞. Show that this process has astationary distribution that is Poisson with mean ρ = λ

µ .

15.8 (a) Let X(t)t≥0 be a Poisson process with rate λ. Let L be anexponential random variable with mean µ−1 and independentof X(t)t≥0. Let N(t) = X(t+L)−X(t). Find the distributionof N(t).

(b) Let Y (t)t≥0 be also a Poisson process with rate µ and indepen-dent of X(t)t≥0 in (a). Let T and T ′ be two successive ‘eventepochs’ for the Y (t)t≥0 process. Let N = X(T ′)−X(T ). Findthe distribution of N .

(c) Let X(t)t≥0 be as in (a). Let τ0 = 0 < τ1 < τ2 < · · · be thesuccessive event epochs of X(t)t≥0. Find the joint distributionof (τ1, τ2, . . . , τn) conditioned on the event N(t) = 1 for some0 < t < ∞.

Page 514: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

506 15. Stochastic Processes

15.9 Let X(t) : t ≥ 0 be a Poisson process with rate λ. Suppose ateach event epoch of the Poisson process an experiment is performedthat results in one of k possible outcomes ai : 1 ≤ i ≤ k withprobability distribution pi : 1 ≤ i ≤ k. Let Xi(t) = outcomes ai in[0, t]. Assume the experiments are iid. Show that Xi(t) : t ≥ 0 areindependent Poisson processes with rate λpi for 1 ≤ i ≤ k.

15.10 Let X(t) : t ≥ 0 be a Poisson process with rate λ, 0 < λ < ∞.Let ξii≥1 be a sequence of iid random variables independent ofX(t) : t ≥ 0 with values in a measurable space (S,S). For eachA ∈ S define

N(A, t) ≡N(t)∑j=1

I(ξj ∈ A), t ≥ 0.

(a) Verify that for each A ∈ S, N(A, t)t≥0 is a Poisson processand find its rate.

(b) Show that if A1, A2 ∈ S, A1 ∩ A2 = S, then the two Poissonprocesses N(Ai, t)t≥0, i = 1, 2 are independent.

(c) Show that for each t > 0, N(A, t) : A ∈ S is a Poissonrandom field on S, i.e., for each A, N(A, t) is Poisson and forA1, A2, . . . , Ak pairwise disjoint elements of S, N(Ai, t)k

1 areindependent.

(d) Show that N(·, t)t≥0 is a process with stationary independentincrements that is Poisson random measure valued.

15.11 Let BN (·, ω) be as in (2.1). Show that BN (·, ω)N≥1 is Cauchy inthe Banach space C[0, 1] with sup norm by completing the followingsteps:

(a) If ξnj(t, ω) = Znj(ω)Snj(t) then

(i) ‖ξnj(·, ω)‖ ≡ sup|ξnj(t, ω)| : 0 ≤ t ≤ 1 = |Znj(ω)|2− (n+1)2

(ii)sup

2n−1∑j=1

|ξnj(t, ω) : 0 ≤ t ≤ 1

= (max|Znj(ω)| : 1 ≤ j ≤ 2n − 1)2− (n+1)2 ,

(b) for any sequence ηii≥1 of random variables with supi E(eηi) <∞, w.p. 1, ηi ≤ 2 log i for all large i,

(c) w.p. 1 there is a C < ∞ such that

max|Znj(ω)| : 1 ≤ j ≤ 2n − 1 ≤ Cn.

Page 515: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

15.3 Problems 507

(d)

∞∑n=1

2n−1∑j=1

‖ξnj(·, ω)‖ < ∞ w.p. 1.

15.12 Show that if B(·) is SBM

P(B(t) = 0 for some t in (t1, t2)

)=

arcsin√

t2t1

.

(Hint: Conditioned on B(t1) = x = 0, the required probability equalsP (T|x| ≤ t2 − t1) = 2

(1 − Φ( |x|√

t2−t1))

and hence the unconditional

probability is E2(1− Φ( |B(t1)|√

t2−t1)).)

15.13 Use the reflection principle to find P (M(t) ≥ x, B(t) ≤ y) for x > ywhere M(t) = maxB(u) : 0 ≤ u ≤ t and B(·) is SBM.

15.14 For a < 0 < b < c find P (Tb < Ta < Tc) where Tx = mint : t >0, B(t) = x where B(·) is SBM.

15.15 Let B0(t) ≡ B(t)−tB(1), 0 ≤ t ≤ 1 (where B(t) : t ≥ 0 is SBM) bethe Brownian bridge. Find the distribution of X(t) ≡ (1+ t)B0

(t

1+t

),

t ≥ 0.

(Hint: X is a Gaussian process. Find its mean and covariance func-tions.)

15.16 Let B(·) be SBM. Let Mn = sup|B(t) − B(n)| : n − 1 ≤ t ≤ n,n = 1, 2, . . ..

(a) Show that Mn

n → 0 w.p. 1 as n →∞.

(Hint: Show Mnn≥1 are iid and EM1 < ∞.)

(b) Using this show that B(t)t → 0 w.p. 1 as t →∞ and give another

proof of the time inversion result 15.2.3.

15.17 Use the exponential martingale to find E(e−λT ) where T = inft :t ≥ 0, B(t) ≥ α + βt, λ > 0, α > 0, β > 0 and B(·) SBM.

15.18 Let Y (t) : −∞ < t < ∞ be the Ornstein-Uhlenbeck process asdefined in 15.2.4. Let f : R → R be Borel measurable and E|f(Z)| <∞ where Z ∼ N(0, 1). Evaluate lim

t→∞1t

∫ t

0 f(Y (u)

)du.

(Hint: Show that Y (·) is a regenerative stochastic process.)

Page 516: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

16Limit Theorems for DependentProcesses

16.1 A central limit theorem for martingales

Let Xnn≥1 be a sequence of random variables on (Ω,F , P ), and letFnn≥1 be a filtration, i.e., a sequence of σ-algebras on Ω such thatFn ⊂ Fn+1 ⊂ F for all n ≥ 1. From Chapter 13, recall that Xn,Fnn≥1is called a martingale if Xn is Fn-measurable for each n ≥ 1 and E(Xn+1 |Fn) = Xn for each n ≥ 1.

Given a martingale Xn,Fnn≥1, define

Y1 = X1 − EX1,

Yn = Xn −Xn−1, n ≥ 1.

Note that each Yn is Fn-measurable and

E(Yn | Fn−1) = 0 for all n ≥ 1, (1.1)

where F0 = Ω, ∅.

Definition 16.1.1: Let Ynn≥1 be a collection of random variableson a probability space (Ω,F , P ) and let Fnn≥1 be a filtration. Then,Yn,Fnn≥1 is called a martingale difference array (mda) if Yn is Fn-measurable for each n ≥ 1 and (1.1) holds.

For example, if Ynn≥1 is a sequence of zero mean independent randomvariables, then Yn,Fnn≥1 is a mda w.r.t. the natural filtration Fn =σ〈Y1, . . . , Yn〉, n ≥ 1. Other examples of mda’s can be constructed from the

Page 517: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

510 16. Limit Theorems for Dependent Processes

examples given in Chapter 13. The main result of this section shows thatfor square-integrable mda’s satisfying a Lindeberg-type condition, the CLTholds. For more on limit theorems for mdas, see Hall and Heyde (1980).

Theorem 16.1.1: For each n ≥ 1, let Yni,Fnii≥1 be a mda on (Ω,F , P )with EY 2

ni < ∞ for all i ≥ 1 and let τn be a finite stopping time w.r.t.Fnii≥1. Suppose that for some constant σ2 ∈ (0,∞),

τn∑i=1

E(Y 2

ni

∣∣ Fn,i−1

)−→p σ2 as n →∞ (1.2)

and that for each ε > 0,

∆n(ε) ≡τn∑i=1

E(Y 2

niI(|Yni| > ε)∣∣ Fn,i−1

)−→p 0 as n →∞. (1.3)

Then,τn∑i=1

Yni −→d N(0, σ2). (1.4)

Proof: First the theorem will be proved under the additional conditionthat

τn = mn for all n ≥ 1 for some nonrandom sequence ofpositive integers mnn≥1 (1.5)

and that for some c ∈ (0,∞),

mn∑i=1

E(Y 2

ni

∣∣ Fn,i−1

)≤ c w.p. 1. (1.6)

Let σ2ni = E

(Y 2

ni | Fn,i−1), i ≥ 1, n ≥ 1. Also, write m for mn to ease the

notation. Since σ2ni is Fn,i−1-measurable, for any t ∈ R,∣∣∣∣E exp

(ιt

m∑j=1

Ynj

)− exp

(− σ2t2/2

)∣∣∣∣≤

∣∣∣∣E exp(

ιtm∑

j=1

Ynj

)− E exp

(ιt

m−1∑j=1

Ynj

)exp

(− t2σ2

nm/2)∣∣∣∣

+ · · ·+∣∣∣∣E exp

(ιtYn1

)exp

(−

m∑j=2

t2σ2nj/2

)

−[

exp(−

m∑j=1

t2σ2nj/2

)]∣∣∣∣

Page 518: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

16.1 A central limit theorem for martingales 511

+∣∣∣∣E exp

(−

m∑j=1

t2σ2nj/2

)− exp(−t2σ2/2)

∣∣∣∣≤

m∑k=1

E∣∣∣E exp(ιtYnk)

∣∣ Fn,k−1

− exp(−t2σ2

nk/2)∣∣∣

+∣∣∣∣E exp

(−

m∑j=1

t2σ2nj/2

)− exp(−t2σ2/2)

∣∣∣∣≡ I1n + I2n, say. (1.7)

By (1.2), (1.5), and the BCT,

I2n → 0 as n →∞. (1.8)

To estimate I1n, note that for any 1 ≤ k ≤ n,

E

exp(ιtYnk)∣∣ Fn,k−1

= 1 + ιtE

(Ynk | Fn,k−1

)+

(ιt)2

2E(Y 2

nk | Fn,k−1)

+ θnk(t)

= 1− t2

2σ2

nk + θnk(t) (1.9)

and

exp(− t2σ2

nk/2)

= 1− t2

2σ2

nk + γnk, say. (1.10)

It is easy to verify that

|θnk| ≤ E

[min

(tYnk)2,

|tYnk|36

∣∣∣ Fn,k−1

]

and|γnk| ≤

(t2σ2

nk

)2 exp(t2σ2

nk/2)/8.

Hence, by (1.3), (1.6), (1.9), and (1.10), for any ε in (0,1),

I1n ≤m∑

k=1

E|θnk|+ |γnk|

≤ t2m∑

k=1

E∣∣∣EY 2

nkI(|Ynk| > ε)∣∣ Fn,k−1

∣∣∣+ |t|3ε ·

m∑k=1

E∣∣∣E(Y 2

nk | Fn,k−1)∣∣∣+ m∑

k=1

E

t4σ4nk exp(t2c/2)

≤ t2E ∆n(ε) + |t|3 · ε · E[ m∑

k=1

σ2nk

]

+ t4 exp(t2c/2)E

[( m∑k=1

σ2nk

)max

l≤k≤mσ2

nk

].

Page 519: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

512 16. Limit Theorems for Dependent Processes

Note that for any ε > 0,

E max1≤k≤m

σ2nk ≤ ε2 + E

[max

1≤k≤mE

Y 2nkI(|Ynk| > ε

) ∣∣ Fn,k−1

]≤ ε2 + E∆n(ε).

Hence, by (1.3), (1.6), and the BCT, for any ε ∈ (0, 1),

lim supn→∞

I1n ≤ lim supn→∞

[t2 E∆n(ε) + |t|3ε · c

+ t4cet2c/2

ε2 + E∆n(ε)]

≤ c1(t) ε

for some c1(t) ∈ (0,∞), not depending on ε. Thus implies that

limn→∞ I1n = 0. (1.11)

Clearly (1.7), (1.8), and (1.11) yield (1.4), whenever (1.5) and (1.6) aretrue.

Next, suppose that condition (1.6) is not assumed a priori but (1.5)holds true. Fix c > σ2 and define the sets Bnk =

∑ki=1 σ2

ni ≤ c, and the

variables Ynk = YnkIBnk, k ≥ 1, n ≥ 1. Note that Bnk ∈ Fn,k−1 and hence,

E(Ynk | Fn,k−1

)= IBnk

E(Ynk | Fn,k−1

)= 0,

andσ2

nk ≡ E(Y 2

nk | Fn,k−1)

= IBnkσ2

nk, (1.12)

for all k ≥ 1. In particular,Ynk,Fn,k

is a mda. Since Bn,k−1 ⊃ Bnk for

all k, by the definitions of the sets Bnk’s, it follows that

m∑k=1

σ2nk =

m∑k=1

σ2nk IBnk

=m∑

k=1

σ2nk IBnm +

m−1∑k=1

σ2nk

(IBn,m−1 − IBnm

)+ · · ·+ σ2

n1(IBn1 − IBn2

)≤ cIBnm

+ c(IBn,m−1 − IBnm

)+ · · ·+

(cIBn1 − IBn2

)≤ c. (1.13)

Thus, the mdaYnk,Fnk

k≥1 satisfies (1.6). Next note that by (1.2) and

(1.5),P(Bc

nm

)→ 0 as n →∞. (1.14)

Page 520: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

16.2 Mixing sequences 513

Also, by (1.12),∑m

k=1 σ2nk =

∑mk=1 σ2

nk on Bnm. Hence, it follows that

m∑k=1

σ2nk −→p σ2,

i.e., the mdaYnk,Fnk

satisfies (1.2). Further, the inequality “|Ynk| ≤

|Ynk|” and the fact that the function h(x) = x2I(|x| > ε), x > 0 is nonde-creasing jointly imply that (1.3) holds for

Ynk,Fnk

. Hence, by the case

already proved,m∑

k=1

Ynk −→d N(0, σ2). (1.15)

But∑m

k=1 Ynk =∑m

k=1 Ynk on Bnm. Hence, by (1.14),

m∑k=1

Ynk −→d N(0, σ2), (1.16)

and therefore, the CLT holds without the restriction in (1.6).Next consider relaxing the restrictions in (1.5) (and (1.6)). Since P (τn <

∞) = 1, there exist positive integers mn such (Problem 16.2) that

P(τn > mn

)→ 0 as n →∞. (1.17)

Next defineYnk = Ynk I(τn ≥ k), k ≥ 1, n ≥ 1. (1.18)

It is easy to check (Problem 16.3) that Ynk,Fnk is a mda, and thatYnk,Fnk satisfies (1.2) and (1.3) with τn replaced by mn (Problem 16.4).Hence, by the previous case already proved,

mn∑k=1

Ynk −→d N(0, σ2).

Next note that (cf. (4.1), Proclem 16.4),

τn∑k=1

Ynk −mn∑k=1

Ynk −→p 0 as n →∞. (1.19)

Hence, (1.4) holds and the proof of the theorem is complete.

16.2 Mixing sequences

This section deals with a class of dependent processes, called the mixingprocesses, where the degree of dependence decreases as the distance (in

Page 521: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

514 16. Limit Theorems for Dependent Processes

time) between two given sets of random variables goes to infinity. The‘degree of dependence’ is measured by various mixing coefficients, whichare defined in Section 16.2.1 below. Some basic properties of the mixingcoefficients are presented in Section 16.2.2. Limit theorems for sums ofmixing random variables are given in Section 16.2.3.

16.2.1 Mixing coefficientsDefinition 16.2.1: Let (Ω,F , P ) be a probability space and G1, G2 besub-σ-algebras of F .

(a) The α-mixing or strong mixing coefficient between G1 and G2 is de-fined as

α(G1,G2) ≡ sup∣∣P (A ∩B)− P (A)P (B)

∣∣ : A ∈ G1, B ∈ G2. (2.1)

(b) The β-mixing coefficient or the coefficient of absolute regularity be-tween G1 and G2 is defined as

β(G1,G2) ≡12

supk∑

i=1

∑j=1

∣∣P (Ai ∩Bj)− P (Ai)P (Bj)∣∣, (2.2)

where the supremum is taken over all finite partitions A1, . . . , Akand B1, . . . , B of Ω by sets Ai ∈ G1 and Bj ∈ G2, 1 ≤ i ≤ k,1 ≤ j ≤ , , k ∈ N.

(c) The ρ-mixing coefficient or the coefficient of maximal correlation be-tween G1 and G2 is defined as

ρ(G1,G2) ≡ supρX1,X2 : Xi ∈ L2(Ω,Gi, P ), i = 1, 2 (2.3)

where ρX1,X2 ≡Cov(X1,X2)√

Var(X1)Var(X2)is the correlation coefficient of X1

and X2.

It is easy to check (Problem 16.5 (a) and (d)) that all three mixing coeffi-cients take values in the interval [0, 1] and that ρ(G1,G2) = sup

|EX1X2| :

Xi ∈ L2(Ω,Gi, P )EXi = 0, EX21 = 1, i = 1, 2

. When the σ-algebras G1

and G2 are independent, these coefficients equal zero, and vice versa. Thus,nonzero values of the mixing coefficients give various measures of the degreeof dependence between G1 and G2. It is easy to check (Problem 16.5 (c))that

α(G1,G2) ≤ β(G1,G2) and α(G1,G2) ≤ ρ(G1,G2). (2.4)

However, no ordering between the β(G1,G2) and ρ(G1,G2) exists, in general(Problem 16.6). There are two other mixing coefficients that are also oftenused in the literature. These are given by the φ-mixing coefficient

φ(G1,G2) ≡ sup∣∣P (A)− P (A | B)

∣∣ : B ∈ G1, P (B) > 0, A ∈ G2, (2.5)

Page 522: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

16.2 Mixing sequences 515

and the Ψ-mixing coefficient

Ψ(G1,G2) ≡ supA∈G∗

1 ,B∈G∗2

|P (A ∩B)− P (A)P (B)|P (A)P (B)

, (2.6)

where P (A | B) = P (A∩B)/P (B) for P (B) > 0, and where G∗i = A : A ∈

Gi, P (A) > 0, i = 1, 2. It is easy to check that Ψ(G1,G2) ≥ φ(G1,G2) ≥β(G1,G2).

Definition 16.2.2: Let Xii∈Z be a (doubly-infinite) sequence of randomvariables on a probability space (Ω,F , P ). Then, the strong- or α-mixingcoefficient of Xii∈Z, denoted by αX(·), is defined by

αX(n) ≡ supi∈Z

α(σ〈Xj : j ≤ i, j ∈ Z

〉, σ〈

Xj : j ≥ i+n, j ∈ Z

〉), n ≥ 1,

(2.7)where the α(·, ·) on the right side of (2.7) is as defined in (2.1). The processXii∈Z is called strongly mixing or α-mixing if

limn→∞ αX(n) = 0. (2.8)

The other mixing coefficients of Xii∈Z (e.g., βX(·), ρX(·), etc.) are definedsimilarly.

For a one-sided sequence Xii≥1, the α-mixing coefficient Xii≥1 isdefined by replacing Z on the right side of (2.7) by N on all three occur-rences. A similar modification is needed for the other mixing coefficients.When there is no chance of confusion, the coefficients αX(·), βX(·), . . . , etc.,will be written as α(·), β(·), . . . , etc., to ease the notation.

Another important notion of ‘weak’ dependence is given by the following:

Definition 16.2.3: Let m ∈ Z+ be an integer and Xii∈Z be a collectionof random variables on (Ω,F , P ). Then, Xii∈Z is called m-dependent iffor every k ∈ Z, Xi : i ≤ k, i ∈ Z and Xi : i > k + m, i ∈ Z areindependent.

Example 16.2.1: If εii∈Z is a collection of independent random vari-ables and Xi = (εi + εi+1), i ∈ Z, then εii∈Z is 0-dependent and Xii∈Z

is 1-dependent. It is easy to see that if Xii∈Z is m-dependent for somem ∈ Z+, then α∗

X(n) = 0 for all n > m, where α∗X ∈ αX , βX , ρX , φX ,ΨX.

Therefore, m-dependence of Xii∈Z implies that the process Xii∈Z isα∗

X -mixing. In this sense, the condition of m-dependence is the strongestand the condition of α-mixing is the weakest among all weak dependenceconditions introduced here.

Example 16.2.2: Let εii∈Z be a collection of iid random variables withEε1 = 0, Eε21 < ∞ and let

Xn =∑i∈Z

aiεn−i, n ∈ Z (2.9)

Page 523: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

516 16. Limit Theorems for Dependent Processes

where ai ∈ R and ai = 0(exp(−ci)

)as i → ∞, c ∈ (0,∞). If ε1 has an

integrable characteristic function, then Xii∈Z is strongly mixing (Chanda(1974), Gorodetskii (1977), Withers (1981), Athreya and Pantula (1986)).

Example 16.2.3: Let Xii∈Z be a zero mean stationary Gaussian pro-cess. Suppose that Xii∈Z has spectral density f : (−π, π) → [0,∞), i.e.,

EX0Xk =∫ π

−π

eιkxf(x)dx, k ∈ Z. (2.10)

Then, αX(n) ≤ ρX(n) ≤ 2πα(n), n ≥ 1 and, therefore, Xii∈Z is α-mixingiff it is ρ-mixing (Ibragimov and Rozanov (1978), Chapter 4). Further,Xii∈Z is α-mixing iff the spectral density f admits the representation

f(t) =∣∣p(eιt)

∣∣2 exp(u(eιt) + v(eιt)

), (2.11)

where p(·) is a polynomial, u and v are continuous real-valued functions onthe unit circle in the complex plane, and v is the conjugate function of u.It is also known that if the Gaussian process Xii∈Z is φ-mixing, then it isnecessarily m-dependent for some m ∈ Z+. Thus, for Gaussian processes,the condition of α-mixing is as strong as ρ-mixing and the conditions of φ-mixing and Ψ-mixing are equivalent to m-dependence. See Ibragimov andRozanov (1978) for more details.

16.2.2 Coupling and covariance inequalitiesThe mixing coefficients can be seen as measures of deviations from indepen-dence. The idea of coupling is to construct independent copies of a givenpair of random vectors on a suitable probability space such that the Eu-clidean distance between these copies admits a bound in terms of the mix-ing coefficient between the (σ-algebras generated by the) random vectors.Thus, coupling gives a geometric interpretation of the mixing coefficients.The first result is for β-mixing random vectors.

Theorem 16.2.1: (Berbee’s theorem). Let (X, Y ) be a random vector ona probability space (Ω0,F0, P0) such that X takes values in Rd and Y inRs, d, s ∈ N. Then, there exist an enlarged probability space (Ω,F , P ) anda s-dimensional random vector Y ∗ such that

(i) (X, Y , Y ∗) are defined on (Ω,F , P ),

(ii) Y ∗ is independent of X under P and (X, Y ) have the same distribu-tion under P and P0,

(iii) P (Y = Y ∗) = β(σ〈X〉, σ〈Y 〉).

Page 524: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

16.2 Mixing sequences 517

Proof: See Corollary 4.2.5 of Berbee (1979).

A weaker version of the above result is available for α-mixing randomvariables, where the difference between Y and its independent copy admitsa bound in terms of the α-mixing coefficient.

Theorem 16.2.2: (Bradley’s theorem). In Theorem 16.2.1, assume s =1 and 0 < E|Y |γ < ∞ for some 0 < γ < ∞. Then, for all 0 < y ≤(E|Y |γ)1/γ ,

P(|Y − Y ∗| ≥ y

)≤ 18

[α(σ〈X〉, σ〈Y 〉

)]2γ/1+2γ

(E|Y |γ)1/1+2γy−γ/(1+2γ). (2.12)

Proof: See Theorem 3 of Bradley (1983).

Next, some bounds on the covariance between mixing random variablesare established. These will be useful for deriving limit theorems for sumsof mixing random variables. For a random variable X, define the function

QX(u) = inft : P (|X| > t) ≤ u

, u ∈ (0, 1). (2.13)

Thus, QX(u) is the quantile function of |X| at (1− u).

Theorem 16.2.3: (Rio’s inequality). Let X and Y be two random vari-ables with

∫ 10 QX(u)QY (u)du < ∞. Then,

∣∣Cov(X, Y )∣∣ ≤ 2

∫ 2α

0QX(u)QY (u)du (2.14)

where α = α(σ〈X〉, σ〈Y 〉

).

Proof: By Tonelli’s theorem,

EX+Y + = E

[(∫ X+

0du)(∫ Y +

0du)]

= E

(∫ ∞

0

∫ ∞

0I(X+ > u)I(Y + > v)dudv

)

=∫ ∞

0

∫ ∞

0P (X > u, Y > v)dudv

and similarly, EX+ =∫∞0 P (X > u)du. Hence, by (2.1), it follows that∣∣Cov(X+, Y +)

∣∣=

∣∣∣∣∫ ∞

0

∫ ∞

0

[P (X > u, Y > v)− P (X > u)P (Y > v)

]dudv

∣∣∣∣≤

∫ ∞

0

∫ ∞

0min

α, P (X > u), P (Y > v)

dudv. (2.15)

Page 525: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

518 16. Limit Theorems for Dependent Processes

Next note that for any real numbers a, b, c, d

(α ∧ a ∧ c) + (α ∧ a ∧ d) + (α ∧ b ∧ c) + (α ∧ b ∧ d)≤ [2(α ∧ a)] ∧ (c + d) + [2(α ∧ b)] ∧ (c + d)≤ 2[2α ∧ (a + b) ∧ (c + d)]. (2.16)

Now using (2.15), (2.16), and the identity Cov(X, Y ) = Cov(X+, Y +) +Cov(X−, Y −)− Cov(X+, Y −)− Cov(X−, Y +), one gets∣∣Cov(X, Y )

∣∣≤ 2

∫ ∞

0

∫ ∞

0min

2α, P (|X| > u), P (|Y | > v)

dudv. (2.17)

Hence, it is enough to show that the right sides of (2.14) and (2.17) agree. Tothat end, let U be a Uniform (0,1) random variable and define (W1, W2) =(0, 0)I(U ≥ 2α) +

(QX(U), QY (U)

)I(U < 2α). Then

EW1W2 =∫ 2α

0QX(u)QY (u)du. (2.18)

On the other hand, noting that QX(a) > t iff P (|X| > t) > a, one has

EW1W2 =∫ ∞

0

∫ ∞

0P(W1 > u, W2 > v

)dudv

=∫ ∞

0

∫ ∞

0P(U < 2α, QX(U) > u, QY (U) > v

)dudv

=∫ ∞

0

∫ ∞

0min

2α, P (|X| > u), P (|Y | > v)

dudv.

Hence, the theorem follows from (2.17), (2.18), and the above identity.

Corollary 16.2.4: Let X and Y be two random variables withα(σ〈X〉, σ〈Y 〉) = α ∈ [0, 1].

(i) (Davydov’s inequality). Suppose that E|X|p < ∞, E|Y |q < ∞ forsome p, q ∈ (1,∞) with 1

p + 1q < 1. Then, E|XY | < ∞ and

∣∣Cov(X, Y )∣∣ ≤ 2r(2α)1/r

(E|X|p

)1/p(E|Y |q

)1/q, (2.19)

where 1r = 1−

( 1p + 1

q

).

(ii) If P(|X| ≤ c1) = 1 = P (|Y | ≤ c2) for some constants c1, c2 ∈ (0,∞),

then ∣∣Cov(X, Y )∣∣ ≤ 4c1c2α. (2.20)

Page 526: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

16.3 Central limit theorems for mixing sequences 519

Proof: Let a = (E|X|p)1/p and b = (E|Y |q)1/q. W.l.o.g., suppose thata, b ∈ (0,∞). Then, by Markov’s inequality, for any 0 < u < 1,

P(|X| > au−1/p

)≤ E|X|p

/(au−1/p)p = u

and hence, QX(u) ≤ au−1/p. Similarly, QY (u) ≤ bu−1/q, 0 < u < 1. Hence,by Theorem 16.2.3,

∣∣Cov(X, Y )∣∣ ≤ 2

∫ 2α

0ab u−1/p−1/qdu

= 2ab(2α)1−1/p−1/q/(

1− 1p− 1

q

).

which is equivalent to (2.19).The proof of (2.20) is a direct consequence of Rio’s inequality and the

bounds QX(u) ≤ c1 and QY (u) ≤ c2 for all 0 < u < 1.

16.3 Central limit theorems for mixing sequences

In this section, CLTs for sequences of random variables satisfying differentmixing conditions are proved.

Proposition 16.3.1: Let Xii∈Z be a collection of random variables withstrong mixing coefficient α(·).

(i) Suppose that∑∞

n=1 α(n) < ∞ and for some c ∈ (0,∞), P (|Xi| ≤c) = 1 for all i. Then,

∞∑n=1

Cov(X1, Xn+1) converges absolutely. (3.1)

(ii) Suppose that∑∞

n=1 α(n)δ/2+δ < ∞ and supi∈Z E|Xi|2+δ < ∞ forsome δ ∈ (0,∞). Then, (3.1) holds.

Proof: A direct consequence of Corollary 16.2.4.

Next suppose that the collection of random variables Xii∈Z is station-ary and that Var(X1) +

∑∞n=1

∣∣Cov(X1, X1+n)∣∣ < ∞. Then by the DCT,

nVar(Xn) = n−1Var( n∑

i=1

Xi

)

= n−1[ n∑

i=1

Var(Xi) + 2∑

1≤i<j≤n

Cov(Xi, Xj)]

Page 527: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

520 16. Limit Theorems for Dependent Processes

= n−1[n Var(X1) + 2

n−1∑i=1

n−i∑k=1

Cov(Xi, Xi+k)]

= n−1[n Var(X1) + 2

n−1∑k=1

(n− k)Cov(X1, X1+k)]

−→ σ2∞ ≡ Var(X1) + 2

∞∑k=1

Cov(X1, X1+k) as n →∞. (3.2)

In particular, under the conditions of part (i) or part (ii) of Proposition16.3.1,

limn→∞ Var(

√n Xn) exists and equals σ2

∞.

In general, it is not guaranteed that σ2∞ > 0. Indeed, it is not difficult

to construct an example of a stationary strong mixing sequence Xnn≥1such that σ2

∞ = 0 (Problem 16.8). However, in addition to the conditions ofProposition 16.3.1, if it is assumed that σ2

∞ > 0, then a CLT for√

n(Xn −EX1) holds in the stationary case; see Corollary 16.3.3 and 16.3.6 below.

A classical method of proving the CLT (and other limit theorems) formixing random variables is based on the idea of blocking, introduced byS. N. Bernstein. Intuitively, the ‘blocking’ approach can be described asfollows: Suppose, µ = EX1 = 0. First, write the sum

∑ni=1 Xi in terms of

alternating sums of ‘big blocks’ Bi’s (of length ‘p’ say) and ‘little blocks’Li’s (of length ‘q’ say) as

n∑i=1

Xi =(X1 + · · ·+ Xp

)+(Xp+1 + · · ·+ Xp+q

)+(Xp+q+1 + · · ·+ X2p+q

)+ · · ·

= B1 + L1 + B2 + L2 + · · ·+ (BK + LK) + Rn,

where the last term Rn is the excess (if any) over the last complete pair ofbig- and little-blocks (BK , LK). Next, group together the Bi’s and Li’s towrite

1√n

n∑i=1

Xi =1√n

K∑j=1

Bj +1√n

K∑j=1

Lj + Rn/√

n. (3.3)

If q p n, then, the number of Xi’s in∑K

j=1 Lj and in Rn are of smallerorder than n, the total number of Xi’s. Using this, one can show that thecontribution of the last two terms in (3.3) to the limit is negligible, i.e.,

1√n

( K∑j=1

Lj + Rn

)−→p 0.

To handle the first term, 1√n

∑Kj=1 Bi, note that the Bj ’s are functions of

disjoint collections of Xj ’s that are separated by a distance of q or more.

Page 528: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

16.3 Central limit theorems for mixing sequences 521

By letting q →∞ suitably and using the mixing condition, one can replacethe Bj ’s by their independent copies, and appeal to the Lindeberg CLT forsums of independent random variables to conclude that

1√n

K∑j=1

Bj −→d N(0, σ2

∞).

Although the blocking approach is described here for stationary randomvariables, with minor modifications, it is applicable to certain nonstationarysequences as shown below.

Theorem 16.3.2: Let Xnn≥1 be a sequence of random variables (notnecessarily stationary) with strong mixing coefficient α(·). Suppose thatthere exist constants σ2

0, c ∈ (0,∞) such that

P (|Xi| ≤ c) = 1 for all i ∈ N, (3.4)

γn ≡ supj≥1

∣∣∣∣n−1 Var( j+n−1∑

i=j

Xi

)− σ2

0

∣∣∣∣→ 0 as n →∞, (3.5)

and that ∞∑n=1

α(n) < ∞. (3.6)

Then, √n(Xn − µn

)−→d N(0, σ2

0) as n →∞ (3.7)

where µn = EXn, and Xn = n−1∑ni=1 Xi, n ≥ 1.

An important special case of Theorem 16.3.2 is the following:

Corollary 16.3.3: If Xnn≥1 is a sequence of stationary bounded randomvariables with

∑∞n=1 α(n) < ∞, and if σ2

∞ of (3.2) is positive, then, withµ = EX1, √

n(Xn − µ

)−→d N(0, σ2

∞) as n →∞. (3.8)

Proof: For stationary random variables, (3.5) holds with σ20 = σ2

∞ (cf.(3.2)). Hence, the Corollary follows from Theorem 16.3.2.

For proving the theorem, the following auxiliary result will be used.

Lemma 16.3.4: Suppose that the conditions of Theorem 16.3.2 hold.Then,

supm≥1

E

[m+n−1∑i=m

(Xi − EXi)]4

= o(n3) as n →∞.

Page 529: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

522 16. Limit Theorems for Dependent Processes

Proof: W.l.o.g., let EXi = 0 for all i. Note that for any m ∈ N,

E

( n+m−1∑j=m

Xj

)4

=∑

j

EX4j + 6

∑i<j

EX2i X2

j + 4∑i =j

EX3i Xj

+ 6∑

i =j =k

EX2i XjXk +

∑i =j =k =

EXiXjXkX

≡ I1n + · · ·+ I5n, say, (3.9)

where the indices i, j, k, in the above sums lie between m and m + n− 1.By (3.4),∣∣I1n

∣∣+ ∣∣I2n

∣∣+ ∣∣I3n

∣∣ ≤ n · c4 + 7n(n− 1)c4 ≤ 7n2c4. (3.10)

By Corollary 16.2.4 (ii), noting that EXi = 0 for all i,∣∣I4n

∣∣ ≤ 12∑

i<j<k

∣∣EX2i XjXk

∣∣+ ∣∣EXiX2j Xk

∣∣+ ∣∣EXiXjX2k

∣∣

≤ 12∑

i<j<k

4c4α(j − k) + 4c4α(j − k) + 4c4α(j − i)

≤ 144c4n2[ n−1∑

r=1

α(r)]. (3.11)

Similarly, ∣∣I5n

∣∣ ≤ (4!)∑

i<j<k<

∣∣EXiXjXkX

∣∣≤ 24

∑i<j<k<

4c4[α(j − i) ∧ α(− k)]

= 96c4n−3∑i=1

n−2∑j=i+1

n−1∑k=j+1

n−k∑r=1

[α(j − i) ∧ α(r)

]

= 96c4n−3∑i=1

n−2−i∑s=1

n−1∑k=i+s+1

n−k∑r=1

[α(s) ∧ α(r)

]

≤ 96c4n2n∑

s=1

n∑r=1

[α(s) ∧ α(r)

]

= 96c4n2[ n∑

s=1

α(s) + 2n−1∑s=1

n∑r=s+1

α(r)]

≤ 192c4n2n∑

r=1

rα(r)

Page 530: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

16.3 Central limit theorems for mixing sequences 523

≤ 192c4n2[n1/2

∞∑r=1

α(r) + n∑

r≥√n

α(r)]. (3.12)

Since∑

r≥√n α(r) = o(1) and the bounds in (3.9)–(3.12) do not depend on

m, Lemma 16.3.4 follows.

Proof of Theorem 16.3.2: W.l.o.g., let EXj = 0 for all j. Let p ≡ pn,q ≡ qn = n1/2, n ≥ 1 be integers satisfying

q/p + p/n = o(1) as n →∞, (3.13)

when a choice of pn will be specified later. Let r = p+ q and K = n/r.As outlined earlier, for i = 1, . . . , K, let

Bi =(i−1)r+p∑

j=(i−1)r+1

Xj ,

Li =ir∑

j=(i−1)r+p+1

Xj ,

and let

Rn =n∑

j=1

Xj −K∑

i=1

(Bi + Li)

respectively denote the sums over the ‘big blocks,’ the ‘little blocks,’ andthe ‘excess’ (cf. (3.3)). Thus

√nXn =

1√n

K∑i=1

Bi +1√n

K∑i=1

Li +1√n

Rn. (3.14)

Since Rn is a sum over (n − Kr) ≤ r consecutive Xj ’s, by Corollary16.2.4 (ii),

E(Rn/

√n)2 ≤ n−1

[ n∑j=Kr+1

EX2j +

∑i =j

∣∣EXiXj

∣∣]

≤ 8c2n−1[(n−Kr) + (n−Kr)

∞∑k=1

α(k)]

= O( r

n

)→ 0 as n →∞. (3.15)

Next note that the variables Li and Li+k depend on two disjoint sets ofXj ’s that are separated by a distance of

[(i+k−1)r+p−ir

]= (k−1)r+p >

Page 531: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

524 16. Limit Theorems for Dependent Processes

kp. Hence, by (3.5), Corollary 16.2.4 (ii) and the monotonicity of α(·),

E

( K∑i=1

Li/√

n

)2

≤ n−1[ K∑

i=1

EL2i + 2

K−1∑i=1

K−i∑k=1

∣∣ELiLi+k

∣∣]

≤ n−1[Kq(σ2

0 + γq

)+ 2K

K−1∑k=1

4q2c2α(kp)]

≤ n−1Kq

[(σ2

0 + γq

)+ 8qc2

K−1∑k=1

p∑j=1

α(kp− j)/

p

]

= O

(n−1Kq + n−1Kq2p−1

∞∑j=0

α(j))

= O(q

p+(q

p

)2)→ 0 as n →∞, (3.16)

as nKp → 1 as n →∞.

Next consider the term 1√n

∑Ki=1 Bi. Note that α

(σ〈Bj〉, σ〈Bi : i ≥

j + 1〉)≤ α(q) for all 1 ≤ j < K. By applying Corollary 16.2.4 (ii) to the

real and imaginary parts, one gets for any t ∈ R,∣∣∣∣E exp(

ιtK∑

j=1

Bj/√

n

)−

K∏j=1

E exp(ιtBj/

√n)∣∣∣∣

≤∣∣∣∣E exp

(ιt

K∑j=1

Bj/√

n

)− E exp

(ιtB1/

√n)E exp

(ιt

K∑j=2

Bj/√

n

)∣∣∣∣+ · · ·+

∣∣∣∣K−2∏j=1

E exp(ιtBj/

√n)

E exp(ιt(BK−1 + BK)/

√n)

−K∏

j=1

E exp(ιtBj/

√n)∣∣∣∣

≤ 16 α(q) ·K

= O(qα(q)

n

pq

)→ 0 as n →∞. (3.17)

The last step follows by noting that∑∞

n=1 α(n) < ∞⇒ nα(n) → 0 as n →∞. Let B1, . . . , BK be independent random variables such that Bi =d Bi,1 ≤ i ≤ K. Note that by (3.5),∣∣∣∣ 1n

K∑i=1

EB2i − σ2

0

∣∣∣∣

Page 532: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

16.3 Central limit theorems for mixing sequences 525

≤ 1n

K∑i=1

∣∣EB2i − pσ2

0

∣∣+ ∣∣n−1Kp− 1∣∣σ2

0

≤ Kp

nγp + n−1|Kp− n|σ2

0 → 0 as n →∞. (3.18)

Next, for k ∈ N, let Γ1(k) = supE(∑m+k−1

i=m Xi

)4k−3 : m ≥ 1

and

Γ∗1(k) = supΓ1(j) : j ≥ k. By Lemma 16.3.4, Γ∗

1(k) ↓ 0 as k → ∞. Nowset p =

⌊n1/2

(Γ∗

1(q))−1/3 ∧ log n

⌋. Then, it is easy to check that (3.13)

holds and that n−1p2Γ∗1(p) ≤ n−1

(n1/2Γ∗

1(q)−1/3

)2Γ∗1(q) ≤ Γ∗

1(q)1/3 → 0

and n →∞. Hence,

K∑i=1

E(Bi/

√n)4 = n−2

K∑i=1

EB4i

≤ n−2Kp3Γ∗1(p) → 0 as n →∞. (3.19)

Using (3.18) and (3.19), it is easy to check that the triangular arrayof independent random variables

B1/

√n, . . . , BK/

√n

n≥1 satisfies theLyapounov’s condition and hence,

1√n

K∑i=1

Bi −→d N(0, σ20) as n →∞.

By (3.17), this implies that

1√n

K∑i=1

Bi −→d N(0, σ20). (3.20)

Theorem 16.3.2 now follows from (3.14), (3.15), (3.16), and (3.20).

For unbounded strongly mixing random variables, a CLT can be provedunder moment and mixing conditions similar to Proposition 16.3.1 (ii).The key idea here is to use Theorem 16.3.2 for the sum of a truncated partof the unbounded random variables and then show that the sum over theremaining part is negligible in the limit.

Theorem 16.3.5: Let Xnn≥1 be a sequence of random variables(not necessarily stationary) such that for some δ ∈ (0,∞), ζδ ≡supn≥1

(E|Xn|2+δ

)1/2+δ< ∞ and

∞∑n=1

α(n)δ/2+δ < ∞. (3.21)

Page 533: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

526 16. Limit Theorems for Dependent Processes

Suppose that there exist M0 ∈ (0,∞) and a function τ(·) : (M0,∞) →(0,∞) such that for all M > M0,

γn(M) ≡ supj≥1

∣∣∣∣n−1Var( j+n−1∑

i=j

XiI(|Xi| < M))−τ(M)

∣∣∣∣→ 0 as n →∞.

(3.22)If τ(M) → σ2

0 as M →∞ and σ20 ∈ (0,∞), then (3.7) holds.

Proof of Theorem 16.3.5: W.l.o.g., let EXi = 0 for all i ∈ N. LetM ∈ (1,∞). For i ≥ 1, define

Yi,M = XiI(|Xi| ≤ M) and Zi,M = XiI(|Xi| > M),

Yi,M = Yi,M − EYi,M , Zi,M = Zi,M − EZi,M .

Then,

√n(Xn − µ) =

1√n

n∑i=1

Yi,M +1√n

n∑i=1

Zi,M ,

≡ Sn,M + Tn,M , say. (3.23)

Note that, with a = M δ/4 and Jn = 1, . . . , n, by Cauchy-Schwarzinequality and Corollary 16.2.4 (i),

ET 2n,M

= n−1∑

|i−j|≤a,i,j∈Jn

∣∣EZi,MZj,M

∣∣+ n−1∑

|i−j|>a,i,j∈Jn

∣∣EZi,MZj,M

∣∣

≤ (2a + 1) supEZ2

i,M : i ≥ 1

+ 2∞∑

k=a

4 + 2δδ

α(k)δ/2+δζ2δ

≤ 4(2a + 1)M−δζ2+δδ +

8 + 4δδ

ζ2δ

∞∑k=a

α(k)δ/2+δ.

Hence,lim

M↑∞supn≥1

ET 2n,M = 0. (3.24)

Since τ(M) → σ20 ∈ (0,∞), there exists M1 ∈ (M0,∞) such that for all

M > M1, τ(M) ∈ (0,∞). Hence, by Theorem 16.3.2 and (3.22), for allM > M1,

Sn,M −→d N(0, τ(M)

)as n →∞. (3.25)

Next note that for any ε ∈ (0,∞) and x ∈ R,

P(√

n Xn ≤ x)

≤ P(Sn,M + Tn,M ≤ x,

∣∣Tn,M

∣∣ < ε)

+ P(∣∣Tn,M

∣∣ > ε)

≤ P(Sn,M ≤ x + ε

)+ P

(∣∣Tn,M

∣∣ > ε)

(3.26)

Page 534: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

16.3 Central limit theorems for mixing sequences 527

and similarly,

P(√

n Xn ≤ x)≥ P

(Sn,M ≤ x− ε

)− P

(∣∣Tn,M

∣∣ > ε). (3.27)

Hence, for any ε > 0, M > M1, and x ∈ R, letting n → ∞ in (3.26) and(3.27), by (3.25) one gets

Φ( x− ε

τ(M)1/2

)− lim sup

n→∞P(∣∣Tn,M

∣∣ > ε)

≤ lim infn→∞ P

(√n Xn ≤ x

)≤ lim sup

n→∞P(√

n Xn ≤ x)

≤ Φ( x + ε

τ(M)1/2

)+ lim sup

n→∞P(∣∣Tn,M

∣∣ > ε).

Since τ(M) → σ20 as M → ∞, letting M ↑ ∞ first and then ε ↓ 0, and

using (3.24), one gets

limn→∞ P

(√n Xn ≤ x

)= Φ(x/σ0) for all x ∈ R.

This completes the proof of Theorem 16.3.5.

A direct consequence of Theorem 16.3.5 is the following result:

Corollary 16.3.6: Let Xnn≥1 be a sequence of stationary random vari-ables with EX1 = µ ∈ R, E|X1|2+δ < ∞ and

∑∞n=1 α(n)δ/2+δ < ∞ for

some δ ∈ (0,∞). If σ2∞ of (3.2) is positive, then with µ = EX1,√

n(Xn − µ

)−→d N(0, σ2

∞).

Proof of Corollary 16.3.6: W.l.o.g., let µ = 0. Clearly, under the sta-tionarity of Xnn≥1, ζ2+δ

δ = E|X1|2+δ < ∞. Let a = M δ/4, Yi,M , andYi,M be as in the proof of Theorem 16.3.5. Also, let τ(M) = Var(Yi,M ) +2∑∞

k=1 Cov(Y1,M , Y1+k,M ), M ≥ 1. Since EX21I(|X1| > M) ≤ M−δζδ, one

gets ∣∣τ(M)− σ2∞∣∣

≤∣∣EY 2

1,M − EX21

∣∣+ 2a−1∑k=1

∣∣EY1,MY1+k,M − EX1X1+k

∣∣+ 2

∞∑k=a

∣∣EY1,MY1+k,M

∣∣+ ∣∣EX1X1+k

∣∣

≤ 2a−1∑k=0

[E(Y1,M −X1

)21/2EY 2

1+k,M

1/2

+(

EX21)1/2(

E∣∣Y1+k,M −X1+k

∣∣2)1/2]

Page 535: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

528 16. Limit Theorems for Dependent Processes

+ 4∞∑

k=a

4 + 2δδ

α(k)δ/2+δ ζ2δ

≤ 4a ζδ

EX2

1I(|X1| > M

)1/2+ 4

4 + 2δδ

ζ2δ

[ ∞∑k=a

α(k)δ/2+δ

]

→ 0 as M →∞. (3.28)

Further, by Corollary 16.3.1 (ii), the stationarity of Xnn≥1, and the argu-ments in the derivation of (3.2), (3.22) follows. Corollary 16.3.6 now followsfrom Theorem 16.3.5.

In the case that the sequence Xnn≥1 is stationary, further refinementsof Corollaries 16.3.3 and 16.3.6 are possible. The following result, due toDoukhan, Massart and Rio (1994) uses a different method of proof toestablish the CLT under stationarity. To describe it, for any nonincreasingsequence of positive real numbers ann≥1, define the function a(t) and itsinverse a−1(t), respectively, by

a(t) =∞∑

n=1

an I(n− 1 < t ≤ n), t > 0,

anda−1(u) = inft : a(t) ≤ u, u ∈ (0,∞).

Theorem 16.3.7: Let Xnn≥1 be a sequence of stationary random vari-ables with EX1 = µ ∈ R, EX2

1 < ∞ and strong mixing coefficient α(·).Suppose that ∫ 1

0α−1(u)QX1(u)2du < ∞, (3.29)

where QX1(u) = inft : P (|X1| > t) ≤ u, 0 < u < 1 is as in (2.13). Then,0 ≤ σ2

∞ < ∞. If σ2∞ > 0, then

√n(Xn − µ) −→d N(0, σ2

∞).

Proof: See Doukhan et al. (1994).

Note that if P (|X1| ≤ c) = 1 for some c ∈ (0,∞), then QX1(u) ≤ c for allu ∈ (0, 1) and (3.29) holds iff

∫ 10 α−1(u)du < ∞ iff

∑∞n=1 α(n) < ∞. Thus,

Theorem 16.3.7 yields Corollary 16.3.3 as a special case. Similarly, it can beshown that if E|X1|2+δ < ∞ and

∑∞n=1 α(n)δ/2+δ < ∞, then (3.29) holds

and, therefore, Theorem 16.3.7 also yields Corollary 16.3.6 as a special case.For another example, suppose α(n) = O

(exp(−cn)

)as n → ∞ for some

c ∈ (0,∞). Then, (3.29) holds if

EX21 log

(1 + |X1|

)< ∞. (3.30)

Page 536: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

16.4 Problems 529

In this case, the logarithmic factor in (3.30) cannot be dropped. Moreprecisely, it is known that finiteness of the second moment alone (with ar-bitrarily fast rate of decay of the strong mixing coefficient) is not enough toguarantee the CLT (Herrndorf (1983)) for strong mixing random variables.

For the ρ-mixing random variables, the following result is known. A pro-cess Xnn≥1 is called second order stationary if

EX2n < ∞, EXn = EX1 and EXnXn+k = EX1X1+k (3.31)

for all n ≥ 1, k ≥ 0.

Theorem 16.3.8: Let Xnn≥1 be a sequence of ρ-mixing, second orderstationary random variables with EX2

1 < ∞ and EX1 = µ. Suppose that∞∑

n=1

ρ(2n) < ∞. (3.32)

Then, σ2∞ ≡ Var(X1) + 2

∑∞k=1 Cov(X1, X1+k) converges and

Var(√

n Xn

)→ σ2

∞.

If σ2∞ ∈ (0,∞), then

√n(Xn − µ

)−→d N(0, σ2

∞).

Proof: See Peligrad (1982).

Thus, unlike the strong mixing case, in the ρ-mixing case the CLT holdsunder finiteness of the second moment, provided (3.32) holds and σ2

∞ > 0.

16.4 Problems

16.1 Deduce (1.16) from (1.15).

16.2 Let Xnn≥1 be a sequence of random variables. Show that thereexists a sequence of integers mn ∈ (0,∞) such that

(a) P(|Xn| > mn

)→ 0 as n →∞.

(b) P(|Xn| > mn i.o.) = 0.

16.3 Show that Ynk,Fnk of (1.18) is a mda.

(Hint: τn ≥ k = τn ≤ k − 1c ∈ Fn,k−1.)

16.4 Show that Ynk,Fnk of (1.18) satisfies (1.2) and (1.3) with τn re-placed by mn.

(Hint: Verify that for any Borel measurable g : R → R and ε > 0,

P(∣∣∣ τn∑

k=1

g(Ynk)−mn∑k=1

g(Ynk)∣∣∣ > ε

)≤ P (τn > mn).) (4.1)

Page 537: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

530 16. Limit Theorems for Dependent Processes

16.5 Let α∗(G1,G2) denote one of the coefficients α(G1,G2), β(G1,G2), andρ(G1,G2).

(a) Show that α∗(G1,G2) ∈ [0, 1].

(b) Show that α∗(G1,G2) = 0 iff G1 and G2 are independent.

(c) Show that α(G1,G2) ≤ minβ(G1,G2), ρ(G1,G2)

.

(d) Show that ρ(G1,G2) = sup|EX1X2|, Xi ∈ L2(Ω,Gi, P ), EXi =

0, EX21 = 1, i = 1, 2

.

16.6 Find examples of σ-algebras G1 and G2 such that

(a) β(G1,G2) < ρ(G1,G2);

(b) ρ(G1,G2) < β(G1,G2).

16.7 If Gi, Fi, i = 1, 2 are σ-algebras and (G1 ∨ F1) is independent of(G2 ∨ F2), then show that

α(G1 ∨ G2,F1 ∨ F2) ≤ α(G1,F1) + α(G2,F2). (4.2)

16.8 Let εii∈Z be a sequence of iid random variables with ε1 ∼Uniform(−1, 1). Let Xi = (εi − εi+1)/2, i ∈ Z. Show that

limn→∞ Var(

√n Xn) = 0. (4.3)

Also, find an ⊂ R such that of an

∑ni=1 Xi −→d T for some non-

degenerate random variable T . Find the distribution of T .

16.9 Let Xii∈Z be a sequence of iid random variables and Yii∈Z bea sequence of stationary random variables with strong mixing co-efficient α(·) such that Xii∈Z and Yii∈Z are independent. LetZi = h(Xi, Yi), i ∈ Z, where h : R2 → R is Borel measurable. Findthe limit distribution of 1√

n

∑ni=1(Zi − EZi) when

(i) h(x, y) = x + y, (x, y) ∈ R2, EX21 < ∞ and

E|Y1|2+δ +∞∑

n=1

α(n)δ/2+δ < ∞ for some δ ∈ (0,∞); (4.4)

(ii) h(x, y) = log(x2 + y2), (x, y) ∈ R2, and E(|X1|δ + |Y1|δ) < ∞and α(n) = O(n−δ) for some δ ∈ (0,∞);

(iii) h(x, y) = y1+x2 , (x, y) ∈ R, and (4.4) holds.

16.10 Let Xnn≥1 be a sequence of stationary random variables withstrong mixing coefficient α(·). Let E|X1|2 log(1 + |X1|) < ∞ andα(n) = O

(exp(−cn)

)as n → ∞, for some c ∈ (0,∞). Show that

(3.29) holds.

Page 538: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

16.4 Problems 531

16.11 Let Yi = xiβ + εi, i ≥ 1 be a simple linear regression model whereβ ∈ R is the regression parameter, xi ∈ R is nonrandom and εii≥1is a sequence of stationary random variables with strong mixing coef-ficient α(·) such that Eε1 = 0, E|ε1|2+δ < ∞ and

∑∞n=1 α(n)δ/2+δ <

∞, for some δ ∈ (0,∞). Suppose that for all h ∈ Z+,

n−h∑i=1

xixi+h

/ n∑i=1

x2i → γ(h) (4.5)

for some γ(h) ∈ R and n−1∑ni=1 |xi|2+δ = O(1) as n → ∞. Let

βn =∑n

i=1 xiyi

/∑ni=1 x2

i denote the least squares estimator of β.Show that ( n∑

i=1

x2i

)1/2

(βn − β) −→d N(0, σ2∞)

for some σ2∞ ∈ [0,∞). Find σ2

∞. (Note that by definition, Z ∼ N(0, 0)if P (Z = 0) = 1.)

16.12 Let Yi = xiβ′ + εi, i ≥ 1, where xi, β ∈ Rp, (p ∈ N) and εii≥1

satisfies the conditions of Problem 16.11. Suppose that for all h ∈ Z+,

n−1n−h∑i=1

x′ixi+h → Γ(h)

for some p × p matrix Γ(h), with Γ(h) nonsingular, and thatmax‖xi‖ : 1 ≤ i ≤ n = O(1) and n → ∞. Find the limit distribu-

tion of√

n(βn − β), where βn =(∑n

i=1 x′ixi

)−1∑ni=1 xiYi, n ≥ 1.

16.13 Let Xii∈Z be a first order stationary autoregressive process

Xi = ρXi−1 + εi, i ∈ Z, (4.6)

where |ρ| < 1 and εii∈Z is a sequence of iid random variables withEε1 = 0, Eε21 < ∞.

(a) Show that Xi =∑∞

j=0 ρjεi−j , i ∈ Z.

(b) Find the limit distribution of√n(ρn − ρ)

n≥1,

where ρn =∑n−1

i=1 XiXi+1/∑n

i=1 X2i , n ≥ 1.

(Hint: Use Theorem 16.1.1.)

Page 539: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

17The Bootstrap

17.1 The bootstrap method for independentvariables

17.1.1 A description of the bootstrap methodThe bootstrap method, introduced in statistics by Efron (1979), is a pow-erful tool for solving many statistical inference problems. Let X1, . . . , Xn

be iid random variables with common cdf F . Let θ = θ(F ) be a param-eter and θn be an estimator of θ based on observations X1, . . . , Xn, i.e.,θn = tn(X1, . . . , Xn) for some measurable function tn(·) of the randomvariables X1, . . . , Xn. The parameter θ is called a ‘level 1’ parameter andthe parameters related to the distribution of θn are called ‘level 2’ param-eters (cf. Lahiri (2003)). For example, Var(θn) or a median of (the distri-bution of) θn are ‘level 2’ parameters. The bootstrap is a general methodfor estimating such ‘level 2’ parameters.

To describe the bootstrap method, let

Rn = rn(X1, . . . , Xn;F ) (1.1a)

be a random variable that is a known function of X1, . . . , Xn and F . Anexample of Rn is given by

Rn =√

n(Xn − µ)/σ, (1.1b)

the normalized sample mean, where Xn = n−1∑ni=1 Xi is the sample mean,

µ = EX1 and σ2 = Var(X1). The objective here is to approximate (esti-

Page 540: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

534 17. The Bootstrap

mate) the unknown distribution of Rn and its functionals. Let Fn be anestimator of F based on X1, . . . , Xn. For example, one may take Fn to bethe empirical distribution function (edf) of X1, . . . , Xn, given by

Fn(x) = n−1n∑

i=1

I(Xi ≤ x), x ∈ R. (1.1c)

Next, given X1, . . . , Xn, generate conditionally iid random variablesX∗

1 , . . . , X∗m with common distribution Fn, where m denotes the bootstrap

sample size. Define the bootstrap version R∗m,n of Rn by replacing the Xi’s

with X∗i ’s, F with Fn, and n with m in the definition of Rn. Then,

R∗m,n = rm

(X∗

1 , . . . , X∗m; Fn

). (1.2)

Let Gn denote the distribution of Rn. The bootstrap estimator of Gn isgiven by the conditional distribution Gm,n (say) of R∗

m,n given X1, . . . , Xn.Further, the bootstrap estimator of a functional φn = φ(Gn) is given by

φm,n ≡ φ(Gm,n).

For example, the bootstrap estimator of the kth moment E(Rn)k of Rn isgiven by E∗(R∗

n)k, k ∈ N, where E∗ denotes the conditional expectationgiven Xn, n ≥ 1. This follows from the above ‘plug-in’ method appliedto the functional φ(G) ≡

∫xkdG(x), as E(Rn)k =

∫xkdGn(x) = φ(Gn)

and E∗(R∗m,n)k =

∫xkdGm,n(x) = φ(Gm,n). Similarly, the bootstrap es-

timator of the α-quantile (0 < α < 1) of Rn is given by the α-quantileof (the conditional distribution of) R∗

m,n. In general, closed form expres-sions for bootstrap estimators are not available. One may use Monte Carlosimulation to evaluate them numerically. See Hall (1992) for more details.

Note that Gm,n and hence, the bootstrap estimators φm,n = φ(Gm,n)depend on the choice of the initial estimator Fn of F . The most commonchoice of Fn is given by Fn of (1.1), in which case the bootstrap variablesX∗

1 , . . . , X∗m can be equivalently generated by simple random sampling with

replacement from X1, . . . , Xn. The following example illustrates the con-struction of the bootstrap version for Rn = Rn as in (1.1b).

Example 17.1.1: For the normalized sample mean Rn =√

n(Xn−µ)/σ,its bootstrap version based on a resample of size m from Fn is given by

R∗m,n =

√n(X∗

m − µn

)/σn, (1.3)

where X∗m = X∗

m,n = m−1∑mi=1 X∗

i is the bootstrap sample mean,µn =

∫xdFn(x) is the (conditional) mean of X∗

1 under Fn and σ2n is the

conditional variance of X∗1 under Fn. For Fn = Fn, it is easy to check that

µn = n−1∑ni=1 Xi = Xn and σ2 = 1

n

∑ni=1(Xi − Xn)2 ≡ s2

n, the sample

Page 541: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

17.1 The bootstrap method for independent variables 535

variance of X1, . . . , Xn. Thus, the ‘ordinary’ bootstrap version of Rn (withFn = Fn) is given by

R∗m,n =

√n(X∗

m − Xn)/sn. (1.4)

Some questions that arise naturally in this context are: Does the bootstrapdistribution of R∗

m,n give a valid approximation to the distribution of Rn?How good is the approximation generated by the bootstrap? How does thequality of approximation depend on the resample size or on the underlyingcdf F? Some of these issues are addressed next. Write P∗ to denote theconditional probability given X1, X2, . . . and supx to denote supremumover x ∈ R.

17.1.2 Validity of the bootstrap: Sample mean

Theorem 17.1.1: Let Xnn≥1 be a sequence of iid random variables withEX1 = µ and Var(X1) = σ2 ∈ (0,∞). Let Rn =

√n(Xn − µ)/σ and let

R∗n,n be its ‘ordinary’ bootstrap version given by (1.4) with m = n. Then

∆n ≡ supx

∣∣P (Rn ≤ x)− P∗(R∗n,n ≤ x)

∣∣→ 0 as n →∞, a.s. (1.5)

Proof: W.l.o.g., suppose that µ = 0 and σ = 1. By the CLT, Rn ≡√nXn −→d N(0, 1). Hence, it is enough to show that as n →∞,

∆1n ≡ supx

∣∣P∗(√

n(X∗n − Xn) ≤ snx

)− Φ(x)

∣∣ = o(1) a.s.,

where Φ(·) denotes the cdf of the N(0, 1) distribution. By the Berry-Esseentheorem,

∆1n ≤ (2.75)E∗|X∗

1 − Xn|3√ns3

n

≤ 23 · (2.75)√n

E∗|X∗1 |3

s3n

.

By the SLLN, s2n → σ2 ∈ (0,∞) and by the Marcinkiewz-Zygmund SLLN

1√n

E∗|X∗1 |3 =

1n3/2

n∑i=1

|Xi|3 → 0 as n →∞ a.s.

Hence, ∆1n = o(1) as n →∞ a.s. and the theorem is proved.

One can also give a more direct proof of Theorem 17.1.1 using theLindeberg-Feller CLT (Problem 17.1).

Theorem 17.1.1 shows that the bootstrap approximation to the distribu-tion of the normalized sample mean is valid under the same conditions thatguarantee the CLT. Under additional moment conditions, a more precise

Page 542: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

536 17. The Bootstrap

bound on the order of ∆n can be given. If E|X1|3 < ∞, then by the SLLN,n−1∑n

j=1 |Xj |3 → E|X1|3 as n → ∞, a.s. and hence, the last step in theproof of Theorem 17.1.1, it follows that ∆n = O(n−1/2) a.s. Thus, in thiscase, the rate of bootstrap approximation to the distribution of Rn is asgood as that of the normal approximation to the distribution of Rn. Animportant result of Singh (1981) shows that the error rate of bootstrap ap-proximation is indeed smaller, provided the distribution of X1 is nonlattice(cf. Chapter 11), i.e.,

|E exp(ιtX1)| = 1 for all t ∈ R \ 0.

A precise statement of this result is given in Theorem 17.1.2 below.

17.1.3 Second order correctness of the bootstrap

Theorem 17.1.2: Let Xnn≥1 be a sequence of iid nonlattice randomvariables with E|X1|3 < ∞. Also, let Rn, R∗

n,n, and ∆n be as in Theorem17.1.1. Then,

∆n = o(n−1/2) as n →∞, a.s. (1.6)

Proof: Only an outline of the proof will be given here. By Theorem 11.4.4on Edgeworth expansions, it follows that

supx∈R

∣∣∣P (Rn ≤ x)−[Φ(x)− µ3

6√

nσ3 (x2− 1)φ(x)]∣∣∣ = o(n−1/2) as n →∞,

(1.7)where µ3 = E(X1 − µ)3 and φ(x) = (2π)−1/2 exp(−x2/2), x ∈ R. It canbe shown that the conditional distribution of R∗

n,n given X1, . . . , Xn, alsoadmits a similar expansion that is valid almost surely:

supx∈R

∣∣∣P∗(Rn ≤ x)−[Φ(x)− µ3,n

6√

nσ3n

(x2 − 1)φ(x)]∣∣∣

= o(n−1/2) as n →∞, a.s., (1.8)

where µ3,n = E∗(X∗1 − Xn)3. By the SLLN, as n →∞, µ3,n → µ3 a.s. and

σ2n → σ2 a.s. Hence,

∆n ≤∣∣∣ µ3,n

σ3n

− µ3

σ3

∣∣∣ supx∈R

|x2 − 1|φ(x) 1

6√

n+ o(n−1/2) a.s.

= o(n−1/2) a.s.

Under the conditions of Theorem 17.1.2, the bootstrap approximationto the distribution of Rn outperforms the classical normal approxima-tion, since the latter has an error of order O(n−1/2). In the literature,

Page 543: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

17.1 The bootstrap method for independent variables 537

this property is referred to as the second order correctness (s.o.c.) of thebootstrap. It can be shown that under additional conditions, the order of∆n is O

(n−1√log log n

)a.s., and therefore the bootstrap gives an accurate

approximation even for small sample sizes where the asymptotic normalapproximation is inadequate. For iid random variables with a smooth cdfF , s.o.c. of the bootstrap has been established in more complex problems,such as in the case of the studentized sample mean Tn ≡

√n(Xn − µ)/sn.

For a detailed description of the second and higher order properties of thebootstrap for independent random variables, see Hall (1992).

17.1.4 Bootstrap for lattice distributionsNext consider the case where the underlying cdf F does not satisfy thenonlatticeness condition of Theorem 17.1.2. Then,∣∣E exp(ιtX1)

∣∣ = 1 for some t = 0

and it can be shown (cf. Chapter 10) that X1 takes all its values in a latticeof the form a + jh : j ∈ Z, a ∈ R, h > 0. The smallest h > 0 satisfying

P(X1 ∈ a + jh : j ∈ Z

)= 1

is called the (maximal) span of F . For the normalized sample mean of lat-tice random variables, s.o.c. of the bootstrap fails under standard metrics,such as the sup-norm metric and the Levy metric. Recall that for any twodistribution functions F and G on R,

dL(F, G) = infε > 0 : F (x− ε)− ε < G(x) < F (x+ ε)+ ε for all x ∈ R(1.9)

defines the Levy metric on the set of probability distribution on R.

Theorem 17.1.3: Let Xnn≥1 be a sequence of iid lattice random vari-ables with span h ∈ (0,∞) and let Rn, R∗

n,n, and ∆n be as in Theorem17.1.1. If E(X4

1 ) < ∞, then

lim supn→∞

√n ∆n =

h√2πσ2

a.s. (1.10)

andlim sup

n→∞

√n dL(Gn, Gn) =

h

σ(1 +√

2π)a.s. (1.11)

where Gn(x) = P (Rn ≤ x) and Gn(x) = P∗(R∗n,n ≤ x), x ∈ R.

Thus, the above theorem shows that for lattice random variables, thebootstrap approximation to the distribution of Rn may not be better thanthe normal approximation in the supremum and the Levy metric. In Theo-rem 17.1.3, relation (1.10) is due to Singh (1981) and (1.11) is due to Lahiri(1994).

Page 544: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

538 17. The Bootstrap

Proof: Here again, an outline of the proof is given. By Theorem 11.4.5,

supx

∣∣P (Rn ≤ x)− ξn(x)∣∣ = o(n−1/2) (1.12)

where ξn(x) = Φ(x) + µ36σ3

√n(1 − x2)φ(x) + h

σ√

nQ([

√nxσ − nx0]/h)φ(x),

x ∈ R, Q(x) = 12 − x + x, and x0 ∈ R is such that P (X1 = x0 + µ) > 0.

Also, by Theorem 1 of Singh (1981),

supx

∣∣P∗(R∗n,n ≤ x)− ξn(x)

∣∣ = o(n−1/2) as n →∞, a.s., (1.13)

where ξn(x) = Φ(x) + µ3,n

6s3n

√n(1 − x2)φ(x) + h

σ√

nQ(√

nxsn/h), x ∈ R andµ3,n = E∗(X2

1 − Xn)3. Hence, by (1.12) and (1.13),

σ√

n

h·∆n = sup

x∈R

∣∣∣Q([√

nxσ − nx0]/h)−Q(√

nxsn/h)∣∣∣φ(x) + o(1) a.s.

Since Q(x) ∈ [−12 , 1

2 ] for all x and φ(0) = 1√2π

, it is clear that

lim supn→∞

σ√

n

h∆n ≤

1√2π

a.s. (1.14)

To get the lower bound, suppose that for almost all realizations ofX1, X2, . . . , and for any given ε, δ ∈ (0, 1), there is a sequence xnn≥1 ∈(0, ε) such that

√nxnsn/h ∈ Z and 〈[

√nxnσ − nx0]/h〉 ∈ (1− δ, 1) i.o., (1.15)

where 〈y〉 denotes the fractional part of y, i.e., 〈y〉 = y−y, y ∈ R. Then,for each n satisfying (1.15),

σ√

n

h∆n ≥

∣∣∣Q([√

nxnσ − nx0]/h)− 12

∣∣∣φ(xn) + o(1)

≥ infy∈(1−δ,1)

∣∣∣Q(y)− 12

∣∣∣ · infx∈(0,ε)

φ(x) + o(1). (1.16)

Since limy→1− Q(y) = −12 , (1.14) and (1.16) together yield (1.10). The

existence of the sequence xnn≥1 satisfying (1.15) can be established byusing the LIL. For an outline of the main arguments, see Problem 17.4.

Next consider (1.11). Let cn = ‖Gn−Gn‖∞+n−1/2. Since dL(Gn, Gn) ≤‖Gn − Gn‖, dL(Gn, Gn) = inf0 < ε < cn : Gn(x − ε) − ε < Gn(x) <Gn(x + ε) + ε for all x ∈ R. Using (1.12) and (1.13), it can be shown thatfor 0 < ε < cn,

Gn(x) < Gn(x + ε) + ε for all x ∈ R

⇔(Q(x)−Q(xτn + (

√n εσ − nx0)/h)

)φ(hx/sn

√n) (1.17)

< h−1εσ√

n(1 + φ(hx/sn

√n)) + o(1) for all x ∈ R,

Page 545: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

17.1 The bootstrap method for independent variables 539

where τn = σ/sn and the o(1) term is uniform over x ∈ R. Similarly, for0 < ε < cn,

Gn(x− ε)− ε < Gn(x) for all x ∈ R

⇔(Q(x)−Q(xτn − (

√n εσ + nx0)/h)

)φ(hx/sn

√n) (1.18)

> h−1εσ√

n(1 + φ(hx/sn

√n))

+ o(1) for all x ∈ R.

It is easy to check that (1.17) and (1.18) are satisfied if ε = h(1 +√

2π)σ√

n−1 + o(n−1/2). Hence,

lim supn→∞

√ndL(Gn, Gn) ≤ h

(σ(1 +

√2π))−1 a.s.

To prove the reverse inequality, note that if there exists a η > 0 suchthat, with εn = dL(Gn, Gn) + n−1,(

Q(x)−Q(τnx + (√

n εnσ − nx0)/h))φ(hx/sn

√n) ≥ (1− η)

√2π

for some xεR, then

h−1εnσ√

n(1 + φ(hx/sn

√n))

> (1− η)/√

2π + o(1)

⇒ h−1σεn

√n(1 + φ(0)

)> (1− η)

√2π + o(1)

⇒ εn > (1− η)h(1 +√

2π)−1(σ√

n)−1 + o(n−1/2)

⇒ dL(Gn, Gn) > (1− η)h(1 +√

2π)−1(σ√

n)−1 + o(n−1/2).

Thus, it is enough to show that for almost all sample sequences, there existsa sequence xnn≥1 ∈ R such that with εn = d(Gn, Gn) + n−1,

limn→∞

(Q(xn)−Q(τnxn + (

√n εnσ − nx0)/h)

)φ(xnh/sn

√n)

=1√2π

. (1.19)

Since limn→∞√

n(τn − 1) = ∞ a.s., for almost all sample sequences, thereexists a subsequence nm such that lim supm→∞

√nm(τnm

−1) = ∞, andτnm

> 1 for all m. Let an denote the fractional part of(√

n εnσ − nx0)/h.

Then 0 ≤ an < 1. Define a sequence of integers xnn≥1 by

xn = (1− an)/(τn − 1) − 1, n ≥ 1.

Then, it is easy to see that

(1− anm)− 2(τnm

− 1) ≤ xnm(τnm

− 1) ≤ (1− anm)− (τnm

− 1),

and

limm→∞ Q(xnm

τnm+ anm

) = limm→∞ Q

(xnm

(τnm− 1)− 1− anm

))

= −1/2.

(1.20)Now, one can use (1.20) to prove (1.19). This completes the proof.

Page 546: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

540 17. The Bootstrap

17.1.5 Bootstrap for heavy tailed random variablesNext consider the case where the Xi’s are heavy tailed, i.e., the Xi’s haveinfinite variance. More specifically, let Xnn≥1 be iid with common cdf Fsuch that as x →∞,

1− F (x) ∼ x−2L(x)F (−x) ∼ cx−2L(x) (1.21)

for some α ∈ (1, 2), c ∈ (0,∞) and for some function L(·) : (0,∞) → (0,∞)that is slowly varying at ∞, i.e., L(·) is bounded on every bounded intervalof (0,∞) and

limy→∞ L(ty)/L(y) = 1 for all t ∈ (0,∞). (1.22)

Since α ∈ (1, 2), E|X1| < ∞, but EX21 = ∞. Thus, the Xi’s have infinite

variance but finite mean. In this case, it is known (cf. Chapter 11) that forsome sequence ann≥1 of normalizing constants

Tn ≡n(Xn − µ)

an−→d Wα, (1.23)

where Xn = n−1∑ni=1 Xi, µ = EX1 and Wα has a stable distribution of

order α. The characteristic function of Wα is given by

φα(t) = exp(∫

h(t, x)dλα(x)), t ∈ R, (1.24)

where h(t, x) =(eιtx − 1 − ιt

), x, t ∈ R and where λα(·) is a measure on(

R,B(R))

satisfying

λα

([x,∞)

)= x−α and λα

((−∞,−x]

)= cx−α, x > 0. (1.25)

The normalizing constants ann≥1 ⊂ (0,∞) in (1.23) are determined bythe relation

nP (X1 > an) ≡ na−2n L(an) → 1 as n →∞. (1.26)

To apply the bootstrap, let X∗1 , . . . , X∗

m denote a random sample of sizem, drawn with replacement form X1, . . . , Xn. Then, the bootstrap ver-sion of Tn is given by

T ∗m,n = m(X∗

m − Xn)/am (1.27)

where X∗m = m−1∑m

i=1 X∗i is the bootstrap sample mean. The main result

of this section shows that L(T ∗m,n | X1, . . . , Xn), the conditional distribu-

tion of T ∗m,n given X1, . . . , Xn has a random limit for the usual choice of the

Page 547: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

17.1 The bootstrap method for independent variables 541

resample size, m = n and, hence, is an “inconsistent estimator” of L(Tn)the distribution of Tn. In contrast, consistency of the bootstrap approxi-mation holds if m = o(n), i.e., the resample size is of smaller order than thesample size. To describe the random limit in the m = n case, the notion ofa random Poisson measure is needed, which is introduced in the followingdefinition.

Definition 17.1.1: M(·) is called a random Poisson measure on(R,B(R)

)with mean measure λα(·) of (1.25), if M(A) : A ∈ B(R) is a collectionof random variables defined on some probability space (Ω,F , P ) such thatfor each w ∈ Ω, M(·)(w) is a measure on

(R,B(R)

), and for any disjoint

finite collection of sets A1, . . . , Am ∈ B(R), m ∈ N, M(A1), . . . , M(Am)are independent Poisson random variables and M(Ai) ∼ Poisson

(λα(Ai)

),

i = 1, . . . , m.

Theorem 17.1.4: Let Xnn≥1 be a sequence of iid random variablessatisfying (1.21) for some α ∈ (1, 2) and let Tn, T ∗

m,n be as in (1.23) and(1.27), respectively.

(i) If m = o(n) as n →∞, then

supx

∣∣P∗(T ∗m,n ≤ x)− P (Tn ≤ x)

∣∣ −→p 0 as n →∞. (1.28)

(ii) Suppose that m = n for all n ≥ 1. Then, for any t ∈ R,

E∗ exp(ιtT ∗

n,n

)−→d exp

(∫h(t, x)dM(x)

), (1.29)

where M(·) is the random Poisson measure defined above andh(t, x) = (eιtx − 1− ιt).

Remark 17.1.1: Part (ii) of Theorem 17.1.4 shows that the bootstrapfails to provide a valid approximation to the distribution of the normalizedsample mean of heavy tailed random variables when m = n. Failure of thebootstrap in the heavy tail case was first proved by Athreya (1987a), whoestablished weak convergence of the finite dimensional vectors

(P∗(T ∗

n,n ≤x1), . . . , P∗(T ∗

n,n ≤ xm))

based on a slightly different bootstrap versionT ∗

n,n of Tn, where T ∗n,n = n(X∗ − Xn)/X(n) and X(n) = maxX1, . . . , Xn.

Extensions of these results are given in Athreya (1987b) and Arcones andGine (1989, 1991). A necessary and sufficient condition for the validity ofthe bootstrap for the sample mean with resample size m = n is that X1belongs to the domain of attraction of the normal distribution (Gine andZinn (1989)).

Proof of Theorem 17.1.4: W.l.o.g., let µ = 0. Let

φm,n(t) = E∗ exp(ιt(X∗

1 − Xn)/am

)

Page 548: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

542 17. The Bootstrap

= n−1n∑

j=1

exp(ιt[X1 − Xn]/am

), t ∈ R.

Then,

E∗ exp(ιtT ∗

m,n

)=

(φm,n(t)

)m=

[1− 1

m

m(1− φm,n(t))

]m. (1.30)

First consider part (ii). Note that for any sequence of complex numbersznn≥1 satisfying, zn → z ∈ C,

(1 +

zn

n

)n/ezn → 1 as n →∞. (1.31)

Hence, by (1.30) (with m = n) and (1.31), (ii) would follow if

n(φn,n(t)− 1

)−→d

∫h(x, t)Mα(dx). (1.32)

Since E∗(X∗1 − Xn) = 0, it follows that

n(φn,n(t)− 1

)= n

[E∗ exp

(ιt[X∗

1 − Xn]/an

)− 1− ιtE∗

([X∗

1 − Xn]/an

)]=

∫h(x, t)dMn(x), (1.33)

where Mn(A) =∑n

j=1 I([Xj − Xn]/an ∈ A

), A ∈ B(R). Note that for a

given t ∈ R, the function h(·, t) is continuous on R, |h(x, t)| = O(x2) asx → 0 and |h(x, t)| = O(|x|) as |x| → ∞. Hence, to prove (1.32), it isenough to show that for any ε > 0,

limη↓0

lim supn→∞

P(∫

|x|≤η

x2dMn(x) > ε)

= 0, (1.34)

limη↓0

lim supn→∞

P(∫

η|x|>1|x|dMn(x) > ε

)= 0, (1.35)

and for any disjoint intervals I1, . . . , Ik whose closures I1, . . . , Ik are con-tained in R \ 0,

Mn(I1), . . . , Mn(Ik)−→d

Mα(I1), . . . , Mα(Ik)

. (1.36)

Let An = |Xn| ≤ n−3/4an. Since nXn/an −→d Wα,

P (Acn) → 0 as n →∞. (1.37)

Page 549: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

17.1 The bootstrap method for independent variables 543

Consider (1.34) and fix η ∈ (0,∞). Then, on the set An, for n1/2 > η,

∫|x|≤η

x2dMn(x) = a−2n

n∑j=1

(Xj − Xn)2I(|Xj − Xn| ≤ anη)

≤ 2a−2n

n∑j=1

X2j I(|Xj | ≤ (|Xn|+ anη)

)+ 2a−2

n n · X2n

≤ 2a−2n

n∑j=1

X2j I(|Xj | ≤ 2ηan) + 2n−1/2.

Hence,

lim supn→∞

P(∫

|x|≤η

x2dMn(x) > ε)

≤ lim supn→∞

[P(∫

|x|≤η

x2dMn(x) > ε∩An

)+ P (Ac

n)]

≤ lim supn→∞

P(2a−2

n

n∑j=1

X2j I(|Xj | ≤ 2ηan) >

ε

2

)

≤ lim supn→∞

na−2n EX2

1I(|X1| ≤ 2ηan). (1.38)

By (1.21) and (1.26), na−2n EX2

1I(|X1| ≤ 2ηan) is asymptotically equivalentto C1 · na−2

n (ηan)2−αL(2ηan) for some C1 ∈ (0,∞) (not depending on η).Hence, by (1.21) and (1.22), the right side of (1.38) is bounded above by4ε · C1 · η2−α, which → 0 as η ↓ 0. Hence, (1.34) follows.

By similar arguments,

lim supn→∞

P(∫

η|x|>1|x|dMn(x) > ε

)

≤ lim supn→∞

na−1n E|X1|I(|X1| > η−1an/2)

≤ const. ηα−1,

which → 0 as η ↓ 0. Hence, (1.35) follows.Next consider (1.36). Let M†

n(A) ≡∑n

j=1 I(Xj/an ∈ A), A ∈ B(R).Then, for any a < b, on the set An,

M†n

([a + εn, b− εn]

)≤ Mn

([a, b]

)≤ M†

n

([a− εn, b + εn]

)for εn = n−3/4. Further, by (1.21), (1.22), for any x = 0,

EM †n

([x− εn, x + εn]

)= nP

(X1 ∈ an[x− εn, x + εn]

)→ 0 as n →∞.

Page 550: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

544 17. The Bootstrap

Hence, it is enough to establish (1.36) with Mn(·) replaced by M†n(·). Since(

M†n(I1), . . . , M†

n(Ik), n −∑k

j=1 M†n(Ij)

)has a multinomial distribution

with parameters(n, p1n, . . . , pkn, 1−

∑kj=1 pjn

)with pjn = P (X1/an ∈ Ij),

and since, by (1.21), npjn → λα(Ij) for Ij ⊂ R \ 0, for j = 1, 2, . . . , k thelast assertion follows. This completes the proof of part (ii).

Next consider part (i). Note that Wα has an absolutely continuous dis-tribution w.r.t. the Lebesgue measure (as the characteristic function of Wα

is integrable on R). Since Tn −→d Wα, it is enough to show that

supx

∣∣∣P∗(T ∗m,n ≤ x)− P (Wα ≤ x)

∣∣∣ −→p 0 as n →∞. (1.39)

Using the fact that “a sequence of random variables Yn converges to 0 inprobability if and only if given any subsequence ni, there exists a furthersubsequence mi ⊂ ni such that Ymi

→ 0 a.s.”, one can show that(1.39) holds iff

E∗ exp(ιtT ∗m,n) −→p E exp(ιtWα) as n →∞ (1.40)

for each t ∈ R (Problem 17.8 (c)).By (1.24), (1.30), and (1.31), it is enough to show that for each t ∈ R,

m(φm,n(t)− 1) −→p

∫h(t, x)dλα(x) as n →∞. (1.41)

The left side of (1.41) can be written as

m

n

n∑i=1

h

(Xi − Xn

am, t

)≡∫

h(x, t)dMn(x), say

where Mn(A) = mn

∑ni=1 I

([Xi − Xn]/am ∈ A

), A ∈ B(R). The proof of

(1.41) now proceeds along the lines of the proof of (1.32), where −→d in(1.36) is replaced by −→p. Note that for m = o(n), by (1.21) and (1.22),

an

am· m

n∼( n

m

)1/α−1 L(n)L(m)

→ 0 as n →∞. (1.42)

By arguments similar to the proof of (1.34), on An, for n large,∫|x|≤η

x2dMn(x)

=m

na−2

m

n∑j=1

(Xj − Xn)2I(|Xj − Xn| ≤ amη)

≤ 2m

na−2

m

n∑j=1

X2j I(|Xj | ≤ 2amη) + 2

m

n

a2n

a2m

n−3/2.

Page 551: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

17.2 Inadequacy of resampling single values under dependence 545

Also, the expected value of the first term on the right side above equals2ma−2

m EX21I(|X1| ≤ 2amη), which is asymptotically equivalent to const.

η2−α. Hence, it follows that (1.34) holds with Mn replaced by Mn. Similarly,one can establish (1.35) for Mn. Hence, to prove (1.41), it remains to showthat (cf. (1.36)) for any interval I1 whose closure is contained in R \ 0,

Mn(I1) −→p λα(I1). (1.43)

By (1.23) and (1.42), Xn/am = nXn

an

an

am

1n −→p 0 as n → ∞. Hence, by

arguments as in the proof of (1.36), it is enough to show that

M†n(I1) −→p λα(I1), (1.44)

where M†n(A) = m

n

∑ni=1 I(Xi/am ∈ A), A ∈ B(R). In view of (1.21) and

(1.26), this can be easily verified by showing

EM†n(I1) → λα(I1) and Var

(M†

n(I1))

= O(m

n

)as n →∞, (1.45)

and hence → 0 as n →∞. Hence, part (i) of Theorem 17.1.4 follows.

17.2 Inadequacy of resampling single values underdependence

The resampling scheme, introduced by Efron (1979) for iid random vari-ables, may not produce a reasonable approximation under dependence. Anexample to this effect was given by Singh (1981), which is described next.Let Xnn≥1 be a sequence of stationary m-dependent random variablesfor some m ∈ N, with EX1 = µ and EX2

1 < ∞.Recall that Xnn≥1 is called m-dependent for some integer m ≥ 0

if X1, . . . , Xk and Xk+m+1,... are independent for all k ≥ 1. Thus,any sequence of independent random variables εnn≥1 is 0-dependent andif Xn ≡ εn + 0.5εn+1, n ≥ 1, then Xnn≥1 is 1-dependent. For an m-dependent sequence Xnn≥1 with finite variances, it can be checked that

σ2∞ ≡ lim

n→∞ nVar(Xn) = Var(X1) + 2m∑

i=1

Cov(X1, X1+i),

where Xn = n−1∑ni=1 Xi. If σ2

∞ ∈ (0,∞), then by the CLT for m-dependent variables (cf. Chapter 16),

√n(Xn − µ) −→d N(0, σ2

∞). (2.1)

Now consider the bootstrap approximation to the distribution of therandom variable Tn =

√n(Xn−µ) under Efron (1979)’s resampling scheme.

Page 552: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

546 17. The Bootstrap

For simplicity, assume that the resample size equals the sample size, i.e.,from (X1, . . . , Xn), an equal number of bootstrap variables X∗

1 , . . . , X∗n are

generated by resampling one a single observation at a time. Then, thebootstrap version T ∗

n,n of Tn is given by

T ∗n,n =

√n(X∗

n − Xn),

where X∗n = n−1∑n

i=1 X∗i . The conditional distribution of T ∗

n,n still con-verges to a normal distribution, but with a “wrong” variance, as shownbelow.

Theorem 17.2.1: Let Xnn≥1 be a sequence of stationary m-dependentrandom variables with EX1 = µ and σ2 = Var(X1) ∈ (0,∞), where m ∈Z+. Then

supx

∣∣P∗(T ∗n,n ≤ x)− Φ(x/σ)

∣∣ = o(1) as n →∞, a.s. (2.2)

Note that, if m ≥ 1, then σ2 need not equal σ2∞.

For proving the theorem, the following result will be needed.

Lemma 17.2.2: Let Xnn≥1 be a sequence of stationary m-dependentrandom variables. Let f : R → R be a Borel measurable function withE|f(X1)|p < ∞ for some p ∈ (0, 2), such that Ef(X1) = 0 if p ≥ 1. Then,

n−1/pn∑

i=1

f(Xi) → 0 as n →∞, a.s.

Proof: This is most easily proved by splitting the given m-dependent se-quence Xnn≥1 into m + 1 iid subsequences Yjii≥1, j = 1, . . . , m + 1,defined by Yji = Xj+(i−1)(m+1), and then applying the standard resultsfor iid random variables to Yjii≥1’s (cf. Liu and Singh (1992)). For1 ≤ j ≤ m + 1, let Ajn = 1 ≤ i ≤ n : j + (i − 1)(m + 1) ≤ n andlet Njn denote the size of the set Ajn. Note that Njn/n → (m + 1)−1 asn → ∞ for all 1 ≤ j ≤ m + 1. Then, by the Marcinkiewz-Zygmund SLLN(cf. Chapter 8) applied to each of the sequences of iid random variablesYjii≥1, j = 1, . . . , m + 1, one gets

n−1/pn∑

i=1

f(Xi)

=m+1∑j=1

[N

−1/pjn

∑i∈Ajn

f(Yji)]· (Njn/n)1/p → 0 as n →∞, a.s.

This completes the proof of Lemma 17.2.2.

Page 553: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

17.3 Block bootstrap 547

Proof of Theorem 17.2.1: Note that conditional on X1, . . . , Xn,X∗

1 , . . . , X∗n are iid random variables with E∗X∗

1 = Xn, Var∗(X∗1 ) = s2

n,and E∗|X∗

1 |3 = n−1∑ni=1 |Xi|3 < ∞. Hence, by the Berry-Esseen theorem,

supx

∣∣P∗(T ∗n,n ≤ x)− Φ(x/sn)

∣∣ ≤ (2.75)E∗|X∗

1 − Xn|3√n s3

n

. (2.3)

By Lemma 17.2.2, w.p. 1,

Xn → µ, n−1n∑

i=1

X2i → EX2

1 , and n−3/2n∑

i=1

|Xi|3 → 0 (2.4)

as n →∞. Hence, the theorem follows from (2.3) and (2.4).

The following result is an immediate consequence of Theorem 17.2.1 and(2.1).

Corollary 17.2.3: Under the conditions of Theorem 17.2.1, if σ2∞ = 0

and∑m

i=1 Cov(X1, X1+i) = 0, then for any x = 0,

limn→∞

∣∣P∗(T ∗n,n ≤ x)− P (Tn ≤ x)

∣∣ = ∣∣Φ(x/σ)− Φ(x/σ∞)∣∣ = 0 a.s.

Thus, for all x = 0, the bootstrap estimator P∗(T ∗n,n ≤ x) of P (Tn ≤ x)

based on Efron (1979)’s resampling scheme has an error that tends to anonzero number in the limit. As a result, the bootstrap estimator of P (Tn ≤x) is not consistent. By resampling individual Xi’s, Efron (1979)’s resam-pling scheme ignores the dependence structure of the sequence Xnn≥1completely, and thus, fails to account for the lag-covariance terms (viz.,Cov(X1, X1+i), 1 ≤ i ≤ m) in the asymptotic variance.

17.3 Block bootstrap

A new type of resampling scheme that is applicable to a wide class of depen-dent random variables is given by the block bootstrap methods. Althoughthe idea of using blocks in statistical inference problems for time series isvery common (cf. Brillinger (1975)), the development of a suitable versionof resampling based on blocks has been slow. In a significant breakthrough,Kunsch (1989) and Liu and Singh (1992) independently formulated a blockresampling scheme, called the moving block bootstrap (MBB). In contrastwith resampling a single observation at a time, the MBB resamples blocksof consecutive observations at a time, thereby preserving the dependencestructure of the original observations within each block. Further, by allow-ing the block size to grow to infinity, the MBB is able to reproduce thedependence structure of the underlying process asymptotically. Essentially

Page 554: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

548 17. The Bootstrap

the same idea of block resampling was implicit in the works of Hall (1985)and Carlstein (1986). A description of the MBB is given next.

Let Xii≥1 be a sequence of stationary random variables and letX1, . . . , Xn ≡ Xn denote the observations. Let be an integer sat-isfying 1 ≤ < n. Define the blocks B1 = (X1, . . . , X), B2 =(X2, . . . , X+1), . . . ,BN = (XN , . . . , Xn), where N = n − + 1. For sim-plicity, suppose that divides n. Let b = n/. To generate the MBB sam-ples, one selects b blocks at random with replacement from the collectionB1, . . . ,BN. Since each resampled block has elements, concatenatingthe elements of the b resampled blocks serially yields b · = n bootstrapobservations X∗

1 , . . . , X∗n. Note that if one sets = 1, then the MBB reduces

to the bootstrap method of Efron (1979) for iid data. However, for a validapproximation in the dependent case, it is typically required that both and b go to ∞ as n →∞, i.e.,

−1 + n−1 = o(1) as n →∞. (3.1)

Next suppose that the random variable of interest is of the form Tn =tn(Xn; θ(Pn)

), where Pn denotes the joint distribution of X1, . . . , Xn and

θ(Pn) is a functional of Pn. The MBB version of Tn based on blocks of size is defined as

T ∗n = tn

(X∗

1 , . . . , X∗n; θ(Pn)

),

where Pn = L(X∗1 , . . . , X∗

n|Xn), the conditional joint distribution ofX∗

1 , . . . , X∗n, given Xn, and where the dependence on is suppressed to

ease the notation.To illustrate the construction of T ∗

n in a specific example, suppose thatTn is the centered and scaled sample mean Tn = n1/2(Xn − µ). Then theMBB version of Tn is given by

T ∗n = n1/2(X∗

n − µn), (3.2)

where X∗n is the sample mean of the bootstrap observations and where

µn = E∗(X∗n)

= N−1N∑

i=1

(Xi + · · ·+ Xi+−1)/

= N−1[ N∑

i=

Xi +−1∑i=1

i/(Xi + Xn−i+1)], (3.3)

which is different from Xn when is > 1.

17.4 Properties of the MBB

Let Xii∈Z be a sequence of stationary random variables with EX1 = µ,EX2

1 < ∞ and strong mixing coefficient α(·). In this section, consistency of

Page 555: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

17.4 Properties of the MBB 549

the MBB estimators of the variance and of the distribution of the samplemean will be proved.

17.4.1 Consistency of MBB variance estimatorsFor Tn =

√n(Xn − µ), the MBB variance estimator Var∗(T ∗

n) has thedesirable property that it can be expressed by simple, closed-form formulasinvolving the observations. This is possible because of the linearity of thebootstrap sample mean in the resampled observations. Let Ui = (Xi + · · ·+Xi+−1)/ denote the average of the block (Xi, . . . , Xi+−1), i ≥ 1. Then,using the independence of the resampled blocks, one gets

Var∗(T ∗n) =

[1N

N∑i=1

U2i − µ2

n

], (4.1)

where N = n − + 1 and when µn = N−1∑ni=1 Ui is as in (3.3). Under

the conditions of Corollary 16.3.6, the asymptotic variance of Tn is givenby the infinite series

σ2∞ ≡ lim

n→∞ Var(Tn) =∞∑

i=−∞EZ1Z1+i, (4.2)

where Zi = Xi − µ, i ∈ Z. The following result proves consistency of theMBB estimator of the ‘level 2’ parameter Var(Tn) or, equivalently, of σ2

∞.

Theorem 17.4.1: Suppose that there exists a δ > 0 such that E|X1|2+δ <∞ and that

∑∞n=1 α(n)δ/2+δ < ∞. If, in addition, −1 + n−1 = o(1) as

n →∞, thenVar∗(T ∗

n) −→p σ2∞ as n →∞. (4.3)

Theorem 17.4.1 shows that under mild moment and strong mixing condi-tions on the process Xii∈Z, the bootstrap variance estimators Var∗(T ∗

n),are consistent for a wide range of bootstrap block sizes , so long as tendsto infinity with n but at a rate slower than n. Thus, block sizes given by = log log n or = n1−ε, 0 < ε < 1, are all admissible block lengths for theconsistency of Var∗(T ∗

n).For proving the theorem, the following lemma from Lahiri (2003) is

needed.

Lemma 17.4.2: Let f : R → R be a Borel measurable function andlet Xii∈Z be a (possibly nonstationary) sequence of random vectors withstrong mixing coefficient α(·). Define ‖f‖∞ = sup|f(x)| : x ∈ R andζ2+δ,n = max

(E|f(U1i)|2+δ

)1/(2+δ) : 1 ≤ i ≤ N, δ > 0, where U1i ≡√

Ui = (Xi + · · · + Xi+−1)/√

, i ≥ 1. Let ain : i ≥ 1, n ≥ 1 ⊂ [−1, 1]be a collection of real numbers. Then, there exist constants C1 and C2(δ)

Page 556: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

550 17. The Bootstrap

(not depending on f(·), , n, and ain’s), such that for any 1 < < n/2 andany n > 2,

Var( N∑

i=1

ainf(U1i))

≤ min

C1‖f‖2∞n[1 +

∑1≤k≤n/

α(k)],

C2(δ)ζ22+δ,nn

[1 +

∑1≤k≤n/

α(k)δ/(2+δ)]

.

(4.4)

Proof: Let

S(j) =2j∑

i=2(j−1)+1

ainf(U1i), 1 ≤ j ≤ J, (4.5)

where J = N/2 and let S(J + 1) =∑n

i=1 ainf(U1i)−∑J

j=1 S(j). Also,

let∑(1) and

∑(2) respectively denote summation over even and odd j ∈1, . . . , J + 1. Note that for any 1 ≤ j, j + k ≤ J + 1, k ≥ 2, the randomvariables S(j) and S(j+k) depend on disjoint sets of Xi’s that are separatedby (k−1)2− observations in between. Hence, noting that |S(j)| ≤ 2‖f‖∞for all 1 ≤ j ≤ J + 1, by Corollary 16.2.4, one gets |Cov(S(j), S(j + k))| ≤4α((k − 1)2− )(4‖f‖∞)2 for all k ≥ 2, j ≥ 1. Hence,

Var( n∑

i=1

ainf(U1i))

= Var(∑(1)

S(j) +∑(2)

S(j))

≤ 2[Var(∑(1)

S(j))

+ Var(∑(2)

S(j))]

≤ 2[ J+1∑

i=1

ES(j)2 + (J + 1) ·∑

1≤k≤J/2

α((2k − 1)2−

)· 4(4‖f‖∞)2

]

≤ 2[( n

2+ 1)42‖f‖2∞ + 64(n)‖f‖2∞

J∑k=1

α(k)]

≤ C1‖f‖2∞ ·[n + (n)

J∑k=1

α(k)]. (4.6)

This yields the first term in the upper bound. The second termcan be derived similarly by using the inequalities

∣∣Cov(S(j), S(j +

k))∣∣ ≤ C(δ)

(ES(j)2+δ

)1/(2+δ)(ES(j + k)2+δ

)α((k − 1)2 −

)δ/(2+δ) and(ES(j)2+δ

)1/(2+δ) ≤ 2ζ2+δ,n and is left as an exercise.

Page 557: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

17.4 Properties of the MBB 551

Proof of Theorem 17.4.1: W.l.o.g., let µ = 0. Then, EX2n = O(n−1) as

n →∞. Hence, by (3.3) and Lemma 17.4.2

nE∣∣µn − Xn

∣∣2 ≤ nE

∣∣∣1− n

N

∣∣∣ |Xn|

+ (N)−1∣∣∣∣

∑i=1

(− i)(Xi + Xn−i+1)∣∣∣∣2

≤ 2

(/N)2nE∣∣Xn

∣∣2 + 2nN−2[E

∣∣∣∣∑

i=1

(i/)Xi

∣∣∣∣2

+ E

∣∣∣∣∑

i=1

(i/)X−i

∣∣∣∣2]

= O([/n]2

)+ O

([/n]

)= O(/n) as n →∞. (4.7)

This implies

Eµ2

n ≤ ·

2E∣∣X∣∣2 + 2E

∣∣Xn − µn

∣∣2= O(/n). (4.8)

Hence, it remains to show that N−1∑Ni=1 U2

i −→p σ2∞ as n → ∞. Let

Vin = U21iI(|U1i| ≤ (n/)1/8

)and Win = U2

1i − Vin, 1 ≤ i ≤ N . Then, byLemma 17.4.2,

E

∣∣∣∣N−1N∑

i=1

(Vin − EVin

)∣∣∣∣2

≤ const.(n/)1/2[n + n

∑1≤k<n/

α(k)]/

N2

≤ const.(n/)−1/2[1 +

∑k≥1

α(k)]

= o(1) as n →∞. (4.9)

Next, note that by definition, U11 =√

X. Further, under the conditionsof Theorem 17.4.1,

√nXn −→d N(0, σ2

∞), by Corollary 16.3.6. Hence, bythe EDCT,

limn→∞ EW11 = lim

n→∞ E|U11|2I(|U11| > (n/)1/8

)= 0. (4.10)

Therefore, |EV1n − σ2∞| ≤ E|U11|2I(|U11|8 > n/) + |EU2

11 − σ2∞| = o(1).

Hence, for any ε > 0, by (4.9), (4.10), and Markov’s inequality,

limn→∞ P

(∣∣∣∣N−1N∑

i=1

U2i − σ2

∣∣∣∣ > 3ε

)

Page 558: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

552 17. The Bootstrap

≤ limn→∞ P

(∣∣∣∣N−1N∑

i=1

(Vin − EVin)∣∣∣∣+ ∣∣EV1n − σ2

∞∣∣

+∣∣∣∣N−1

N∑i=1

Win

∣∣∣∣ > 3ε

)

≤ limn→∞ P

(∣∣∣∣N−1N∑

i=1

(Vin − EVin)∣∣∣∣ > ε

)

+ limn→∞ P

(∣∣∣∣N−1N∑

i=1

Win

∣∣∣∣ > ε

)

≤ limn→∞ ε−2E

∣∣∣∣N−1N∑

i=1

(Vin − EVin)∣∣∣∣2

+ limn→∞ ε−1E|W11|

= 0.

This proves Theorem 17.4.1.

17.4.2 Consistency of MBB cdf estimators

The main result of this section is the following:

Theorem 17.4.3: Let Xii∈Z be a sequence of stationary random vari-ables. Suppose that there exists a δ ∈ (0,∞) such that E|X1|2+δ < ∞ and∑∞

n=1 α(n)δ/(2+δ) < ∞. Also, suppose that σ2∞ =

∑i∈Z Cov(X1, X1+i) ∈

(0,∞) and that −1 + n−1 = o(1) and n →∞. Then,

supx∈Rd

∣∣P∗(T ∗n ≤ x)− P (Tn ≤ x)

∣∣ −→p 0 as n →∞, (4.11)

where Tn =√

n(Xn − µ) and T ∗n =

√n(X∗

n − µn) is the MBB version ofTn based on blocks of size .

Theorem 17.4.3 shows that like the MBB variance estimators, the MBBdistribution function estimator is consistent for a wide range of values ofthe block length parameter . Indeed, the conditions on presented in bothTheorems 17.4.1 and 17.4.2 are also necessary for consistency of these boot-strap estimators. If remains bounded, then the block bootstrap methodsfail to capture the dependence structure of the original data sequence andconverge to a wrong normal limit as was noted in Corollary 17.2.3. On theother hand, if goes to infinite at a rate comparable with the sample sizen (violating the condition n−1 = o(1) as n → ∞), then it can be shown(cf. Lahiri (2001)) that the MBB estimators converge to certain randomprobability measures.

Page 559: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

17.4 Properties of the MBB 553

Proof: Since Tn converges in distribution to N(0, σ2∞), it is enough to

show that

supx

∣∣P∗(T ∗n ≤ x)− Φ(x/σ∞)

∣∣ −→p 0 as n →∞, (4.12)

where Φ(·) denotes the cdf of the standard normal distribution. LetU∗

i = (X∗(i−1)+1 + · · · + X∗

i1)/, 1 ≤ i ≤ b denote the average ofthe ith resampled MBB block. Also, let an = (n/)1/4, n ≥ 1 and let∆n(a) = b−1∑b

i=1 E∗∣∣U∗

i − µn

∣∣2I(√∣∣U∗

i − µn

∣∣ > 2a), a > 0. Note that

conditional on X1, . . . , Xn, U∗1 , . . . , U∗

b are iid random vectors with

P∗(U∗

1 = Uj

)=

1N

for j = 1, . . . , N, (4.13)

where N = n− + 1 and Uj =(Xj + · · ·+ Xj+−1

)/, 1 ≤ j ≤ N . Hence,

by (4.8) and the EDCT, for any ε > 0,

P(∆n(an) > ε

)≤ ε−1E∆n(an)

= ε−1E

E∗∣∣U∗

1 − µn

∣∣2I(√∣∣U∗

1 − µn

∣∣ > 2an

)= ε−1E

∣∣U11 −√

µn

∣∣2I(∣∣U11 −√

µn

∣∣ > 2an

)≤ 4ε−1

[E|U11|2I

(|U11| > an

)+ E|µn|2

]→ 0 as n →∞.

Thus,∆n(an) −→p 0 as n →∞. (4.14)

To prove (4.12), it is enough to show that for any subsequence nkk≥1,there is a further subsequence nkii≥1 ⊂ nkk≥1 (for notational simplic-ity, nki

is written as nki) such that

limi→∞

supx

∣∣P∗(T ∗nki

≤ x)− Φ(x/σ∞)∣∣ = 0 a.s. (4.15)

Fix a subsequence nkk≥1. Then, by (4.14) and Theorem 17.4.1, thereexists a subsequence nkii≥1 of nkk≥1 such that as i →∞,

Var∗(T ∗

nki

)→ σ2

∞ a.s. and ∆nki(anki

) → 0 a.s. (4.16)

Note that T ∗n =

∑bi=1(U

∗i − µn)

√/b is a sum of conditionally iid random

vectors (U∗1 − µn)

√/b, . . . , (U∗

b − µn)√

/b, which, by (4.16), satisfy Lin-deberg’s condition along the subsequence nki, a.s. Hence, by the CLT forindependent random vectors, the conditional distribution of T ∗

nkiconverges

to N(0, σ2∞) as i → ∞, a.s. Hence, by Polya’s theorem (cf. Chapter 8),

(4.15) follows.

Page 560: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

554 17. The Bootstrap

17.4.3 Second order properties of the MBBUnder appropriate regularity conditions, second order correctness (s.o.c.)(cf. Section 17.1.3) of the MBB is known for the normalized and studentizedsample mean. As in the independent case, the proof is based on Edgeworthexpansions for the given pivotal quantities and their block bootstrap ver-sions. Derivation of the Edgeworth expansion under dependence is rathercomplicated. In a seminal work, Gotze and Hipp (1983) developed someconditioning argument and established an asymptotic expansion for thenormalized sum of weakly dependent random vectors. S.o.c. of the MBBcan be established under a similar set of regularity conditions, stated next.

Let Xii∈Z be a sequence of stationary random variables on a probabil-ity space (Ω,F , P ) and let Dj : j ∈ Z be a collection of sub-σ-algebrasof F . For −∞ ≤ a ≤ b ≤ ∞, let Db

a = σ〈∪Dj : j ∈ Z, a ≤ j ≤ b〉. Thefollowing conditions will be used:

(C.1) σ2∞ ≡ lim

n→∞ n−1 Var( n∑

i=1Xi

)∈ (0,∞).

(C.2) There exists δ ∈ (0, 1) such that for all n, m = 1, 2, . . . with m > δ−1,there exists a Dn+m

n−m-measurable random vector X‡n,m satisfying

E∣∣Xn −X‡

n,m

∣∣ ≤ δ−1 exp(−δm).

(C.3) There exists δ ∈ (0, 1) such that for all i ∈ Z, m ∈ N, A ∈ Di−∞, and

B ∈ D∞i+m, ∣∣P (A ∩B)− P (A)P (B)

∣∣ ≤ δ−1 exp(−δm).

(C.4) There exists δ ∈ (0, 1) such that for all m, n, k = 1, 2, . . ., andA ∈ Dn+k

n−k

E∣∣P (A|Dj : j = n)−P (A|Dj : 0 < |j−n| ≤ m+k)

∣∣ ≤ δ−1 exp(−δm).

(C.5) There exists δ ∈ (0, 1) such that for all m, n = 1, 2, . . . with δ−1 <m < n and for all t ∈ Rd with |t| ≥ δ,

E∣∣E exp(ιt · [Xn−m + · · ·+ Xn+m]) | Dj : j = n

∣∣ ≤ exp(−δ).

Condition (C.4) is a strong-mixing condition on the underlying auxil-iary sequence of σ-algebras Dj ’s, that requires the σ-algebras Dj ’s to bestrongly mixing at an exponential rate. For Edgeworth expansions for thenormalized sample mean under polynomial mixing rates, see Lahiri (1996).Condition (C.3) connects the strong mixing condition on the σ-fields Dj ’sto the weak-dependence structure of the random vectors Xj ’s. If, for allj ∈ Z, one sets Dj = σ〈Xj〉, the σ-field generated by Xj , then Condition

Page 561: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

17.4 Properties of the MBB 555

(C.3) is trivially satisfied with X‡n,m = Xn for all m. However, this choice

of Dj is not always the most useful one for the verification of the rest ofthe conditions.

Condition (C.4) is an approximate Markov-type property, which triviallyholds if Xj is Dj-measurable and Xii∈Z is itself a Markov chain of a finiteorder. Finally, (C.5) is a version of the Cramer condition in the weaklydependent case. Note that if Xj ’s are iid and the σ-algebras Dj ’s are chosenas Dj = σ〈Xj〉, j ∈ Z, then Condition (C.5) is equivalent to requiring thatfor some δ ∈ (0, 1),

1 > e−δ ≥ E∣∣Eexp(ιt ·Xn) | Xj : j = n

∣∣=

∣∣E exp(ιt ·X1)∣∣ for all |t| ≥ δ, (4.17)

which, in turn, is equivalent to the standard Cramer condition

lim sup|t|→∞

∣∣E exp(ιt ·X1)∣∣ < 1. (4.18)

However, for weakly dependent stationary Xj ’s, the standard Cramer con-dition on the marginal distribution of X1 is not enough to ensure a smoothEdgeworth expansion for the normalized sample mean (cf. Gotze and Hipp(1983)).

Conditions (C.2)–(C.5) have been verified for different classes of depen-dent processes, such as, (i) linear processes with iid innovations, (ii) smoothfunctions of Gaussian processes, and (iii) Markov processes, etc. See Gotzeand Hipp (1983) and Lahiri (2003), Chapter 6, for more details.

The next theorem shows that the MBB is s.o.c. for a range of block sizesunder some moment condition and under the regularity conditions listedabove.

Theorem 17.4.4: Let Xii∈Z be a collection of stationary random vari-ables satisfying Conditions (C.1)–(C.5). Let Rn =

√n(Xn−µ)/σ∞ and let

R∗n be the MBB version of Rn based on blocks of length , where µ = EX1

and σ2∞ is as in (C.1). Suppose that for some δ ∈ (0, 1

3 ), E|X1|35+δ < ∞and

δnδ ≤ ≤ δ−1n1/3 for all n ≥ δ−1. (4.19)

Then

supx

∣∣P∗(R∗

n ≤ x)− P

(Rn ≤ x

)∣∣ = Op

(n−1 + n−1/2−1). (4.20)

Proof: See Theorem 6.7 (b) of Lahiri (2003).

The moment condition can be reduced considerably if the error boundon the right side of (4.20) is replaced by o(n−1/2) only, and if the range of values in (4.19) is restricted further (cf. Lahiri (1991)).

Page 562: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

556 17. The Bootstrap

Remark 17.4.1: It is worth mentioning that the error bound on the rightside of (4.20) is optimal. Indeed, for mean-squared-error (MSE) optimalestimation of the probability P (Rn ≤ x), x = 0, the optimal block sizeis of the form 0 ∼ C1n

1/4 for some constant C1 depending on the jointdistribution of the Xi’s (cf. Hall, Horowitz and Jing (1995)). This rate canbe deduced also by minimizing the expression on the right side of (4.20)as a function of (4.20). On the other hand, MSE-optimal block size forestimating variance-type functionals is of the form C1n

1/3. See Chapter 7,Lahiri (2003), for more details.

17.5 Problems

17.1 (a) Apply the Lindeberg-Feller CLT to the triangular array(X∗

1 , X∗2 , . . . , X∗

n)n≥1 given a realization of Xjj≥1 to give adirect proof of Theorem 17.1.1.

(b) Show that under the condition of Theorem 17.1.1, (1.5) holds ifffor each x ∈ R, P (Rn ≤ x)− P∗(R∗

n ≤ x) → 0 as n →∞, a.s.

(Hint: Use Polya’s theorem (cf. Chapter 8).)

17.2 Suppose that Xnn≥1 be a sequence of iid random variables withEX2

1 < ∞. Let Rn =√

n(Xn − µ)/σ and R∗m,n =

√n(X∗

m − Xn

)/sn

be as in (1.4). Show that for any sequence mn →∞,

supx

∣∣∣P (Rn ≤ x)− P∗

(R∗

mn,n ≤ x)∣∣∣→ 0 as n →∞ a.s.

17.3 Suppose that Xnn≥1 be a sequence of iid random variables withµ = EX1, σ2 = Var(X1) ∈ (0,∞). Let Fn be an estimator of F suchthat µn =

∫xdFn(x) ∈ R and σ2

n =∫

x2dFn(x)− (µn)2 ∈ (0,∞).

Let Rn =√

n(Xn−µ)/σ and Tn =√

n(Xn−µ) and let R∗n,n and T ∗

n =√n(X∗

n − µn

)be the bootstrap versions of Rn and Tn, respectively,

based on a resample of size m = n from Fn.

(a) Suppose that for some δ ∈ (0, 12 ),

E∗∣∣X∗

1 − µn

∣∣2I(∣∣X∗1 − µn

∣∣ > nδσn

)/σ2

n = o(1) a.s. (5.1)

Show that

supx

∣∣P (Rn ≤ x)− P∗(R∗n ≤ x)

∣∣→ 0 as n →∞, a.s.

(b) Show that if Fn(·) ≡ Fn(· − µn) −→d F = F (· − µ) a.s. andσ2

n → σ2 a.s., then

supx

∣∣P (Tn ≤ x)− P∗(T ∗n ≤ x)

∣∣→ 0 as n →∞, a.s.

Page 563: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

17.5 Problems 557

(Hint: First verify (5.1).)

17.4 Show that under the conditions of Theorem 17.1.3, (1.15) holds.

(Hint: Set xn = jnh/(sn√

n) and ηn = 〈nx0/h〉, where jn ∈ Z is tobe chosen. Check that 〈

√nxnσh −ηn〉 = 〈jn( σ

sn−1)−ηn〉. By the LIL,(

σsn− 1)

> an i.o., a.s. where an = n−1/2(log log n)1/4. Now choose

any jn ∈(

σsn− 1)−1(1− δ + ηn, 1 + ηn) ∩ Z.)

17.5 Let Xnn≥1 be a sequence of iid random variables with X1 ∼N(θ, 1), θ ∈ R. For n ≥ 1, let θn = F−1

n (1/2), the sample median ofX1, . . . , Xn, Xn = n−1∑n

i=1 Xi, the sample mean of X1, . . . , Xn, andY 2

n = max1, n−1∑n

i=1(Xi − Xn)2, where Fn is the empirical cdf

of X1, . . . , Xn. Suppose that X∗1 , . . . , X∗

n are (conditionally) iid withX∗

1 ∼ N(θn, Y 2n ) and let X∗

n = n−1∑ni=1 X∗

i be the (parametric)bootstrap sample mean. Define

Tn =√

n(Xn − θ), T ∗n =

√n(X∗

n − θn) and T ∗∗n =

√n(X∗

n − Xn).

(a) Show that

limn→∞ sup

x

∣∣P (Tn ≤ x)− P∗(T ∗n ≤ x)

∣∣ = 0 a.s.

(b) Show that

limn→∞ sup

x

∣∣P∗(T ∗∗n ≤ x)− Φ

(Y −1

n [x−Wn])∣∣ = 0 a.s.

for some random variables Wn’s such that Wn −→d W .

(c) Find the distribution of W in part (b).

(Hint: Use the Bahadur representation (cf. Bahadur (1966)) forsample quantiles.)

(d) Show that Y 2n −→p 1 as n →∞.

(e) Conclude that there exists ε > 0 such that

limn→∞ P

(supx∈R

∣∣P (Tn ≤ x)− P∗(T ∗∗n ≤ x)

∣∣ > ε)

> ε.

(Thus, T ∗∗n is an incorrect parametric bootstrap version of Tn.)

17.6 Let Xnn≥1 be a sequence of iid random variables with E(X1) = µ,Var(X1) = σ2 ∈ (0,∞) and E|X1|4 < ∞. Let X∗

1 , . . . , X∗n denote the

nonparametric bootstrap sample drawn randomly with replacementfrom X1, . . . , Xn and X∗

n = n−1∑ni=1 X∗

i . Let

Tn =√

n(Xn − µ) and T ∗n =

√n(X∗

n − Xn).

Page 564: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

558 17. The Bootstrap

Show that there exists a constant K ∈ (0,∞) such that

lim supn→∞

√n√

log log nsupx∈R

∣∣P (Tn ≤ x)− P∗(T ∗n ≤ x)

∣∣ = K a.s.

Find K.(This shows that the bootstrap approximation for Tn is less accuratethan that for the normalized sample mean.)

17.7 (a) Let F and G be two cdfs on R and let F be continuous. Showthat

supx

∣∣F (x)−G(x)∣∣ = sup

x

∣∣F (x−)−G(x−)∣∣.

(b) Let Xnn≥1 be a sequence of iid random variables with EX1 =µ, Var(X1) = σ2 ∈ (0,∞) and E|X1|3 < ∞. Let Rn =

√n(Xn−

µ)/σ, and R∗n be its bootstrap version based on a resample of

size n from the edf Fn.

(i) Show that

supx

∣∣P (Rn < x)− Φ(x)∣∣ = o(n−1/2) as n →∞

where Φ(·) denotes the cdf of the N(0, 1) distribution.(ii) Show that if the distribution of X1 is nonlattice, then

supx

∣∣P (Rn < x)− P∗(R∗n < x)

∣∣ = o(n−1/2) a.s.

17.8 Let R∗m,n be the bootstrap version of a random variable Rn, n ≥ 1

and let Rn −→d R∞, where R∞ has a continuous distribution on R.Let φm,n(t) = E∗ exp(ιtR∗

m,n) and φn(t) = E exp(ιtRn), 1 ≤ n ≤ ∞.

(a) Show that if

supx

∣∣P∗(R∗m,n ≤ x)− P (Rn ≤ x)

∣∣ −→p 0 as n →∞,

then for every t ∈ R, φm,n(t) −→p φ∞(t) as n →∞.

(b) Suppose that there exists a sequence hnn≥1 such that

supt∈R

∣∣φm,n(t)− φm,n(t + hn)∣∣ −→p 0 as n →∞. (5.2)

Then, the converse to (a) holds.

(c) Let Wα and T ∗m,n be as in (1.23) and (1.27), respectively. Show

that (1.40) implies (1.39).

Page 565: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

17.5 Problems 559

(Hint: (b) Let Yn(t) ≡ |φm,n(t) − φ∞(t)|, t ∈ R and let Q =q1, q2, . . . be an enumeration of the rationals in R. Given a sub-sequence n0jj≥1, extract a subsequence n1jj≥1 ⊂ n0jj≥1 suchthat Yn1j (q1) → 0 as j → ∞, a.s. Next extract n2j ⊂ n1j suchthat Yn2j

(q2) → 0 as j →∞, a.s., and continue this for each qk ∈ Q.Let nj ≡ njj , j ≥ 1. Then, there exists a set A with P (A) = 1, andon A,

Ynj (qk) → 0 as j ≥ ∞ for all qk ∈ Q.

Now use (5.2) to show that w.p. 1, Yn′j(t) → 0 as j →∞ for all t ∈ R,

for some n′j ⊂ nj.

(c) Use the inequality

supt

∣∣E exp(ιtX)− E exp(ι(t + h)X

)∣∣ ≤ hE|X|

and the fact that under (1.21), |X(n)| + |X(1)| = O(nβ) as n → ∞,a.s. for any β > 1

α . )

17.9 Verify (1.45) in the proof of Theorem 17.1.4.

17.10 (Athreya (1987a)). Let Xnn≥1 be a sequence of iid random vari-ables satisfying (1.21) and Tn be as in (1.23). Let X∗

1 , . . . , X∗n be

conditionally iid random variables with cdf Fn of (1.1). Define analternative bootstrap version of Tn as

T ∗n,n = n(X∗

n − Xn)/X(n)

where X∗n = n−1∑n

i=1 X∗i , Xn = n−1∑n

i=1 Xi and X(n) =maxX1, . . . , Xn. (Here the scaling constants an’s are replaced bythe bootstrap analog of (1.26), i.e., by n

(1 − Fn(an)

)→ 1, which

yields an = X(n).) Let

φn(t) = E∗ exp(ιtT ∗n,n), t ∈ R.

Show that φn(t) converges in distribution to a random limit as n →∞. Identify the limit.

17.11 Suppose that Xnn≥1 satisfies conditions of Theorem 17.1.4 andthat T ∗

m,n is as in (1.27). Show that for any t1, . . . , tk ∈ R, k ∈ N,

(φn(t1), . . . , φn(tk)

)−→d

(exp

(∫h(t1, x)dM(x)

), . . . ,

exp(∫

h(tk, x)M(dx)))

,

where φn(t) ≡ E∗ exp(ιtT ∗n,n), t ∈ R.

Page 566: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

560 17. The Bootstrap

17.12 Show that (4.17) and (4.18) are equivalent.

17.13 Let Xii∈Z be a sequence of stationary random variables. Let TN =√n(Xn − µ) and T ∗

n =√

n(X∗n − µn), where X∗

n is the MBB samplemean based on blocks of length , where −1+n−1 = o(1) as n →∞.Suppose that X1 has finite moments of all order and that Xi is m-dependent for m ∈ Z+. Assume that the Berry-Esseen theorem holdsfor Xn.

(a) Show that

∆n() ≡ supx

∣∣P (Tn ≤ x)− Pn(T ∗n ≤ x)

∣∣ = Op

((n−1)1/2 + −1).

(b) Find the limit distribution of n1/3∆n() for = n1/3, where∆n() is as in part (a).

17.14 Suppose that Xii∈Z, X∗n, , etc. be as in Problem 17.13. Let

Rn =√

n(Xn − µ)/σn and R∗∗n =

√n(X∗

n − Xn)/σn() whereσ2

n = nVar(Xn) and σ2n() = nVar∗(X∗

n).

(a) Show that

∆n() ≡ supx

∣∣P (Rn ≤ x)− P (R∗∗n ≤ x)

∣∣= Op

(n−1/21/2).

(b) Find the limit distribution of n1/2−1/2∆n().

17.15 Let Xii∈Z, X∗n, , be as in Problem 17.13.

(a) Find the leading terms in the expansion of MSE(σ2

n())≡

E(σ2

n()−σ2n

)2, where σ2n() = nVar∗(X∗

n) and σ2n = nVar(Xn).

(b) Find the MSE-optimal block size for estimating σ2n.

17.16 Let Xii∈Z, Tn, T ∗n , be as in Problem 17.13. For α ∈ (0, 1), let

tα,n = α-quantile of Tn = infx : x ∈ R, P (Tn ≤ x) ≥ α

and tα,n()

be the α-quantile of T ∗n = inf

x : x ∈ R, P∗(T ∗

n ≤ x) ≥ α.

(a) Show that

[tα,n()− tα,n] −→p 0 as n →∞.

(b) Suppose that = n1/3 and write tα,n = tα,n

(n1/3

). Find a

sequence ann≥1 such that

an

(tα,n − tα,n

)−→d Z

for some nondegenerate random variable Z. Identify the distri-bution of Z.

Page 567: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

18Branching Processes

The study of the growth and development of many species of animals,plants, and other organisms over time may be approached in the followingmanner. At some point in time, a set of individuals called ancestors (orthe zeroth generation) is identified. These produce offspring, and the col-lection of offspring of all the ancestors constitutes the first generation. Theoffspring of these first generation individuals constitute the second genera-tion and so on. If one specifies the rules by which the offspring productiontakes place, then one could study the behavior of the long-time evolutionof such a process, called a branching process. Questions of interest are thelong-term survival or extinction of such a process, the growth rate of sucha population, fluctuation of population sizes, effects of control mechanisms,etc. In this chapter, several simple mathematical models and some resultsfor these will be discussed. Many of the assumptions made, especially theone about independence of lines of descent, are somewhat idealistic andunrealistic. Nevertheless, the models do prove useful in answering somegeneral questions, since many results about long-term behavior stay valideven when the model deviates somewhat from the basic assumptions.

It is worth noting that the models described below are relevant and ap-plicable not only to population dynamics as mentioned above but also toany evolution that has a tree-like structure, such as the process of electronmultiplication, gamma ray radiation, growth of sentences in context-freegrammars, algorithmic steps, etc. Thus, the theory of branching processeshas found applications in cosmic ray showers, data structures, combina-torics, and molecular biology, especially DNA sequencing. Some referencesto these applications may be found in Athreya and Jagers (1997).

Page 568: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

562 18. Branching Processes

18.1 Bienyeme-Galton-Watson branching process

One of the earliest and simplest models is the Bienyeme-Galton-Watson(BGW) branching process. In this model, it is assumed that the offspring ofeach individual is a random variable with a probability distribution (pj)j≥0and that the offspring of different individuals are stochastically indepen-dent. Thus, if Z0 is the size of the zeroth generation, then Z1 =

∑Z0i=1 ζ0,i

where ζ0,i : i = 1, 2 . . . are independent random variables with thesame distribution pjj≥0 and also independent of Z0. Here ζ0,i is to bethought of as the number of offspring of the ith individual in the zerothgeneration. Similarly, if Zn denotes the nth generation population, thenZn+1 =

∑Zn

i=1 ζni, where ζni : i = 1, 2 . . . are iid random variables withdistribution pjj≥0 and also independent of Z0, Z1, . . . , Zn. This impliesthat the lines of descent initiated by different individuals of a given gen-eration evolve independently of each other (Problem 18.3). This may notbe very realistic when there is competition for limited resources such asspace and food in the habitat. The long-term behavior of the sequenceZn∞

0 is crucially dependent on the parameter m ≡∑∞

j=0 jpj , the meanoffspring size. The BGW branching process Zn∞

0 with offspring distri-bution pjj≥0 is called supercritical, critical, and subcritical according asm > 1,= 1, or < 1, respectively. In what follows, the case when pi = 1for some i is excluded. The main results are the following: see Athreya andNey (2004) for details and full proofs.

Theorem 18.1.1: (Extinction probability).

(a) m ≤ 1 ⇒ Zn = 0 for all large n with probability one (w.p. 1).

(b) m > 1 ⇒ P (Zn → ∞ as n → ∞ | Z0 = 1) = 1 − q > 0, whereq ≡ P (Zn = 0 for all large n | Z0 = 1) is the smallest root in [0,1] of

s = f(s) ≡∞∑

j=0

pjsj . (1.1)

Theorem 18.1.1 says that if m ≤ 1, then the process will be extinct infinite time w.p. 1, whereas if m > 1, then the extinction probability q (withZ0 = 1) is strictly less than one, and when the process does not becomeextinct, it grows to infinity.

Proof: Since Znn≥0 is a Markov chain with state space Z+ ≡0, 1, 2, 3, . . . and 0 is an absorbing barrier, all nonzero states are tran-sient. Thus P (Zn = 0 for some n ≥ 1) + P (Zn →∞ as n →∞) = 1.

Also,

q ≡ P (Zn = 0 for some n ≥ 1 | Z0 = 1)

Page 569: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

18.1 Bienyeme-Galton-Watson branching process 563

= limn→∞ P (Zn = 0|Z0 = 1)

= limn→∞ fn(0)

where fn(s) ≡ E(sZn |Z0 = 1), 0 ≤ s ≤ 1.But by the definition of Zn,

fn+1(s) = E(sZn+1 |Z0 = 1)= E

(E(sZn+1 |Zn)|Z0 = 1

)= E

((f(s))Zn

)= fn(f(s)).

Iterating this yieldsfn(s) = f (n)(s), n ≥ 0

where f (n)(s) is the nth iterate of f(s).By continuity of f(s) in [0,1],

q = limn→∞ fn(0) = lim

n→∞ f(fn(0)

)= f(q).

If q′ is any other solution of (1.1) in [0,1], then q′ = f (n)(q′) ≥ f (n)(0) foreach n and hence q′ ≥ limn→∞ f (n)(0) = q.

This establishes (1.1). It is not difficult to show that by the convexity ofthe function f(s) on [0,1], q = 1 if m ≡ f ′(1) ≤ 1 and p0 ≤ q < 1 if m > 1.

The following results are refinements of Theorem 18.1.1.

Theorem 18.1.2: (Supercritical case). Let m > 1, Z0 = z0, 1 ≤ z0 < ∞,and Wn ≡ Zn/mn. Then

(a) Wnn≥0 is a nonnegative martingale and hence, converges w.p. 1 toa limit W .

(b)∑∞

j=1(j log j)pj < ∞ ⇒ P (W = 0) = qz0 , EW = z0, and W has astrictly positive density on (0,∞).

(c)∑∞

j=1(j log j)pj = ∞⇒ P (W = 0) = 1.

(d) There always exists a sequence Cn∞0 such that limn→∞ Zn/Cn ≡ W

exists and is finite w.p. 1, P (W = 0) = q and Cs+1/Cn → m.

This theorem says that in a supercritical BGW process in the event ofnonextinction, the process grows exponentially fast, confirming the asser-tion of the economist and demographer Malthus of the 19th century.

Proof: Since Znn≥0 is a Markov chain

EZn+1|Z0, Z1, . . . , Zn = EZn+1|Zn = Znm

Page 570: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

564 18. Branching Processes

implying E(Wn+1|W0, W1, . . . , Wn) = Wn proving (a).

Parts (b) and (c) are known as the Kesten-Stigum theorem, and for fullproof see Athreya and Ney (2004) where a proof of part (d) is also given.For a weaker version of (b), see Problem 18.2.

Theorem 18.1.3: (Critical case). Let m = 1 and Z0 = z0, 1 ≤ z0 < ∞.Let 0 < σ2 =

∑∞j=1 j2pj − 1 < ∞. Then

limn→∞ P

(Zn

n≤ x | Zn = 0

)= 1− exp(−2x/σ2). (1.2)

Thus, in the critical case, conditioned on nonextinction by time n, thepopulation in the nth generation is of order n, which when divided by n isdistributed approximately as an exponential with mean σ2/2.

Theorem 18.1.4: (Subcritical case). Let m < 1 and Z0 = z0, 1 ≤ z0 < ∞.Then

limn→∞ P (Zn = j|Zn > 0) = πj (1.3)

exists for all j ≥ 1 and∑∞

j=1 πj = 1. Furthermore,∑∞

j=1 jπj < ∞ if andonly if

∑∞j=1(j log j)pj < ∞.

For proof of Theorems 18.1.3 and 18.1.4, see Athreya and Ney (2004).In the supercritical case with p0 = 0, it is possible to estimate consistently

the mean m and the variance σ2 of the offspring distribution from observingthe population size sequence Znn≥0, but the whole distribution pjj≥0is not identifiable. However, if the entire tree is available, then pjj≥0 isidentifiable.

18.2 BGW process: Multitype case

The model discussed in the previous section has a natural generalization.Consider a population with k types of individuals, 1 < k < ∞. Assumethat a type i individual can produce offspring of all types with a probabilitydistribution that may depend on i but is independent of past history aswell as the other individuals in the same generation. This ensures that thelines of descent initiated by different individuals in a given generation areindependent, and those initiated by the individuals of the same type are iidas well. The dichotomy of extinction or infinite growth continues to hold.The analog of the key parameter m of the single type case is the maximaleigenvalue ρ of the mean matrix M ≡

((mij)

)k×k

, where mij is the meannumber of type j offspring of an individual of type i.

Page 571: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

18.2 BGW process: Multitype case 565

Theorem 18.2.1: (Extinction probability). Let M ≡((mij)

), the mean

matrix, be such that for some n0 > 0, Mn0 >> 0 i.e., all the entries ofMn0 are strictly positive. Let ρ be the maximal eigenvalue of M . Then

(a) ρ ≤ 1 ⇒ P (the population is extinct in finite time) = 1, for anyinitial condition.

(b) ρ > 1 ⇒ P (the population is extinct in finite time) = 1 − P (thepopulation grows to infinity) < 1 for all initial conditions other than0 ancestors.

Furthermore, if qi = P (extinction starting with one ancestor of type i),then the vector q = (q1, q2, . . . , qk) is the smallest root in [0, 1]k of theequation s = f(s), where s = (s1, s2, . . . , sk) and f(s) = [f1(s), . . . , fk(s)],with fk(s) being the generating function of the offspring distribution of atype i individual.

The following refinements of the above Theorem 18.2.1 are the analogs ofTheorems 18.1.2, 18.1.3, and 18.1.4. It is known that for the maximal eigen-value ρ of M , there exist strictly positive eigenvectors u = (u1, u2, . . . , uk)and v = (v1, v2, . . . , vk) such that uM = ρu and MvT = ρvT (where su-perscript T stands for transpose) normalized to satisfy

∑ki=1 ui = 1 and∑k

i=1 uivi = 1. Let Zn = (Zn1, Zn2, . . . , Znk) denote the vector of popula-tion sizes of individuals of the k types in the nth generation.

Theorem 18.2.2: (Supercritical case). Let 1 < ρ < ∞. Let

Wn =v · Zn

ρn=

(∑ki=1 viZni

)ρn

. (2.1)

Then Wn is a nonnegative martingale and hence converges w.p. 1 to alimit W and Zn/(v · Zn) → u on the event of nonextinction.

This theorem says that when the process does not die out it does go to∞, and the sizes of individuals of different types grow at the same rateand are aligned in the direction of the vector u. There are also appropriateanalogs of Theorem 18.1.2 (b), (c), and (d).

If ζ is a k-vector such that ζ · v = 0, then the sequence Yn ≡ ζ·Zn

ρn con-verges w.p. 1 to (ζ ·u)W . If ζ ·v = 0, then ρn is not the right normalizationfor ζ · Zn. The problem of the correct normalization for such vectors aswell as the proofs of Theorems 18.2.2–18.2.4 are discussed in Chapter 5 ofAthreya and Ney (2004).

Theorem 18.2.3: (Critical case). Let Z0 = z0 = 0 and ρ = 1. Let all theoffspring distributions possess finite second moments. Then

limn→∞ P

(v · Zn

n≤ x | Zn = 0

)= 1− e−λx (2.2)

Page 572: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

566 18. Branching Processes

for some 0 < λ < ∞ and

limn→∞ P

(∥∥∥ Zn

v · Zn− u∥∥∥ > ε | Zn = 0

)= 0 (2.3)

for each ε > 0, where ‖ · ‖ is the Euclidean distance.

Theorem 18.2.4: (Subcritical case). Let ρ < 1 and Z0 = z0 = 0. Then

limn→∞ P (Zn = j|Zn = 0) = πj (2.4)

exists for all vectors j = (j1, j2, . . . , jk) = 0 and∑

πj = 1.

18.3 Continuous time branching processes

The models discussed in the past two sections deal with branching processesin which individuals live for exactly one unit of time and are replaced by arandom number of offspring. A model in which individuals live a randomlength of time and then produce a random number of offspring is discussedbelow.

Assume that each individual lives a random length of time with distri-bution function G(·) and then produces a random number of offspring withdistribution pjj≥0 and these two random quantities are independent andfurther independently distributed over all individuals in the population.Let Z(t) denote the population size and Y (t) be the set of ages of all in-dividuals alive at time t. Assume G(0) = 0 and m ≡

∑∞j=1 jpj < ∞. The

offspring mean m continues to be the key parameter.When G(·) is exponential, the process Z(t) : t ≥ 0 is a continuous time

Markov chain. In the general case, the vector process (Z(t), Y (t)) : t ≥ 0is a Markov process.

Theorem 18.3.1: (Extinction probability). Let q = P [Z(t) = 0 for somet > 0] when Z(0) = 1 and Y (0) = 0. Then m ≤ 1 ⇒ q = 1 andm > 1 ⇒ q < 1 and P [Z(t) →∞] = 1− P [Z(t) = 0 for some t > 0].

The following refinements of the above Theorem 18.3.1 are analogs ofTheorems 18.1.2–18.1.4.

When m > 1, the effect of random lifetimes is expressed through theMalthusian parameter α defined by

m

∫(0,∞)

e−αtdG(t) = 1. (3.1)

The reproductive age value V (x) of an individual of age x is

V (x) ≡ m(∫

[x,∞)e−αtdG(t)

)[1−G(x)]−1. (3.2)

Page 573: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

18.3 Continuous time branching processes 567

If m(t) ≡ EZ(t) when one starts with one individual of age 0, then m(·)satisfies the integral equation

m(t) = 1−G(t) + m

∫(0,t)

m(t− u)dG(u). (3.3)

It can be shown using renewal theory (cf. Section 8.5) that m(t)e−αt con-verges to a finite positive constant (Problem 18.6).

Theorem 18.3.2: (Supercritical case). Let m > 1. Then

(a) W (t) ≡ e−αt∑Z(t)

i=1 V (Xi), where x1, . . . , xZ(t) are the ages of theindividuals alive at t, is a nonnegative martingale and converges w.p.1 to a limit W where V (·) is as in (3.2),

(b)∑∞

j=1(j log j)pj = ∞⇒ W = 0 w.p. 1,

(c)∑∞

j=1(j log j)pj < ∞⇒

P (W = 0 | Y0 = 0) = q and

E(W |Y0 = x) = V (x),

(d) on the event of nonextinction, w.p. 1 the empirical age distributionat time t, A(x, t) ≡ number of individuals alive at t with age ≤x/number of individuals alive at t converges in distribution ast →∞ to the steady-state age distribution:

A(x) ≡∫ x

0 e−αy[1−G(y)]dy∫∞0 e−αy[1−G(y)]dy

(3.4)

(e)∑∞

j=1(j log j)pj < ∞⇒ Z(t)e−αt converges w.p. 1 to

W[ ∫∞

0 V (x)dA(x)]−1, where W is as in (a).

Theorem 18.3.3: (Critical case). Let m = 1 and σ2 =∑

j(j−1)2pj < ∞.Assume limt→∞ t2(1−G(t)) = 0 and

∫∞0 tdG(t) = µ. Then, for any initial

Z0 = 0.

(a) limt P [Z(t)/t ≤ x|Z(t) > 0] = 1− exp[(−2µ/σ2)x], 0 < x < ∞.

(b) limt P [supx |A(x, t) − A(x)| > ε|Z(t) > 0] = 0 for any ε > 0, whereA(·, t) and A(x) are as in Theorem 18.3.2 (d) with α = 0.

Theorem 18.3.4: (Subcritical case). Let m < 1. Then for any initialZ0 = 0, G(·) nonlattice (cf. Chapter 10),

limt→∞ P [Z(t) = j|Z(t) > 0] = πj (3.5)

exists for all j ≥ 1 and∑∞

j=1 πj = 1.

Page 574: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

568 18. Branching Processes

18.4 Embedding of Urn schemes in continuoustime branching processes

It turns out that many urn schemes can be embedded in continuous timebranching processes. The case of Polya’s urn is discussed below.

Recall that Polya’s urn scheme is the following. Let an urn have an initialcomposition of R0 red and B0 black balls. A draw consists of taking a ballat random from the urn, noting its color, and returning it to the urn withone more ball of the color drawn. Let (Rn, Bn) denote the composition aftern draws. Clearly, Rn + Bn = R0 + B0 + n for all n ≥ 0 and Rn, Bnn≥0is a Markov chain.

Let Zi(t) : t ≥ 0, i = 1, 2 be two independent continuous timebranching processes with unit exponential life times and offspring distri-bution of binary splitting, i.e., p2 = 1 and Z1(0) = R0, Z2(0) = B0.Let τ0 = 0 < τ1 < τ2 < . . . < τn < . . . denote the successive timesof death of an individual in the combined population. Then the sequence(Z1(τn), Z2(τn)

)n≥0 has the same distribution as (Rn, Bn)n≥0.

To establish this claim, by the Markov property of(Z1(t), Z2(t)

)t≥0, it

suffices to show that(Z1(τ1), Z2(τ1)

)has the same distribution as (R1, B1).

It is easy to show that if ηi : i = 1, 2, . . . , n are independent exponentialrandom variables with parameters λi, i = 1, 2, . . . , n then the η ≡ minηi :1 ≤ i ≤ n is an Exponential (

∑ni=1 λi) random variable and P (η = ηi) =

λi

(∑n

j=1 λj)(Problem 18.9). This, in turn, leads to the fact at time τ1, the

probability that a split takes place in Z1(t) : t ≥ 0 is Z1(0)Z1(0)+Z2(0)

. At thissplit, the parent is lost but is replaced by two new individuals resulting ina net addition of one more individual, establishing the claim.

The same reasoning yields the embedding of the following general urnscheme. Let Xn = (Xn1, . . . , Xnk) be the vector of the composition of anurn at time n where Xni is the number of balls of color i. Assume thatgiven (X0,X1, . . . ,Xn), Xn+1 is generated as follows.

Pick a ball at random from the urn. If it happens to be of color i, thenreturn it to the urn along with a random number ζij of balls of colorj = 1, 2, . . . , k where the joint distribution of ζi ≡ (ζi1, ζi2, . . . , ζik) dependson i, i = 1, 2, . . . , k. Now set, Xn+1 = Xn + ζi.

The embedding is done as follows. Consider a continuous time multitypebranching process Z(t) : t ≥ 0 with Exponential (1) lifetimes and theoffspring distribution of the ith type is the same as that of ζi ≡ ζi+δi whereζi is as above and δi is ith the unit vector. Let for i = 1, 2, . . . , k, Zi(t) :t ≥ 0 be a branching process that evolves as Z(t) : t ≥ 0 above buthas initial size Zi(0) ≡ (0, 0, . . . , X0i, 0, . . . , 0). Let 0 = τ0 < τ1 < τ2 < . . .denote the times at which deaths occur in the process obtained by poolingall the k processes. Then

(Zi(τn) : i = 1, 2, . . . , k

), n = 0, 1, 2 . . . has the

same distribution as (Xni, i = 1, 2, . . . , k), n ≥ 0.

Page 575: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

18.5 Problems 569

This embedding has been used to prove limit theorems for urn models.See Athreya and Ney (2004), Chapter 5, for details. For applications toclinical trials, see Rosenberger (2002).

18.5 Problems

18.1 Show that for any probability distribution pjj≥0, f(s) =∑∞

j=0 pjsj

is convex in [0,1]. Show also that there exists a q ∈ [0, 1) such thatf(q) = q iff m = f ′(1·) > 1.

18.2 Assume∑∞

j=1 j2pj < ∞.

(a) Let vn = V (Zn|Z0 = 1). Show that vn+1 = V (Znm|Z0 = 1) +E(Znσ2|Z0 = 1) where m = E(Z1|Z0 = 1) and σ2 = V (Z1|Z0 =1) and hence vn+1 = m2vn + σ2mn.

(b) Conclude from (a) that supn EW 2n < ∞, where Wn = Zn/mn.

(c) Using the fact Wnn≥0 is a martingale, show that if∑

j2pj <∞ then Wn converges w.p. 1 and in L2 to a random variableW such that E(W |Z0 = 1) = 1.

18.3 By definition, the sequence Znn≥0 of population sizes satisfies therandom iteration scheme

Zn+1 =Zn∑i=1

ζni

where ζni, i = 1, 2, . . . , n = 1, 2, . . . is a doubly infinite sequence ofiid random variable with distribution pj.

(a) (Independence of lines of descent). Establish the property thatfor any k ≥ 0 if Z0 = k then Znn≥0 has the same distributionas∑k

j=1 Z(j)n

n≥0 where

Z(j)

n n≥0, j ≥ 1 are iid copies of

Znn≥0 with Z0 = 1.

(b) In the context of Theorem 18.1.2, show that if Z0 = 1 thenW ≡ limWn can be represented as

W =1m

Z1∑j=1

W (j)

where Z1, W (j), j = 1, 2, . . . are all independent with Z1 hav-ing distribution pjj≥0 and W (j)j≥1 are iid with distributionsame as W .

Page 576: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

570 18. Branching Processes

(c) Let α ≡∑

aj∈D P (W = aj) where D ≡ aj is the set of valuessuch that P (W = aj) > 0. Show using (b) that α = f(α) andconclude that if α < 1, then α = q and hence that if α < 1, thenthe distribution of W conditional on W > 0 must be continuous.

(d) Let β be the singular component of the distribution of W in itsLebesgue decomposition. Show using (b) that β satisfies β ≤f(β) and hence that if β < 1, then β = P (W = 0) and thedistribution of W conditional on W > 0 must be absolutelycontinuous.

(e) Let p0 = 0. Show that the distribution of W is of the pure type,i.e., it is either purely discrete, purely singular continuous, orpurely absolutely continuous.

18.4 (a) Show using Problem 18.3 (b) that if W has a lattice distributionwith span d, then d must satisfy d = md and hence d = ∞.Conclude that if P (W = 0) < 1, then the distribution of W onW > 0 must be nonlattice.

(b) Let p0 = 0 and P (W = 0) = 0. Use (a) to concludethat the characteristic function φ(t) ≡ E(eιtW ) of W satisfiessup1≤|t|≤m |φ(t)| < 1.

(c) Let p0 = 0. Show that for any 0 ≤ s0 < 1, ε > 0, f (n)(s0) =0(εn).

(Hint: By the mean value theorem, f (n)(s) =∏n−1

j=0 f ′(fj(s)).Now use f ′(0) = p0, fj(s) → 0 as j →∞.)

(d) Let p0 = 0, P (W = 0) = 0. Show that for any n ≥ 1,φ(mnt) = f (n)(φ(t)) and hence

∫∞−∞ |φ(u)|du < ∞. Conclude

that the distribution of W is absolutely continuous.

18.5 In the multitype case for the martingale defined in (2.1), show thatWnn≥0 is L2 bounded if EiZ

21j < ∞ for all i, j where Ei denotes

expectation when one starts with an individual of type i.

18.6 Let m(·) satisfy the integral equation (3.3).

(a) Show that mα(t) ≡ m(t)e−αt satisfies the renewal equation

mα(·) =(1−G(t)

)e−αt +

∫(0,t]

mα(t− u)dGα(u)

where Gα(t) ≡ m∫ t

0 e−αudG(u), t ≥ 0.

(b) Use the key renewal theorem of Section 8.5 to conclude thatlimt→∞ mα(t) exists and identify the limit.

Page 577: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

18.5 Problems 571

(c) Assuming∑∞

j=1 j2pj < ∞ show using the key renewal theoremof Section 8.5 that W (t) : t ≥ 0 of Theorem 18.3.2 is L2

bounded.

18.7 Consider an M/G/1 queue with Poisson arrivals and general servicetime. Let Z1 be the number of customers that arrive during the servicetime of the first customer. Call these first generation customers. LetZ2 be the number of customers that arrive during the time it takes toserve all the Z1 customers. Call these second generation customers.For n ≥ 1, let Zn+1 denote the number of customers that arrive duringthe time it takes to serve all Zn of the nth generation customers.

(a) Show that Znn≥0 is a BGW branching process as in Section18.1.

(b) Find the offspring distribution pj∞0 and its mean m in terms

of the rate parameter λ of the Poisson arrival process and theservice time distribution G(·).

(c) Show that the queue size goes to ∞ with positive probability iffm > 1.

(d) Set up a functional equation for the moment generating functionof the busy period U , i.e., the time interval between when thefirst service starts and when the server is idle for the first time.

18.8 Let ηi : i = 1, 2, . . . , n be independent exponential random vari-ables with Eηi = λ−1

i , i = 1, 2, . . . , n. Let η ≡ minηi : 1 ≤ i ≤ n.Show that η has an exponential distribution with Eη =

(∑ni=1 λi

)−1

and that P (η = ηj) = λj

(∑ni=1 λi

).

18.9 Using the embedding outlined in Section 18.4 for the Polya urnscheme, show that Yn ≡ Rn

Rn+Bn→ Y w.p. 1 and that Y can be

represented as Y =∑R0

i=1 Xi∑R0+B0

j=1 Xj

where Xii≥1 are iid exponential (1)

random variables. Conclude that Y has Beta (R0, B0) distribution.

Page 578: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Appendix AAdvanced Calculus: A Review

This Appendix is a brief review of elementary set theory, real numbers,limits, sequences and series, continuity, differentiability, Riemann integra-tion, complex numbers, exponential and trigonometric functions, and met-ric spaces. For proofs and further details, see Rudin (1976) and Royden(1988).

A.1 Elementary set theory

This section reviews the following: sets, set operations, product sets (fi-nite and infinite), equivalence relation, axiom of choice, countability, anduncountability.

Definition A.1.1: A set is a collection of objects.

It is typically defined as a collection of objects with a common definingproperty. For example, the collection of even integers can be written asE ≡ n : n is an even integer. In general, a set Ω with defining property pis written as

Ω = ω : ω has property p.

The individual elements are denoted by the small letters ω, a, x, s, t, etc.,and the sets by capital letters Ω, A, X, S, T , etc.

Example A.1.1: The closed interval [0, 1] ≡ x : x a real number, 0 ≤x ≤ 1.

Page 579: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

574 Appendix A. Advanced Calculus: A Review

Example A.1.2: The set of positive rationals ≡ x : x = mn , m and n

positive integers.

Example A.1.3: The set of polynomials in x of degree 10 ≡ P (x) :P (x) =

∑10j=0 ajx

j , aj real, j = 0, . . . , 10, a10 = 0.

Example A.1.4: The set of all polynomials in x ≡ P (x) : P (x) =∑nj=0 ajx

j , n a nonnegative integer, aj real, j = 0, 1, 2, . . . , n.

A.1.1 Set operations

Definition A.1.2: Let A be a set. A set B is called a subset of A andwritten as B ⊂ A if every element of B is also an element of A. Two sets Aand B are the same and written as A = B if each is a subset of the other.A subset A ⊂ Ω is called empty and denoted by ∅ if there exists no ω in Ωsuch that ω ∈ A.

Using the mathematical notation ∈ and ⇒, one writes

B ⊂ A if x ∈ B ⇒ x ∈ A.

Here ‘∈’ means “belongs to” and ⇒ means “implies.”

Example A.1.5: Let N be the set of natural numbers, i.e., N =1, 2, 3, . . .. Let E be the set of even natural numbers, i.e., E = n : n ∈ N,n = 2k for some k ∈ N. Then E ⊂ N.

Example A.1.6: Let A = [0, 1] and B be the set of x in A such thatx2 < 1

4 . Then B = x : 0 ≤ x < 12 ⊂ A.

Definition A.1.3: (Intersection and union). Let A1, A2 be subsets of aset Ω. Then A1 union A2, written as A1 ∪A2,is the set defined by

A1 ∪A2 = ω : ω ∈ A1 or ω ∈ A2 or both.

Similarly, A1 intersection A2, written as A1 ∩A2, is the set defined by

A1 ∩A2 = ω : ω ∈ A1 and ω ∈ A2.

Example A.1.7: Let Ω ≡ N ≡ 1, 2, 3, . . .,

A1 = ω : ω = 3k for some k ∈ N andA2 = ω : ω ≡ 5k for some k ∈ N.

Then A1 ∪ A2 = ω : ω is divisible by at least one of the two integers 3and 5, A1 ∩A2 = ω : ω is divisible by both 3 and 5.

Page 580: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

A.1 Elementary set theory 575

Definition A.1.4: Let Ω and I be nonempty sets. Let Aα : α ∈ I be acollection of subsets of Ω. Then I is called the index set.

The union of Aα : α ∈ I is defined as⋃α∈I

Aα ≡ ω : ω ∈ Aα for some α ∈ I.

The intersection of Aα : α ∈ I is defined as⋂α∈I

Aα ≡ ω : ω ∈ Aα for every α ∈ I.

Definition A.1.5: (Complement of a set). Let A ⊂ Ω. Then the comple-ment of the set A, written as Ac (or A), is defined by Ac ≡ ω : ω /∈ A.

Example A.1.8: If Ω = N and A is the set of all integers that are divisibleby 2, then Ac is the set of all odd integers, i.e., Ac = 1, 3, 5, 7, . . ..

Proposition A.1.1: (DeMorgan’s law). For any Aα : a ∈ I of subsetsof Ω, (∪α∈IAα)c = ∩α∈IA

cα, (∩α∈IAα)c = ∪α∈IA

cα.

Proof: To show that two sets A and B are the same, it suffices to showthat

ω ∈ A ⇒ ω ∈ B and ω ∈ B ⇒ ω ∈ A.

Let ω ∈ (∪α∈IAα)c. Then ω /∈ ∪α∈IAα

⇒ ω /∈ Aα for any α ∈ I

⇒ ω ∈ Acα for each α ∈ I

⇒ ω ∈⋂

α∈I

Acα.

Thus (∪α∈IAα)c ⊂ ∩α∈IAcα. The opposite inclusion and the second iden-

tity are similarly proved.

Definition A.1.6: (Product sets). Let Ω1 and Ω2 be two nonempty sets.Then the product set of Ω1 and Ω2, denoted by Ω ≡ Ω1 × Ω2, consists ofall ordered pairs (ω1, ω2) such that ω1 ∈ Ω1, ω2 ∈ Ω2.

Note that if Ω1 = Ω2 and ω1 = ω2, then the pair (ω1, ω2) is not the sameas (ω2, ω1), i.e., the order is important.

Example A.1.9: Ω1 = [0, 1], Ω2 = [2, 3]. Then Ω1 × Ω2 = (x, y) : 0 ≤x ≤ 1, 2 ≤ y ≤ 3.

Definition A.1.7: (Finite products). If Ωi, i = 1, 2 . . . , k are nonemptysets, then

Ω = Ω1 × Ω2 × . . .× Ωk

Page 581: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

576 Appendix A. Advanced Calculus: A Review

is the set of all ordered k vectors (ω1, ω2, . . . , ωk) where ωi ∈ Ωi. If Ωi = Ω1

for all 1 ≤ i ≤ k, then Ω1 × Ω2 × . . .× Ωk is written as Ω(k)1 or Ωk

1 .

Definition A.1.8: (Infinite products). Let Ωα : α ∈ I be an infinitecollection of nonempty sets. Then ×α∈IΩα, the product set, is defined asf : f is a function defined on I such that for each α, f(α) ∈ Ωα. IfΩα = Ω for all α ∈ I, then ×α∈IΩα is also written as ΩI .

It is a basic axiom of set theory, known as the axiom of choice (A.C.), thatthis space is nonempty. That is, given an arbitrary collection of nonemptysets, it is possible to form a parliament with one representative from eachset.

For a long time it was thought this should follow from the other axiomsof set theory. But it is shown in Cohen (1966) that it is an independentaxiom. That is, both the A.C. and its negation are consistent with the restof the axioms of set theory.

There are several equivalent versions of A.C. These are Zorn’s lemma,Hausdorff’s maximality principle, the ‘Principle of Well Ordering,’ andTukey’s lemma. For a proof of these equivalences, see Hewitt and Stromberg(1965).

Definition A.1.9: (Functions, countability and uncountability). A func-tion f is a correspondence between the elements of a set X and anotherset Y and is written as f : X → Y . It satisfies the condition that for eachx, there is a unique y in Y that corresponds to it and is denoted as

y = f(x).

The set X is called the domain of f and the set f(X), defined as, f(X) ≡y : there exists x in X such that f(x) = y is called the range of f . It ispossible that many x’s may correspond to the same y and also there mayexist y in Y for which there is no x such that f(x) = y. If f(X) is all of Y ,then the map is called onto. If for each y in f(X), there is a unique x in Xsuch that f(x) = y, then f is called (1–1) or one-to-one. If f is one-to-oneand onto, then X and Y are said to have the same cardinality.

Definition A.1.10: Let f : X → Y be (1–1) and onto. Then, for each y inY , there is a unique element x in X such that f(x) = y. This x is denotedas f−1(y). Note that in this case, g(y) ≡ f−1(y) is a (1–1) onto map fromY to X and is called the inverse of f .

Example A.1.10: Let X = N ≡ 1, 2, 3, . . .. Let Y = n : n = 2k, k ∈ Nbe the set of even integers. Then the map f(x) = 2x is a (1–1) onto mapfrom X to Y .

Example A.1.11: Let X be N and let P be the set of all prime numbers.Then X and P have the same cardinality.

Page 582: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

A.1 Elementary set theory 577

Definition A.1.11: A set X is finite if there exists n ∈ N such that Xand Y ≡ 1, 2, . . . , n have the same cardinality, i.e., there exists a (1–1)onto map from Y to X. A set X is countable if X and N have the samecardinality, i.e., there exists a (1–1) onto map from N to X. A set X isuncountable if it is not finite or countable.

Example A.1.12: The set 0, 1, 2, . . . , 9 is finite, the set Nk (k ∈ N) iscountable and NN is uncountable (Problem A.6).

Definition A.1.12: Let Ω be a nonempty set. Then the power set of Ω,denoted by P(Ω), is the collection of all subsets of Ω, i.e.,

P(Ω) ≡ A : A ⊂ Ω.

Remark A.1.1: P(N) is an uncountable set (Problem A.5).

A.1.2 The principle of inductionThe set N of natural numbers has the well ordering property that everynonempty subset A of N has a smallest element s such that (i) s ∈ Aand (ii) a ∈ A ⇒ a ≥ s. This property is one of the basic postulates inthe definition of N. The principle of induction is a consequence of the wellordering property. It says the following:

Let P (n) : n ∈ N be a collection of propositions (or statements). Sup-pose that

(i) P (1) is true.

(ii) For each n ∈ N, P (n) true ⇒ P (n + 1) true.

Then, P (n) is true for all n ∈ N.

See Problem A.9 for some examples.

A.1.3 Equivalence relations

Definition A.1.13:

(a) Let Ω be a nonempty set. Let G be a nonempty subset of Ω × Ω.Write x ∼ y if (x, y) ∈ G and call it a relation defined by G.

(b) A relation defined by G is an equivalence relation if

(i) (reflexive) for all x in Ω, x ∼ x, i.e., (x, x) ∈ G;

(ii) (symmetric) x ∼ y ⇒ y ∼ x, i.e., (x, y) ∈ G ⇒ (y, x) ∈ G;

(iii) (transitive) x ∼ y, y ∼ z ⇒ x ∼ z, i.e., (x, y) ∈ G, (y, z) ∈ G ⇒(x, z) ∈ G.

Page 583: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

578 Appendix A. Advanced Calculus: A Review

Example A.1.13: Let Ω = Z, the set of all integers, G = (m, n) : m−nis divisible by 3. Thus, m ∼ n if (m − n) is a multiple of 3. It is easy toverify that this is an equivalence relation.

Definition A.1.14: (Equivalence classes). Let Ω be a nonempty set. LetG define an equivalence relation on Ω. For each x in Ω, the set [x] ≡ y :x ∼ y is called the equivalence class generated by x.

Proposition A.1.2: Let C be the set of all equivalence classes in Ω gen-erated by an equivalence relation defined by G. Then

(i) C1, C2 ∈ C ⇒ C1 = C2 or C1 ∩ C2 = ∅.

(ii)⋃

C∈CC = Ω.

Proof:

(i) Suppose C1 ∩C2 = ∅. Then there exist x1, x2, y such that C1 = [x1],C2 = [x2] and y ∈ C1 ∩ C2. This implies x1 ∼ y, x2 ∼ y. But bysymmetry y ∼ x2 and this implies by transitivity that x1 ∼ x2, i.e.,x2 ∈ C1 implying C2 ⊂ C1. Similarly, C1 ⊂ C2, i.e., C1 = C2.

(ii) For each x in Ω, (x, x) ∈ G and so [x] is not empty and x ∈ [x].

The above proposition says that every equivalence relation on Ω leads toa decomposition of Ω into equivalence classes that are disjoint and whoseunion is all of Ω. In the example given above, the set Z of all integers canbe decomposed to three equivalence classes Cj ≡ n : n = 3m + j for somem ∈ Z, j = 0, 1, 2.

A.2 Real numbers, continuity, differentiability, andintegration

A.2.1 Real numbersThis section reviews the following: integers, rationals, real numbers; alge-braic, order, and completeness axioms; Archimedean property, densenessof rationals.

There are at least two approaches to defining the real number system.

Approach 1. Start with the natural numbers N, construct the set Z of allintegers (N∪0∪ (−N)), and next, the set Q of rationals and then the setR of real numbers either as the set of all Cauchy sequences of rationals oras Dedekind cuts. The step going from Q to R via Cauchy sequences is alsoavailable for completing any incomplete metric space (see Section A.4).

Page 584: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

A.2 Real numbers, continuity, differentiability, and integration 579

Approach 2. Define the set of real numbers R as a set that satisfies threesets of axioms. The first set is algebraic involving addition and multiplica-tion. The second set is on ordering that, with the first, makes R an orderedfield (see Royden (1988) for a definition). The third set is a single axiomknown as the completeness axiom. Thus R is defined as a complete orderedfield.

The algebraic axioms say that there are two binary operations knownas addition (+) and multiplication (·) that render R a field. See Royden(1988) for the nine axioms for this set.

The order axiom says that there is a set P ⊂ R, to be called positivenumbers such that

(i) x, y ∈ P ⇒ x · y ∈ P, x + y ∈ P

(ii) x ∈ P ⇒ −x /∈ P

(iii) x ∈ R ⇒ x = 0 or x ∈ P or −x ∈ P.

The set Q of rational numbers is an ordered field (i.e., it satisfies thealgebraic and order axioms). But Q does not satisfy the completeness axiom(see below).

Given P, one can define an order on R by defining x < y (read x lessthan y) to mean y − x ∈ P. Since for all x, y in R, (x − y) is either 0 or(x− y) ∈ P or (y − x) ∈ P, it follows that for all x, y in R, either x = y orx < y or x > y. This is called total or linear order.

Definition A.2.1: (Upper and lower bounds).

(a) Let A ⊂ R. A real number M is an upper bound for A if a ∈ A ⇒a ≤ M and m is a lower bound for A if a ∈ A ⇒ a ≥ m.

(b) The supremum of a set A, denoted by sup A or the least upper bound(l.u.b.) of A, is defined by the following conditions:

(i) x ∈ A ⇒ x ≤ supA,(ii) K < sup A ⇒ there exists x ∈ A such that K < x.

The completeness axiom says that if A ⊂ R has an upper bound M inR, then there exists a M in R such that M = sup A.

That is, every set A that is bounded above in R has a l.u.b. in R. Theordered field of rationals Q does not possess this property. One well-knownexample is the set

A = r : r ∈ Q, r2 < 2.Then A is bounded above in Q but has no l.u.b. in Q (Problem A.11).

Next some consequences of the completeness axiom are discussed.

Proposition A.2.1: (Axiom of Eudoxus and Archimedes (AOE)). For allx in R, there exists a natural number n such that n > x.

Page 585: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

580 Appendix A. Advanced Calculus: A Review

Proof: If x ≤ 1, take n = 2. If x > 1, let Sx ≡ k : k ∈ N, k ≤ x. Then Sx

is not empty and is bounded above. By the completeness axiom, there is areal number y that is the l.u.b. of Sx. Thus y − 1

2 is not an upper boundfor Sx and so there exists k0 ∈ Sx such that y − 1

2 < k0. This implies that(k0 + 1) > y− 1

2 + 1 = y + 12 > y and so (k0 + 1) /∈ Sx. By the linear order

in R, (k0 + 1) > x and so (k0 + 1) is the desired integer.

Corollary A.2.2: For any x, y ∈ R with x < y, there is a r in Q suchthat x < r < y.

Proof: Let z = (y− x)−1. Then there is an integer k such that 0 < z < k(by AOE.) Again by AOE, there is a positive integer n such that n > yk.Let S = n : n ∈ N, n > yk. Since S = ∅, it has a smallest element (by thewell ordering property of N) say, p. Then p−1 < yk < p, i.e., p−1

k < y < pk .

Since 1k < 1

z = (y − x) and pk > y, it follows that p−1

k > x. Now taker = p−1

k .

Remark A.2.1: This property is often stated as: The set Q of rationalsis dense in the set R of real numbers.

Definition A.2.2: The set R of extended real numbers is the set consistingof R and two elements identified as +∞ (plus infinity) and −∞ (negativeinfinity) with the following definition of addition (+) and multiplication(·). For any x in R, x + ∞ = ∞, x − ∞ = −∞, x · ∞ = ∞ if x > 0,x · ∞ = −∞ if x < 0, 0 · ∞ = 0, ∞ + ∞ = ∞, −∞ − ∞ = −∞,∞ · (±∞) = ±∞, (−∞) · (±∞) = ∓∞. But ∞−∞ is not defined. Theorder property on R is defined by extending that on R with the additionalcondition x ∈ R ⇒ −∞ < x < +∞. Finally, if A ⊂ R does not have anupper bound in R, then supA is defined as +∞ and if A ⊂ R does nothave a lower bound in R, then inf A is defined as −∞.

A.2.2 Sequences, series, limits, limsup, liminf

Definition A.2.3: Let xnn≥1 be a sequence of real numbers.

(i) For a real number a, limn→∞ xn = a if for every ε > 0, there exists a

positive integer Nε such that n ≥ Nε ⇒ |xn − a| < ε.

(ii) limn→∞ xn = ∞ if for any K in R, there exists an integer NK such thatn ≥ NK ⇒ xn > K.

(iii) limn→∞ xn = −∞ if lim

n→∞(−xn) = ∞.

(iv) lim supn→∞

xn ≡ limn→∞ xn = inf

n≥1(supj≥n

xj).

(v) lim infn→∞ xn ≡ lim

n→∞xn = sup

n≥1( infj≥n

xj).

Page 586: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

A.2 Real numbers, continuity, differentiability, and integration 581

Definition A.2.4: (Cauchy sequences). A sequence xnn≥1 ⊂ R is calleda Cauchy sequence if for every ε > 0, there is Nε such that n, m ≥ Nε ⇒|xm − xn| < ε.

Proposition A.2.3: If xnn≥1 ⊂ R is convergent in R (i.e.,limn→∞ xn = a exists in R), then xnn≥1 is Cauchy. Conversely, ifxnn≥1 ⊂ R is Cauchy, then there exists an a ∈ R such that limn→∞ xn =a.

The proof is based on the use of the l.u.b. axiom (Problem A.14).

Definition A.2.5: Let xnn≥1 be a sequence of real numbers. For n ≥ 1,sn ≡

∑nj=1 xj is called the nth partial sum of the series

∑∞j=1 xj . The series∑∞

j=1 xj is said to converge to s in R if limn→∞ sn = s. If limn→∞ sn = ±∞,then the series

∑∞j=1 xj is said to diverge to ±∞.

Note that if xj ≥ 0 for all j, then either limn→∞ sn = s ∈ R, orlimn→∞ sn = ∞.

Example A.2.1: (Geometric series). Fix 0 < r < 1. Let xn = rn, n ≥ 0.Then sn = 1 + r + . . . + rn = 1−rn+1

1−r and∑∞

j=1 rj converges to s = 11−r .

Example A.2.2: Consider the series∑∞

j=11jp , 0 < p < ∞. It can be shown

that this converges for p > 1 and diverges to ∞ for 0 < p ≤ 1.

Definition A.2.6: The series∑∞

j=1 xj converges absolutely if the series∑∞j=1 |xj | converges in R.

There exist series∑∞

j=1 xj that converge but not absolutely. For example,∑∞j=1

(−1)j

j . For further material on convergence properties of series, suchas tests for convergence, rates of convergence, etc., see Rudin (1976).

Definition A.2.7: (Power series). Let ann≥0 be a sequence of realnumbers. For x ∈ R, the series

∑∞n=0 anxn is called a power series. If the

series∑∞

n=0 anxn converges for all x in B ⊂ R, the power series∑∞

n=0 anxn

is said to be convergent on B.

Proposition A.2.4: Let ann≥0 be a sequence of real numbers. Let ρ =(lim supn→∞ |an|

1n )−1. Then

(i) |x| < ρ ⇒∑∞

n=0 |anxn| converges.

(ii) |x| > ρ ⇒∑∞

n=0 |anxn| diverges to +∞.

Proof of this is left as an exercise (Problem A.15).

Definition A.2.8: ρ ≡ (lim supn→∞ |an|1n )−1 is called the radius of con-

vergence of the power series∑∞

n=0 anxn.

Page 587: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

582 Appendix A. Advanced Calculus: A Review

A.2.3 Continuity and differentiabilityDefinition A.2.9: Let f : A → R, A ⊂ R. Then

(a) f is continuous at x0 in A if for every ε > 0, there exists a δ > 0 suchthat x ∈ A, |x − x0| < δ, implies |f(x) − f(x0)| < ε. (Here, δ maydepend on ε and x0.)

(b) f is continuous on B ⊂ A if it is continuous at every x0 in B.

(c) f is uniformly continuous on B ⊂ A if for every ε > 0, there exists aδε > 0 such that sup|f(x)− f(y)| : x, y ∈ B, |x− y| < δε < ε.

Some properties of continuous functions are listed below.

Proposition A.2.5:

(i) (Sums, products, and ratios of continuous functions). Let f , g : A →R, A ⊂ R. Let f and g be continuous on B ⊂ A. Then

(a) f + g, f − g, α · f for any α ∈ R are all continuous on B.

(b) f(x)/g(x) is continuous at x0 in B, provided g(x0) = 0.

(ii) (Continuous functions on a closed bounded interval). Let f be con-tinuous on a closed and bounded interval [a, b]. Then

(a) f is bounded, i.e., sup|f(x)| : a ≤ x ≤ b < ∞,

(b) it achieves its maximum and minimum, i.e., there exist x0, y0in [a, b] such that f(x0) ≥ f(x) ≥ f(y0) for all x in [a, b] and fattains all values in [f(y0), f(x0)], i.e., for all ∈ [f(y0), f(x0)],there exists z ∈ [a, b] such that f(z) = . Thus, f maps boundedclosed intervals onto bounded closed intervals.

(c) f is uniformly continuous on [a, b].

(iii) (Composition of functions). Let f : A → R, g : B → R be continuouson A and B, respectively. Let f(A) ⊂ B, i.e., for any x in A, f(x) ∈B. Let h(x) = g(f(x)) for x in A. Then h : A → R is continuous.

(iv) (Uniform limits of continuous functions). Let fnn≥1, be a sequenceof functions continuous on A to R, A ⊂ R. If sup|fn(x) − f(x)| :x ∈ A → 0 as n → ∞ for some f : A → R, i.e., fn converges to funiformly on A, then f is continuous on A.

Remark A.2.2: The function f(x) ≡ x is clearly continuous on R. Now byProposition A.2.5 (i) and (iv), it follows that all polynomials are continuouson R, and hence, so are their uniform limits. Weierstrass’ approximationtheorem is a sort of converse to this. That is, every continuous function ona closed and bounded interval is the uniform limit of polynomials. Moreprecisely, one has the following:

Page 588: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

A.2 Real numbers, continuity, differentiability, and integration 583

Theorem A.2.6: Let f : [a, b] → R be continuous. Then for any ε > 0there is a polynomial p(x) =

∑n0 ajx

j, aj ∈ R, j = 0, 1, 2, . . . , n such thatsup|f(x)− p(x)| : x ∈ [a, b] < ε.

It should be noted that a power series A(x) ≡∑∞

0 anxn is the uniformlimits of polynomials on [−λ, λ] for any 0 < λ < ρ ≡

(limn→∞|an|1/n

)−1

and hence is continuous on (−ρ, ρ).

Definition A.2.10: Let f : (a, b) → R, (a, b) ⊂ R. The function f is saidto be differentiable at x0 ∈ (a, b) if

limh→0

f(x0 + h)− f(x0)h

≡ f ′(x0) exists in R.

A function is differentiable in (a, b) if it is differentiable at each x in (a, b).

Some important consequences of differentiability are listed below.

Proposition A.2.7: Let f , g : (a, b) → R, (a, b) ⊂ R. Then

(i) f differentiable at x0 in (a, b) implies f is continuous at x0.

(ii) (Mean value theorem). f differentiable on (a, b), f continuous on[a, b] implies that for some a < c < b, f(b)− f(a) = (b− a)f ′(c).

(iii) (Maxima and minima). f differentiable at x0 and for some δ > 0,f(x) ≤ f(x0) for all x ∈ (x0 − δ, x0 + δ) implies that f ′(x0) = 0.

(iv) (Sums, products and ratios). f , g differentiable at x0 implies that forany α, β in R, (αf + βg), f − g are differentiable at x0 with

(αf + βg)′(x0) = αf ′(x0) + βg′(x0),(fg)′(x0) = f ′(x0)g(x0) + f(x0)g′(x0),

and if g′(x0) = 0, then f/g is differentiable at x0 with

(f/g)′(x0) =f ′(x0)g(x0)− f(x0)g′(x0)

(g(x0))2.

(v) (Chain rule). If f is differentiable at x0 and g is differentiable atf(x0), then h(x) ≡ g(f(x)) is differentiable at x0 with h′(x0) =g′(f(x0))f ′(x0).

(vi) (Differentiability of power series). Let A(x) ≡∑∞

n=0 anxn be a powerseries with radius of convergence ρ ≡

(limn→∞|an|1/n

)−1> 0. Then

A(·) is differentiable infinitely many times on (−ρ, ρ) and for x in(−ρ, ρ),

dkA(x)dxk

=∞∑

n=k

n(n− 1) · · · (n− k + 1)xn−k, k ≥ 1.

Page 589: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

584 Appendix A. Advanced Calculus: A Review

Remark A.2.3: It should be noted that the converse to (a) in the aboveproposition does not hold. For example, the function f(x) = |x| is contin-uous at x0 = 0 but is not differentiable at x0. Indeed, Weierstrass showedthat there exists a function f : [0, 1] → R such that it is continuous on [0, 1]but is not differentiable at any x in (0, 1).

Also note that the mean value theorem implies that if f ′(·) ≥ 0 on (a, b),then f is nondecreasing on (a, b).

Definition A.2.11: (Taylor series). Let f be a map from I ≡ (a−η, a+η)to R for some a ∈ R, η > 0. Suppose f is n times differentiable in I, foreach n ≥ 1. Let an = f(n)(a)

n! . Then power series∑∞

n=0 an(x − a)n =∑∞n=0

f(n)(a)n! (x− a)n is called the Taylor series of f at a.

Remark A.2.4: Let f be as in Definition A.2.11. Taylor’s remaindertheorem says that for any x in I and any n ≥ 1, if f is (n + 1) timesdifferentiable in I, then

∣∣∣∣f(x)−n∑

j=0

ajxj

∣∣∣∣ ≤ |f (n+1)(yn)|(n + 1)!

for some yn in I.Thus, if for some ε > 0, sup

|y−a|<ε

|f (k)(y)| ≡ λk satisfies λk

k! → 0 as k →∞,

then the Taylor series satisfies

sup|x−a|<ε

∣∣∣∣n∑

j=0

(x− a)j f (j)(a)j!

− f(x)∣∣∣∣→ 0 as n →∞.

A.2.4 Riemann integrationLet f : [a, b] → R, [a, b] ⊂ R. Let sup|f(x)| : a ≤ x ≤ b < ∞. For anypartition P ≡ a = x0 < x1 < x2 < xk = b, let

U(P, f) ≡n∑

i=1

Mi(P )∆i(P ) and

L(P, f) ≡n∑

i=1

mi(P )∆i(P )

where

Mi(P ) = supf(x) : xi ≤ x ≤ xi,mi(P ) = inff(x) : xi−1 ≤ x ≤ xi,∆i(P ) = (xi − xi−1)

Page 590: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

A.2 Real numbers, continuity, differentiability, and integration 585

for i = 1, 2, . . . , k.

Definition A.2.12: The upper and lower Riemann integrals of f over

[a, b], denoted by∫ b

af(x)dx and

∫ b

af(x)dx, respectively defined as

∫ b

a

f(x)dx ≡ infU(P, f) : P a partition of [a, b]∫ b

a

f(x)dx ≡ supL(P, f) : P a partition of [a, b].

Definition A.2.13: Let f : [a, b] → R, [a, b] ⊂ R and let sup|f(x)| : a ≤x ≤ b < ∞. The function f is Riemann integrable on [a, b] if

∫ b

a

f(x)dx =∫ b

a

f(x)dx

and the common value is denoted as∫ b

af(x)dx.

The following are some important results on Riemann integration.

Proposition A.2.8:

(i) Let f : [a, b] → R, [a, b] ⊂ R be continuous. Then f is Riemannintegrable and

∫ b

af(x)dx = limn→∞ 1

kn

∑kn

i=0 f(xni)∆ni where Pn ≡xni : 1 ≤ i ≤ kn is a sequence of partitions of [a, b] such that∆n ≡ max(xni − xn,i−1) : 1 ≤ i ≤ kn → 0 as n →∞.

(ii) If f , g are Riemann integrable on [a, b], then αf + βg, α, β ∈ R andfg are Riemann integrable on [a, b] and∫ b

a

(αf + βg)(x)dx = α

∫ b

a

f(x)dx + β

∫ b

a

g(x)dx.

(iii) Let f be Riemann integrable on [a, b] and [b, c], −∞ < a < b < c < ∞.Then f is Riemann integrable on [a, c] and∫ c

a

f(x)dx =∫ b

a

f(x)dx +∫ c

b

f(x)dx.

(iv) (Fundamental theorem of Riemann integration: Part I ). Let f beRiemann integrable on [a, b]. Then it is Riemann integrable on [a, c]for all a ≤ c ≤ b. Let F (x) ≡

∫ x

af(u)du, a ≤ x ≤ b. Then

(a) F (·) is continuous on [a, b].

(b) If f(·) is continuous at x0 ∈ (a, b), then F (·) is differentiable atx0 and F ′(x0) = f(x0).

Page 591: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

586 Appendix A. Advanced Calculus: A Review

(v) (Fundamental theorem: Part II ). If F : [a, b] → R is differentiableon (a, b), f : [a, b] → R is continuous on [a, b] and F ′(x) = f(x) forall a < x < b, then F (x) = F (a) +

∫ c

af(x)dx, a ≤ c ≤ b.

(vi) (Integration by parts). If f and g are continuous and differentiableon (a, b) with f ′ and g′ continuous on (a, b), then

∫ d

c

f(x)g′(x)dx +∫ d

c

f ′(x)g(x)dx

= f(d)g(d)− f(c)g(c) for all a < c < d < b.

(vii) (Interchange of limits and integration). Let fn, f be continuous on[a, b] to R. Let fn converge to f uniformly on [a, b]. Then

Fn(c) ≡∫ c

a

fn(x)dx → F (c) ≡∫ c

a

f(x)dx

uniformly on [a, b].

A.3 Complex numbers, exponential andtrigonometric functions

Definition A.3.1: The set C of complex numbers is defined as the setR × R of all ordered pairs of real numbers endowed with addition andmultiplication as follows:

(a1, b1) + (a2, b2) = (a1 + a2, b1 + b2)(a1, b1) · (a2, b2) = (a1a2 − b1b2, a1b2 + a2b1). (3.1)

It can be verified that C satisfies the field axioms Royden (1988). Bydefining ι to be the element (0,1), one can write the elements of C in theform (a, b) = a + ιb and do addition and multiplication with the rule ofreplacing ι2 by −1. Clearly, the set R of real numbers can be identified withthe set (a, 0) : a ∈ R. The set (0, b) : b ∈ R is called the set of purelyimaginary numbers.

Definition A.3.2: For any complex number z = (a, b) in C,

Re(z), called the real part of z is a,

Im(z), called the imaginary part of z is b,

z, called the complex conjugate of z is z = a− ιb

|z|, called the absolute value of z is |z| =√

a2 + b2 .

(3.2)

Page 592: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

A.3 Complex numbers, exponential and trigonometric functions 587

Clearly, zz = zz = |z|2 and any z = 0 can be written as z = |z|ωwhere |ω| = 1. A set A ⊂ C is bounded if sup|z| : z ∈ A < ∞. Ifd(z1, z2) ≡ |z1 − z2|, then (C, d) is a complete separable metric space (seeSection A.4). Clearly, zn = (an, bn) converges to z = (a, b) in this metriciff Re(zn) = an → Re(z) = a and Im(zn) = bn → Im(z) = b.

Definition A.3.3: The exponential function is a map from C to C definedby

exp(z) ≡∞∑

n=0

zn

n!(3.3)

where the right side is defined as the limit of the partial sum sequence∑m

n=0zn

n! m≥0, which exists since∑∞

n=0|z|nn! < ∞ for each z ∈ C. In fact,

the convergence of the partial sum sequence is uniform on every boundedsubset A of C. This in turn implies that the exponential function is con-tinuous on C.

Notation: exp(z) will also be written as ez.

Definition A.3.4: The number e1 ≡∑∞

n=01n! will be called e. It is not

difficult to show that e = limn→∞(1 + 1n )n. By definition of ez, e0 = 1.

Definition A.3.5: The cosine and sine functions from R → R are definedby

cos t ≡ Re(eιt) =∞∑

k=0

(−1)k t2k

(2k)!

sin t ≡ Im(eιt) =∞∑

k=0

(−1)k t2k+1

(2k + 1)!.

Thus, one gets Euler’s formula eιt = cos t + ι sin t, for all t ∈ R.

The following are some important and useful properties of the exponen-tial and the cosine and sine (also called trigonometric) functions.

Theorem A.3.1:

(i) For all z1, z2 in C

ez1ez2 = ez1+z2 and ez = 0 for any z.

(ii) ez is differentiable for all z (i.e., (ez)′ ≡ limh→0ez+h−ez

h exists) and(ez)′ = ez.

(iii) The function ex : R → R+ is strictly increasing with ex ↑ ∞ as x ↑ ∞and ↓ 0 as x ↓ −∞.

(iv) For t in R, (cos t)2 + (sin t)2 = 1, (cos t)′ = − sin t, (sin t)′ = cos t.

Page 593: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

588 Appendix A. Advanced Calculus: A Review

(v) There is a smallest positive number, called π, such that eιπ/2 = ι.

(vi) ez is a periodic function such that ez = ez+ι2π.

Proof:

(i) Since∑∞

n=0|z|nn! converges for all z ∈ C,

ez1 · ez2 = limm→∞

( m∑n=0

zn1

n!

)( m∑n=0

zn2

n!

)

= limm→∞

∑0≤r,s≤m

zr1zs

2

r!s!

= limm→∞

2m∑k=0

1k!

k∑r=0

k!r!(k − r)!

zr1zk−r

2

= limm→∞

2m∑k=0

(z1 + z2)k

k!= ez1+z2 .

Since ez · e−z = e0 = 1, ez = 0 for any z ∈ C.

(ii) Fix z ∈ C. For any h ∈ C, h = 0, by (i),

ez+h − ez

h= ez eh − 1

h.

But

∣∣∣eh − 1h

− 1∣∣∣ ≤

( ∞∑k=2

|h|kk!

)1|h|

≤ |h|∞∑

k=0

1k!

if |h| ≤ 1.

Thus

limh→0

(eh − 1h

− 1)

= 0

and

limh→0

(ez+h − ez

h

)= ez, i.e. (ii) holds.

(iii) That the map t → et is strictly increasing on [0,∞) and that et ↑ ∞as t ↑ ∞ is clear from the definition of ez. Since e−tet = e0 = 1,e−t = 1

et for all t ∈ R, so that et is strictly increasing on all of R andet ↓ 0 as t ↓ −∞.

Page 594: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

A.3 Complex numbers, exponential and trigonometric functions 589

(iv) From the definition of ez, it follows that ez = ez and, in particular,for t real eιt = eιt = e−ιt and hence

|eιt|2 = eιteιt = e−ιt+ιt = e0 = 1.

Thus, for all t real |eιt| = 1 and since eιt = cos t + ι sin t, it followsthat

|eιt|2 = (cos t)2 + (sin t)2 = 1.

Also from (ii), for t ∈ R

(eιt)′ = (cos t)′ + ι(sin t)′ = ιeιt = − sin t + ι cos t

yielding (cos t)′ = − sin t, (sin t)′ = cos t proving (iv).

(v) By definition

cos 2 =∞∑

k=0

(−1)k 22k

(2k)!.

Now ak ≡ 22k

(2k)! satisfies ak+1 < ak for k = 0, 1, 2, . . . and hence∑k≥3(−1)kak = −(a3 − a4) + (a5 − a6) + · · · < 0. Thus,

cos 2 < 1− 22

2!+

24

4!= −1

3< 0.

Since cos t is a continuous function on R with cos 0 = 1 and cos 2 < 0,there exists a smallest t0 > 0 such that cos t0 = 0, defined by t0 =inft : t > 0, cos t = 0. Set π = 2t0. Since cos π

2 = 0, (iv) impliessin π

2 = 1 and hence that eι π2 = ι.

(vi) Clearly, eι π2 = ι implies that eιπ = −1 and eι2π = 1 and eι2πk = 1

for all integers k. Since eι2π = 1, it follows that ez = ez+ι2π for allz ∈ C,

It is now possible to prove various results involving π that one learns incalculus from the above definition. For example, that the arc length of theunit circle z : |z| = 1 is 2π and that

∫∞−∞

11+x2 dx = π, etc. (Problems

A.19 and A.20).The following assertions about ez can be proved with some more effort.

Theorem A.3.2:

(i) ez = 1 iff z = 2πιk for some integer k.

(ii) The map t → eιt from R is onto the unit circle.

(iii) For any ω ∈ C, ω = 0 there is a z ∈ C such that ω = ez.

Page 595: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

590 Appendix A. Advanced Calculus: A Review

For a proof of this theorem as well as more details on Theorem A.3.1,see Rudin (1987).

Theorem A.3.3: (Orthogonality of eι2πntn∈Z). For any n ∈ Z,∫ 10 eι2πntdt = 0 if n = 0 and 1 if n = 0.

Proof: Since (eιt)′ = ιeιt

(eι2πnt)′ = ι2πn eι2πnt, n ∈ Z

and so for n = 0, ∫ 1

0eι2πntdt =

1ι2πn

∫ 1

0(eι2πnt)′dt

=1

ι2πn(eι2πn − 1) = 0.

Corollary A.3.4: The family cos 2πnt : n = 0, 1, 2, . . .∪sin 2πnt : n =1, 2, . . . are orthogonal in L2[0, 1] (Problem A.22), i.e., for any two f , g

in this family,∫ 10 f(x)g(x)dx = 0 for f = g.

A.4 Metric spaces

A.4.1 Basic definitionsThis section reviews the following: metric spaces, Cauchy sequences, com-pleteness, functions, continuity, compactness, convergence of sequencesfunctions, and uniform convergence.

Definition A.4.1: Let S be a nonempty set. Let d : S× S → R+ = [0,∞)be such that

(i) d(x, y) = d(y, x) for any x, y in S.

(ii) d(x, z) ≤ d(x, y) + d(y, z) for any x, y, z in S.

(iii) d(x, y) = 0 iff x = y.

Such a d is called a metric on S and the pair (S, d) a metric space. Property(ii) is called the triangle inequality.

Example A.4.1: Let Rk ≡ (x1, . . . , xk) : xi ∈ R, 1 ≤ i ≤ k be thek-dimensional Euclidean space. For 1 ≤ p < ∞ and x = (x1, . . . , xk), y =(y1, y2, . . . , yk) ∈ Rk, let

dp(x, y) =( k∑

i=1

|xi − yi|p) 1

p

,

Page 596: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

A.4 Metric spaces 591

and d∞(x, y) = max|xi − yi| : 1 ≤ i ≤ k.

It can be shown that dp(·, ·) is a metric on R for all 1 ≤ p ≤ ∞ (ProblemA.24).

Definition A.4.2: A sequence xnn≥1 in a metric space (S, d) convergesto an x in S if for every ε > 0, there is a Nε such that n ≥ Nε ⇒ d(xn, x) < εand is written as limn→∞ xn = x.

Definition A.4.3: A sequence xnn≥1 in a metric space (S, d) is Cauchyif for all ε > 0, there exists Nε such that n, m ≥ Nε ⇒ d(xn, xm) < ε.

Definition A.4.4: A metric space (S, d) is complete if every Cauchy se-quence xnn≥1 converges to some x in S.

Example A.4.2:

(a) Let S = Q, the set of rationals and d(x, y) ≡ |x− y|. Then (Q, d) is ametric space that is not complete.

(b) Let S = R and d(x, y) = |x − y|. Then (R, d) is complete (cf. Propo-sition A.2.3).

(c) Let S = Rk. Then (Rk, dp) is complete for every 1 ≤ p ≤ ∞, wheredp is as in Example A.4.1.

Remark A.4.1: (Completion of an incomplete metric space). Let (S, d)be a metric space. Let S be the set of all Cauchy sequences in S. Identifyeach x in S with the Cauchy sequence xn = xn≥1. Define a function fromS× S to R+ by

d(xnn≥1, ynn≥1) = lim supn→∞

d(xn, yn).

It is easy to verify that d is symmetric and satisfies the triangle inequality.Define s1 = xnn≥1 and s2 = ynn≥1 to be equivalent (write xn ∼yn) if d(s1, s2) = 0. Let S be the set of all equivalence classes in S anddefine d(c1, c2) ≡ d(s1, s2), where c1, c2 are equivalence classes and s1, s2are arbitrary elements of c1 and c2, respectively.

It can now be verified that (S, d) is a complete metric space and (S, d)is embedded in (S, d) by identifying each x in S with the equivalence classcontaining the sequence xn = xn≥1.

Definition A.4.5: A metric space (S, d) is separable if there exists a subsetD ⊂ S that is countable and dense in S, i.e., for each x in S and ε > 0,there is a y in D such that d(x, y) < ε.

Example A.4.3: By the Archimedean property, Q is dense in R. SimilarlyQk, the set of all k vectors with components from Q, is dense in Rk.

Page 597: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

592 Appendix A. Advanced Calculus: A Review

Definition A.4.6: A metric space (S, d) is called Polish if it is completeand separable.

Example A.4.4: (Rk, dp) in Example A.4.2 is Polish.

A.4.2 Continuous functionsLet (S, d) and (T, ρ) be two metric spaces. Let f : S → T be a map from Sto T.

Definition A.4.7:

(a) f is continuous at p in S if for each ε > 0, there exists δ > 0 suchthat d(x, p) < δ ⇒ ρ(f(x), f(p)) < ε. (Here the δ may depend on εand p.)

(b) f is continuous on a set B ⊂ S if it is continuous at every p ∈ B.

(c) f is uniformly continuous on B if for each ε > 0, there exists δ > 0such that for each pair x, y in S, d(x, y) < δ ⇒ ρ(f(x), f(y)) < ε.

Definition A.4.8: Let (S, d) be a metric space.

(a) A set O ⊂ (S, d) is open if x ∈ O ⇒ there exists δ > 0 such thatd(x, y) < δ ⇒ y ∈ O. That is, at every point x in O, an open ballBx(δ) ≡ y : d(x, y) < δ of positive radius δ is a subset of O.

(b) A set C ⊂ (S, d) is closed if Cc is open.

Theorem A.4.1: Let (S, d) and (T, ρ) be metric spaces. A map f : S → Tin continuous on S iff for each O open in T, f−1(O) is open in S.

Proof is left as an exercise (Problem A.28).

A.4.3 CompactnessDefinition A.4.9: A collection of open sets Oα : α ∈ I is an open coverfor a set B ⊂ (S, d) if for each x ∈ B, there exists α ∈ I such that x ∈ Oα.

Example A.4.5: Let B = (0, 1). Then the collection (α− α2 , α+ (1−α)

2 ) :α ∈ Q ∩ (0, 1) is an open cover for B.

Definition A.4.10: Let (S, d) be a metric space. A set K ⊂ S is calledcompact if given any open cover Oα : α ∈ I for K, there exists a finitesubcollection Oαi : αi ∈ I, i = 1, 2, . . . , n, n < ∞ that is an open coverfor K.

Example A.4.6: The set B = (0, 1) is not compact as the open cover inthe above Example A.3.4 does not admit a finite subcover.

Page 598: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

A.4 Metric spaces 593

The next result is the well-known Heine-Borel theorem.

Theorem A.4.2:

(i) For any −∞ < a < b < ∞, the closed interval [a, b] is compact in R.

(ii) Any K ⊂ R is compact iff it is bounded and closed.

For a proof, see Rudin (1976). From Proposition A.4.1, it is seen thatthe inverse image of an open set under a continuous function is open butthe forward image may not have this property. But the following is true.

Theorem A.4.3: Let (S, d) and (T, ρ) be two metric spaces and letf : (S, d) → (T, ρ) be continuous. Let K ⊂ S be compact. Then f(K)is compact.

The proof is left as an exercise (Problem A.35).

A.4.4 Sequences of functions and uniform convergenceDefinition A.4.11: Let (S, d) and (T, ρ) be two metric spaces and letfnn≥1 be a sequence of functions from (S, d) to (T, ρ). The sequencefnn≥1 is said to:

(a) converge pointwise to f on a set A ⊂ S if limn→∞ fn(x) = f(x) foreach x in A;

(b) converges uniformly to f on a set A ⊂ S if for each ε > 0, there existsNε > 0 (depending on ε and A) such that

n ≥ Nε ⇒ ρ(fn(x), f(x)

)< ε for all x in A.

A consequence of uniform convergence is the preservation of the conti-nuity property.

Theorem A.4.4: Let (S, d) and (T, ρ) be two metric spaces and letfnn≥1 be a sequence of functions from (S, d) to (T, ρ). Let for each n ≥ 1,fn be continuous on A ⊂ S. Let fnn≥1 converge to f uniformly on A.Then f is continuous on A.

Proof: The proof is based on the “break up into three parts” idea. By thetriangle inequality,

ρ(f(x), f(y)

)≤ ρ(f(x), fn(x)

)+ ρ(fn(x), fn(y)

)+ ρ(fn(y), f(y)

).

Fix x in A. By the uniform convergence on A, supρ(fn(u), f(u)

): u ∈

A → 0 as n → ∞. So for each ε > 0, there exists Nε < ∞ such that n ≥Nε ⇒ ρ

(fn(u), f(u)

)< ε

3 for all u in A. Now since fNε(·) is continuous on

A, there exists a δ > 0 (depending on Nε and x), such that d(x, y) < δ, y ∈A ⇒ ρ

(fNε(y), fNε(x)

)< ε

3 . Thus, y ∈ A, d(x, y) < δ ⇒ ρ(f(x), f(y)

)<

2ε3 + ε

3 = ε.

Page 599: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

594 Appendix A. Advanced Calculus: A Review

A.5 Problems

A.1 Express the following sets in the form x : x has property p.

(a) The set A of all integers which when divided by 7 leave a re-mainder ≤ 3.

(b) The set B of all functions form [0, 1] to R with at most twodiscontinuity points.

(c) The set C of all students at a given university who are graduatestudents with at least one course in mathematics at the graduatelevel.

(d) The set D of all algebraic numbers. (A number x is called analgebraic number, if it is the root of a polynomial with rationalcoefficients.)

(e) The set E of all possible sequences whose elements are either 0or 1.

A.2 Give an example of sets A1, A2 such that A1 ∩A2 = A1 ∪A2.

A.3 Let I = [0, 1], Ω = R and for α ∈ R, Aα = (α − 1, α + 1), the openinterval x : α− 1 < x < α + 1.

(a) Show that ∪α∈IAα = (−1, 2) and ∩α∈IAα = (0, 1).

(b) Suppose J = x : x ∈ I, x is rational. Find ∪x∈JAx and∩x∈JAx.

A.4 With Ω ≡ N ≡ 1, 2, 3, . . ., find Ac in the following cases:

(a) A = ω : ω is divisible by 2 or 3 or both. If ω ∈ Ac, what canbe said about its prime factors?

(b) A = ω : ω is divisible by 15 and 16.(c) A = ω : ω is a perfect square.

A.5 Show that X ≡ 0, 1N, the set of all sequences ωii∈N where eachωi ∈ 0, 1, is uncountable. Conclude that P(N) is uncountable.

A.6 Show that if Ωi is countable for each i ∈ N, then for each k ∈ N,×k

i=1Ωi is countable and ∪i∈NΩi is also countable but ×i∈NΩi is notcountable.

A.7 Show that the set of all polynomials in x with integer coefficients iscountable.

A.8 Show that the well ordering property implies the principle of induc-tion.

A.9 Apply the principle of induction to establish the following:

Page 600: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

A.5 Problems 595

(a) For each n ∈ N,n∑

j=1j2 = n(n+1)(2n+1)

6 .

(b) For each n ∈ N, x1, x2, . . . , xk ∈ R,

(i) (The binomial formula). (x1 + x2)n =n∑

r=0

(nr

)xr

1xn−r2 .

(ii) (The multinomial formula).

(x1 + x2 + . . . + xk)n =∑ n!

r1!r2! . . . rk!xr1

1 xr22 . . . xrk

k ,

where the summation extends over all (r1, r2, . . . , rk) suchthat ri ∈ N, 0 ≤ ri ≤ n,

∑kr=1 ri = n.

A.10 Verify that on R, the relation x ∼ y if x−y is rational is an equivalencerelation but the relation x ∼ y if x− y is irrational is not.

A.11 Show that the set A = r : r ∈ Q, r2 < 2 is bounded above in Q buthas no l.u.b. in Q.

A.12 Show that for any two sequences xnn≥1, ynn≥1 ⊂ R,

limn→∞

xn + limn→∞

yn ≤ limn→∞

(xn + yn) ≤ limn→∞(xn + yn)

≤ limn→∞ xn + lim

n→∞ yn.

A.13 Verify that limn→∞ xn = a ∈ R iff lim

n→∞xn = lim

n→∞ xn = a.

A.14 Establish Proposition A.2.3.

(Hint: First show that a Cauchy sequence is bounded and then showthat lim

n→∞xn = lim

n→∞ xn.)

A.15 (a) Prove Proposition A.2.4 by comparison with the geometric se-ries.

(b) Show that for integer k ≥ 1, the power series∑∞

n=k n(n−1)(n−k+1)anxn−k has the same radius of convergence as

∑∞n=0 anxn.

A.16 Show that the series∑∞

j=21

j(log j)p converges for p > 1 and divergesfor p ≤ 1.

A.17 Find the radius of convergence, ρ, for the powers series A(x) ≡∑∞n=0 anxn where

(a) an = n(n+1) , n ≥ 0.

(b) an = np, n ≥ 0, p ∈ R.

Page 601: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

596 Appendix A. Advanced Calculus: A Review

(c) an = 1n! , n ≥ 0 (where 0! = 1).

A.18 (a) Find the Taylor series at a = 0 for the function f(x) = 11−x in

I ≡ (−1,+1) and show that it converges to f(x) on I.

(b) Find the Taylor series of 1 + x + x2 in I = (1, 3), centered at 2.

(c) Let

f(x) =

e− 1x2 if |x| < 1, x = 0

0 if x = 0 .

(i) Show that f is infinitely differentiable at 0 and computef (j)(0) for all j ≥ 1.

(ii) Show that the Taylor series at a = 0 converges but not to fon (−1, 1).

A.19 Let S = z : z ∈ C, |z| = 1 be the unit circle. Using the parameter-ization t → eιt = (cos t + ι sin t) from [0, 2π] to S, show that the arclength of S (i.e., the circumference of the limit circle) is 2π.

A.20 Set φ(t) = sin tcos t for −π

2 < t < π2 . Verify that φ′ = 1 + φ2 and that

φ : (−π2 , π

2 ) to (−∞,∞) is strictly monotone increasing and onto.Conclude that∫ ∞

−∞

11 + x2 dx =

∫ π/2

−π/2

φ′(t)1 + (φ(t))2

dt = π.

A.21 Using the property that eι π2 = ι verify that for all t in R

cos(π

2− t) = sin t, sin(

π

2− t) = cos t,

cos(π + t) = − cos t, sin(π + t) = − sin t,

cos(2π + t) = cos t, sin(2π + t) = sin t.

Also show that cos t is a strictly decreasing map from [0, π] onto[−1, 1] and that sin t is a strictly increasing map from [−π

2 , π2 ] onto

[−1, 1].

A.22 Using (i) of Theorem A.3.1, express cos(t1 + t2), sin(t1 + t2) in termscos ti, sin ti, i = 1, 2 and in turn use this to prove Corollary A.3.4from Theorem A.3.3.

A.23 Verify that pn(z) ≡ (1 + zn )n converges to ez uniformly on bounded

sets in C.

A.24 (a) Verify that for p = 1, p = 2 and p = ∞, dp is a metric on Rk.

(b) Show that for fixed x and y, ϕ(p) ≡ dp(x, y) is continuous in pon [1,∞].

Page 602: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

A.5 Problems 597

(c) Draw the open unit ball Bp ≡ x : x ∈ R2, dp(x, 0) < 1 in R2

for p = 1, 2 and ∞.

A.25 Let S = C[0, 1] be the set of all real valued continuous functions on[0, 1]. Now let

d1(f, g) =∫ 1

0|f(x)− g(x)|dx, (area metric)

d2(f, g) =(∫ 1

0|f(x)− g(x)|2dx

) 12

, (least square metric)

d∞(f, g) = sup|f(x)− g(x)| : 0 ≤ x ≤ 1 (sup metric).

Show that all these are metrics on S.

A.26 Let S = R∞ ≡ xnn≥1 : xn ∈ R for all n ≥ 1 be thespace of all sequences of real numbers. Let d(xnn≥1, ynn≥1) =∑∞

j=1(|xj−yj |

1+|xj−yj | )12j . Show that (S, d) is a Polish space.

A.27 If sk = xknn≥1 and s = xnn≥1, are elements of S = R∞ as inProblem A.26, verify that as k → ∞, sk → s iff xkn → xn for alln ≥ 1.

A.28 Establish Theorem A.4.1.

A.29 Let S = C[0, 1] and dp(f, g) ≡( ∫ 1

0 |f(t) − g(t)|pdt) 1

p for 1 ≤ p < ∞and d∞(f, g) = sup|f(t)− g(t)| : t ∈ [0, 1].

(a) Let f(x) ≡ 1. Let fn(t) ≡ 1 for 0 ≤ t ≤ 1− 1n , and fn(t) = n(1−t)

for 1− 1n ≤ t ≤ 1. Show that dp(fn, f) → 0 for 1 ≤ p < ∞ but

d∞(fn, f) → 0.

(b) Fix f ∈ C[0, 1]. Let gn(t) = f(t), 0 ≤ t ≤ 1 − 1n , and gn(t) =

f(1− 1n ) + (f(1)− f(1− 1

n ))n(t + 1n − 1), 1− 1

n ≤ t ≤ 1.

Show that dp(gn, f) → 0 for all 1 ≤ p ≤ ∞.

A.30 Show that if xnn≥1 is a convergent sequence in a metric space (S, d),then it is Cauchy.

A.31 Verify (b) of Example A.4.2 from the axioms of real numbers (cf.Proposition A.2.3). Verify (c) of the same example from (b).

A.32 Let S = C[0, 1] and d be the supremum metric, i.e.,

d(f, g) = sup|f(x)− g(x)| : 0 ≤ x ≤ 1.

By approximating any continuous function with piecewise linear func-tions with rational end points and rational values, show that (S, d) isPolish, i.e., it is complete and separable.

Page 603: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

598 Appendix A. Advanced Calculus: A Review

A.33 Show that the function f(x) = x2 is continuous on R, uniformly soon any bounded set B ⊂ R but not uniformly on R.

A.34 Show that unions of open sets are open and intersection of any twoopen sets is open. Give an example to show that the intersection ofan infinite number of open sets need not be open.

A.35 Prove Theorem A.4.3.

A.36 Let fn(x) = xn and g(x) ≡ 0 on R. Then fnn≥1 converges pointwiseto g on (−1, 1), uniformly on [a, b] for −1 < a < b < 1, but notuniformly on (0, 1).

A.37 Let fnn≥1, f ∈ C[0, 1]. Let fnn≥1 converge to f uniformly on[0, 1]. Show that lim

n→∞∫ 10 |fn(x) − f(x)|dx = 0 and lim

n→∞∫ 10 fn(x) =∫ 1

0 f(x)dx.

A.38 Give a proof of Proposition A.2.6 (vi) (term by term differentiabil-ity of a power series) using Proposition A.2.7 (iv) (the fundamentaltheorem of Riemann integration).

Page 604: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Appendix BList of Abbreviations and Symbols

B.1 Abbreviations

a.c. absolutely continuous (functions)a.e. almost everywhereAR(1) autoregressive process of order onea.s. almost sure(ly)BCT bounded convergence theorem

BGW Biengeme-Galton-Watsoncdf cumulative distribution functionCE conditional expectationCLT central limit theoremCTMC continuous time Markov chain

DCT dominated convergence theoremEDCT extended dominated convergence theoremfdds finite dimensional distributionsiff if and only ifIFS iterated function system

iid independent and identically distributedIIIDRM iterations of iid random mapsi.o. infinitely oftenLIL law of the iterated logarithmLLN laws of large numbers

Page 605: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

600 Appendix B. List of Abbreviations and Symbols

MBB moving block bootstrapm.c.f.a. monotone continuity from abovem.c.f.b. monotone continuity from belowMCMC Markov chain Monte CarloMCT monotone convergence theorem

o.n.b. orthonormal basisr.c.p. regular conditional probabilitySBM standard Brownian motionSLLN strong law of large numberss.o.c. second order correctness

SSRW simple symmetric random walkUI uniform integrabilityWLLN weak law of large numbersw.p. 1 with probability onew.r.t. with respect to

w.l.o.g. without loss of generality

B.2 Symbols

µ ν: absolute continuity of a measure−→d convergence in distribution−→p convergence in probability(·) ∗ (·) convolution of measures, functions, etc.(·)∗ extension of a measure

a ∼ b a and b are equivalent (under an equivalence relation)an ∼ bn

an

bn→ 1 as n →∞

a the integer part of a, i.e., a = k if k ≤ a < k + 1,k ∈ Z, a ∈ R

"a# the smallest integer not less than a, i.e., "a# = k + 1 ifk < a ≤ k + 1, k ∈ Z, a ∈ R

A closure of A

Ac complement of a set A∂A boundary of AAB symmetric difference of two sets A and B,

i.e., AB = (A ∩Bc) ∪ (Ac ∩B)B(S) Borel σ-algebra on a metric space S such as S = R, Rk,

R∞

B(S, R) ≡ f | f : S → R, F-measurable, sup|f(s)| : s ∈ S ≤ 1

B(x, ε), Bx(ε) open ball of radius ε with center at x in a metric space(S, d), i.e., y : d(x, y) < ε

Page 606: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

B.2 Symbols 601

C the set of all complex numbersC[a, b] = f | f : [a, b] → R, f continuousCB(R) ≡ f | f : R → R, f bounded and continuousCc(R) ≡ f | f : R → R, continuous and f ≡ 0 outside a

bounded intervalC0(R) ≡ f | f : R → R, continuous and lim|x|→∞ f(x) = 0

C0(S) = f | f : S → R, f continuous and for every ε > 0, thereexists a compact set Kε such that |f(x)| < ε for x ∈ Kε

C(F ), CF the set of all continuity points of a cdf Fδij Kronecker delta, i.e., δij = 1 if i = j and = 0 if i = jδx the probability distribution putting mass one at xdµdν Radon-Nikodym derivative of µ w.r.t. ν

E(Y |G) conditional expectation of Y given GH⊥ orthogonal complement of a subspace H of a Hilbert

spaceι

√−1

IA(·) the indicator function of a set AIIk the identity matrix of order k

λ〈A〉 λ-class generated by a class of sets ALp(Ω,F , µ) = f |f : Ω → F, F-measurable,

∫|f |pdµ < ∞, with

F = R or C (F = C in Sections 5.6, 5.7 only)Lp(R) = Lp(R,B(R), m)m the Lebesgue measureµF Lebesgue-Stieltjes measure corresponding to F

µ ⊥ ν singularity of measures µ and νN the set of natural numbers∅ the null setΦ(·) standard normal cdf, i.e., Φ(x) ≡ 1√

∫ x

−∞ e−u2/2du,−∞ < x < ∞

P (A|G) probability of A given G

Pλ(·) probability distribution of a Markov chain with initialdistribution λ

P(Ω) the power set of Ω = A : A ⊂ ΩPx(·) same as Pλ with λ = δx

(Ω,F , P ) generic probability space(Ω,F , µ) generic measure space

Q the set of all rationalsR the set of real numbers, (−∞,∞)R+ the set of nonnegative real numbers, [0,∞)R the set of all extended real numbers, [−∞,∞]R+ = [0,∞]

Page 607: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

602 Appendix B. List of Abbreviations and Symbols

(S, d) a metric space S with a metric dσ〈A〉 σ-algebra generated by a class of sets Aσ〈fa : a ∈ A〉 σ-algebra generated by a collection of mappings

fa : a ∈ AT tail σ-algebra|z| =

√a2 + b2, the absolute value of a complex number

z = a + ιb, a, b ∈ R

Re(z) = a, the real part of a complex number z = a + ιb, a,b ∈ R

Im(z) = b, the imaginary part of a complex numberz = a + ιb, a, b ∈ R

Z the set of all integers = 0,±1,±2, . . .Z+ the set of all nonnegative integers = 0, 1, 2, . . .

Page 608: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

References

Arcones, M. A. and Gine, E. (1989), ‘The bootstrap of the mean with arbi-trary bootstrap sample size’, Ann. Inst. H. Poincare Probab. Statist.25(4), 457–481.

Arcones, M. A. and Gine, E. (1991), ‘Additions and correction to: “Thebootstrap of the mean with arbitrary bootstrap sample size” [Ann.Inst. H. Poincare Probab. Statist. 25(4) (1989), 457–481]’, Ann. Inst.H. Poincare Probab. Statist. 27(4), 583–595.

Athreya, K. B. (1986), ‘Darling and Kac revisited’, Sankhya A 48(3), 255–266.

Athreya, K. B. (1987a), ‘Bootstrap of the mean in the infinite variancecase’, Ann. Statist. 15(2), 724–731.

Athreya, K. B. (1987b), Bootstrap of the mean in the infinite variance case,in ‘Proceedings of the 1st World Congress of the Bernoulli Society’,Vol. 2, VNU Sci. Press, Utrecht, pp. 95–98.

Athreya, K. B. (2000), ‘Change of measures for Markov chains and thel log l theorem for branching processes’, Bernoulli 6, 323–338.

Athreya, K. B. (2004), ‘Stationary measures for some Markov chain modelsin ecology and economics’, Econom. Theory 23(1), 107–122.

Athreya, K. B., Doss, H. and Sethuraman, J. (1996), ‘On the convergenceof the Markov chain simulation method’, Ann. Statist. 24(1), 69–100.

Page 609: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

604 References

Athreya, K. B. and Jagers, P., eds (1997), Classical and Modern Branch-ing Processes, Vol. 84 of The IMA Volumes in Mathematics and itsApplications, Springer-Verlag, New York.

Athreya, K. B. and Ney, P. (1978), ‘A new approach to the limit theory ofrecurrent Markov chains’, Trans. Amer. Math. Soc. 245, 493–501.

Athreya, K. B. and Ney, P. E. (2004), Branching Processes, Dover Pub-lications, Inc, Mineola, NY. (Reprint of Band 196, Grundlehren derMathematischen Wissenschaften, Springer-Verlag, Berlin).

Athreya, K. B. and Pantula, S. G. (1986), ‘Mixing properties of Harrischains and autoregressive processes’, J. Appl. Probab. 23(4), 880–892.

Athreya, K. B. and Stenflo, O. (2003), ‘Perfect sampling for Doeblin chains’,Sankhya A 65(4), 763–777.

Bahadur, R. R. (1966), ‘A note on quantiles in large samples’, Ann. Math.Statist. 37, 577–580.

Barnsley, M. F. (1992), Fractals Everywhere, 2nd edn, Academic Press,New York.

Berbee, H. C. P. (1979), Random Walks with Stationary Increments andRenewal Theory, Mathematical Centre, Amsterdam.

Berry, A. C. (1941), ‘The accuracy of the Gaussian approximation to thesum of independent variates’, Trans. Amer. Math. Soc. 48, 122–136.

Bhatia, R. (2003), Fourier Series, 2nd edn, Hindustan Book Agency, NewDelhi, India.

Bhattacharya, R. N. and Rao, R. R. (1986), Normal Approximation andAsymptotic Expansions, Robert E. Krieger, Melbourne, FL.

Billingsley, P. (1968), Convergence of Probability Measures, John Wiley,New York.

Billingsley, P. (1995), Probability and Measure, 3rd edn, John Wiley, NewYork.

Bradley, R. C. (1983), ‘Approximation theorems for strongly mixing ran-dom variables’, Michigan Math. J. 30(1), 69–81.

Brillinger, D. R. (1975), Time Series. Data Analysis and Theory, Holt,Rinehart and Winston, Inc, New York.

Carlstein, E. (1986), ‘The use of subseries values for estimating the vari-ance of a general statistic from a stationary sequence’, Ann. Statist.14(3), 1171–1179.

Page 610: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

References 605

Chanda, K. C. (1974), ‘Strong mixing properties of linear stochastic pro-cesses’, J. Appl. Probab. 11, 401–408.

Chow, Y.-S. and Teicher, H. (1997), Probability Theory: Independence, In-terchangeability, Martingales, Springer-Verlag, New York.

Chung, K. L. (1967), Markov Chains with Stationary Transition Probabil-ities, 2nd edn, Springer-Verlag, New York.

Chung, K. L. (1974), A Course in Probability Theory, 2nd edn, AcademicPress, New York.

Cohen, P. (1966), Set Theory and the Continuum Hypothesis, Benjamin,New York.

Doob, J. L. (1953), Stochastic Processes, John Wiley, New York.

Doukhan, P., Massart, P. and Rio, E. (1994), ‘The functional central limittheorem for strongly mixing processes’, Ann. Inst. H. Poincare Probab.Statist. 30, 63–82.

Durrett, R. (2001), Essentials of Stochastic Processes, Springer-Verlag, NewYork.

Durrett, R. (2004), Probability: Theory and Examples, 3rd edn, DuxburyPress, San Jose, CA.

Efron, B. (1979), ‘Bootstrap methods: Another look at the jackknife’, Ann.Statist. 7(1), 1–26.

Esseen, C.-G. (1942), ‘Rate of convergence in the central limit theorem’,Ark. Mat. Astr. Fys. 28A(9).

Esseen, C.-G. (1945), ‘Fourier analysis of distribution functions. a mathe-matical study of the Laplace-Gaussian law’, Acta Math. 77, 1–125.

Etemadi, N. (1981), ‘An elementary proof of the strong law of large num-bers’, Z. Wahrsch. Verw. Gebiete 55(1), 119–122.

Feller, W. (1966), An Introduction to Probability Theory and Its Applica-tions, Vol. II, John Wiley, New York.

Feller, W. (1968), An Introduction to Probability Theory and Its Applica-tions, Vol. I, 3rd edn, John Wiley, New York.

Geman, S. and Geman, D. (1984), ‘Stochastic relaxation, Gibbs distribu-tions and the Bayesian restoration of images’, IEEE Trans. PatternAnalysis Mach. Intell. 6, 721–741.

Gine, E. and Zinn, J. (1989), ‘Necessary conditions for the bootstrap of themean’, Ann. Statist. 17(2), 684–691.

Page 611: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

606 References

Gnedenko, B. V. and Kolmogorov, A. N. (1968), Limit Distributionsfor Sums of Independent Random Variables, Revised edn, Addison-Wesley, Reading, MA.

Gorodetskii, V. V. (1977), ‘On the strong mixing property for linear se-quences’, Theory Probab. 22, 411–413.

Gotze, F. and Hipp, C. (1978), ‘Asymptotic expansions in the centrallimit theorem under moment conditions’, Z. Wahrsch. Verw. Gebiete42, 67–87.

Gotze, F. and Hipp, C. (1983), ‘Asymptotic expansions for sums of weaklydependent random vectors’, Z. Wahrsch. Verw. Gebiete 64, 211–239.

Hall, P. (1985), ‘Resampling a coverage pattern’, Stochastic Process. Appl.20, 231–246.

Hall, P. (1992), The Bootstrap and Edgeworth Expansion, Springer-Verlag,New York.

Hall, P. G. and Heyde, C. C. (1980), Martingale Limit Theory and ItsApplications, Academic Press, New York.

Hall, P., Horowitz, J. L. and Jing, B.-Y. (1995), ‘On blocking rules for thebootstrap with dependent data’, Biometrika 82, 561–574.

Herrndorf, N. (1983), ‘Stationary strongly mixing sequences not satisfyingthe central limit theorem’, Ann. Probab. 11, 809–813.

Hewitt, E. and Stromberg, K. (1965), Real and Abstract Analysis, Springer-Verlag, New York.

Hoel, P. G., Port, S. C. and Stone, C. J. (1972), Introduction to StochasticProcesses, Houghton-Mifflin, Boston, MA.

Ibragimov, I. A. and Rozanov, Y. A. (1978), Gaussian Random Processes,Springer-Verlag, Berlin.

Karatzas, I. and Shreve, S. E. (1991), Brownian Motion and StochasticCalculus, 2nd edn, Springer-Verlag, New York.

Karlin, S. and Taylor, H. M. (1975), A First Course in Stochastic Processes,Academic Press, New York.

Kifer, Y. (1988), Random Perturbations of Dynamical Systems, Birkhauser,Boston, MA.

Kolmogorov, A. N. (1956), Foundations of the Theory of Probability, 2ndedn, Chelsea, New York.

Page 612: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

References 607

Korner, T. W. (1989), Fourier Analysis, Cambridge University Press, NewYork.

Kunsch, H. R. (1989), ‘The jackknife and the bootstrap for general station-ary observations’, Ann. Statist. 17, 1217–1261.

Lahiri, S. N. (1991), ‘Second order optimality of stationary bootstrap’,Statist. Probab. Lett. 11, 335–341.

Lahiri, S. N. (1992), ‘Edgeworth expansions for m-estimators of a regressionparameter’, J. Multivariate Analysis 43, 125–132.

Lahiri, S. N. (1994), ‘Rates of bootstrap approximation for the mean oflattice variables’, Sankhya A 56, 77–89.

Lahiri, S. N. (1996), ‘Asymptotic expansions for sums of random vectorsunder polynomial mixing rates’, Sankhya A 58, 206–225.

Lahiri, S. N. (2001), ‘Effects of block lengths on the validity of block re-sampling methods’, Probab. Theory Related Fields 121, 73–97.

Lahiri, S. N. (2003), Resampling Methods for Dependent Data, Springer-Verlag, New York.

Lehmann, E. L. and Casella, G. (1998), Theory of Point Estimation,Springer-Verlag, New York.

Lindvall, T. (1992), Lectures on Coupling Theory, John Wiley, New York.

Liu, R. Y. and Singh, K. (1992), Moving blocks jackknife and bootstrapcapture weak dependence, in R. Lepage and L. Billard, eds, ‘Exploringthe Limits of the Bootstrap’, John Wiley, New York, pp. 225–248.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. andTeller, E. (1953), ‘Equations of state calculations by fast computingmachines’, J. Chem. Physics 21, 1087–1092.

Meyn, S. P. and Tweedie, R. L. (1993), Markov Chains and StochasticStability, Springer-Verlag, New York.

Munkres, J. R. (1975), Topology, A First Course, Prentice Hall, EnglewoodCliffs, NJ.

Nummelin, E. (1978), ‘A splitting technique for Harris recurrent Markovchains’, Z. Wahrsch. Verw. Gebiete 43(4), 309–318.

Nummelin, E. (1984), General Irreducible Markov Chains and NonnegativeOperators, Cambridge University Press, Cambridge.

Orey, S. (1971), Limit Theorems for Markov Chain Transition Probabilities,Van Nostrand Reinhold, London.

Page 613: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

608 References

Parthasarathy, K. R. (1967), Probability Measures on Metric Spaces, Aca-demic Press, San Diego, CA.

Parthasarathy, K. R. (2005), Introduction to Probability and Measure,Vol. 33 of Texts and Readings in Mathematics, Hindustan BookAgency, New Delhi, India.

Peligrad, M. (1982), ‘Invariance principles for mixing sequences of randomvariables’, Ann. Probab. 10(4), 968–981.

Petrov, V. V. (1975), Sums of Independent Random Variables, Springer-Verlag, New York.

Reiss, R.-D. (1974), ‘On the accuracy of the normal approximation forquantiles’, Ann. Probab. 2, 741–744.

Robert, C. P. and Casella, G. (1999), Monte Carlo Statistical Methods,Springer-Verlag, New York.

Rosenberger, W. F. (2002), ‘Urn models and sequential design’, SequentialAnal. 21(1–2), 1–41.

Royden, H. L. (1988), Real Analysis, 3rd edn, Macmillan Publishing Co.,New York.

Rudin, W. (1976), Principles of Mathematical Analysis, International Seriesin Pure and Applied Mathematics, 3rd edn, McGraw-Hill Book Co.,New York.

Rudin, W. (1987), Real and Complex Analysis, 3rd edn, McGraw-Hill BookCo., New York.

Shohat, J. A. and Tamarkin, J. D. (1943), The problem of moments,in ‘American Mathematical Society Mathematical Surveys’, Vol. II,American Mathematical Society, New York.

Singh, K. (1981), ‘On the asymptotic accuracy of Efron’s bootstrap’, Ann.Statist. 9, 1187–1195.

Strassen, V. (1964), ‘An invariance principle for the law of the iteratedlogarithm’, Z. Wahrsch. Verw. Gebiete 3, 211–226.

Stroock, D. W. and Varadhan, S. (1979), Multidimensional Diffusion Pro-cesses, Band 233, Grundlehren der Mathematischen Wissenschaften,Springer-Verlag, Berlin.

Szego, G. (1939), Orthogonal Polynomials, Vol. 23 of American Mathemat-ical Society Colloquium Publications, American Mathematical Society,Providence, RI.

Page 614: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

References 609

Withers, C. S. (1981), ‘Conditions for linear processes to be strong-mixing’,Z. Wahrsch. Verw. Gebiete 57, 477–480.

Woodroofe, M. (1982), Nonlinear Renewal Theory in Sequential Analysis,SIAM, Philadelphia, PA.

Page 615: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Author Index

Arcones, M. A., 541Athreya, K. B., 428, 460, 464,

472, 475, 499, 516, 541,559, 561, 564, 565, 569

Bahadur, R. R., 557Barnsley, M. F., 460Berbee, H.C.P., 517Berry, A. C., 361Bhatia, R. P., 167Bhattacharya, R. N., 365, 368Billingsley, P., 14, 211, 254, 301,

306, 373, 375, 475, 498Bradley, R. C., 517Brillinger, D. R., 547

Carlstein, E., 548Casella, G., 391, 477, 480Chanda, K. C., 516Chow, Y. S., 364, 417, 430Chung, K. L., 14, 250, 323, 359,

489Cohen, P., 576

Doob, J. L., 211, 399

Doss, H., 472Doukhan, P., 528Durrett, R., 274, 278, 372, 393,

429, 493

Efron, B., 533, 545Esseen, C. G., 361, 368Etemadi, N., 244

Feller, W., 166, 265, 308, 313,323, 354, 357, 359, 362,489, 499

Geman, D., 477Geman, S., 477Gine, E., 541Gnedenko, B. V., 355, 359Gorodetskii, V. V., 516Gotze, F., 554

Hall, P. G., 365, 510, 534, 548,556

Herrndorf, N., 529Hewitt, E., 30, 576Heyde, C. C., 510Hipp, C., 554

Page 616: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Author Index 611

Hoel, P. G., 455Horowitz, J. L., 556

Ibragimov, I. A., 516

Jagers, P., 561Jing, B. Y., 556

Karatzas, I., 494, 499, 504Karlin, S., 489, 493, 504Kifer, Y., 460Kolmogorov, A. N., 170, 200, 225,

244, 355, 359Korner, T. W., 167, 170Kunsch, H. R., 547

Lahiri, S. N., 533, 537, 549, 552,556

Lehmann, E. L., 391Lindvall, T., 265, 456Liu, R. Y., 547

Massart, P., 528Metropolis, N., 477Meyn, S. P., 456Munkres, J. R., 71

Ney, P., 464, 564, 565, 569Nummelin, E., 463, 464

Orey, S., 464

Pantula, S. G., 516Parthasarathy, K. R., 393Peligrad, M., 529Petrov, V. V., 365Port, S. C., 455

Rao, R. Ranga, 365, 368Reiss, R. D., 381Rio, E., 528Robert, C. P., 477, 480Rosenberger, W. F., 569Rosenbluth, A. W., 477Rosenbluth, M. N., 477Royden, H. L., 27, 62, 94, 97, 118,

128, 130, 156, 573, 579

Rozanov, Y. A., 516Rudin, W., 27, 94, 97, 132, 181,

195, 581, 590, 593

Sethuraman, J., 472Shreve, S. E., 494, 499, 504Shohat, J. A., 308Singh, K., 536, 537, 545, 547Stenflo, O., 460Stone, C. J., 455Strassen, V., 279, 576Stromberg, K., 30Stroock, D. W., 504Szego, G.,107

Tamarkin, 308Taylor, H. M., 489, 493, 504Teicher, H., 364, 417, 430Teller, A. H., 477Teller, E., 477Tweedie, R. L., 456

Varadhan, S.R.S., 504

Withers, C. S., 516Woodroofe, M., 407

Zinn, J., 541

Page 617: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Subject Index

λ-system, 13π-λ theorem, 13π-system, 13, 220σ-algebra, 10, 44

Borel, 12, 299product, 147, 157, 203tail, 225trace, 33trivial, 10

Abel’s summation formula, 254absolute continuity, 53, 113, 128,

319absorption probability, 484algebra, 9analysis of variance formula, 391arithmetic distribution (See dis-

tribution)

BCT (See theorem)Bahadur representation, 557Banach space, 96Bernouilli shift, 272Bernstein polynomial, 239betting sequence, 408

binary expansion, 135Black-Scholes formula, 504block bootstrap method, 547bootstrap method, 533Borel-Cantelli lemma, 245

conditional, 427first, 223second, 223, 232

branching process, 402, 431, 561Bienyeme-Galton-Watson,

562critical, 562, 564, 565subcritical, 562, 564, 566supercritical, 562, 565

Brownian bridge, 375, 498Brownian motion, 373

laws of large numbers, 499nondifferentiability, 499reflection, 495reflection principle, 496scaling properties, 495standard, 493, 507time inversion, 495translation invariance, 495

Page 618: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Subject Index 613

cardinality, 576Carleman’s condition, 308Cauchy sequence, 91, 581central limit theorem, 343, 510,

519α-mixing, 521, 525, 527, 528ρ-mixing, 529functional, 373iid, 345Lindeberg, 345, 521, 535Lyapounov’s, 348martingale, 510multivariate, 352random sums, 378sample quantiles, 379

change of variables, 81, 132, 141,193

Chapman-Kolmogorov equation,461, 488

characteristic function (See func-tion)

completemeasure space, 160metric space, 91, 96, 124, 237orthogonal basis, 171orthogonal set, 100orthogonal system, 169

complexconjugate, 322, 586logarithm, 362numbers, 586

conditionalexpectation, 386independence, 396probability, 392variance, 391

consistency, 281convergence

almost everywhere, 61almost surely, 238in distribution, 287, 288, 299in measure, 62pointwise, 61in probability, 237, 289of moments, 306

radius of, 164, 581of types, 355weak, 288, 299with probability one, 238vague, 291, 299

convexfunction, 84

convolution ofcdfs, 184functions, 163sequence, 162signed measures, 161

correlation coefficient, 248Cramer-Wold device, 336, 352coupling, 455, 516Cramer’s condition, 365cumulative distribution function,

45, 46, 133, 191joint, 221marginal, 221singular, 133

cycles, 445

DCT (See theorem)de Morgan’s laws, 72, 575Delta method, 310detailed balance condition, 492differentiation, 118, 130, 583

chain rule of, 583directly Riemann integrable, 268distribution

arithmetic, 264, 319Cauchy, 353compound Poisson, 359, 360initial, 439, 456nonarithmetic, 265discrete univariate, 144finite dimensional, 200lattice, 264, 319, 364nonlattice, 265, 364, 536Pareto, 358stationary, 440, 451, 469, 492

domain of attraction, 358, 541Doob’s decomposition, 404dual space, 94

Page 619: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

614 Subject Index

EDCT (See theorem)Edgeworth expansions, 364, 365,

536, 538, 554empirical distribution function,

241, 375equivalence relation, 577ergodic, 273

Birkhoff’s (See theorem)Kingman’s (See theorem)(maximal) inequality, 274sequence, 273

essential supremum, 89exchangeable, 397exit probability, 502extinction probability, 491, 562,

565

Feller continuity, 473filtration, 399first passage time, 443, 462Fourier

coefficients, 99, 166, 170series, 187transform (See transforms)

functionabsolutely continuous, 129Cantor, 114Cantor ternary, 136characteristic, 317, 332continuous, 41, 299, 582, 592differentiable, 320, 583exponential, 587generating, 164Greens, 463Haar, 109, 493integrable, 54regularly varying, 354simple, 49slowly varying, 354transition, 458, 488trigonometric, 587

Gaussian process, 209Gibbs sampler, 480growth rate, 503

Harris irreducible (Seeirreducible)

Harris recurrence (Seerecurrence)

heavy tails, 358, 540Hilbert space, 98, 388hitting time, 443, 462

independence, 208, 211, 219, 336pairwise, 219, 224, 244, 247,

257inequalities

Bessel’s, 99Burkholder’s, 416Cauchy-Schwarz, 88, 98, 198Chebychev’s, 83, 196Cramer’s, 84Davydov’s, 518Doob’s Lp-maximal, 413Doob’s L log L maximal, 414Doob’s maximal, 412Doob’s upcrossing, 418Holder’s, 87, 88, 198Jensen’s, 86, 197, 390Kolmogorov’s first, 249Kolmogorov’s second, 249Levy’s, 250Markov’s, 83, 196Minkowski’s, 88, 198Rio’s, 517smoothing, 362

infinitely divisible distributions,358, 360

inner-product, 98integration

by parts formula, 155, 586Lebesgue, 51, 54, 61Riemann, 51, 59, 60, 61, 585

inversion formula, 175, 324, 325,326, 338

irreducible, 273, 447Harris, 462

isometry, 94isomorphism, 100iterated function system, 441, 460

Page 620: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Subject Index 615

Jordan decomposition, 123

Kolmogorov-Smirnov statistic,375, 499

Lp-norm, 89space, 89

large deviation, 368law of large numbers, 237, 448,

470strong (See SLLN)weak (See WLLN)

law of the iterated logarithm,278, 538

lemmaC-set, 464Fatou’s, 7, 54, 389Kronecker’s, 255, 433Riemann-Lebesgue, 171, 320Wald’s, 415

liminf, 223, 580, 595limits, 580limsup, 222, 580, 595Lindeberg condition, 344linear operator, 96Lyapounov’s condition, 347

MBB, 547MCMCMCT (See theorem)m-dependence, 515, 545Malthusian parameter, 566map

co-ordinate, 203projection, 203

Markov chain, 208, 439, 457absorbing, 447aperiodic, 454communicating, 447embedded, 488periodic, 453semi-, 468solidarity, 447

Markov process

branching, 490generator, 489

Markov property, 397, 488strong, 444, 497

martingale, 399, 501, 509, 563difference array, 509Doob’s, 401reverse, 423sub, 400super, 400

meanarithmetic, 87geometric, 87

mean matrix, 564measure, 14, 20

complete, 24, 30counting, 16finite, 16induced, 45Lebesgue, 26Lebesgue-Stieltjes, 17, 26,

28, 58, 59, 325negative variation, 121occupation, 476, 482outer, 22positive variation, 121probability, 16product, 151Radon, 29, 131regularity of, 29signed, 119singular, 114space, 39total variation, 121uniqueness of, 29

measurablerectangle, 147space, 39

median, 250method of moments, 307metric space, 12, 299, 590, 597

complete, 91, 96, 124, 300,591

discrete, 90Kolmogorov, 312

Page 621: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

616 Subject Index

Levy, 311, 537Polish, 300, 306, 592separable, 300, 591supremum, 300, 537

Metropolis-Hastings algorithm,478

mixing, 278α, 514, 515β, 514φ, 514Ψ, 515ρ, 514coefficient, 514process, 513strongly, 514, 515

moment, 198convergence of (See

convergence)moment generating function, 194,

198, 314moment problem, 307, 309Monte-Carlo, 534

iid, 248Markov chain (See MCMC)

nonexplosion condition, 488nonnegative definite, 323norm, 95

total variation, 123, 269, 271normal numbers, 248, 281

absolutely, 281normed vector space, 95null array, 348

order statistics, 375orthogonal

basis, 100polynomials, 107

orthogonality, 99orthonormality, 99

parameterlevel 1, 533level 2, 533

Parseval identity, 170

partition, 32Polya’s criterion, 323polynomial

trigonometric, 170power series, 581

differentiability of, 583probabiilty

distribution, 45, 191space, 39

processbirth and death, 489branching (See branching

process)compound Poisson, 491empirical, 375Gaussian, 493, 516Levy, 491, 493, 497Markov (See Markov

process)Ornstein-Uhlenbeck, 498,

507Poisson, 490, 505regenerative, 268, 467, 491renewal (See renewal)stable, 497Yule, 505

product space, 147projection, 108, 384

quadratic variation, 500queues

M/G/1, 571M/M/∞, 505M/M/1, 505busy period, 571

Radon-Nikodym derivative, 118,404

random map, 442random Poisson measure, 541random variable, 39, 191

extended real-valued, 226tail, 225

random vector, 192random walk, 401, 431

Page 622: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Subject Index 617

multiplicative, 459simple symmetric, 446

recurrence, 444, 446, 456Harris, 463null, 444, 449positive, 444

regenerationtimes, 269

regression, 280, 351, 384, 531regular conditional probability,

392renewal

equation, 266function, 262, 266process, 261, 484, 567sequence, 260theorem, weak, 264theorem, strong, 265theorem, key, 267, 268

SLLN, 240, 259, 424Borel’s, 240Etemadi’s, 244Kolmogorov’s, 257Marcinkiewz-Zygmund, 256

sample path, 442sample space, 189second order correctness, 537, 554second order stationary, 529semialgebra, 3, 19

σ-finite, 28set, 573

Cantor, 37, 134compact, 592complement, 575cylinder, 202empty, 574function, 14intersection, 574Lebesgue measurable, 26power, 9product, 147, 575union, 574

slowly varying function (See func-tion)

sojourn time, 468, 487span, 319, 364stable distribution, 353, 360stationary, 271Stirling’s approximation, 313stochastic processes, 199stochastically bounded (See

tight)stopping time, 262, 406, 462

bounded, 262, 263finite, 406

sub-martingale (See martingale)subspace, 96super-martingale (See martin-

gale)symmetric difference, 4

ternary expansion, 135theorem

a.s. convergence, sub-martingale, 419

Berbee’s, 516Berry-Esseen, 361betting, 408binomial, 595Birkhoff’s, 274Bochner-Khinchine, 323Bolzano-Weirstrass, 291bounded convergence, 57Bradley’s, 517continuous mapping, 305de Finetti’s, 430dominated convergence, 7,

57, 77, 390Donsker, 498Doob’s optional stopping,

408, 409, 410Egorov’s, 69, 71extended dominated conver-

gence, 57extension, 24, 157, 204Feller’s, 348Frechet-Shohat, 307Fubini’s, 153, 222Glivenko-Cantelli, 241, 285

Page 623: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

618 Subject Index

Hahn decomposition, 122Heine-Borel, 593Helly’s, 292, 296, 373, 475Helly-Bray, 293, 305Kakutani’s, 429Kallianpur-Robbins, 499Kesten-Stigum, 564Khinchine, 355Khinchine-Kolmogorov’s 1-

series, 252Kingman’s, 277Kolmogorov’s consistency,

200, 210Kolmogorov’s 3-series, 249,

252Lp convergence, sub-

martingales, 422Lebesgue decomposition,

115Levy-Cramer, 330, 360Levy-Khinchine, 359Lusin’s, 69mean value, 583minorization, 464monotone convergence, 6,

52, 77, 389multinomial, 595Polya’s, 242, 285, 290, 361,

556Prohorov-Varadarajan, 303,

373Radon-Nikodym, 115Rao-Blackwell, 391, 398regeneration, 465Riesz representation, 94, 97,

101, 144Scheffe’s, 64, 241Shannon-McMillan-

Breiman, 276Skorohod’s, 304, 306Slutsky’s, 290, 298Taylor’s, 584Tonelli’s, 152, 222

tight, 295, 303, 307time reversibility, 484

topological space, 11transformation

linear, 96measure preserving, 271

transformsFourier, 173, 320Laplace, 165Laplace-Stieltjes, 166Plancherel, 180

transient, 444, 449transition function (See function)transition probability, 439, 456

function, 209matrix, 209

triangular array, 343

uniform integrability, 65, 306upcrossing inequality (See

inequalities)urn schemes, 568

Polya’s, 568

vague compactness, 475variation

bounded, 127negative, 126positive, 126total, 126

vector space, 95volatility rate, 503

WLLN, 238waiting time, 459, 472Wald’s equation, 263Weyl’s equi-distribution

property, 314

zero-one law, 223Kolmogorov, 225, 422

Page 624: Springer Texts in Statistics - Tanujit Chakraborty's Blog...Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and

Springer Texts in Statistics (continued from page ii)

Lehmann and Romano: Testing Statistical Hypotheses, Third EditionLehmann and Casella: Theory of Point Estimation, Second EditionLindman: Analysis of Variance in Experimental DesignLindsey: Applying Generalized Linear ModelsMadansky: Prescriptions for Working StatisticiansMcPherson: Applying and Interpreting Statistics: A Comprehensive Guide,

Second EditionMueller: Basic Principles of Structural Equation Modeling: An

Introduction to LISREL and EQSNguyen and Rogers: Fundamentals of Mathematical Statistics: Volume I:

Probability for StatisticsNguyen and Rogers: Fundamentals of Mathematical Statistics: Volume II:

Statistical InferenceNoether: Introduction to Statistics: The Nonparametric WayNolan and Speed: Stat Labs: Mathematical Statistics Through ApplicationsPeters: Counting for Something: Statistical Principles and PersonalitiesPfeiffer: Probability for ApplicationsPitman: ProbabilityRawlings, Pantula and Dickey: Applied Regression AnalysisRobert: The Bayesian Choice: From Decision-Theoretic Foundations to

Computational Implementation, Second EditionRobert and Casella: Monte Carlo Statistical MethodsRose and Smith: Mathematical Statistics with MathematicaRuppert: Statistics and Finance: An IntroductionSantner and Duffy: The Statistical Analysis of Discrete DataSaville and Wood: Statistical Methods: The Geometric ApproachSen and Srivastava: Regression Analysis: Theory, Methods, and

ApplicationsShao: Mathematical Statistics, Second EditionShorack: Probability for StatisticiansShumway and Stoffer: Time Series Analysis and Its Applications:

With R Examples, Second EditionSimonoff: Analyzing Categorical DataTerrell: Mathematical Statistics: A Unified IntroductionTimm: Applied Multivariate AnalysisToutenburg: Statistical Analysis of Designed Experiments, Second EditionWasserman: All of Nonparametric StatisticsWasserman: All of Statistics: A Concise Course in Statistical InferenceWeiss: Modeling Longitudinal DataWhittle: Probability via Expectation, Fourth EditionZacks: Introduction to Reliability Analysis: Probability Models and

Statistical Methods