probability and measure theory · 2021. 1. 20. · probability theory 3 -amit rajaraman x1. measure...

Probability and Measure Theory

Amit Rajaraman

Summer 2020

We primarily use [2] as reference for these notes.

Contents

0 Notation 2

1 Measure Theory 31.1 Classes of Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Measurable Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4 Outer Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.5 The Approximation and Extension Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.6 Important Examples of Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Introduction to Probability 202.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2 Important Examples of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3 The Product Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Independence 253.1 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Independence of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3 The Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4 Kolmogorov’s 0-1 Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Generating Functions 334.1 Definitions and Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 The Poisson Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Branching Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 The Integral 385.1 Set Up to Define the Integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.2 The Integral and some Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.3 Monotone Convergence and Fatou’s Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.4 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6 Moments 466.1 Parameters of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.2 The Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.3 The Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.4 Speed of Convergence in the Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . 55

Probability Theory 2 -Amit Rajaraman

§0. Notation

N represents the set 1, 2, . . ..N0 represents the set 0, 1, 2, . . ..For x ∈ R and n ∈ N, (

x

r

)=x(x− 1) · · · (x− r + 1)

r!

is the generalised binomial coefficient.

For n ∈ N, we denote 1, 2, . . . , n as [n] and 0, 1, 2, . . . , n as [n]0

For a ∈ Rn, we denote the ith coordinate of a by ai for each i = 1, 2, . . . , n.

For a, b ∈ Rn, we write a < b if ai < bi for each i = 1, 2, . . . , n.

Let (an)n∈N be a sequence of reals. Then

lim supn→∞

an = limn→∞

(supm≥n

an

)= infn≥0

(supm≥n

an

)lim infn→∞

an = limn→∞

(infm≥n

an

)= supn≥0

(infm≥n

an

)

Let A and B be two sets. We denote byA4B = (A \B) ∪ (B \A)

the symmetric difference of A and B.

If sets A and B are disjoint, we represent their union as A ]B (similar to the ⊕ notation in linear algebra).


§1. Measure Theory

Before beginning a rigorous study of probability theory, it is necessary to understand some parts of basic measuretheory.

1.1. Classes of Sets

Let Ω be a non-empty set and A ⊆ 2Ω, where 2Ω is the power set of Ω. Then

Definition 1.1. A is called

• ∩-closed (closed under intersections) or a π-system if A ∩B ∈ A for all A,B ∈ A.

• σ-∩-closed (closed under countable intersections) if⋂∞i=1Ai ∈ A for any choice of countably many sets

A1, A2, . . . ∈ A.

• ∪-closed (closed under unions) if A ∪B ∈ A for all A,B ∈ A.

• σ-∪-closed (closed under countable unions) if⋃∞i=1Ai ∈ A for any choice of countably many sets A1, A2, . . . ∈ A.

• \-closed (closed under differences) if A \B ∈ A for all A,B ∈ A.

• closed under complements if Ac = Ω \A ∈ A for all A ∈ A.

Theorem 1.1. Let A be closed under complements. Then A is ∪-closed (σ-∪-closed) if and only if A is closed∩-closed (σ-∩-closed).

The above is relatively straightforward to prove using De Morgan’s Laws.

Theorem 1.2. Let A be \-closed. Then

(a) A is ∩-closed,

(b) if A is σ-∪-closed, then A is σ-∩-closed.

(c) Any countable union of sets in A can be expressed as a countable union of pairwise disjoint sets in A.

Proof.

(a) For A,B ∈ A, A ∩B = A \ (A \B) ∈ A.

(b) Let A1, A2, . . . ∈ A. Then

∞⋂i=1

Ai =

∞⋂i=1

(A1 ∩Ai)

=

∞⋂i=1

A1 \ (A1 \Ai)

= A1 \∞⋃i=1

(A1 \Ai).

(c) Let A1, A2, . . . ∈ A. We then have

∞⋃i=1

Ai = A1 ] (A2 \A1) ] ((A3 \A2) \A1) ] · · ·

The result follows.


This equivalence between ∩ and ∪ if the class is \-closed is apparent from De Morgan’s laws.

Definition 1.2 (Algebra). A class of sets A ⊆ 2Ω is called an algebra if

(i) Ω ∈ A,

(ii) A is \-closed, and

(iii) A is ∪-closed.

Definition 1.3 (σ-algebra). A class of sets A ⊆ 2Ω is called a σ-algebra if

(i) Ω ∈ A,

(ii) A is closed under complements, and

(iii) A is σ-∪-closed.

σ-algebras are also known as σ-fields.Note that any σ-algebra is an algebra (but the converse is not true).

Theorem 1.3. A class of sets A ⊆ 2Ω is an algebra if and only if

(a) Ω ∈ A,

(b) A is closed under complements, and

(c) A is ∩-closed.

The proof of the above is left as an exercise to the reader.

Definition 1.4 (Ring). A class of sets A ⊆ 2Ω is called a ring if

(i) ∅ ∈ A,

(ii) A is \-closed, and

(iii) A is ∪-closed.

Further, a ring is a σ-ring if it is σ-∪-closed.

Definition 1.5 (Semiring). A class of sets A ⊆ 2Ω is called a semiring if

(i) ∅ ∈ A,

(ii) for any A,B ∈ A, A \B is a finite union of mutually disjoint sets in A, and

(iii) A is ∩-closed.

Definition 1.6 (λ-system). A class of sets A ⊆ 2Ω is called a λ-system (or Dynkin’s λ-system) if

(i) Ω ∈ A,

(ii) for any A,B ∈ A with B ⊆ A, A \B ∈ A, and

(iii)

∞⊎i=1

Ai ∈ A for any choice of countably many pairwise disjoint sets A1, A2, . . . ∈ A.

Among the above classes of sets, σ-algebras in particular are extremely important as we shall use them when definingprobability.

Theorem 1.4.

(a) Every σ-algebra is also a λ-system, an algebra and a σ-ring.


(b) Every σ-ring is a ring, and every ring is a semiring.

(c) Every algebra is a ring. An algebra on a finite set Ω is a σ-algebra.

Proof.

(a) Let A be a σ-algebra. Then for any A,B ∈ A, A \B = (Ac ∪B)c ∈ A and A ∩B = (Ac ∪Bc)c ∈ A, that is, Ais \-closed and ∪-closed. The result follows.

(b) Let A be a ring. Then theorem 1.1 implies that A is ∩-closed. The result follows.

(c) Let A be an algebra. With proof similar to the first part of this theorem, it is seen that A is \-closed. We have∅ = Ω \ Ω ∈ A and thus, it is a ring. If Ω is finite, then A is finite. Thus any countable union of sets is a finiteunion of sets and the result follows.

Definition 1.7. Let A1, A2, . . . be subsets of Ω. Then

lim infn→∞

An :=∞⋃i=1

∞⋂j=i

Aj and lim supn→∞

An :=∞⋂i=1

∞⋃j=i

Aj

are respectively called the limit inferior and limit superior, of the sequence (An)n∈N.

The above may be rewritten as

A∗ := lim infn→∞

An = ω ∈ Ω : |n ∈ N : ω 6∈ An| <∞

A∗ := lim supn→∞

An = ω ∈ Ω : |n ∈ N : ω ∈ An| =∞

That is, A∗ represents the set of elements that do not appear in a finite number of sets and A∗ represents the set ofelements that appear in an infinite number of sets. This implies that A∗ ⊆ A∗. (Why is the opposite not necessarilytrue?)

Definition 1.8 (Indicator function). Let A be a subset of Ω. The indicator function on A is defined by

1A(x) =

1, x ∈ A0, x 6∈ A

With the above notation, it may be shown that

1A∗ = lim infn→∞

1An and 1A∗ = lim supn→∞

1An .

If A ⊆ 2Ω is a σ-algebra and if An ∈ A for every n ∈ N, then A∗ ∈ A and A∗ ∈ A. This is clear from the fact thatσ-algebras are closed under countable unions and intersections.Proving the above statements is left as an exercise to the reader.

Theorem 1.5. Let I be some index set and Ai be a σ-algebra for each i ∈ I. Then the intersection AI =⋂i∈I Ai

is also a σ-algebra.

Proof. We can prove this by using the three conditions in the definition of a σ-algebra.

(i) Since Ω ∈ Ai for every i ∈ I, Ω ∈ AI .

(ii) Let A ∈ AI . Then A ∈ Ai for each i ∈ I and thus Ac ∈ Ai for each i ∈ I. Therefore, Ac ∈ AI .

(iii) Let A1, A2, . . . ∈ AI . Then An ∈ Ai for each n ∈ N and i ∈ I. Thus A =⋃∞n=1An ∈ Ai for each i as well. The

result follows.


A similar statement holds for λ-systems.

Theorem 1.6. Let E ⊆ 2Ω. Then there exists a smallest σ-algebra σ(E) with E ⊆ σ(E):

σ(E) =⋂

A⊆2Ω is a σ-algebraE⊆A

A.

σ(E) is called the σ-algebra generated by E and E is called a generator of σ(E).

Proof. 2Ω is a σ-algebra that contains E so the intersection is non-empty. By theorem 1.5, σ(E) is a σ-algebra.

Similar to the above, δ(E) is defined as the λ-system generated by E .

We always have the following:

1. E ⊆ σ(E).

2. If E1 ⊆ E2, then σ(E1) ⊆ σ(E2).

3. A is a σ-algebra if and only if σ(A) = A.

Similar statements hold for λ-systems. Further, δ(E) ⊆ σ(E). This is to be expected as σ-algebras have more“structure” than λ-systems.

Theorem 1.7 (∩-closed λ-system). Let D ⊆ 2Ω be a λ-system. Then D is a π-system if and only if D is a σ-algebra.

Proof. If D is a σ-algebra, then it is obviously a π-system. Let D be a π-system. Then

(a) As D is a λ-system, Ω ∈ D.

(b) Let A ∈ D. Since Ω ∈ D and D is a λ-system, Ac = Ω \A ∈ D.

(c) Let A,B ∈ D. We have A ∩B ∈ D. We now have A \B = A \ (A ∩B) ∈ D, that is, D is \-closed.

Let A1, A2, . . . ∈ D. Then by theorem 1.2, there exist B1, B2, . . . ∈ D such that

∞⋃i=1

Ai =

∞⊎i=1

Bi ∈ D.

This completes the proof.

Theorem 1.8 (Dynkin’s π-λ theorem). If E ⊆ 2Ω is a π-system, then δ(E) = σ(E).

Proof. We already have δ(E) ⊆ σ(E). We must now prove the reverse inclusion. We shall show that δ(E) is aπ-system.For each E ∈ δ(E), let

DE = A ∈ δ(E) : A ∩ E ∈ δ(E).

To show that δ(E) is a π-system, it suffices to show that δ(E) ⊆ DE for all E ∈ δ(E). We shall first show that DE isa λ-system for each E ∈ E by checking each of the conditions in definition 1.6.

(a) We clearly have Ω ∈ DE as Ω ∩ E = E.

(b) For any A,B ∈ DE with A ⊆ B,

(B \A) ∩ E = (B ∩ E) \ (A ∩ E) ∈ δ(E).

(c) Let A1, A2, . . . ∈ DE be mutually disjoint sets. Then( ∞⊎i=1

Ai

)∩ E =

∞⊎i=1

(Ai ∩ E) ∈ δ(E).


Now since DE is a λ-system and E ⊆ DE (Why?), δ(E) ⊆ DE .Now that we have shown that δ(E) is a π-system, the result follows by theorem 1.7.

Definition 1.9 (Topology). Let Ω 6= ∅ be an arbitrary set. A class of sets τ ⊆ 2Ω is called a topology on 2Ω if

(i) ∅,Ω ∈ τ ,

(ii) τ is ∩-closed, and

(iii) for any F ⊆ τ ,⋃A∈F A ∈ τ.

In the above case, the pair (Ω, τ) is called a topological space. The sets A ∈ τ are called open and the sets A ⊆ Ωwith Ac ∈ τ are called closed.

Note that in contrast with σ-algebras, topologies are closed under only finite intersections but are also closed underarbitrary unions.

For example, consider the natural topology on R which consists of all open intervals in R and any arbitrary union ofthem.

Definition 1.10 (Borel σ-algebra). Let (Ω, τ) be a topological space. The σ-algebra

B(Ω) = B(Ω, τ) = σ(τ)

that is generated by the open sets is called the Borel σ-algebra on Ω. The elements A ∈ B(Ω, τ) are called Borel setsor Borel measurable sets.

A Borel σ-algebra that we shall often encounter is B(Rn) for n ∈ N. Consider the following classes of sets:

A1 = A ⊆ Rn : A is openA2 = A ⊆ Rn : A is closedA3 = A ⊆ Rn : A is compactA4 = (a, b) : a, b ∈ Qn and a < bA5 = (a, b] : a, b ∈ Qn and a < bA6 = [a, b) : a, b ∈ Qn and a < bA7 = [a, b] : a, b ∈ Qn and a < bA8 = (−∞, b) : b ∈ QnA9 = (−∞, b] : b ∈ QnA10 = (a,∞) : a ∈ QnA11 = [a,∞) : a ∈ Qn

It may be proved that B(Rn) is generated by any of the classes of sets A1,A2, . . . ,A11.

For A ∈ B(R), we represent by B(R)|A the restriction of B(R) to A. It may be proved that this is equal to B(A),the σ-algebra generated by the open subsets of A.

1.2. Measure

Definition 1.11. Let A ⊆ 2Ω and let µ : A → [0,∞] be a set function. We say that µ is

(i) monotone if for any A,B ∈ A, A ⊆ B implies that µ(A) ≤ µ(B),

(ii) additive if for any choice of finitely many mutually disjoint sets A1, . . . , An ∈ A with⊎ni=1Ai ∈ A,

µ

(n⊎i=1

Ai

)=

n∑i=1

µ(Ai),


(iii) σ-additive if for any choice of countably many mutually disjoint sets A1, A2, . . . ∈ A with⊎∞i=1Ai ∈ A,

µ

( ∞⊎i=1

Ai

)=

∞∑i=1

µ(Ai),

(iv) subadditive if for any choice of finitely many sets A,A1, A2, . . . , An ∈ A with A ⊆⋃ni=1Ai, we have

µ(A) ≤n∑i=1

µ(Ai), and

(v) σ-subadditive if for any choice of countably many sets A,A1, A2, . . . ∈ A with A ⊆⋃∞i=1Ai, we have

µ(A) ≤∞∑i=1

µ(Ai).

Definition 1.12. Let A be a semiring and µ : A → [0,∞] be a set function with µ(∅) = 0. µ is called a

(i) content if µ is additive,

(ii) premeasure if µ is σ-additive, and

(iii) measure if µ is σ-additive and A is a σ-algebra.

Theorem 1.9 (Properties of contents). Let A be a semiring and µ be a content on A. Then

(a) If A is a ring, then µ(A ∪B) + µ(A ∩B) = µ(A) + µ(B) for any A,B ∈ A.

(b) µ is monotone. If A is a ring, then µ(B) = µ(A) + µ(B \A) for any A,B ∈ A with A ⊆ B.

(c) µ is subadditive. If µ is σ-additive, then it is also σ-subadditive.

(d) If A is a ring, then∞∑n=1

µ(An) ≤ µ

( ∞⋃n=1

An

)for any choice of countably many mutually disjoint sets A1, A2, . . . ∈ A with

⋃∞i=1Ai ∈ A.

Proof.

(a) Note that A ∪B = A ] (B \A) and B = (A ∩B) ] (B \A). As µ is additive,

µ(A ∪B) = µ(A) + µ(B \A) and µ(B) = µ(A ∩B) + µ(B \A).

The result follows.

(b) Let A ⊆ B. If B \A ∈ A (which is true in the case of a ring), we have B = A ] (B \A) and thus

µ(B) = µ(A) + µ(B \A).

If A is just a semiring, then there exist n ∈ N and mutually disjoint sets C1, C2, . . . , Cn ∈ A such that

B \A =

n⊎i=1

Ci.

In either case, we have µ(A) ≤ µ(B).


(c) Let A,A1, A2, . . . , An ∈ A such that A ⊆⋃ni=1Ai. Let B1 = A1 and for each k = 2, 3, . . . , n, let

Bk = Ak \

(k−1⋃i=1

Ai

).

Note that any two Bis are disjoint. As µ is additive and monotone, we have

µ(A) ≤ µ

(n⋃i=1

Ai

)

= µ

(n⊎i=1

Bi

)

=

n∑i=1

µ(Bi) ≤n∑i=1

µ(Ai).

We can similarly prove that if µ is σ-additive, then it is σ-subadditive.

(d) Let A =⋃∞i=1Ai ∈ A. Since µ is additive and monotone,

m∑i=1

µ(Ai) = µ

(m⊎i=1

Ai

)≤ µ(A) for any m ∈ N.

The result follows.

Note that if equality holds in the fourth part of the above theorem, µ is a premeasure.

Definition 1.13 (Finite content). Let A be a semiring. A content µ on A is called

(i) finite if µ(A) <∞ for all A ∈ A and

(ii) σ-finite if there exists a sequence of sets Ω1,Ω2, . . . ∈ A such that Ω =⋃∞i=1 Ωi and µ(Ωi) <∞ for every i ∈ N.

Definition 1.14. Let A,A1, A2, . . . be sets. We write

(i) An ↑ A if A1 ⊆ A2 ⊆ A3 ⊆ · · · and⋃∞i=1Ai = A. In this case, we say that An increases to A.

(ii) An ↓ A if A1 ⊇ A2 ⊇ A3 ⊇ · · · and⋂∞i=1Ai = A. In this case, we say that An decreases to A.

For example, if An =(− 1n ,

1n

)for n ∈ N, then An ↓ 0.

Definition 1.15 (Continuity of contents). Let µ be a content on the ring A. µ is called

(i) lower semicontinuous if limn→∞ µ(An) = µ(A) for any A ∈ A and sequence (An)n∈N such that An ↑ A,

(ii) upper semicontinuous if limn→∞ µ(An) = µ(A) for any A ∈ A and sequence (An)n∈N such that µ(An) <∞ forsome n (this implies that it holds for all n ∈ N) and An ↓ A,

(iii) ∅-continuous if (ii) holds for A = ∅.

Theorem 1.10. Let µ be a content on the ring A. The following properties are equivalent:

(a) µ is σ-additive (and hence a premeasure).

(b) µ is σ-subadditive.

(c) µ is lower semicontinuous.

(d) µ is ∅-continuous.


(e) µ is upper semicontinuous.

Then (a)⇐⇒ (b)⇐⇒ (c) =⇒ (d)⇐⇒ (e). If µ is finite, then all five statements are equivalent.

Proof.

• (a) =⇒ (b) (σ-additivity implies σ-subadditivity).

This follows from theorem 1.9(c).

• (b) =⇒ (a) (σ-subadditivity implies σ-additivity).

This follows from theorem 1.9(d).

• (a) =⇒ (c) (σ-additivity implies lower semicontinuity).

Let µ be a premeasure and A ∈ A. Let A1, A2, . . . ∈ A such that An ↑ A and let A0 = ∅. Then

µ(A) =

∞∑i=1

µ(Ai \Ai−1) = limn→∞

n∑i=1

µ(Ai \Ai−1) = limn→∞

µ(An).

• (c) =⇒ (a) (lower semicontinuity implies σ-additivity).

Let B1, B2, . . . ∈ A be mutually disjoint and let B =⊎∞n=1Bn ∈ A. Let An =

⊎ni=1Bi for each n ∈ N. Then

µ(B) = limn→∞

µ(An) =

∞∑i=1

µ(Bi).

Thus µ is σ-additive.

• (d) =⇒ (e) (∅-continuity implies upper semicontinuity).

Let A,A1, A2, . . . ∈ A with An ↓ A and µ(A1) <∞. Define Bn = An \A ∈ A for valid n. Then Bn ↓ ∅. Thus

limn→∞

µ(An)− µ(A) = limn→∞

µ(Bn) = 0

and the result is proved.

• (e) =⇒ (d) (upper semicontinuity implies ∅-continuity).

This is obvious.

• (c) =⇒ (d) (lower semicontinuity implies ∅-continuity).

Let A1, A2, . . . ∈ A with An ↓ ∅ and µ(A1) <∞. Then A1 \An ∈ A for all n ∈ N and A1 \An ↑ A1. Thus

µ(A1) = limn→∞

µ(A1)− µ(An).

Since µ(A1) <∞, limn→∞An = 0 and the result is proved.

• (d) =⇒ (c) (∅-continuity implies lower semicontinuity) if µ is finite.

Let A,A1, A2, . . . ∈ A with An ↑ A. Then A \An ↓ ∅ and

limn→∞

µ(A)− µ(An) = limn→∞

µ(A \An) = 0.

The result follows.

Definition 1.16 (Measurable spaces).

(i) A pair (Ω,A) consisting of a nonempty set Ω and a σ-algebra A ⊆ 2Ω is called a measurable space. The setsA ∈ A are called measurable sets. If Ω is countable and A = 2Ω, then the space (Ω, 2Ω) is called discrete.

(ii) A triple (Ω,A, µ) is called a measure space if (Ω,A) is a measurable space and µ is a measure on A.


1.3. Measurable Maps

In measure theory, the homomorphisms (structure-preserving maps between objects) are studied as measurable maps.

Definition 1.17 (Measurable map). Let (Ω,A) and (Ω′,A′) be measurable spaces. A map X : Ω → Ω′ is calledA−A′-measurable (or just measurable) if

X−1(A′) ∈ A for any A′ ∈ A′.

In this case, we write X : (Ω,A)→ (Ω′,A′).

Theorem 1.11 (Generated σ-algebra). Let (Ω′,A′) be a measurable space and Ω be a nonempty set. Let X : Ω→ Ω′

be a map. ThenX−1(A′) = X−1(A′) : A′ ∈ A′

is the smallest σ-algebra with respect to which X is measurable. We call X−1(A′) the σ-algebra generated by X anddenote it as σ(X).

Proof. Let X be measurable with respect to some σ-algebra A. Then X−1(A′) ∈ A for any A′ ∈ A′, that is,σ(X) ⊆ A. Let us now prove that σ(X) is a σ-algebra by checking each of the axioms in definition 1.3.

1. As Ω′ ∈ A′ and X−1(Ω′) = Ω, Ω ∈ σ(X).

2. Let A ∈ σ(X) and A′ ∈ A′ such that X−1(A′) = A. Then as A′ is closed under complements,

Ω \A = X−1(Ω′) \X−1(A′) = X−1(Ω′ \A′) ∈ σ(X).

Therefore, σ(X) is closed under complements.

3. Let A1, A2 . . . ∈ σ(X) and A′1, A′2, . . . ∈ A′ such that Ai = X−1(A′i) for each i ∈ N. Then as A′ is σ-∪-closed,

⋃i∈N

Ai =⋃i∈N

X−1(A′i) = X−1

(⋃i∈N

A′i

)∈ σ(X)

Therefore, σ(X) is a σ-algebra.

Theorem 1.12. Let (Ω,A) and (Ω′,A′) be measurable spaces and X : Ω→ Ω′ be a map. Let E ′ ⊆ A′ be a class ofsets. Then σ(X−1(E ′)) = X−1(σ(E ′)).

Proof. We have that X−1(E) ⊆ X−1(σ(E)) = σ(X−1(σ(E))). This implies that

σ(X−1(E)) ⊆ X−1(σ(E)).

To establish the reverse inclusion, consider

A′0 = A′ ∈ σ(E ′) : X−1(A′) ∈ σ(X−1(E ′))

We shall show that A′0 is a σ-algebra.

(a) Clearly, Ω′ ∈ A′0 as Ω ∈ σ(X−1(E ′)) and Ω′ ∈ σ(E ′).

(b) Let A′0 ∈ A′0. ThenX−1((A′0)c) = (X−1(A′0))c ∈ σ(X−1(E ′))

and thus A′′ is closed under complements.

(c) Let A′1, A′2, . . . ∈ A′0. Then

X−1

( ∞⋃i=1

A′i

)=

∞⋃i=1

X−1 (A′i) ∈ σ(X−1(E ′)).

Thus, A′0 is σ-∪-closed.


Now, note that E ′ ⊆ A′0 and A′0 ⊆ σ(E ′). This implies that A′0 = σ(E ′), and thus X−1(σ(E ′)) ⊆ σ(X−1(E ′)).This proves the result.

Corollary 1.13. Let (Ω,A) and (Ω′,A′) be measurable spaces and X : Ω → Ω′ be a map. Let E ′ ⊆ A′ be aclass of sets. Then X is A-σ(E ′) measurable if and only if X−1(E ′) ∈ A. If in particular σ(E ′) = A′, then X isA−A′-measurable if and only if X−1(E ′) ⊆ A′.

Theorem 1.14. Let (Ω,A), (Ω′,A′), and (Ω′′,A′′) be measurable spaces and let X : Ω→ Ω′ and X ′ : Ω′ → Ω′′ bemeasurable. Then Y = X ′ X : Ω→ Ω′′ is A−A′′-measurable.

Proof. This is due to the fact that

Y −1(A′′) ⊆ X−1((X−1)(A′′)) ⊆ X−1(A′) ⊆ A.

The above theorem just states that the composition of two measurable maps is measurable.

Theorem 1.15 (Measurability of Continuous Maps). Let (Ω, τ) and (Ω′, τ ′) be topological spaces and let f : Ω→ Ω′

be a continuous map. Then f is B(Ω)− B(Ω′)-measurable.

Proof. As B(Ω′) = σ(τ ′), by corollary 1.13 it is enough to show that f−1(A′) ∈ σ(τ) for all A′ ∈ τ ′. However, sincef is continuous, f−1(A′) ∈ τ for all A′ ∈ τ ′ so the result follows.

Theorem 1.16. Let X1, X2, . . . be measurable maps (Ω,A) → (R,B(R)). Then infn∈N

Xn, supn∈N

Xn, lim infn∈N

Xn, and

lim supn∈N

Xn are also measurable.

Proof. For any x ∈ R, (infn∈N

Xn

)−1

([−∞, x)) =

∞⋃n=1

(Xn)−1([−∞, x)) ∈ A.

The first part of the result follows by corollary 1.13. The proof for supn∈NXn is similar.For lim supn∈N, consider the sequence (Yn)n∈N where Yn = supm≥nXm. Each Ym is measurable. Then sinceinfn∈N Ym is measurable, the result follows.

Definition 1.18 (Simple Function). Let (Ω,A) be a measurable space. A map f : Ω → R is called simple if thereexists some n ∈ N, mutually disjoint sets A1, . . . , An ∈ A, and α1, . . . , αn ∈ R such that

f =

n∑i=1

αi1Ai .

Definition 1.19. Let f, f1, f2, . . . be maps Ω→ R such that

f1(ω) ≤ f2(ω) ≤ · · · and limn→∞

fn(ω) = ω for all ω ∈ Ω.

We then write fn ↑ f . Similarly, we write fn ↓ f if (−fn) ↑ (−f).

Theorem 1.17. Let (Ω,A) be a measurable space and let f : Ω→ [0,∞] be measurable. Then

(a) There exists a sequence (fn)n∈N of non-negative simple functions such that fn ↑ f .

(b) There are measurable sets A1, A2, . . . ∈ A and α1, α2, . . . ∈ [0,∞) such that f =∑ni=1 αi1Ai .

Proof.

(a) For n ∈ N0, definefn = minn, 2−nb2nfc.

Each fn is measurable. (Why?) Since it can take at most n2n + 1 distinct values, each fn is simple. Clearly,fn ↑ f .


(b) Let fn be the same as above. For n ∈ N and i ∈ [2n], define

Bn,i = ω : fn(ω)− fn−1(ω) = i2−n and βn,i = i2−n.

Then fn − fn−1 =∑2n

i=1 βn,i1Bn,i . Changing the enumeration from (n, i) to m, we get some (αm)m∈N and(Am)m∈N such that

f = f0 +

∞∑n=1

(fn − fn−1) =

∞∑m=1

αm1Am .

1.4. Outer Measure

Lemma 1.18. Let (Ω,A, µ) be a σ-finite measure space and E ⊆ A be a π-system that generates A. Assume thereexists sequence Ω1,Ω2 . . . ∈ E such that

⋃∞i=1 Ωi = Ω and µ(Ωi) < ∞ for all i ∈ N. Then µ is uniquely determined

by the values µ(E), E ∈ E .If Ω ∈ A and µ(Ω) = 1, then the existence of the sequence (Ωn)n∈N is not required.

Proof. Let ν be a σ-finite measure on (Ω,A) such that µ(E) = ν(E) for all E ∈ E .

Let E ∈ E with µ(E) <∞. Consider

DE = A ∈ A : µ(A ∩ E) = ν(A ∩ E).

We claim that DE is a λ-system. We shall prove this by checking each of the conditions of definition 1.6.

(a) Clearly, Ω ∈ DE .

(b) Let A,B ∈ DE with B ⊆ A. Then

µ((A \B) ∩ E) = µ(A ∩ E)− µ(B ∩ E) (using theorem 1.9)

= ν(A ∩ E)− ν(B ∩ E)

= ν((A \B) ∩ E).

That is, (A \B) ∈ DE .

(c) Let A1, A2, . . . ∈ DE be mutually disjoint sets. Then

µ

(( ∞⊎i=1

Ai

)∩ E

)=

∞∑i=1

µ(Ai ∩ E)

=

∞∑i=1

ν(Ai ∩ E)

= ν

(( ∞⊎i=1

Ai

)∩ E

).

Therefore,⊎∞i=1Ai ∈ DE and DE is a λ-system.

As E ⊆ DE (Why?), δ(E) ⊆ DE . Since E is a π-system, theorem 1.8 implies that

A ⊇ DE ⊇ δ(E) = σ(E) = A.

Hence DE = A.Therefore, µ(A ∩ E) = ν(A ∩ E) for any A ∈ A and E ∈ E with µ(E) <∞.Now, let Ω1,Ω2, . . . ∈ E be a sequence such that

⋃∞i=1 Ωi = Ω and µ(Ωi) < ∞ for all i ∈ N. Let E0 = ∅ and

En =⋃ni=1 Ωi for each n ∈ N. Note that

En =

n⊎i=1

(Eci−1 ∩ Ωi).


Therefore for any A ∈ A and n ∈ N,

µ(A ∩ En) =

n∑i=1

µ((A ∩ Eci−1) ∩ Ωi)

=

n∑i=1

ν((A ∩ Eci−1) ∩ Ωi) = ν(A ∩ En).

Now, since En ↑ Ω and µ, ν are lower semicontinuous (by theorem 1.10),

µ(A) = limn→∞

µ(A ∩ En)

= limn→∞

ν(A ∩ En) = ν(A)

This proves the result.

The second part of the theorem is trivial as E ∪ Ω is a π-system that generates A. Hence one can choose theconstant sequence En = Ω, n ∈ N.

Definition 1.20 (Outer Measure). A function µ∗ : 2Ω → [0,∞] is called an outer measure if

(i) µ∗(∅) = 0,

(ii) µ∗ is monotone, and

(iii) µ∗ is σ-subadditive.

Lemma 1.19. Let A ⊆ 2Ω be an arbitrary class of sets with ∅ ∈ A and let µ be a nonnegative set function on Awith µ(∅) = 0. For A ⊆ Ω, define the set of countable coverings F with sets F ∈ A

U(A) =

F ⊆ A : F is countable and A ⊆

⋃F∈F

F

.

Define

µ∗(A) = inf

∑F∈F

µ(F ) : F ∈ U(A)

where inf ∅ =∞. Then µ∗ is an outer measure. If µ is σ-subadditive then µ∗(A) = µ(A) for all A ∈ A.

Proof. Let us check each of the three conditions in the definition of an outer measure.

(a) Since ∅ ∈ A, we have ∅ ∈ U(∅) and hence µ(∅) = 0.

(b) If A ⊆ B, then U(A) ⊆ U(B), and hence µ∗(A) ≤ µ∗(B).

(c) Let A,A1, A2, . . . ⊆ Ω such that A ⊆⋃∞i=1Ai. We claim that µ∗(A) ≤

∑∞i=1 µ

∗(Ai).

Without loss of generality, assume that µ∗(Ai) < ∞ and hence U(Ai) 6= ∅ for all i ∈ N. Fix some ε > 0. Now,for every n ∈ N, we may choose a covering Fn ∈ U(An) such that∑

F∈Fn

µ(F ) ≤ µ∗(An) + ε2−n.

Then let F =⋃∞n=1 Fn ∈ U(A).

µ∗(A) ≤∑F∈F

µ(F ) ≤∞∑n=1

∑F∈Fn

µ(F ) ≤∞∑n=1

µ∗(An) + ε.

This proves the first part of the result.


To prove the next part of the result, first note that since A ∈ U(A), we have µ∗(A) ≤ µ(A). If µ is σ-subadditive,then for any F ∈ U(A), ∑

F∈Fµ(F ) ≥ µ(A).

It follows that µ∗(A) ≥ µ(A).

Definition 1.21 (µ∗-measurable sets). Let µ∗ be an outer measure. A set A ∈ 2Ω is called µ∗-measurable if

µ∗(A ∩ E) + µ∗(Ac ∩ E) = µ∗(E) for any E ∈ 2Ω.

We write M(µ∗) = A ⊆ Ω : A is µ∗-measurable.

Lemma 1.20. A ∈M(µ∗) if and only if

µ∗(A ∩ E) + µ∗(Ac ∩ E) ≤ µ∗(E) for any E ∈ 2Ω.

Proof. As µ∗ is subadditive, we trivially have

µ∗(A ∩ E) + µ∗(Ac ∩ E) ≥ µ∗(E) for any E ∈ 2Ω.

The result follows.

Lemma 1.21. M(µ∗) is an algebra.

Proof. We shall check the conditions given in the definition of an algebra definition 1.2.

(a) We clearly have Ω ∈M(µ∗).

(b) By definition, M(µ∗) is closed under complements.

(c) We must check that M(µ∗) is closed under intersections. Let A,B ∈M(µ∗) and E ⊆ Ω. Then

µ∗((A ∩B) ∩ E) + µ∗((A ∩B)c ∩ E) = µ∗((A ∩B) ∩ E)

+ µ∗((A ∩Bc ∩ E) ∪ (Ac ∩B ∩ E) ∪ (Ac ∩Bc ∩ E))

≤ µ∗(A ∩ (B ∩ E)) + µ∗(A ∩ (Bc ∩ E))

+ µ∗(Ac ∩ (B ∩ E)) + µ∗(Ac ∩ (Bc ∩ E))

= µ∗(B ∩ E) + µ∗(Bc ∩ E) (since A ∈M(µ∗))

= µ∗(E). (since B ∈M(µ∗))


Lemma 1.22. An outer measure µ∗ is σ-additive on M(µ∗).

Proof. Let A,B ∈M(µ∗) with A ∩B = ∅. Then

µ∗(A ∪B) = µ∗(A ∩ (A ∪B)) + µ∗(Ac ∩ (A ∪B))

= µ∗(A) + µ∗(B).

That is, µ∗ is additive (and thus a content) on M(µ∗). Since µ∗ is σ-subadditive, theorem 1.10 gives the requiredresult.

Lemma 1.23. If µ∗ is an outer measure, M(µ∗) is a σ-algebra.


Proof. We have already shown that M(µ∗) is an algebra (and thus a π-system). Using theorem 1.7, it is sufficientto show that M(µ∗) is a λ-system.Let A1, A2, . . . ∈ M(µ∗) be mutually disjoint sets and let A =

⊎∞i=1Ai. Further, for each n ∈ N, let Bn =

⊎ni=1Ai.

We must show that M ∈M(µ∗).For any E and valid n ∈ N, we have

µ∗(E ∩Bn+1) = µ∗((E ∩Bn+1) ∩Bn) + µ∗((E ∩Bn+1) ∩Bcn)

= µ∗(E ∩Bn) + µ∗(E ∩An+1).

By a simple induction, it follows that

µ(E ∩Bn) =

n∑i=1

µ∗(E ∩Ai).

Since µ∗ is monotonic, we have

µ∗(E) = µ∗(E ∩Bn) + µ∗(E ∩Bcn)

≥ µ∗(E ∩Bn) + µ∗(E ∩Ac)

=

n∑i=1

µ∗(E ∩Ai) + µ∗(E ∩Ac).

Letting n→∞ and using the fact that µ∗ is σ-subadditive, we have

µ∗(E) ≥∞∑i=1

µ∗(E ∩Ai) + µ∗(E ∩Ac)

≥ µ∗(E ∩A) + µ∗(E ∩Ac)

Therefore, A ∈M(µ∗) and this completes the proof.

1.5. The Approximation and Extension Theorems

Theorem 1.24 (Approximation Theorem for Measures). Let A ⊆ 2Ω be a semiring and let µ be a measure on σ(A)that is σ-finite on A. For any A ∈ σ(A) with µ(A) < ∞ and any ε > 0, there exists n ∈ N and mutually disjointsets A1, A2, . . . , An ∈ A such that µ (A4

⋃ni=1An) < ε.

Proof. Consider the outer measure µ∗ as defined in lemma 1.19. Note that by lemma 1.19 and lemma 1.18, µ andµ∗ are equal on σ(A). By the definition of µ∗, for any A ∈ A, there exists a covering B1, B2, . . . ∈ A of A such that

µ(A) ≥∞∑i=1

µ(Bi)− ε/2.

Since µ(A) <∞, there exists some n ∈ N such that∑∞i=n+1 µ(Bi) < ε/2. Now, let D =

⋃ni=1Bi and E =

⋃∞i=n+1Bi.

We have

A4D = (D \A) ∪ (A \D)

⊆ (D \A) ∪ (A \ (D ∪ E)) ∪ E⊆ (A4(D ∪ E)) ∪ E.

This together with the fact that A ⊆⋃∞i=1Bi implies that

µ(A4D) ≤ µ(A4(D ∪ E)) + µ(E)

≤ µ

( ∞⋃i=1

Bi

)− µ(A) +

ε

2

≤ ε.

Now define A1 = B1 and for each i ≥ 2, Ai = Bi \⋃i=1j=1Bj . By definition, A1, A2 . . . are mutually disjoint. This

proves the result.


The following theorem allows us to “extend” measures from a semiring to the σ-algebra generated by it. This allowsus to define measures over an entire σ-algebra by defining its values over just a semiring that generates it.

Theorem 1.25 (Measure Extension Theorem). Let A be a semiring and let µ : A → [0,∞] be an additive,σ-subadditive and σ-finite set function with µ(∅) = 0. Then there is a unique σ-finite measure µ : σ(A)→ [0,∞]such that µ(A) = µ(A) for all A ∈ A.

Proof. Since A is a π-system, if such a µ exists, it is uniquely defined due to lemma 1.18.We shall explicitly construct a function that satisfies the given conditions. In order to do so, define as in lemma 1.19

µ∗(A) = inf

∑F∈F

µ(F ) : F ∈ U(A)

for any A ⊆ Ω.

By lemma 1.19, µ∗ is an outer measure and µ∗(A) = µ(A) for any A ∈ A.

We first claim that A ⊆M(µ∗).To prove this, let A ∈ A and E ⊆ Ω with µ∗(E) <∞. Fix some ε > 0. Then by the definition of µ∗, there exists asequence E1, E2, . . . ∈ A such that

E ⊆∞⋃i=1

Ei and

∞∑i=1

µ(Ei) ≤ µ∗(E) + ε.

For each n, define Bn = En ∩ A. Since A is a semiring, there exists for each n some mn ∈ N and mutually disjointsets Cn,1, Cn,2, . . . , Cn,mn such that

En \A = En \Bn =

mn⊎i=1

Cn,i

Then we have that

E ∩A ⊆∞⋃n=1

Bn,

E ∩Ac ⊆∞⋃n=1

mn⊎i=1

Cn,i, and

En = Bn ]mn⊎i=1

Cn,i.

This implies that

µ∗(E ∩A) + µ∗(E ∩Ac) ≤∞∑n=1

µ(Bn) +

∞∑n=1

mn∑i=1

µ(Cn,i) (since µ is σ-subadditive)

=

∞∑n=1

(µ(Bn) +

mn∑i=1

µ(Cn,i)

)

=

∞∑n=1

µ(En) (since µ is additive)

≤ µ∗(E) + ε.

lemma 1.20 implies that A ∈ M(µ∗), that is, A ⊆ M(µ∗). This in turn in implies that σ(A) ⊆ M(µ∗). Define therequired function by µ : σ(A)→ [0,∞], A 7→ µ∗(A). By lemma 1.22, µ is σ-additive. Since µ is σ-finite, µ is σ-finiteas well. This proves the result.


1.6. Important Examples of Measures

Now that we have the Measure Extension Theorem, we may introduce the Lebesgue-Stieltjes measure, a very usefulmeasure on (R,B(R)), which is given as follows.

Definition 1.22 (Lebesgue-Stieltjes Measure). Let F : R → R be monotone increasing and right continuous. Themeasure µF on (R,B(R)) defined by

µF ((a, b]) = F (b)− F (a) for all a, b ∈ R such that a < b

is called the Lebesgue-Stieltjes measure with distribution function F .

The Lebesgue-Stieltjes measure is well-defined due to the Measure Extension Theorem theorem 1.25.

To see this more clearly, let A = (a, b ] : a, b ∈ R and a ≤ b. It may be checked that A is a semiring. Further,σ(A) = B(R). Now, define the function µF : A → [0,∞) by (a, b] 7→ F (b) − F (a). Clearly µF (∅) = 0 and thefunction is additive. It remains to check that µF is σ-subadditive.

Let (a, b], (a1, b1], (a2, b2], . . . ∈ A such that (a, b ] ⊆⋃∞i=1(ai, bi]. Fix some ε > 0 and choose aε ∈ (a, b) such that

F (aε)− F (a) < ε/2 =⇒ µF ((a, b])− µF ((aε, b ]) < ε/2.

It is possible to choose such an ε due to the right continuity of F . Also, for any k ∈ N, choose bk,ε such that

F (bk,ε)− F (bk) < ε2−k−1 =⇒ µF ((ak, bk,ε])− µF ((ak, bk]) < ε2−k−1.

We now have

[aε, b ] ⊆ (a, b] ⊆∞⋃i=1

(ak, bk] ⊆∞⋃k=1

(ak, bk,ε]

Due to the compactness of [aε, b], there then exists some k0 ∈ N such that

(aε, b ] ⊆k0⋃k=1

(ak, bk,ε].

This implies that

µF ((a, b ]) ≤ ε

2+ µF ((a, b])

≤ ε

2+

k0∑k=1

µF ((ak, bk,ε])

≤ ε

2+

k0∑k=1

(µF ((ak, bk]) + ε2−k−1

)≤ ε+

∞∑k=1

µF ((ak, bk])

As this is true for any choice of ε, µF is σ-subadditive.

Then the extension of µF uniquely to a σ-finite measure is guaranteed by theorem 1.25. This measure is known asthe Lebesgue-Stieltjes measure.

The measure that results when the function F is equal to the identity function is referred to the Lebesgue measureon R1. Similar to this, we can define the Lebesgue measure in general as follows.

Definition 1.23 (Lebesgue Measure). There exists a unique measure λn on (Rn,B(Rn)) such that for all a, b ∈ Rnwith a < b,

λn((a, b]) =

n∏i=1

(bi − ai).

λn is called the Lebesgue measure on (Rn,B(Rn)) or the Lebesgue-Borel measure.


Let E be a finite nonempty set and Ω = EN. If ω1, ω2, . . . , ωn ∈ E, we define the following.

[ω1, ω2, . . . , ωn] = ω′ ∈ Ω : ω′i = ωi for i ∈ [n].

This represents the set of all sequences whose first n elements are ω1, ω2, . . . , ωn.

Theorem 1.26 (Finite Products of Measures). Let n ∈ N and µ1, µ2, . . . , µn be Lebesgue-Stieltjes measures on(R,B(R)). Then there exists a unique σ-finite measure µ on (Rn,B(Rn)) such that for all a, b ∈ Rn with a < b,

µ((a, b]) =

n∏i=1

µi((ai, bi])

We call µ the product measure of µ1, µ2, . . . , µn and denote it by⊗n

i=1 µi.

The proof of the above is similar to that of theorem 1.25. We choose intervals (a, bε] and so on such that µ((a, bε]) <µ((a, b]) + ε. Such bε exists due to the right continuity of each of the Fis corresponding to each of the µis.


§2. Introduction to Probability

2.1. Basic Definitions

Definition 2.1 (Probability Space). Let (Ω,A, µ) be a measure space. If µ(Ω) = 1, then (Ω,A, µ) is called aprobability space and µ is called a probability measure.

In the above definition, Ω is called the sample space, A is called the event space (and its elements are called events),and µ is called the probability function.

While the above definition may appear to be completely unrelated to the intuitive notion of probability we have, thefollowing example should hopefully make the meanings clear.

Consider a coin toss. The sample space Ω has two elements, H (for heads) and T (for tails). The event space Athen has four elements: ∅, H, T, and H,T. Each of these events have associated probabilities 0, 1

2 ,12 , and 1

respectively. Note that H,T represents the event that either a heads or a tails occurs.

The requirement of the event space to be a σ-algebra is quite natural as well. The event Ω corresponds to sayingthat something happens, which must occur with certainty. Closedness under complements means that if we have anevent A, we should have another event that corresponds to A not occurring. Finally, σ-∪-closedness corresponds tothe occurrence of at least one of the events we are taking the union of.

• Uniform distribution. Let Ω be a finite nonempty set. Consider the function µ : 2Ω → [0, 1] given by

µ(A) =|A||Ω|

for each A ⊆ Ω.

This defines a probability measure on 2Ω. This function µ is called the uniform distribution on Ω and is denotedUΩ. As the reader might expect, it represents the case where any element of Ω is equally likely to occur. Theresulting probability space (Ω, 2Ω,UΩ) is called a Laplace space.

• Dirac measure. Let ω ∈ Ω and δω(A) = 1(ω). Then δω is a probability measure on any σ-algebra A ⊆ 2Ω.δω is called the Dirac measure for the point ω.

The Dirac measure is useful in constructing discrete probability distributions.

Let Ω be a countable non-empty set and A = 2Ω. Further let (pω)ω∈Ω be non-negative numbers. The map given byA 7→ µ(A) =

∑ω∈A pω defines a σ-finite measure on 2Ω. p = (pω)ω∈Ω is called the weight function of µ. pω is called

the weight of µ at point ω.In the case where

∑ω∈Ω pω = 1, µ is a probability measure. Then the vector (pω)ω∈Ω is called a probability vector.

Given any probability space, we typically use the Pr or P symbol to denote the universal object of a probabilitymeasure, and the probabilities Pr[·] or P[·] are always written in (square) brackets.

Definition 2.2 (Probability Distribution Function). A right continuous monotonically increasing function F : R→[0, 1] such that limx→−∞ F (x) = 0 and limx→∞ F (x) = 1 is called a (proper) probability distribution function, oftenabbreviated as p.d.f. If we instead have limx→∞ F (x) ≤ 1, F is called a (possibly) defective p.d.f. If µ is a probabilitymeasure on (R,B(R)), then the function Fµ given by x 7→ µ((−∞, x]) is called the distribution function of µ.

A probability measure is uniquely determined by its distribution function. (Why?)

Definition 2.3 (Random Variable). Let (Ω,A,P) be a probability space, (Ω′,A′) a measurable space, andX : Ω→ Ω′ be measurable. Then X is called a random variable with values in (Ω′,A′). If (Ω′,A′) = (R,B(R)),then X is called a real random variable. For A′ ∈ A′, we often denote

P[X−1(A′)] as P[X ∈ A′] and X−1(A′) as X ∈ A′.

In particular, we let X ≥ 0 = X−1([0,∞)) and define X ≤ b and other terms similarly.


As we shall primarily deal with real random variables in our study of probability, we often drop the “real” and referto them as just random variables.

Definition 2.4. Let X be a random variable with underlying probability space (Ω,A,P).

(i) The probability measure PX = P X−1 is called the distribution of X.

(ii) For a real random variable X, the map FX given by x 7→ P[X ≤ x] is called the distribution function of PX(or X). If µ = PX , we write X ∼ µ and say that X has distribution µ.

(iii) A family (Xi)i∈I of random variables is called identically distributed if PXi = PXj for all i, j ∈ I. We write

XD= Y if PX = PY (D for distribution).

The distribution of a random variable essentially gives us a probability corresponding to each element of Ω′. Tworandom variables being identically distributed means that they are essentially the same, in the sense that a dielabelled from 1 through 6 is the same as one labelled from a through f .

Theorem 2.1. For any p.d.f. F , there exists a real random variable X with FX = F .

Proof. We shall explicitly construct a probability space (Ω,A,P) and random variable X : Ω→ R such that FX = F .

One choice that might come to mind is to take (Ω,A) = (R,B(R)), X : R → R as the identity function, and P asthe Lebesgue-Stieltjes measure with distribution function F .

While this choice of ours works, let us attempt to construct another more “standard” choice that is perhaps moreenlightening. Let Ω = (0, 1),A = B(R)|Ω and P be the Lebesgue measure on (Ω,A). This is standard in the sensethat given any F , we construct a random variable over the same probability space. Define the left continuous inverseof F as

F−1(t) = infx ∈ R : F (x) ≥ t for t ∈ (0, 1).

Note that F−1(t) ≤ x if and only if F (x) ≥ t. In particular,

t : F−1(t) ≤ x = (0, F (x)] ∩ (0, 1)

and so F−1 : (Ω,A)→ (R,B(R)) is measurable. Thus

P[t : F−1(t) ≤ x

]= F (x).

This implies that F−1 is the random variable we wish to construct.

Note that the above implies that there is a bijection between probability distribution functions and distributionfunctions corresponding to random variables.

Definition 2.5. If a distribution F : Rn → [0, 1] is of the form

F (x) =

∫ x1

−∞dt1

∫ x2

−∞dt2 · · ·

∫ xn

−∞dtn f(t1, t2, . . . , tn) for x ∈ Rn

for some integrable function f : Rn → [0,∞), then f is called the density of the distribution.

It is often easier to describe continuous probability distributions either in terms of their density or the correspondingprobability distribution function. For example, if a random variable corresponds to picking a number uniformlyrandomly from [0, 1], then its density function is uniformly equal to 1.

2.2. Important Examples of Random Variables

We now give several important examples of random variables that we shall encounter several times in our study ofprobability.


1. Bernoulli Distribution.

Let p ∈ [0, 1] and P[X = 1] = p, P [X = 0] = 1−p. Then PX is called the Bernoulli distribution with parameterp and is denoted Berp. More formally,

Berp = (1− p)δ0 + pδ1.

Its distribution function is

FX(x) =

0, x < 0

1− p, x ∈ [0, 1)

1, x ≥ 1

Note that the above can be likened to the outcome of a weighted coin, with heads and tails corresponding to0 and 1.

The distribution PY of Y = 2X − 1 is called the Rademacher distribution with parameter p. More formally,

Radp = (1− p)δ−1 + pδ1.

Rad1/2 is simply called the Rademacher distribution.

2. Binomial Distribution.

Let p ∈ [0, 1] and n ∈ N. Let X : Ω→ [n]0 be such that for each valid k,

P[X = k] =

(n

k

)pk(1− p)n−k.

Then PX is called the binomial distribution with parameters n and p and is denoted bn,p. More formally,

bn,p =

n∑k=0

(n

k

)pk(1− p)n−kδk.

3. Geometric Distribution.

Let p ∈ (0, 1] and X : Ω→ N0 be such that for each n ∈ N0,

P[X = n] = p(1− p)n.

Then PX is called the geometric distribution with parameter p and is denoted γp or b−1,p. More formally,

γp =

∞∑n=0

p(1− p)nδn.

The geometric distribution γp represents the waiting time for a success in a series of independent randomexperiments, each of which succeeds with a probability p.

4. Negative Binomial Distribution.

Let r > 0 and p ∈ (0, 1]. We denote by

b−r,p =

∞∑k=0

(−rk

)(−1)kpr(1− p)kδk

the negative binomial distribution or Pascal distribution with parameters r and p. Note that r need not bean integer. The negative binomial distribution b−r,p represents the waiting time for the rth success in a seriesof independent random experiments, each of which succeeds with a probability p. Based on this intuition, weexpect there to be some relation between the geometric distribution and the negative binomial distribution.We shall explain this relation later in the notes.


5. Poisson Distribution.

Let λ ∈ [0,∞) and X : Ω→ N0 be such that for each n ∈ N0,

P [X = n] = e−λλn

n!.

Then PX = Poiλ is called the Poisson distribution with parameter λ.

6. Hypergeometric Distribution.

Consider a basket with B ∈ N black balls and W ∈ N white balls. If we draw n ∈ N balls from the basket,some simple combinatorics shows that the probability of drawing (exactly) b ∈ [n]0 black balls is given by thehypergeometric distribution with parameters B,W, n:

HypB,W ;n(b) =

(Bb

)(Wn−b)(

B+Wn

) .

In general, if we have k colors with Bi balls of colour i for each i, the probability of drawing exactly bi balls ofcolour i for each i is given by the generalised hypergeometric distribution:

HypB1,B2,...,Bk;n((b1, b2, . . . , bk)) =

(B1

b1

)(B2

b2

)· · ·(Bkbk

)(B1+B2+···+Bk

n

)where n = b1 + b2 + · · ·+ bk.

7. Gaussian Normal Distribution.

Let µ ∈ R, σ2 > 0. Let X be a real random variable such that for x ∈ R,

P[X ≤ x] =1√

2πσ2

∫ x

−∞exp

(− (t− µ)2

2σ2

)dt

Then PX is called the Gaussian normal distribution (or just normal distribution) with parameters µ and σ2

and is denoted Nµ,σ2 . In particular, N0,1 is the standard normal distribution.

8. Exponential Distribution.

Let θ > 0 and X be a nonnegative random variable such that for each x ≥ 0,

P[X ≤ x] = P [X ∈ [0, x]] =

∫ x

0

θe−θt dt.

Then PX is called the exponential distribution with parameter θ and is denoted expθ.

2.3. The Product Measure

Let E be a finite set and Ω = EN. Let (pe)e∈E be a probability vector. Define

A = [ω1, . . . , ωn] : ω1, . . . , ωn and n ∈ N

and a content µ on A by

µ([ω1, ω2, . . . , ωn]) =

n∏i=1

pωi

We wish to extend µ to a measure on σ(A). Similar to how we proved the existence of the Lebesgue-Stieltjes measuredefinition 1.22, we use a compactness argument to show that µ is σ-subadditive.Let A,A1, A2, . . . ∈ A such that A ⊆

⋃∞i=1Ai. We claim that there exists n ∈ N such that A ⊆

⋃ni=1Ai.

For each n ∈ N, let Bn = A\⋃ni=1Ai. We assume that Bn 6= ∅ for all n ∈ N and prove the required by contradiction.


Due to the pigeonhole principle, there exists some ω1 ∈ E such that [ω1] ∩Bn 6= ∅ for infinitely many n ∈ N. SinceB1 ⊇ B2 ⊇ · · · , we have that

[ω1] ∩Bn 6= ∅ for all n ∈ N.

Similarly, there exist ω2, ω3, . . . ∈ E such that

[ω1, . . . , ωk] ∩Bn 6= ∅ for all k, n ∈ N.

Each Bn is a disjoint union of sets Cn,1, . . . , Cn,mn ∈ A. Thus for each n ∈ N, there is some in ∈ [mn] such that

[ω1, ω2, . . . , ωk] ∩ Cn,in 6= ∅ for infinitely many k ∈ N.

As [ω1] ⊇ [ω1, ω2] ⊇ · · · , this implies that

[ω1, ω2, . . . , ωk] ∩ Cn,in 6= ∅ for all k ∈ N

As Cn,in ∈ A, for fixed n and large k (k ≥ mn), we have

[ω1, ω2, . . . , ωk] ⊆ Cn,in .

This implies that ω = (ω1, ω2, . . .) ∈ Cn,in ⊆ Bn. This in turn implies that⋂∞i=0Bi 6= ∅, which yields a contradiction.

Therefore, A ⊆⋃ni=1An for some n ∈ N. Since µ is known to be (finite) subadditive, we have

µ(A) ≤n∑i=1

µ(Ai) ≤∞∑i=1

µ(Ai),

which is the required result.

Definition 2.6 (Product Measure). Let E be a finite nonempty set and Ω = EN. Let (pe)e∈E be a probabilityvector. There then is a unique probability measure µ on σ(A) = B(Ω) (where A is defined as above) such that

µ([ω1, ω2, . . . , ωn]) =

n∏i=1

pωi for all ωi ∈ E and n ∈ N.

µ is called the product measure or Bernoulli measure on Ω with weights (pe)e∈E and is denoted by(∑

e∈E peδe)⊗N

.The σ-algebra σ(A) is called the product σ-algebra on Ω and is denoted by (2E)⊗N.

We explain the above more intuitively in the following section.


§3. Independence

3.1. Independent Events

In the following, let (Ω,A,P) be a probability space and the sets A ∈ A be events.

Definition 3.1. Two events A and B are said to be independent if

Pr[A ∩B] = Pr[A] Pr[B].

For example, if we roll a die twice, the event of rolling a 6 the first time is independent of the event of rolling a 6 thesecond time.Here, Ω = [6]2,A = 2Ω and the probability distribution is P = UΩ. Our claim may be verified as follows. Towardsshowing that the outcome of the first roll is independent of that of the second, let A, B ⊆ Ω, A = A × Ω andB = Ω× B. We must show that Pr[A] Pr[B] = Pr[A ∩B]. This is obvious as follows:

Pr[A] =|A|36

=|A|6

Pr[B] =|B|36

=|B|6

Pr[A ∩B] =|A ∩B|

36=|A||B|

36= Pr[A] Pr[B].

While in the above example it is intuitively clear that the two events must be independent, we can have less obviousexamples as well. For example, the event that the sum of the two rolls is odd and the event that the first roll givesat most a three are independent. We leave it to the reader to verify this claim.

We extend this definition of two independent events to any number of independent events as follows.

Definition 3.2 (Independence of Events). Let I be an index set and (Ai)i∈I be a family of events. The family(Ai)i∈I is called independent if for any finite subset J ⊆ I, the following holds:

Pr

⋂j∈J

Aj

=∏j∈J

Pr[Aj ].

Note that pairwise independence does not guarantee overall independence. For example, For example, consider(X,Y, Z) chosen uniformly from (0, 0, 0), (1, 1, 0), (1, 0, 1), (0, 1, 1). Then X,Y, Z are pairwise independent but theyare not overall independent.

Let us now return to the product measure defined in definition 2.6, which can be understood intuitively as follows.If E is a finite set of outcomes, consider the probability space comprising Ω = EN, the σ-algebra

A = σ([ω1, . . . , ωn] : ω1, . . . , ωn ∈ E and n ∈ N)

and the product measure P =(∑

e∈E peδe)⊗N

. This basically represents that we repeatedly conduct the experiment

of choosing an outcome from E. Let Ai ⊆ E for each i ∈ N and let Ai be the event such that Ai occurs in the ithexperiment, given by

Ai = ω ∈ Ω : ωi ∈ Ai =⊎

(ω1,...,ωi)∈Ei−1×Ai

[ω1, . . . , ωi]

Intuitively, the family (Ai)i∈N should be independent, since the outcome of one of the conducted experiments doesnot depend on the outcomes of the other experiments.


We shall check this. Let J ⊆ N and For j ∈ J , let Bj = Aj and Bj = Aj and for j ∈ [n] \ J , let Bj = Ω and Bj = E.Then

Pr

⋂j∈J

Aj

= Pr

n⋂j=1

Bj

= Pr

[ω ∈ Ω : ωj ∈ Bj for each j ∈ [n]

]=∑e1∈B1

· · ·∑

en∈Bn

n∏j=1

pej

=

n∏j=1

∑e∈Bj

pe

=∏j∈J

∑e∈Aj

pe

In particular, as this is true for |J | = 1, we have for some fixed i ∈ [n],

Pr[Ai] =

∑e∈Ai

pe

.

Substituting the above, we have

Pr

⋂j∈J

Aj

=∏j∈J

∑e∈Aj

pe

=∏j∈J

Pr[Aj ].

This proves the result. We state the result for future reference as follows.

Theorem 3.1. Let E be a finite set. Consider the probability space (Ω,A,P) where Ω = EN,

A = σ([ω1, . . . , ωn] : ω1, . . . , ωn ∈ E and n ∈ N)

and P =(∑

e∈E peδe)⊗N

. For each i ∈ N, let Ai ⊆ E and Ai be the event such that Ai occurs in the ith experiment,given by

Ai = ω ∈ Ω : ωi ∈ Ai =⊎

(ω1,...,ωi)∈Ei−1×Ai

[ω1, . . . , ωi]

Then the family (Ai)i∈N is independent.

Note that if events A and B are independent, then the events Ac and B are independent as well. This can bedescribed more precisely as follows.

Theorem 3.2. Let I be an index set and (Ai)i∈I be a family of events. Define B0i = Ai and B1

i = Aci for each i ∈ I.Then the following statements are equivalent.

(a) The family (Ai)i∈I is independent.

(b) There is some α ∈ 0, 1I such that the family (Bαii )i∈I is independent.

(c) For all α ∈ 0, 1I , the family (Bαii )i∈I is independent.

We leave the proof of the above to the reader.Now, recall the limit superior defined in definition 1.7. The limit superior represents the event that a particular eventoccurs an infinite amount of times (for example, the event that we roll a 4 an infinite number of times when we rolla die a countably infinite number of times). This is formalized in the following.


Theorem 3.3 (Borel-Cantelli Lemma). Let A1, A2, . . . be events and let A∗ = lim supn→∞An. Then

(a) If∑∞n=1 Pr[An] <∞, Pr[A∗] = 0.

(b) If (An)n∈N is independent and∑∞n=1 Pr[An] =∞, then Pr[A∗] = 1.

Proof. By theorem 1.10, P is upper semicontinuous, lower semicontinuous and σ-subadditive.

(a) As P is upper semicontinuous and σ-subadditive,

Pr[A∗] = Pr

[ ∞⋂n=1

∞⋃m=n

Am

]

limn→∞

Pr

[ ∞⋃m=n

Am

]

≤ limn→∞

∞∑m=n

Pr[Am] = 0

The result follows.

(b) As P is lower semicontinuous and the family (An)n∈N is independent, we have

Pr[(A∗)c] = Pr

[ ∞⋃n=1

∞⋂m=n

Acm

]

= limn→∞

Pr

[ ∞⋂m=n

Acm

]

= limn→∞

∞∏m=n

(1− Pr[Am])

Now for any n ∈ N, as log(1− x) ≤ −x

∞∏m=n

(1− Pr[Am]) = exp

( ∞∑m=n

log(1− Pr[Am])

)

≤ exp

(−∞∑m=n

Pr[Am]

)= 0

The result follows.

A saying the reader might have come across is that if a monkey is left with a typewriter for an infinite amount oftime, it will eventually type the complete works of Shakespeare. This is in fact a consequence of the Borel-CantelliLemma, and is often referred to as the Infinite Monkey Theorem!

We now extend the definition of independence of a family of events as follows.

Definition 3.3 (Independence of classes of events). Let I be an index set and Ei ⊆ A for all i ∈ I. The family(Ei)i∈I is called independent if for any finite J ⊆ I and any choice of j ∈ J and Ej ∈ Ej , the family (Ej)j∈J isindependent.

For example, if we roll a die an infinite number of times, for each i ∈ N, consider the class of events given byEi = ω ∈ Ω : ωi ∈ A : A ⊆ [6] where Ω = [6]N. Then the family (Ei)i∈I is independent.

https://en.wikipedia.org/wiki/Infinite_monkey_theorem


Theorem 3.4. Let I be an index set and for each i ∈ I, let Ei ⊆ A. Then

(a) Let I be finite. If Ω ∈ Ei for each i, then (Ei)i∈I is independent if and only if (Ei)i∈I is independent for anychoice of Ei ∈ Ei, i ∈ I.

(b) If (Ei ∪ ∅) is ∩-closed for each i, then (Ei)i∈I is independent if and only if (σ(Ei))i∈I is independent.

(c) Let K be an arbitrary set and let (Ik)k∈K be mutually disjoint subsets of I. If (Ei)i∈I is independent, then(⋃i∈Ik Ei

)k∈K is also independent.

Proof.

(a) The forward implication is obvious from the definition. To prove the backward implication, for J ⊆ I andj ∈ I \ J , choose Ej = Ω.

(b) The backward implication is obvious. Let us now prove the forward implication.

First, we claim that for any J ⊆ J ′ ⊆ I where J is finite,

Pr

[⋂i∈J′

Ei

]=∏i∈J′

Pr[Ei]

for any choice of Ei ∈ σ(Ei) if i ∈ J and Ei ∈ Ei if i ∈ J ′ \ J .

We shall prove the above claim by induction on |J |. If |J | = 0, then the claim is true as (Ei)i∈I is independent.Now assume that the claim is true for all J ⊆ I with |J | = n and all finite J ′ ⊇ J . Fix such a J . Let j ∈ I \ J .Define J = J ∪ j and choose some J ′ ⊇ J . We shall show that the claim is true if we replace J with J , thusproving the inductive step.

Fix Ei ∈ σ(Ei) for each i ∈ J and Ei ∈ Ei for each i ∈ J ′ \ J . Consider measures µ, ν on (Ω,A) such that

µ : Ej 7→ Pr

[⋂i∈J′

Ei

]ν : Ej 7→

∏i∈J′

Pr[Ei]

By the induction hypothesis, µ(Ej) = ν(Ej) for all Ej ∈ Ej ∪∅,Ω. As Ej ∪∅ is ∩-closed, lemma 1.18 impliesthat µ(Ej) = ν(Ej) for all Ej ∈ σ(Ej).This proves our claim. Setting J = J ′ yields the required result.

(c) This is trivial as in definition 3.3, we only need to consider J ⊆ I such that J ∩ Ik ≤ 1 for any k ∈ K.

3.2. Independence of Random Variables

Let I be an index set and (Ω,A) be measurable space. For each i ∈ I, let (Ωi,Ai) be a measurable space andXi : (Ω,A)→ (Ωi,Ai) be a random variable with generated σ-algebra σ(Xi).

Definition 3.4. The family (Xi)i∈I of random variables is said to be independent if the family (σ(Xi))i∈I ofgenerated σ-algebras is independent.

We say that a family (Xi)i∈I of random variables is said to be i.i.d. (independent and identically distributed) if thefamily (Xi)i∈I is independent and PXi = PXj for any i, j ∈ I.

The meaning of independence of random variables might be more clear from the following restructuring of thedefinition. A family (Xi)i∈I of random variables is independent if and only if for any finite set J ⊆ I and any choiceof Aj ∈ Aj , j ∈ J , we have

Pr

⋂j∈JXj ∈ Aj

=∏j∈J

Pr[Xj ∈ Aj ].


For example, let us flip a coin four times. Let X be the random variable be the number of heads that show up inthe first two tosses and Y be the random variable given by the number of tails that show up in the next two tosses.Then X and Y are independent.

For each i ∈ I, let (Ω′i,A′i) be another measurable space and assume that fi : (Ω,A)→ (Ω′i,A′i) is a measurable map.If (Xi)i∈I is independent, then (fi Xi)i∈I is independent as well. This is a consequence of the fact that fi Xi isσ(Xi)−A′i-measurable.

Theorem 3.5. For each i ∈ I, let Ei be a π-system that generates A. If (X−1(Ei)) is independent, then (Xi)i∈I isindependent.

Proof. By theorem 1.12, X−1(Ei) is a π-system that generatesX−1(Ai) = σ(Xi). The result follows from theorem 3.4.

Theorem 3.6. Let E be a finite set (pe)e∈E be a probability vector on E. There then exists a probability space(Ω,A,P) and an independent family (Xn)n∈N of E-valued random variables on (Ω,A,P) such that Pr[Xn = e] = pefor each e ∈ E.

Proof. We shall prove this by constructing the required probability space (Ω,A,P). Let Ω = EN and A =

σ ([ω1, ω2, . . . , ωk] : ωi ∈ E for each i ∈ [k] and k ∈ N). Let P =(∑

e∈E peδe)⊗N

be the product measure. Foreach n ∈ N, define the random variable Xn : Ω→ E by ω 7→ ωn, where ωn represents the nth coordinate of ω.As a consequence of theorem 3.1, (Xj)j∈N is independent. and the result is proved.

Definition 3.5. Let I be an index set and for each i ∈ I, let Xi be a random variable. For any J ⊆ I, letF(Xj)j∈J : RJ → [0, 1] be given by

x 7→ Pr[Xj ≤ xj for each j ∈ J ] = Pr

⋂j∈J

Xj ≤ xj

.This function is called the joint distribution function of (Xj)j∈J and is denoted FJ . The probability measure P(Xj)j∈J

on RJ is called the joint distribution of (Xj)j∈J .

Theorem 3.7. A family (Xi)i∈I of random variables is independent if and only if for every finite J ⊆ I and x ∈ RJ ,

FJ(x) =∏j∈J

Fj(xj).

Proof. The forward implication is obvious.The class of sets (−∞, a] : a ∈ R is a ∩-closed generator of B(R). The given condition is equivalent to saying thatthe events Xj ∈ (−∞, xj ] are independent. By theorem 3.5, the backward implication is proved.

Corollary 3.8. If in addition to the conditions of the previous theorem, each FJ has a continuous density functionfJ = f(Xj)j∈J , then the family (Xi)i∈I is independent if and only if for any finite J ⊆ I and x ∈ RJ ,

fJ(x) =∏j∈J

fj(xj).

Theorem 3.9. Let K be an arbitrary set and Ik, k ∈ K be mutually disjoint sets. Define I =⋃k∈K Ik.

If a family (Xi)i∈I of random variables is independent, then the family of σ-algebras (⋃i∈Ik σ(Xi))k∈K is

independent.

Proof. For k ∈ K, let

Sk =

⋂j∈Ik

Aj : Aj ∈ σ(Xj), |j ∈ Ik : Ak 6= Ω| <∞

.


Note that Sk is a π-system and σ(Sk) = σ(Xj : j ∈ Ik). Theorem 3.4(b) implies that we can just show that (Sk)k∈Kis independent and by definition 3.3, we can even assume that K is finite.For each k ∈ K, let Bk ∈ Sk and Jk ⊆ Ik be finite such that Bk =

⋂j∈Jk Aj for some Aj ∈ σ(Xj). Define

J =⋃k∈K Jk. Then

Pr

[ ⋂k∈K

Bk

]= Pr

⋂j∈J

Aj

=∏j∈J

Pr[Aj ] =∏k∈K

Pr

⋂j∈Jk

Aj

=∏k∈K

Pr[Bk].

3.3. The Convolution

Definition 3.6. Let µ and ν be probability measures on (Z, 2Z). The convolution (µ∗ν) is defined as the probabilitymeasure on (Z, 2Z) given by

(µ ∗ ν)(n) =

∞∑m=−∞

µ(m)ν(n−m).

We define the nth convolution power by µ∗1 = µ and µ∗n = µ∗(n−1) ∗ µ.

Theorem 3.10. If X and Y are independent Z-valued random variables, then PX+Y = PX ∗PY .

Proof. For any n ∈ Z, we have

PX+Y [n] = Pr[X + Y = n]

= Pr

[ ∞⊎m=−∞

X = m ∩ Y = n−m

]

=

∞∑m=−∞

Pr[X = m] Pr[Y = n−m]

=

∞∑m=−∞

PX [m]PY [n−m] = (PX ∗PY )[n]

Given the above theorem, we can generalise the convolution as follows.

Definition 3.7 (Convolution of Probability Measures). Let X and Y be independent random variables on Rn suchthat µ = PX and ν = PY . The convolution (µ ∗ ν) is defined as PX+Y .

We define µ∗k for k ∈ N recursively similarly to the first case with µ∗0 = δ0.

For example, let λ, µ ∈ [0,∞). Consider independent random variables X,Y such that X ∼ Poiµ and Y ∼ Poiλ.Then for n ∈ N0

Pr[X + Y = n] = e−µe−λn∑

m=0

µm

m!

λm

(n−m)!

= e−(µ+λ) (µ+ λ)n

n!.

Thus Poiλ ∗Poiµ = Poiλ+µ.


3.4. Kolmogorov’s 0-1 Law

The Borel-Cantelli Lemma was an example of a so-called 0-1 law. In this subsection, we study another such 0-1 law.

Definition 3.8 (Tail σ-algebra). Let I be a countably infinite index set and (Ai)i∈I be a family of σ-algebras. Then

T ((Ai)i∈I) =⋂J⊆I|J|<∞

σ

⋃j∈I\J

Aj

is called the tail σ-algebra of (Ai)i∈I . If (Ai)i∈I is a family of events, we define

T ((Ai)i∈I) = T ((∅, Ai, Aci ,Ω)i∈I).

If (Xi)i∈I is a family of random variables, we define

T ((Xi)i∈I) = T ((σ(Xi))i∈I).

If the meaning is clear from context, we represent the tail σ-algebra as just T .Intuitively, the above means that we consider those events that are independent of the values of any finite subfamilyof (Xi)i∈I . This shall be made clearer as follows.

Theorem 3.11. Let J1, J2, . . . be finite sets with Jn ↑ I. Then

T ((Ai)i∈I) =

∞⋂n=1

σ

⋃m∈I\Jn

Am

.

If I = N, then this says that

T ((Ai)i∈N) =

∞⋂n=1

σ

( ∞⋃m=n

Am

).

Proof. It is obvious from the definition that T ((Ai)i∈I) is a subset of the expression on the right.Let Jn ↑ I and J ⊆ I be a finite set. There exists some n0 ∈ N such that J ⊆ Jn0

. We then have

∞⋂n=1

σ

⋃m∈I\Jn

Am

⊆ N⋂n=1

σ

⋃m∈I\Jn

Am

= σ

⋃m∈I\JN

Am

⊆ σ

⋃m∈I\J

Am

Noting that the expression on the left does not depend on J and taking the intersection over all J implies the reverseinclusion and the result follows.

If we interpret I = N as a set of times, the above theorem essentially says that any event in T is independent of thefirst finitely many time points.

Now, it is not immediately clear whether T even contains any nontrivial events (events other than ∅ and Ω).

For starters, if A1, A2, . . . are events, then A∗ = lim supn→∞An and A∗ = lim infn→∞An are both in T ((Ai)i∈N).

To see this for A∗, define Bn =⋂∞m=nAm for n ∈ N. We then have that Bn ↑ A∗ and Bn ∈ σ(

⋃∞m=n0

Am) for any

n ≥ n0. This implies that A∗ ∈ σ(⋃∞m=n0

Am) for any n0 ∈ N and thus, A∗ ∈ T .

For A∗, we must show that for any m ∈ N, lim supn→∞An ∈ σ(⋃∞n=m σ(An)). This is true as lim supn→∞An =

lim supn→∞An+m. Since each An+m is in σ(⋃∞n=m σ(An)), the statement is true.


Let (Xn)n∈N be real random variables. Then the Cesaro limits

lim infn→∞

1

n

n∑i=1

Xi and lim supn→∞

1

n

n∑i=1

Xi

are T ((Xn)n∈N)-measurable.

Theorem 3.12 (Kolmogorov’s 0-1 Law). Let I be a countable infinite index set and ((Ai)i∈I) be an independentfamily of σ-algebras. Then the tail σ-algebra is P-trivial, that is,

Pr[A] ∈ 0, 1 for any A ∈ T ((Ai)i∈I).

Proof. Assume w.l.o.g. that I = N. For each i ∈ N, let

Fn =

n⋂i=1

Ak : Aj ∈ Aj for each j ∈ [n]

.

Let F =⋃∞i=1 Fi. Note that F is a semiring.

Further, for any m ∈ N and Am ∈ Am, we have Am ∈ F . This implies that σ (⋃∞i=1Ai) ⊆ σ(F). We also have

Fm ⊆ σ

(m⋃i=1

Ai

)⊆ σ

( ∞⋃i=1

Ai

).

This implies that F = σ (⋃∞i=1Ai).

Let A ∈ T ((An)n∈N) and ε > 0. By theorem 1.24, there exists n0 ∈ N and mutually disjoint sets F1, F2, . . . , Fn0

such that

Pr

[A4

n0⋃i=1

Fi

]< ε.

Let F =⋃n0

i=1 Fi. There must be some n ∈ N such that F1, . . . , Fn0∈ Fn. This implies that F ∈ σ(

⋃ni=1Ai). By

the definition of the tail σ-algebra, A ∈ σ(⋃∞i=n+1Ai) so A must be independent of F . Therefore,

ε > Pr[A \ F ]

= Pr[A](1− Pr[F ])

≥ Pr[A](1− Pr[A]− ε).

As this is true for any ε > 0, 0 = Pr[A](1− Pr[A]) and the result is proved.

Corollary 3.13. Let (An)n∈N be an independent family of events. Then

Pr

[lim supn→∞

An

]and Pr

[lim infn→∞

An

]are in 0, 1.

The above can be inferred from the fact that the lim sup and lim inf lie in the tail σ-algebra. It also follows from theBorel-Cantelli Lemma.

Corollary 3.14. Let (Xn)n∈N be an independent family of R-valued random variables. Then X∗ = lim infn→∞Xn

and X∗ = lim supn→∞Xn are almost surely constant, that is, there exist x∗, x∗ ∈ R such that Pr[X∗ = x∗] = 1 and

Pr[X∗ = x∗] = 1.

The above follows from the fact that for any x ∈ R, X∗ < x ∈ T ((Xn)n∈N) and X∗ > x ∈ T ((Xn)n∈N).


§4. Generating Functions

It is a common theme in mathematics to determine relations between objects that are of interest and objects that areeasy to compute with. In probability theory, probability generating functions, Laplace transforms and characteristicfunctions fall in the former category while the mean, median and variance of random variables fall in the latter.

4.1. Definitions and Basics

Definition 4.1 (Probability Generating Function). Let X be an N0-valued random variable. The probability gener-ating function (abbreviated pgf ) of PX (or X) is the map ψPX = ψX : [0, 1]→ [0, 1] is given by (where 00 = 1)

z 7→∞∑n=0

Pr[X = n]zn.

From the properties of a power series, we have the following result, which we do not prove.

Theorem 4.1.

(a) ψX is continuous on [0, 1] and infinitely often continuously differentiable on (0,1). For n ∈ N,

limz→1−

ψ(n)X (z) =

∞∑k=n

Pr[X = k] · k(k − 1) · · · (k − n+ 1)

where both sides can equal ∞.

(b) PX is uniquely determined by ψX .

(c) For any r ∈ (0, 1), ψX is uniquely determined by countably many values ψX(xi) where xi ∈ [0, r] for each i ∈ N.If the series given in ψX converges for some z > 1, then this statement is also true for any r ∈ (0, z) and

limz→1−

ψ(n)X (z) = ψ

(n)X (1) <∞ for n ∈ N.

Here, ψX is uniquely determined by ψ(n)X (1), n ∈ N.

While this definition of a pgf may seem quite arbitrary, it is useful when we want to add random variables, as isevident from the following theorem.

Theorem 4.2. Let X1, X2, . . . , Xn be independent N0-valued random variables. Then

ψX1+X2+...+Xn =

n∏i=1

ψXi .

Proof. We shall prove the claim for n = 2 and the result will follow inductively. For any z ∈ [0, 1]

ψX1(z) · ψX2(z) =

( ∞∑n=0

Pr[X1 = n]zn

)( ∞∑n=0

Pr[X2 = n]zn

)

=

∞∑n=0

zn

(n∑

m=0

Pr[X1 = m] Pr[X2 = n−m]

)

=

∞∑n=0

zn

(n∑

m=0

Pr [X1 = m ∩ X2 = n−m]

)

=

∞∑n=0

zn Pr[X1 +X2 = n]

= ψX1+X2.


For example,

• For some m,n ∈ N and p ∈ [0, 1], let X ∼ bn,p and Y ∼ bm,p. Then

ψX(z) =

n∑i=0

(n

i

)pi(1− p)izi = (pz + 1− p)n.

If X and Y are independent, then

ψX+Y (z) = ψX(z) · ψY (z) = (pz + 1− p)m+n.

Therefore, X + Y ∼ bm+n,p, which is not immediately apparent otherwise. (Note that this also implies thatbm,p ∗ bn,p = bm+n,p)

• Let p ∈ (0, 1] and X1, . . . , Xn ∼ γp be independent random variables. Define Y = X1 + · · · + Xn. For anyz ∈ [0, 1],

ψX1(z) =

∞∑i=0

p(z(1− p))i =p

1− z(1− p).

Then

ψY (z) =pn

(1− z(1− p))n

=

∞∑i=0

pn(−ni

)(−1)i(1− p)izi

=

∞∑i=0

b−n,p(i)zi.

Therefore, γ∗np = b−n,p. Note that this matches with the intuition we introduced while defining the geometricand negative binomial distributions. The waiting time for the nth success is equivalent to waiting for a singlesuccess n times, that is, X1 + · · ·+Xn.

• We can also show that for λ, µ ∈ [0,∞),ψPoiλ(z) = eλ(z−1).

This implies that Poiλ ∗Poiµ = Poiλ+µ (Recall that we had also proved this earlier by manually calculatingthe convolution).

• For r, s ∈ (0,∞) and p ∈ (0, 1], we have b−r,p ∗ b−s,p = b−r+s,p. We leave it to the reader to prove this.

4.2. The Poisson Approximation

Theorem 4.3. Let µ, µ1, µ2, . . . be probability measures on (N0, 2N0) with generating functions ψ,ψ1, ψ2, . . .. Then

the following are equivalent.

(a) limn→∞ µn(k) = µ(k) for all k ∈ N0.

(b) limn→∞ µn(A) = µ(A) for all A ⊆ N0.

(c) limn→∞ ψn(z) = ψ(z) for all z ∈ [0, 1].

(d) limn→∞ ψn(z) = ψ(z) for all z ∈ [0, η) for some η ∈ (0, 1).

If any of the above is true, we write limn→∞ µn = µ and say that (µn)n∈N converges weakly to µ.

Proof.


• (a) =⇒ (b).

Fix ε > 0. Choose some N ∈ N such that

µ(N + 1, N + 2, . . .) < ε

4

and sufficiently large n0 ∈ N such that

N∑k=0

|µn(k)− µ(k)| < ε

4for all n ≥ n0.

Note that for any n ≥ n0, µn(N + 1, N + 2, . . .) < ε2 . Therefore, for any A ⊆ N and n ≥ n0,

|µn(A)− µ(A)| ≤ µn(N + 1, N + 2, . . .) + µ(N + 1, N + 2, . . .) +∑

k∈A∩0,1,...,N

|µn(k)− µ(k)|

≤ ε.


• (b) =⇒ (a).

This is trivial.

• (a)⇐⇒ (c)⇐⇒ (d)

This follows from theorem 4.1.

When we are dealing with the binomial distribution for large n, it is usually inconvenient to calculate the terms.However, in the case where n is very large and p is very small such that np = λ is of reasonable magnitude, then wecan approximate bn,p by Poiλ. This is made rigorous as follows.

Let (pn,k)n,k∈N such that pn,k ∈ [0, 1]. Let

λ = limn→∞

∞∑k=1

pn,k ∈ (0,∞) and limn→∞

∞∑k=1

p2n,k = 0.

An example of such a family would be pn,k = λ/n for k ≤ n and pn,k = 0 otherwise.For each n ∈ N, let (Xn,k)k∈N be a family of independent random variables such that Xn,k ∼ Berpn,k . For n, k ∈ N,define

Sn =

∞∑l=1

Xn,l and Snk =

k∑l=1

Xn,l.

Theorem 4.4 (Poisson Approximation). With the above notation, the distributions (PSn)n∈N converge weakly toPoiλ.

Proof. For any z ∈ [0, 1],

ψSn(z) =

∞∏l=1

(pn,lz + 1− pn,l)

= exp

( ∞∑l=1

log(pn,lz + 1− pn,l)

).

Now, as | log(1− x)− x| ≤ x2 for |x| < 12 , we have∣∣∣∣∣

( ∞∑l=1

log(pn,lz + 1− pn,l)

)−

((z − 1)

∞∑l=1

pn,l

)∣∣∣∣∣ ≤∞∑l=1

p2n,l.


Taking the limit as n→∞, we have

limn→∞

ψSn(z) = exp

((z − 1)

∞∑l=1

pn,l

)= eλ(z−1).

As ψPoiλ = eλ(z−1), this completes the proof.

For example, let pn,k be λ/n if k ≤ n and 0 otherwise. Then by the Poisson approximation, limn→∞ bn, λn= Poiλ.

4.3. Branching Processes

Let T,X1, X2, . . . be N0-valued random variables and S =∑Ti=1Xi. Note that S is measurable (and thus a random

variable) as

S = k =

∞⋃i=0

T = i ∩ X1 + · · ·+Xn = k.

Theorem 4.5. With the above notation, if X1, X2, . . . are all identically distributed, then ψS = ψT ψX1 .

Proof. We have

ψS(z) =

∞∑k=0

Pr[S = k]zk

=

∞∑k=0

∞∑i=0

Pr[T = i] Pr[X1 + · · ·+Xn = k]zk

=

∞∑i=0

Pr[T = i](ψX1(z))n

= (ψT ψX1)(z).

Let p0, p1, . . . ∈ [0, 1] such that∑∞i=0 pi = 1. Let (Xn,i)n,i∈N0 be an independent family of random variables such

that Pr[Xn,i = k] = pk for any k, n, i ∈ N.

Let Z0 = 1 and Zn =∑Zn−1

i=1 Xn,i for each n ∈ N. This can be interpreted as the number of members in eachgeneration of a family where the number of offspring each person has is random and given by Xn,i.

Definition 4.2. (Zn)n∈N0 is called a Galton-Watson process or branching process with offspring distribution (pk)k∈N0

Branching processes are easier to study with the assistance of generator functions. For z ∈ [0, 1], let

ψ(z) =

∞∑k=0

pkzk.

Recursively defineψ1 = ψ and ψn+1 = ψn for all n ∈ N.

Theorem 4.6. ψZn = ψn for all n ∈ N.

Proof. We have ψZ1= ψ1 (by definition). By theorem 4.5, ψZn+1

= ψ ψZn . The result follows inductively.

Now, let us study the probability that the family dies out.Denote by qn = Pr[Zn = 0] the probability that Z is extinct by time n. Clearly, qn is monotone increasing in n. Inthe limiting case, we have the following.

Definition 4.3. Let (Zn)n∈N0 be a branching process. We define its extinction probability by

q = limn→∞

Pr[Zn = 0].


A natural question to ask is: under what condition does the family definitely die out, that is, q = 1? We clearly haveq ≥ p0 since qn is monotone increasing and q = limn→∞ qn. If p0 = 0, then Zn is monotone increasing in n and as aresult, q = 0 as well.

Theorem 4.7 (Extinction Probability of a Branching Process). Let (Zn)n∈N0be a branching process with offspring

distribution (pk)k∈N0such that p1 6= 1. Then

(a) r ∈ [0, 1] : ψ(r) = r = q, 1.

(b) The following holds:

q < 1 ⇐⇒ limz→1

ψ′(z) > 1 ⇐⇒∞∑k=1

kpk > 1.

Proof.

(a) Let F = r ∈ [0, 1] : ψ(r) = r. Clearly, 1 ∈ F as ψ(1) = 1. Note that qn = ψn(0) = ψ(qn−1) for all n ∈ N. Asψ is continuous,

ψ(q) = ψ(

limn→∞

qn

)= limn→∞

ψ(qn) = limn→∞

qn+1 = q.

Thus q ∈ F .

We next claim that q = minF . Let r ∈ F . Then r ≥ 0 = q0. If r ≥ qn for some n ∈ N, then as ψ is monotoneincreasing, r = ψ(r) ≥ ψ(qn) = qn+1.

Inductively, r ≥ qn for every n ∈ N. The claim follows. We complete the remainder of the proof in the secondpart.

(b) The second equivalence follows by theorem 4.1. For the first equivalence, consider the following two cases.

• limz→1 ψ′(z) ≤ 1. Since ψ is strictly convex, it follows that ψ(z) > z for all z ∈ [0, 1) and so F = 1. By

the first part of the proof, q = 1.

• limz→1 ψ′(z) > 1. Since ψ is strictly convex and ψ(0) ≥ 0, there is some unique r ∈ (0, 1) such that ψ(r) = r.

Then F = r, 1 and by the first part, q = minF = r < 1.



§5. The Integral

In the following, we assume (Ω,A, µ) to be a measure space. We denote by E the vector space of simple functionson (Ω,A) and by E+ = f ∈ E : f ≥ 0 the cone of nonnegative simple functions. If

f =

n∑i=1

αi1Ai

for some n ∈ N, mutually disjoint sets A1, . . . , An ∈ A and α1, . . . , αn ∈ (0,∞) then this representation is said to bea normal representation of f .

5.1. Set Up to Define the Integral

Lemma 5.1. Let f =∑mi=1 αi1Ai and f =

∑nj=1 βj1Bj be two normal representations of f ∈ E+. Then

m∑i=1

αiµ(Ai) =

n∑j=1

βjµ(Bj)

Proof. Clearly, if αi 6= 0 for some i, then Ai ⊆⋃nj=1Bj . A similar result holds for Bj . Thus,

m∑i=1

αiµ(Ai) =

m∑i=1

n∑j=1

αiµ(Ai ∩Bj).

If µ(Ai ∩Bj) 6= 0, then f(ω) = αi = βj for any ω ∈ Ai ∩Bj . Therefore,

m∑i=1

n∑j=1

αiµ(Ai ∩Bj) =

n∑j=1

m∑i=1

βjµ(Ai ∩Bj) =

n∑j=1

βjµ(Bj).

Definition 5.1. Define I : E+ → [0,∞] by

I(f) =

m∑i=1

αiµ(Ai)

if f has the normal representation f =∑mi=1 αi1Ai .

The above definition makes sense due to the previous lemma.

Lemma 5.2. Let f, g ∈ E+ and α ≥ 0. Then

(a) I(αf) = αI(f),

(b) I(f + g) = I(f) + I(g), and

(c) If f ≤ g, then I(f) ≤ I(g).

We leave the proof of this theorem as an exercise to the reader.

Definition 5.2 (Integral). If f : Ω→ [0,∞] is measurable, then we define the integral of f with respect to µ by∫f dµ = sup

I(g) : g ∈ E+, g ≤ f

.

Note that by lemma 5.2(iii), I(f) =∫f dµ for any f ∈ E+. That is, the integral is an extension of I from E+ to the

set of non-negative measurable functions. We expand this to measurable functions in general in the next subsection.

Let f, g : Ω → R. Similar to how we write f ≤ g if f(ω) ≤ g(ω) for all ω ∈ Ω, we write f ≤ g almost everywhere ifthere exists some set N ∈ A such that µ(N) = 0 and f(ω) ≤ g(ω) for all ω ∈ Ω \N .


Theorem 5.3. Let f, g, f1, f2, . . . be measurable maps Ω→ [0,∞]. Then

(a) If f ≤ g, then∫f dµ ≤

∫g dµ.

(b) If fn ↑ f , then∫fn dµ ↑

∫f dµ.

(c) If α, β ∈ [0,∞], then ∫(αf + βg) dµ = α

∫f dµ+ β

∫g dµ

where we take ∞ · 0 = 0.

Proof.

(a) This is obvious from the definition of the integral.

(b) By the definition of the integral,

limn→∞

∫fn dµ = sup

n∈N

∫fn dµ ≤

∫f dµ

We must now show that∫f dµ ≤ supn∈N

∫fn dµ. Let g ∈ E+ with g ≤ f . It is enough to show that

supn∈N∫fn dµ ≥

∫g dµ. Fix some t ∈ (0, 1). For each n ∈ N, define

An = ω ∈ Ω : fn(x) ≥ tg(x).

Note that each An is measurable (Why?) and that Ai ⊂ Ai+1 for each i ∈ N.

We first claim that⋃∞i=1Ai = Ω. We prove this as follows. For any ω ∈ Ω,

• If f(ω) ≤ tg(ω), then f(ω) = 0 and ω ∈ An for every n ∈ N.

• If f(ω) > tg(ω), then there exists some n ∈ N such that fn(ω) > tg(ω). It follows that ω ∈ An.

Therefore, Ω ⊆⋃∞i=1Ai. Since the reverse inclusion is obviously true, we have

⋃ni=1Ai = Ω.

Now, ∫fn dµ ≥

∫tg1An dµ

Taking the limit as n→∞ and t→ 1, we have

limn→∞

∫fn dµ ≥

∫g dµ.


(c) By theorem 1.17, there exist sequences (fn)n∈N and (gn)n∈N in E+ such that fn ↑ f and gn ↑ g. Then bylemma 5.2 and (ii),∫

(αf + βg) dµ = limn→∞

∫(αf + βg) dµ

= α limn→∞

∫f dµ+ β lim

n→∞

∫g dµ = α

∫f dµ+ β

∫g dµ.

It is important to note that (b) in the above only works if the sequence of functions is non-decreasing. We do nothave a similar result for a non-increasing sequence. To see this, consider f1, f2, . . . : B(R)→ R given by fn = 1[n,∞)

for each n ∈ N.

5.2. The Integral and some Properties

We have now introduced enough to define the integral for measurable functions in general.


Definition 5.3 (Integrals of Measurable Functions). Let f : Ω → R be measurable. We call f µ-integrable if∫|f |dµ <∞ and write

L1(µ) = L1(Ω,A, µ) =f : Ω→ R : f is µ-integrable

.

For f ∈ L1(µ), we define the integral of f with respect to µ by∫f(ω)µ(dω) =

∫f dµ =

∫f+ dµ−

∫f− dµ.

For A ∈ A, we define ∫A

f dµ =

∫(f1A) dµ.

Elements of L1(P) are also sometimes called absolutely integrable functions.

Theorem 5.4. Let f : Ω→ [0,∞] be measurable.

(a) f = 0 almost everywhere if and only if∫f dµ = 0.

(b) If∫f dµ <∞, then f <∞ almost everywhere.

Proof.

(a) Let us first prove the forward implication. Let P = ω ∈ Ω : f(ω) > 0. Then f ≤ ∞ · 1P . As n1N ↑ ∞1N , bytheorem 5.3,

0 ≤∫f dµ ≤ lim

n→∞

∫n1N dµ = 0.

For the backward implication, let An = ω ∈ Ω : f(ω) ≥ 1/n. Then An ↑ P and for any n ∈ N,

0 =

∫f dµ ≥

∫1

n1An =

µ(An)

n.

This implies that µ(An) = 0 for any n ∈ N and therefore, µ(P ) = 0.

(b) Let A = ω ∈ Ω : f(ω) =∞. Then for any n ∈ N,

µ(A) =

∫1A dµ ≤ 1

n

∫f1f≥n ≤

1

n

∫f dµ.

Taking the limit as n→∞, we get µ(A) = 0.

We now expand some of the properties that we proved earlier for non-negative measurable functions to measurablefunctions in general.

Theorem 5.5. Let f, g ∈ L1(µ).

(a) (Monotonicity). If f ≤ g almost everywhere, then∫f dµ ≤

∫g dµ. In particular, if f = g almost everywhere,

then∫f dµ =

∫g dµ.

(b) (Triangle Inequality). |∫f dµ| ≤

∫|f |dµ.

(c) (Linearity). If α, β ∈ R, then αf + βg ∈ L1(µ) and∫αf + βg dµ = α

∫f dµ+ β

∫g dµ.

Proof.


(a) Since f ≤ g almost everywhere, f+ ≤ g+ and f− ≥ g− almost everywhere. It is enough to show that∫f+ dµ ≤∫

g+ dµ and∫f− dµ ≥

∫g− dµ. Let us prove the former. The latter can similarly be shown.

Let h = g+ − f+. As h− = 0 almost everywhere, by theorem 5.4,∫hdµ =

∫h+ dµ ≥ 0. The result follows.

(b) We have ∣∣∣∣∫ f dµ

∣∣∣∣ =

∣∣∣∣∫ f+ dµ−∫f− dµ

∣∣∣∣≤∫

(f+ + f−) dµ

=

∫|f |dµ.

(c) To show linearity, it suffices to show that for any α ∈ [0,∞),

•∫

(f + g) dµ =∫f dµ+

∫g dµ,

•∫αf dµ = α

∫f dµ, and

•∫

(−f) dµ = −∫f dµ.

These are easily shown by splitting each function f into f+ and f− and using theorem 5.3 wherever necessary.

Definition 5.4. Let f : Ω→ [0,∞) be measurable. Define the measure ν by

ν(A) =

∫(1Af) dµ for A ∈ A.

Then ν is said to have density f with respect to µ. We also denote ν by fµ.

In showing that ν is a measure using theorem 1.10, finite additivity follows from finite additivity of the integral andlower semicontinuity follows from theorem 5.9.

Theorem 5.6. Let f : Ω→ [0,∞) and g : Ω→ R be measurable. Then g ∈ L1(fµ) if and only if (gf) ∈ L1(µ). Inthis case, ∫

g d(fµ) =

∫(gf) dµ.

We omit the proof of the above. It may be shown by first assuming g to be an indicator function, then extendingthis to simple functions, non-negative measurable functions and measurable functions.

Definition 5.5. Let f : Ω→ R be measurable and p ∈ [1,∞). Define

‖f‖p =

(∫|f |p dµ

)1/p

and

‖f‖∞ = infk ≥ 0 : µ(|f | ≥ k) = 0.

Further, for any p ∈ [0,∞], we define the vector space

Lp(µ) = f : Ω→ R : f is measurable and ‖f‖p <∞.

Theorem 5.7. The map ‖·‖1 is a seminorm on L1(µ), that is, for any f, g ∈ L1(µ) and α ∈ R,

‖αf‖1 = |α| · ‖f‖1‖f + g‖1 ≤ ‖f‖1 + ‖g‖1‖f‖1 ≥ 0 with equality if and only if f = 0 almost everywhere.


Proof. The first and third (in)equalities follow from theorem 5.5(c) and theorem 5.4(a) respectively. The secondinequality follows from the fact that |f + g| ≤ |f |+ |g|. We leave the details of the proof to the reader.

In fact, ‖·‖p is a seminorm on Lp(µ) for any p ∈ [1,∞]. The proofs of the first and third (in)equalities are similarlystraightforward. The proof of the second however requires Minkowski’s inequality.

Theorem 5.8. Let µ(Ω) < ∞ and 1 ≤ p′ ≤ p ≤ ∞. Then Lp(µ) ⊆ Lp′(µ) and further, the canonical inclusionLp(µ) → Lp′(µ) given by f 7→ f is continuous.

Proof. Let us first take the case where p =∞. For any f ∈ L∞(µ), since |f | ≤ ‖f‖∞ almost everywhere,∫|f |p

′dµ ≤

∫‖f‖p

′

∞ dµ = µ(Ω) ‖f‖p′

∞ <∞.

It follows that for any f, g ∈ L∞(µ),

‖f − g‖p′ ≤ (µ(Ω))1/p′ ‖f − g‖∞and so the inclusion map is continuous.Let us next take the case where p is finite. Then for any f ∈ Lp(µ) since |f |p′ ≤ 1 + |f |p,∫

|f |p′dµ ≤

∫1 + |f |p dµ = µ(Ω) +

∫|f |p dµ <∞.

Now, for any f, g ∈ Lp(µ), let c = ‖f − g‖p. Then

|f − g|p′

= |f − g|p′1|f−g|≤c + |f − g|p

′1|f−g|>c

≤ cp′+ cp

′−p|f − g|p

This implies that

‖f − g‖p′ ≤ c(∫

1 + c−p|f − g|p dµ

)1/p′

= ‖f − g‖p (1 + µ(Ω))1/p′ .


5.3. Monotone Convergence and Fatou’s Lemma

Under what conditions can we exchange the limit and the integral? We answered this question in part in theo-rem 5.3(b). Over the course of this subsection, we attempt to answer this.

Theorem 5.9 (Monotone Convergence, Beppo-Levi Theorem). Let f1, f2, . . . ∈ L1(µ) and f : Ω → R bemeasurable. Assume that fn ↑ f almost everywhere. Then

limn→∞

∫fn dµ =

∫f dµ

where both sides can equal ∞.

Proof. Let N be a set such that µ(N) = 0 and fn(ω) ↑ f(ω) for all ω ∈ N c. For each n ∈ N, define

f ′n = (fn − f1)1Nc and f ′ = (f − f1)1Nc .


Note that f ′n ↑ f ′ and each of these functions are non-negative. By theorem 5.3(b),∫f ′n dµ ↑

∫f ′ dµ. Then by

theorem 5.5(a),

limn→∞

∫fn dµ = lim

n→∞

(∫f ′n dµ+

∫f1 dµ

)=

∫f ′ dµ+

∫f1 dµ

=

∫f dµ.

Theorem 5.10 (Fatou’s Lemma). Let f ∈ L1(µ) and let f1, f2, . . . be measurable with fn ≥ f almost everywherefor all n ∈ N. Then ∫

lim infn→∞

fn dµ ≤ lim infn→∞

∫fn dµ

Proof. Considering fn − f for each n ∈ N, we may assume each fn to be non-negative almost everywhere. For eachm ∈ N, consider gm = infn≥m fn. Note that gm ↑ lim infn→∞ fn. Thus, using theorem 5.5(a) and theorem 5.9,∫

lim infn→∞

fn =

∫limm→∞

gm dµ = limm→∞

∫gm dµ ≤ lim inf

n→∞

∫fn dµ.

In the above theorem, we require an integrable f for the statement to hold. This f is called a “minorant”.

An example to show this is the following, known as Peterburg’s game.Consider a gamble in a casino where you double your bet with probability p ≤ 1

2 and lose it with probability 1−p. We

gamble over and over. This can be modelled by the probability space (Ω,A,P) where Ω = −1, 1N, A = (2−1,1)⊗N,and P = ((1− p)δ−1 + pδ1)N. Let us denote by Dn : Ω→ −1, 1 the output of the nth game

If the player bets bi dollars on the ith game, then his total profit after the nth game is Sn∑ni=1 biDi. Now, let the

gambler assume the following strategy. He bets 1 dollar on the first game. If he wins, he stops playing (Hn = 0 forall n ≥ 2). If he loses, he doubles the bet in the subsequent round.That is,

Hn =

0, if Di = 1 for some i < n

1, otherwise.

The probability of no win until the nth game is (1− p)n.Therefore, P[Sn = 1− 2n] = (1− p)n and P[Sn = 1] = 1− (1− p)n. The average expected gain is∫

Sn dP = (1− 2n)(1− p)n + (1− (1− p)n) = 1− (2(1− p))n ≤ 0.

Define

S =

−∞, if −1 = D1 = D2 = · · ·1, otherwise.

Then limn→∞ Sn = S almost surely. However, limn→∞∫Sn dP < limn→∞

∫S dP since S = 1 almost surely.

By Fatou’s Lemma, this is only possible if there is no integrable minorant f for (Sn)n∈N.Indeed, letting S = infSn : n ∈ N,

P[S = 1− 2n−1] = P[D1 = D2 = · · · = Dn−1 = −1 and Dn = 1] = p(1− p)n−1.

Therefore, ∫S dP =

∞∑n=1

(1− 2n−1)p(1− p)n−1 = −∞.


Also, in Fatou’s Lemma, we cannot replace the lim inf with a lim sup, which can be seen from the following example.Let X be drawn uniformly randomly from [0, 1], and let Xn be the nth binary digit of X (we needn’t worry about thecase where there is ambiguity in the binary expression as this almost surely does not occur). We have E[Xn] = 1/2for all n. Then lim supn→∞Xn is almost surely 1, so we cannot replace lim inf with lim sup in Fatou’s lemma.

5.4. Miscellaneous

It may be shown that if f : I → R is Riemann integrable on I = [a, b], then f is Lebesgue integrable on I withintegral ∫

I

f dλ =

∫ b

a

f(x) dx

The converse is not true, that is, a function that is Lebesgue integrable need not be Riemann integrable. An exampleof this is 1Q.

Theorem 5.11. Let f : Ω→ R be measurable and f ≥ 0 almost everywhere. Then

∞∑n=1

µ(f ≥ n) ≤∫f dµ ≤

∞∑n=0

µ(f > n).

Proof. Define f1 = bfc and f2 = dfe. Clearly, f1 ≤ f ≤ f2 and so,∫f1 dµ ≤

∫f dµ ≤

∫f2 dµ. We then have∫

f1 dµ =

∞∑k=1

kµ(f1 = k)

=

∞∑k=1

k∑n=1

µ(f1 = k)

=

∞∑n=1

∞∑k=n

µ(f1 = k)

=

∞∑n=1

µ(f1 ≥ n)

=

∞∑n=1

µ(f ≥ n).

Similarly, for the second inequality, ∫f2 dµ =

∞∑n=1

µ(f2 ≥ n)

=

∞∑n=1

µ(f > n− 1).


Theorem 5.12. Let f : Ω→ R be measurable and f ≥ 0 almost everywhere. Then∫f dµ =

∫ ∞0

µ(f ≥ t) dt.

Proof. Define g by g(t) = µ(f ≥ t). We may assume that g(t) <∞ for all t > 0.For ε > 0 and k ∈ N, let gε = min(g, g(ε)), fε = f1f≥ε, and fε,k = 2kfε. and

αε,k = 2−k∞∑n=1

µ(fε ≥ 2−kn).


Note that αε,kk→∞−−−−→

∫∞0gε dµ (Why?). Using theorem 5.11 on fε,k,

αε,k = 2−k∞∑n=1

µ(fε,k ≥ n) ≤∫fε dµ

≤ 2−k∞∑n=0

µ(fε,k > n)

= 2−k∞∑n=0

µ(fε > n2−k) ≤ αε,k + 2−kg(ε).

Taking the limit as k →∞, equality must hold everywhere so we get∫ ∞0

gε dµ =

∫fε dµ.

Taking the limit as ε ↓ 0 completes the proof.


§6. Moments

When describing random variables, quantities like the expectation, median and variance are often used. They describethe behaviour of the variable on average, and how much they deviate from the average.

6.1. Parameters of Random Variables

In the following, let (Ω,A,P) be a probability space.

Definition 6.1. Let X be a real random variable.

(i) If X ∈ L1(P), then X is called integrable and we call

E[X] =

∫X dP

the expectation or mean of X. If E[X] = 0, we call X centered. The expectation of X is given by the sameexpression even if only X+ or X− is integrable.

(ii) Let n ∈ N and X ∈ L1(P). Then for any k ∈ [n],

mk = E[Xk] and Mk = E[|X|k]

are called the kth moments and kth absolute moments of X respectively.

(iii) If X ∈ L2(P), X is called square integrable and

Var[X] = E[X2]−E[X]2

is called the variance of X. The number σ =√

Var[X] is called the standard deviation of X (This makes senseas we prove in theorem 6.4(a) that Var[X] ≥ 0). We sometimes write Var[X] =∞ if E[X2] =∞.

(iv) If X,Y ∈ L2(P), we define the covariance of X and Y by

Cov[X,Y ] = E[(X −E[X])(Y −E[Y ])].

X and Y are called uncorrelated if Cov[X,Y ] = 0 and correlated otherwise.

We now give some basic properties of expectation. All of them follow from the corresponding properties of theintegral.

The mean represents the average value of the random variable, the variance gives a measure of the deviation ordispersion of the random variable from the mean, and the covariance gives a measure of how related two randomvariables are.

Theorem 6.1 (Properties of Expectation). Let X,Y,Xn, Zn (n ∈ N) be real integrable random variables on(Ω,A,P). Then

(a) If PX = PY , then E[X] = E[Y ].

(b) For any c ∈ R, cX ∈ L1(P) and X + Y ∈ L1(P). Further,

E[cX] = cE[X] and E[X + Y ] = E[X] + E[Y ].

(c) If X ≥ 0 almost surely, then E[X] = 0 if and only if X = 0 almost surely.

(d) If X ≤ Y almost surely, then E[X] ≤ E[Y ]. Further, E[X] = E[Y ] if and only if X = Y almost surely.

(e) |E[X]| ≤ E[|X|].

(f) If Xn ≥ 0 almost surely for each n ∈ N, then E[∑∞i=0Xi] =

∑∞i=0 E[Xi].


(g) If Zn ↑ Z for some Z, then E[Z] = limn→∞ Zn.

Note that as a consequence of part (b) of the above,

Cov[X,Y ] = E[XY ]−E[X]E[Y ].

Setting X = Y , we have Cov[X,X] = Var[X].

In the above, we did not use anything other than the properties of the integral itself. When we involve probability,namely independence, we get the following result.

Theorem 6.2 (Independent Variables are Uncorrelated). Let X,Y ∈ L1(P) be independent. Then XY ∈ L1(P)and further, E[XY ] = E[X]E[Y ]. Note that this implies Cov[X,Y ] = 0.

Proof. Let us first take the case where X and Y each take finitely many values. Then Z = XY takes finitely manyvalues as well and so, Z ∈ L1(P). We now have

E[Z] =∑

z∈R\0

z Pr[Z = z]

=∑

x∈R\0

∑y∈R\0

xyPr[X = x, Y = y]

=

∑x∈R\0

xPr[X = x]

∑y∈R\0

yPr[Y = y]

= E[X]E[Y ].

Now, take the case where X and Y are each non-negative. (They may take an infinite number of values) Usingtheorem 1.17(a), let (Xn)n∈N, (Yn)n∈N be sequences of simple functions such thatXn ↑ X and Yn ↑ Y . By theorem 5.9,

E[XY ] = limn→∞

E[XnYn]

= limn→∞

E[Xn]E[Yn]

= limn→∞

E[Xn] limn→∞

E[Yn] = E[X]E[Y ].

The general result where X and Y need not be non-negative is easily proved by splitting X into X+−X− and Y asY + − Y −.

The converse however, is not true. That is, uncorrelated variables need not be independent. For example, let X take0 and 1 with probability 1/2 each and Y take −1 and 1 with probability 1/2 each. Then XY and X are uncorrelatedbut not independent.

Theorem 6.3 (Wald’s Identity). Let T,X1, X2, . . . ∈ L1(P) such that X1, X2, . . . are identically distributed and let

T take values only in N0. Define ST =∑Ti=1Xi. Then ST ∈ L1(P) and E[ST ] = E[T ]E[X1].

Proof. For each n ∈ N, let Sn =∑ni=1Xn. Then we can write ST as

ST =

∞∑n=0

1T=nSn.

We may prove that ST ∈ L1(P) by performing the following calculation using |ST | instead of ST .


Since 1T=n and Sn are independent for each n ∈ N, we now have

E[ST ] = E

[ ∞∑n=0

1T=nSn

]

=

∞∑n=0

E[1T=n]E[Sn]

=

∞∑n=0

Pr[T = n]E

[n∑i=0

Xi

]

=

∞∑n=0

nPr[T = n]E[X1])

= E[T ]E[X1].

Let us now discuss some properties of the variance.

Theorem 6.4 (Properties of Variance). Let X ∈ L2(P). Then

(a) Var[X] = E[(X −E[X])2] ≥ 0,

(b) Var[X] = 0 if and only if X = E[X] almost surely, and

(c) the map f : R→ R given by x 7→ E[(X − x)2] attains its minimum at x0 = E[X] taking value f(x0) = Var[X].

Proof.

(a) This is clear from the fact that Cov[X,X] = Var[X].

(b) This follows as E[(X −E[X])2] = 0 if and only if (X −E[X])2 = 0 almost surely.

(c) Expanding the expression of f as E[X2] + x2 − 2xE[X], we see that f(x) = Var[X] + (x − E[X])2. The resultis clear.

Theorem 6.5. Let X1, . . . , Xm, Y1, . . . , Yn ∈ L2(P) and α1, . . . , αm, β1, . . . , βn, d, e ∈ R. Then

Cov

d+

m∑i=1

αiXi, e+

n∑j=1

βjYj

=∑

1≤i≤m1≤j≤n

αiβj Cov[Xi, Yj ].

Proof. By theorem 6.1,

Cov

d+

m∑i=1

αiXi, e+

n∑j=1

βjYj

= E

( m∑i=1

αi(Xi −E[Xi])

) m∑j=1

βi(Yj −E[Yj ])

=

∑1≤i≤m1≤j≤n

αiβjE[(Xi −E[Xi])(Yj −E[Yj ])]

=∑

1≤i≤m1≤j≤n

αiβj Cov[Xi, Yj ].

The above can be stated more concisely as:


The map Cov : L2(P)×L2(P)→ R is a positive semidefinite symmetric bilinear form and Cov[X,Y ] = 0if Y is almost surely constant.

Breaking up the individual terms,

• Cov[X,Y ] = 0 if Y is almost surely constant means that the d and e in the equation do not contribute to thecovariance.

• “Symmetric bilinear form” means that Cov[X,Y ] = Cov[Y,X], Cov[X1 +X2, Y ] = Cov[X1, Y ]+Cov[X2, Y ],and Cov[αX, Y ] = αCov[X,Y ] for all X,Y,X1, X2 ∈ L2(P) and α ∈ R.

• “Positive semidefinite” means that Cov[X,X] ≥ 0 for all X ∈ L2(P). This is because Cov[X,X] = Var[X] ≥0.

Corollary 6.6. For any X,X1, . . . , Xm ∈ L2(P) and α ∈ R, Var[αX] = α2 Var[X] and

Var

[m∑i=1

Xi

]=

m∑i=1

Var[Xi] +∑

1≤i,j≤mi 6=j

Cov[Xi, Xj ].

In particular, for uncorrelated X1, . . . , Xm,

Var

[m∑i=1

Xi

]=

m∑i=1

Var[Xi].

The above is known as the Bienayme formula.

Theorem 6.7 (Cauchy-Schwarz Inequality). Let X,Y ∈ L2(P). Then

(Cov[X,Y ])2 ≤ Var[X] Var[Y ].

Proof. If Var[Y ] = 0, the statement is trivial. If Var[Y ] 6= 0,

0 ≤ Var

[X − Cov[X,Y ]

Var[Y ]Y

]Var[Y ]

=

(Var[X] +

(Cov[X,Y ]

Var[Y ]

)2

Var[Y ]− 2Cov[X,Y ]

Var[Y ]Cov[X,Y ]

)Var[Y ]

= Var[X] Var[Y ]−Cov[X,Y ]2.

Our choice of Cov[X,Y ]Var[Y ] in the above is not arbitrary. The term

ρ =Cov[X,Y ]√

Var[X] Var[Y ]

gives a measure of the correlation of X and Y , and is sometimes called the correlation coefficient. It shows howlinearly dependent X −E[X] and Y −E[Y ] are.The Cauchy-Schwarz Inequality applied to X − E[X] and Y − E[Y ] implies that −1 ≤ ρ ≤ 1. Now, let us try toestimate Y −E[Y ] by a linear combination a(X −E[X]) + d. We shall try to minimize

E[(Y −E[Y ]− c(X −E[X])− d)2] = Var[Y ] + c2 Var[X]− 2cCov[X,Y ] + d2

= Var[Y ] + c2 Var[X]− 2cρ√

Var[X] Var[Y ] + d2.

We must clearly take d = 0. The minimum of the resulting expression is attained at c = ρ√

Var[Y ]Var[X] , and the minimum

expectation is Var[Y ](1− ρ2). The closer ρ is to 1, the more linearly dependent Y −E[Y ] and X −E[X] are. Here,


by linearly dependent, we mean that there exist a1, a2 not both 0 such that a1(X−E[X])+a2(Y −E[Y ]) = 0 almostsurely.Further, |ρ| = 1 if and only if X −E[X] and Y −E[Y ] are linearly dependent.

It follows that equality holds in the Cauchy-Schwarz inequality if and only if there exist a, b, c ∈ R not all 0 suchthat aX + bY + c = 0 almost surely.

Theorem 6.8 (Blackwell-Girshick Equation). Let T,X1, X2, . . . ∈ L1(P) such that X1, X2, . . . are identically dis-

tributed and let T take values only in N0. Define ST =∑Ti=1Xi. Then

Var[ST ] = E[X1]2 Var[T ] + E[T ] Var[X1].

Proof. For each n ∈ N, let Sn =∑ni=1Xn. Then writing ST as we did in the proof of Wald’s Identity,

E[S2T ] = E

[ ∞∑n=0

1T=nSn

]

=

∞∑n=0

E[1T=n] ·E[S2n]

=

∞∑n=0

Pr[T = n](Var[Sn] + E[Sn]2

)=

∞∑n=0

Pr[T = n](nVar[X1] + n2E[X1]2

)= E[T ] Var[X1] + E[T 2]E[X1]2.

Now, Wald’s identity implies

Var[ST ] = E[T ] Var[X1] + E[T 2]E[X1]2 − (E[T ]E[X1])2

= E[T ] Var[X1] + Var[T ]E[X1]2.

We now state the expectations and variances for some of the probability distributions that we mentioned earlier insection 2.2.

1. Let p ∈ [0, 1] and X ∼ Berp. Then E[X] = p and Var[X] = p(1− p).

2. Let p ∈ [0, 1], n ∈ N and X ∼ bn,p. Then E[X] = np and Var[X] = np(1 − p). This can easily be shown bynoting that X is equal to the sum of n i.i.d. random variables, each of which has distribution Berp.

3. Let µ ∈ R, σ2 > 0 and X ∼ Nµ,σ2 . Then E[X] = µ and Var[X] = σ2.

4. Let θ > 0 and X ∼ expθ. Then E[X] = θ−1 and Var[X] = θ−2.

6.2. The Weak Law of Large Numbers

We earlier mentioned that variance gives a measure of how much a random variable deviates from the expectation.This is made rigorous by the Chebyshev inequality.

Theorem 6.9 (Markov Inequality). Let X be a real random variable and f : [0,∞) → [0,∞) be monotoneincreasing. Then for any ε > 0 with f(ε) > 0,

Pr[|X| ≥ ε] ≤ E[f(|X|)]f(ε)

.


Proof. We have

E[f(|X|)] ≥ E[f(|X|)1f(|X|)≥f(ε)]

≥ E[f(ε)1f(|X|)≥f(ε)]

≥ f(ε) Pr[|X| ≥ ε].

The Markov inequality forms a basis for most probability related inequalities we encounter.

Corollary 6.10 (Chebyshev Inequality). Let X ∈ L2(P) and ε > 0. Then

Pr[|X −E[X]| ≥ ε] ≤ ε−2 Var[X].

Proof. Using the Markov inequality on f : x 7→ x2 and X −E[X] gives the required result.

This shows that the variance quantifies how much a random variable deviates from it’s expectation.

Definition 6.2 (Convergence of Random Variables). Let (Xn)n∈N be a sequence of random variables.

(i) (Xn)n∈N is said to converge in probability towards a random variable X if for all ε > 0,

limn→∞

Pr[|Xn −X| > ε] = 0.

(ii) (Xn)n∈N is said to converge almost surely towards a random variable X if

Pr[

limn→∞

Xn = X]

= 1.

To see that these two are not the same, consider the sequence of random variables (Xn)n∈N where Xn takes 1 withprobability 1/n and 0 otherwise. Then (Xn)n∈N converges in probability to the 0 random variable, but does notconverge almost surely.

Given a random variable X, we expect the arithmetic mean of a large number of independent observations of X tobe close to E[X] (assuming, of course, that its variance is finite.).

Making this property more general, we define the following.

Definition 6.3 (Laws of Large Numbers). Let (Xn)n∈N be a sequence of real random variables in L1(P) andlet Sn =

∑ni=1(Xn −E[Xn]).

(i) We say that (Xn)n∈N satisfies the weak law of large numbers if for any ε > 0,

limn→∞

(Pr

[∣∣∣∣ 1nSn∣∣∣∣ < ε

])= 1.

(ii) We say that (Xn)n∈N satisfies the strong law of large numbers if

Pr

[limn→∞

∣∣∣∣ 1nSn∣∣∣∣ = 0

]= 1.

We sometimes abbreviate “law of large numbers” to LLN.The weak law is equivalent to saying that (Sn/n)n∈N converges in probability to the 0 random variable and thestrong law is equivalent to saying that (Sn/n)n∈N converges almost surely to the 0 random variable.

The weak law says that for any nonzero (specified) margin, the average of a sufficiently large number of observationswill be within the margin of the expectation with high probability.


On the other hand, the strong law says that almost surely, the sequence of sample means converges to the expectation.

Note that as a consequence, if (Xn)n∈N satisfies the strong law of large numbers, it must also satisfy the weak lawof large numbers.

Theorem 6.11. Let X1, . . . , Xn ∈ L2(P) be uncorrelated random variables with V = supn∈N Var[Xn] < ∞.Then (Xn)n∈N satisfies the weak law of large numbers. Further,

Pr

[∣∣∣∣ 1nSn∣∣∣∣ ≥ ε] ≤ V

ε2nfor all n ∈ N.

Proof. Without loss of generality, assume E[Xi] = 0 for all i ∈ N. Then the Bienayme formula implies that

Var

[1

nSn

]=

1

n2

n∑i=1

Var[Xi] ≤V

n.

This shows how the more random variables we add, the more it “concentrates” around the mean 0. Now, byChebyshev’s inequality,

Pr

[∣∣∣∣ 1nSn∣∣∣∣ ≥ ε] ≤ V

ε2n.

Taking the limit as n→∞ gives that (Xn)n∈N satisfies the weak LLN.

In general, if we wish to show the existence of an object that has some properties, we give a constructive proof.Equipped with probability now, we have a non-constructive method to prove the existence of something, commonlyknown as the probabilistic method of proof. We do this by considering some distribution over all objects, andshowing that the probability of a randomly selected object having the required properties is non-zero, thus implyingthe existence of the required object.An example of this is the following.

Corollary 6.12 (Weierstrass’ Approximation Theorem). Let f : [0, 1] → R be a continuous real map. Then forn ∈ N there exist polynomials fn of degree at most n such that

‖fn − f‖∞n→∞−−−−→ 0.

Proof. For n ∈ N, define fn : [0, 1]→ R, known as the Bernstein polynomial of order n, by

fn(x) =

n∑k=0

f(k/n)

(n

k

)xk(1− x)n−k

As f is continuous on the compact interval [0, 1], it is uniformly continuous. Fixing some ε > 0, there exists δ > 0such that

|f(x)− f(y)| < ε for all x, y such that |x− y| < δ.

Now, let p ∈ [0, 1], X1, X2, . . . be i.i.d. random variables with distribution Berp, and Sn =∑ni=1Xn ∼ bn,p. Note

thatE[f(Sn/n)] = f(p).

Uniform continuity implies that|f(Sn/n)− f(p)| ≤ ε+ 2 ‖f‖∞ 1|Sn/n−p|≥δ

Therefore,

|fn(p)− f(p)| ≤ E[|f(Sn/n)− f(p)|]

≤ ε+ 2 ‖f‖∞ Pr

[∣∣∣∣Snn − p∣∣∣∣ ≥ δ]

≤ ε+ 2 ‖f‖∞ ·p(1− p)nδ2

≤ ε+‖f‖∞2nδ2

.

The result follows.


Corollary 6.13. Let n ∈ N and p1, . . . , pn ∈ [0, 1]. Let X1, . . . , Xn be independent random variables such thatXi ∼ Berpi for each i ∈ [n]. Define Sn =

∑ni=1Xi and m = E[Sn]. Then for any δ > 0,

Pr [Sn ≥ (1 + δ)m] ≤(

eδ

(1 + δ)1+δ

)mand

Pr [Sn ≤ (1− δ)m] ≤ exp

(−δ

2m

2

).

Proof. For some λ > 0, consider f : [0,∞)→ [0,∞) given by f(x) = eλx. By the Markov Inequality, for any ε > 0,

Pr[Sn ≥ (1 + δ)m] ≤ E[eλSn ]

eλm(1+δ)

= e−λm(1+δ)n∏i=1

E[eλXi ]

Now, for each i ∈ [n],

E[eλXi ] = 1 + pi(eλ − 1) ≤ epi(e

λ−1).

Therefore, since m = p1 + · · ·+ pn,

Pr[Sn ≥ (1 + δ)m] ≤ e−λm(1+δ)n∏i=1

ep(eλ−1)

= e−λm(1+δ)em(eλ−1).

We can now optimize over λ to find the minimum of the expression on the right. Setting λ = ln(1 + δ), we obtain

Pr[Sn ≥ (1 + δ)m] ≤(

δ

(1 + δ)1+δ

)m.

The second bound can be obtained similarly as we can optimize the following inequality over λ:

Pr[X ≤ (1− δ)m] = Pr[e−λX ≥ e−λm(1−δ)]

≤ E[eSn ]

e−λm(1−δ) .

More generally, if X is the sum of n independent random variables X1, . . . , Xn, then

Pr[X ≥ a] ≤ mint>0

e−ta∏i

E[etXi ]

andPr[X ≤ a] ≤ min

t>0eta∏i

E[e−tXi ].

The above inequalities are known as the Chernoff Bounds.

6.3. The Strong Law of Large Numbers

There are many versions of the Strong Law of Large Numbers, the one we present here is that given by Etemadi.Namely, if X1, X2, . . . ∈ L2(P) are pairwise independent and identically distributed, then (Xn)n∈N follows the strongLLN.Before we prove this, we start with some lemmas.Define µ = E[X1] and Sn = X1 + · · ·+Xn.


Lemma 6.14. For n ∈ N, let Yn = Xn1|Xn|≤n and Tn = Y1 + · · · + Yn. Then (Xn)n∈N follows the strong law of

large numbers if Tn/nn→∞−−−−→ µ almost surely.

Proof. By theorem 5.11,∞∑n=1

Pr[|Xn| > n] ≤∞∑n=1

Pr[|Xn| ≥ n] ≤ E[X1] <∞.

Now due to to the Borel-Cantelli Lemma,

Pr[Xn 6= Yn for infinitely many n] = 0.

There then almost surely exists some n0 such that Xn = Yn for all n ≥ n0. Therefore, for all n ≥ n0,

Tn − Snn

=Tn0 − Sn0

n

n→∞−−−−→ 0.

Since Tn/nn→∞−−−−→ µ almost surely, Sn/n

n→∞−−−−→ µ almost surely and our proof is complete.

Lemma 6.15. For all x > 0, 2x∑n>x n

−2 ≤ 4.

Proof. For any m ∈ N,∞∑n=m

n−2 ≤ m−2 +

∫ ∞m

t−2 dt = m−2 +m−1 ≤ 2

m.

The result follows.

Lemma 6.16. With the above notation,∞∑n=1

E[Y 2n ]

n2≤ 4E[|X1|].

Proof. By theorem 5.12,

E[Y 2n ] =

∫ ∞0

Pr(Y 2n ≥ t) dt.

Simplifying by substituting t = x2,

E[Y 2n ] =

∫ ∞0

2xPr[|Yn| ≥ x] dx ≤∫ n

0

2xPr[|X1| ≥ x] dx.

For m ∈ N, define the function fm by

fm(x) =

(m∑n=1

n−21x<n

)2xPr[|X1| > x] · .

Let fm ↑ f . Note that f ≤ 4 Pr[|X1| > x].Finally, by the Monotone Convergence Theorem,

∞∑n=1

E[Y 2n ]

n2≤∞∑n=1

n−2

∫ n

0

2xPr[|X1| ≥ x] dx

=

∫ ∞0

( ∞∑n=1

n−21x<n

)2xPr[|X1| > x]

≤ 4

∫ ∞0

Pr[|X1| > x] = 4E[|X1|].

We may now prove the main result of this subsection.


Theorem 6.17 (Etemadi’s Strong Law of Large Numbers[1]). Let X1, X2, . . . ∈ L1(P) be pairwise independentand identically distributed random variables. If E[|X1|] < ∞, then (Xn)n∈N satisfies the strong law of largenumbers.

Proof. Since (X+n )n∈N and (X−n )n∈N satisfy the conditions of the theorem, it suffices to consider only (X+

n )n∈N, thatis, Xn ≥ 0 for all n ∈ N.Fix some ε > 0 and α > 1 and let kn = bαnc for each n. Then by Chebyshev’s inequality,

∞∑n=1

Pr

[∣∣∣∣Tkn −E[Tkn ]

kn

∣∣∣∣ > ε

]≤ ε−2

∞∑n=1

Var[Tkn ]

k2n

= ε−2∞∑n=1

1

k2n

kn∑i=1

Var[Yi]

= ε−2∞∑i=1

Var[Yi]∑

n:kn≥m

1

k2n

≤ 1

ε2(1− α−2)

∞∑i=1

Var[Yi]

m2=

1

ε2(1− α−2)

∞∑i=1

E[Y 2i ]

m2

By lemma 6.16, this is finite. Letting ε ↓ 0, the Borel-Cantelli Lemma implies that

limn→∞

Tkn −E[Tkn ]

kn= 0 almost surely.

By the Monotone Convergence Theorem, E[Yn]n→∞−−−−→ E[X1].

This in turn gives that E[Tkn ]/knn→∞−−−−→ E[X1] and therefore,

limn→∞

Tnn

= limn→∞

Tknkn

= E[X1] almost surely.

Lemma 6.14 completes the proof.

We now give an example of the applications of the above.

Let f ∈ L1([0, 1]) be a function [0, 1]→ R and say we want to evaluate I =∫ 1

0f(x) dx. Let X1, X2, . . . be independent

random variables uniformly distributed in [0, 1] and define

In =1

n

n∑i=1

f(Xi).

The strong law of large numbers implies that Inn→∞−−−−→ I almost surely. This method of integration is known as

Monte Carlo integration.If f is a sufficiently nice function, usual numerical methods tend to give better estimates. However, If [0, 1] is replacedwith G ⊆ Rd for very large d, Monte Carlo integration tends to be useful.

Note that although In converges to I, we do not have an estimate of how fast it converges.

6.4. Speed of Convergence in the Strong Law of Large Numbers

We had discussed the speed of convergence in the weak LLN (theorem 6.11).In the strong LLN, we get good estimates if we assume the existence of higher moments, such as the following resultthat can be seen as an analogue of the Chebyshev Inequality.


Theorem 6.18 (Kolmogorov’s Inequality). Let n ∈ N and X1, . . . , Xn be independent random variables withE[Xi] = 0 and Var[Xi] <∞ for all valid i. Let Sk = X1 + · · ·+Xk for k ∈ [n]. Then for any t > 0,

Pr[maxSk : k ∈ [n] ≥ t] ≤ Var[Sn]

t2 + Var[Sn]

andPr[max|Sk| : k ∈ [n] ≥ t] ≤ t−2 Var[Sn].

The latter equation in particular is referred to as Kolmogorov’s Inequality.

Proof. Let τ be the first time at which the partial sum exceeds t, that is,

τ = mink ∈ [n] : Sk ≥ t.

and Ak = τ = k for each k. Further, let

A = maxSk : k ∈ [n] ≥ t =n⊎i=1

Ak.

Fix some c ≥ 0. Note that (Sk + c)1Ak is σ(X1, . . . , Xk)-measurable and (Sn − Sk) is σ(Xk+1, . . . , Xn)-measurable.Theorem 3.9 implies that these two random variables are independent and thus,

E[(Sk + c)1Ak(Sn − Sk)] = E[(Sk + c)1Ak ]E[Sn − Sk] = 0.

We now have

Var[Sn] + c2 = E[(Sn + c)2]

≥∑k∈[n]

E[(Sn + c)21Ak ]

=∑k∈[n]

E[((Sk + c)2 + 2(Sk + c)(Sn − Sk) + (Sn − Sk)2

)1Ak ]

=∑k∈[n]

E[((Sk + c)2 + (Sn − Sk)2

)1Ak ]

≥∑k∈[n]

E[(Sk + c)21Ak ]

≥ E[(t+ c)21A] = (t+ c)2 Pr[A].

That is,

Pr[A] ≤ Var[Sn] + c2

(t+ c)2.

Setting c = Var[Sn]/t gives the first inequality.

For the second inequality, letτ = mink ∈ [n] : |Sk| ≥ t,

Ak = τ = k and A = τ ≤ n. Repeating the steps similar to the first inequality but with c = 0 gives the requiredresult.

Now that we have Kolmogorov’s inequality, we may establish some more results.

Theorem 6.19. Let X1, X2, . . . be independent random variables with E[Xn] = 0 for any n ∈ N and V =supn∈N Var[Xn] <∞. Then for any ε > 0,

lim supn→∞

|Sn|n1/2(log(n))ε+(1/2)

= 0 almost surely.


Proof. Let l(n) = n1/2(log(n))ε+(1/2) and kn = 2n for n ∈ N. For sufficiently large n and k ∈ N such thatkn−1 ≤ k ≤ kn, |Sk|/l(k) ≤ 2|Sk|/l(kn) (Why?). It then suffices to show that for any δ > 0,

lim supn→∞

max|Sk| : k ≤ knl(kn)

≤ δ almost surely.

For each n ∈ N, define An,δ = max|Sk| : k ≤ kn > δl(kn)Then by Kolmogorov’s inequality,

∞∑n=1

Pr[An,δ] ≤∞∑n=1

V kn(δl(kn))−2

=V

δ2(log 2)1+2ε

∞∑n=1

1

n1+2ε<∞.

Applying the Borel-Cantelli Lemma on (An,δ)n∈N gives that Pr [lim supn→∞An,δ] = 0, which leads to the result.

Readers who are familiar with complexity theory can think of this as just saying that |Sn| = o(n1/2(log(n))ε+(1/2)

).

A stronger result that we do not prove here states that for iid square integrable centered random variables X1, X2, . . .,

lim supn→∞

|Sn|√2nVar[X1] log(log(n))

= 1 almost surely.

This result is significantly stronger as it says that |Sn| = Θ(√

2nVar[X1] log(log(n)))

.


References

[1] N. Etemadi. An elementary proof of the strong law of large numbers. Z. Wahrscheinlichkeitstheorie verw Gebiete,55 (1):119–122, 1981. doi:10.1007/BF01013465.

[2] Achim Klenke. Probability Theory: A Comprehensive Course. Springer, 2 edition, 2006.

https://doi.org/10.1007/BF01013465

probability and measure theory · 2021. 1. 20. · probability theory 3 -amit rajaraman x1. measure...

Documents