advanced probability and statistics

Advanced Probability and Statistics

MAFS5020HKUST

Kani Chen (Instructor)

1

PART I

PROBABILITY THEORY

2

Chapter 0. Review of classical probability calculations through examples.

We first review some basic concepts about probability space through examples. Then we summarizethe structure of probability space and present axioms and theory.

Example 0.1. De Mere’s Problem. (for the purpose of reviewing the discrete probability space.)

The Chevalier de Mere is a French nobleman and a gambler of the 17th century. He was puzzledby the equality of the probability of two events: At least one ace turns up with four rolls of a die,and at least one double-ace turns up in 24 rolls of two dice. His reasoning:

In one roll of a die, there is 1/6 chance of getting an ace, so in 4 rolls, there is 4 ∗ (1/6) = 2/3chance of getting at least one die.

In one roll of two dice there is 1/36 chance of getting a double-ace, so in 24 rolls, there is 24∗(1/36) =2/3 chance of getting at least one double ace.

De Mere turned to Blaise Pascal (1623-1662) and Pierre de Fermat (1601-1665) for help. And thetwo mathematicians/physicists gave the right answer:

P (getting at least one ace in 4 rolls of a die) = 1− (1 − 1/6)4 = 0.518

4-rolls makes favorable bet, and 3-rolls makes not.

P (getting at least one double-ace in 24 rolls of two dice) = 1− (1− 1/36)24 = 0.491

25 rolls makes a favorable bet, 24 rolls make still an unfavorable bet.

Historical remark: (Cardano’s mistake) Probability theory has an infamous birth place: gam-bling room. The earliest publications on probability dates back to “Liber de Ludo Aleae” (thebook on games of chance) by Gerolamo Cardano (1501-1576), an Italian mathematician/physician/astrologer/gambler of the time. Cardano made an important discovery of the product law of inde-pendent events, and it is believed that ”Probability” was first coined and used by Cardano. Eventhough he made several serious mistakes that seem to be elementary nowadays, he is considered asa pioneer who first systematically computed probability of events. Pascal, Fermat, Bernoulli andde Moivre are among many other prominent developers of the probability theory. The axioms ofprobability was first formally and rigorously shown by Kolmogorov in the last century.

Exercise 0.1’. Galileo’s problem. Galileo (1564-1642), the famous physicist and astronomer, wasalso involved in a calculation of probabilities of similar nature. Italian gamblers used to bet onthe total number of spots in a roll of three dice. The question is the chance of 9 total dots thesame as that of 10 total dots? There are altogether 6 combinations of total 9 dots (126, 135, 144,234, 225, 333) and 6 combinations of total 10 dots (145, 136, 226, 235, 244, 334). This can givethe false impression that the two chances are equal. Galileo gave the correct answer: 25/216 forone and 27/216 for the other. The key point here is to lay out all 63 = 216 outcomes/elements ofthe probability space and realize that each of these 216 outcomes/elements are of the same chance1/216. Then the chance of an event in the sum of the probability of the outcomes in the event.

Example 0.2. The St. Petersberg Paradox. (for the purpose of reviewing the concept of Expecta-tion)

A gambler pays an entry fee M $ to play following game: A fair coin is tossed repeated until the firsthead occurs and you win 2n−1 amount of money where n is the total number of tosses. Question:what is the ”fair” amount of M?

n is a random number with P (n = k) = 2−k for k = 1, 2, .... Therefore the ”Expected Winning” is

E(2n−1) =

∞∑

k=1

2k−1 × 1

2k=∞.

3

Notice that here I have used the expectation of a function of random variable. It appears that a“fair”, but indeed naive, M should be∞. However, by common sense, this game, despite its infinitepayoff, should not worth the same as infinity.

Daniel Bernoulli (1700-1782), a Dutch-born Swiss mathematician, provided one solution in 1738. Inhis own words: The determination of the value of an item must not be based on the price, but ratheron the utility it yields. There is no doubt that a gain of one thousand ducats is more significant tothe pauper than to a rich man though both gain the same amount. Using a utility function, e.g., assuggested by Bernoulli himself, the logarithmic function u(x) = log(x) (known as log utility), theexpected utility of the payoff (for simplicity assuming an initial wealth of zero) becomes finite:

E(U) =

∞∑

k=1

u(2k−1) ∗ P (n = k) =

∞∑

k=1

log(2k−1)/2k = log(2) = u(2) <∞.

(This particular utility function suggests that game is as useful as 2 dollars.)

Before Bernoulli’s publication in 1738, another Swiss mathematician, Gabriel Cramer, found alreadyparts of this idea (also motivated by the St. Petersburg Paradox) in stating that The mathematiciansestimate money in proportion to its quantity, and men of good sense in proportion to the usage thatthey may make of it.

Example 0.3. The dice game called “craps”. (This is to review conditional probability).

Two dice are rolled repeatedly and let the total dots of the n-th roll be Zn. If Z1 is 2 or 3 or 12,it is an immediate loss of the game. If Z1 is 7 or 11, it is an immediate win. Else, continue therolls of the two dice until either Z1 occurs, meaning a loss, or 7 occurs, meaning a win. What isthe chance of a win of this game.

Solution. Write

P (Win) =

12∑

k=2

P (Win and Z1 = k) =

12∑

k=2

P (Win |Z1 = k)P (Z1 = k)

= P (Z1 = 7) + P (Z1 = 11) +∑

k=4,5,6,8,9,10

P (Win |Z1 = k)P (Z1 = k)

= 6/36 + 2/36 +∑

k=4,5,6,8,9,10

P (Ak|Z1 = k)P (Z1 = k)

= 2/9 +∑

k=4,5,6,8,9,10

P (Ak)P (Z1 = k)

where Ak is the event that starting from the second roll, 7 dots occur before k dots. Now,

P (Ak) =

12∑

j=2

P (Ak ∩ Z2 = j) =

12∑

j=2

P (Ak|Z2 = j)P (Z2 = j)

= P (Z2 = 7) +

12∑

j=2j 6=k,j 6=7

P (starting from the 3rd roll, 7 occurs before k)P (Z2 = j)

= 6/36 + P (Ak)(1 − P (Z2 = k)− P (Z2 = 7)).

As a result,P (Ak) = P (Z1 = 7)/[P (Z1 = k) + P (Z1 = 7)].

And,

P (win) = 2/9 +∑

k=4,5,6,8,9,10

P (Z1 = 7)/[P (Z1 = k) + P (Z1 = 7)]P (Z1 = k) = 0.492929.

4

Example 0.4. Jailer’s reasoning. (Bayes Probabilities)

Three men, A, B and C are in jail and one to be executed and the other two to be freed. C, beinganxious, asked the jailer to tell him who of A and B would be freed. The jailer, pondering for awhile, answered “for your own interest, I will not tell you, because, if I do, your chance of beingexecuted would rise from 1/3 to 1/2.” What is wrong with the jailer’s reasoning?

Solution. Let AF (BF) be the event that jailer says A (B) to be freed. Let AE or BE or CE be theevent that A or B or C to be executed. Then, P (CE) = 1/3. but, by the Bayes formula ,

P (CE|AF ) =P (AF |CE)P (CE)

P (AF |AE)P (AE) + P (AF |BE)P (BE) + P (AF |CE)P (CE)

=0.5 ∗ 1/3

0 ∗ 1/3 + 1 ∗ 1/3 + 1/2 ∗ 1/3

= 1/3

= P (CE).

Likewise P (CE|BF ) = P (CE). So the “rise of probability” is false.

Example 0.5. Buffon’s needle (Continuous random variables).

Randomly drop a need of 1 cm onto a surface of many parallel straight lines that are 1 cm apart.What is the chance that the needle touches one of the lines?

Solution. Let x be the distance from the center of the needle to the nearest line. Let θ be thesmaller angle of the needle with the nearest line. Then, the needle crosses a line if and only ifx/sin(θ) ≤ 0.5.

It follows from the randomness of the drop that that θ and x are independent following the uniformdistribution on [0, π/2] and [0, 1/2]. Therefore,

P (x/sin(θ) ≤ 0.5) =4

π

∫ π/2

0

∫ sin(θ)/2

0

dxdθ =2

π

∫ π/2

0

sin(θ)dθ =2

π.

The chance is 2/π.

5

Chapter 1. σ-algebra, measure, probability space and random variables.

This section lays the necessary rigorous foundation for probability as a mathematical theory. Itbegins with sets, relations among sets, measurement of sets and functions defined on the sets.

Example 1.1. (A prototype of probability space.) Drop a needle blindly on the interval[0, 1]. The needle hits interval [a, b], a sub-interval of [0, 1] with chance b − a. Suppose A is anysubset of [0, 1]. What’s the chance or length of A?

Here, we might interpret the largest set Ω = [0, 1] as the “universe”. Note that not all subsetsare “nice” in the sense that their volume/length can be properly assigned. So we first focus ourattention on certain class of “nice” subsets.

To begin with, the “Basic” subsets are all the sub-intervals of [0, 1], which may be denoted as [a, b],with 0 ≤ a ≤ b ≤ 1. Denote B as the collection of all subsets of [0, 1], which are generated by allbasic sets after finite set operations. B is called an algebra of Ω.

It can be proved that any set in B is a finite union of disjoint intervals (closed, open or half-closed).

Still, B is not rich enough. For example, it does not contain the set of all rational numbers. Moreimportantly, the limits of sets in B are often not in B. This is serious restrictions of mathematicalanalysis.

Let A be the collection of all subsets of [0, 1], which are generated by all “basic” sets after countablymany set operations. A is called Borel σ-algebra of Ω. Sets in A are called Borel sets. Limits ofsets in A are still in A. (Ω,A) is a measurable space.

Borel measure: any set A in A can be assigned a volume, denoted as µ(A), such that

(i). µ([a, b]) = b− a.(ii). µ(A) = limµ(An) for any sequence of Borel sets An ↑ A.

Lebesgue measure (1901): Completion of Borel σ-algebra by adding all subsets of Borel measure0 sets, denoted as F . Sets with measure 0 are called null sets.

Why should Borel measure or Lebesgue measure exist in general?

Caratheodory’s extension theorem: extending a (σ-finite) measure on an algebra B to the σ-algebraA = σ(B).

Ω = [0, 1] (the universe).

B: an algebra (finite set operations) generated by subintervals.

A: the Borel σ-algebra, is a σ-algebra, generated by subintervals.

F : completion of A, a σ-algebra, generated by A and null sets.

(Ω,B, µ) does not form a probability space,

(Ω,A, µ) forms a probability space.

(Ω,F , µ) forms a probability space.

Sets and set operations:

Consider Ω as the “universe”, (Beyond which is nothing.) Write Ω = ω, ω denotes an member ofthe set, called element. Let A and B: be two subsets of Ω, called “events”.

The set operations are:

intersection: ∩, A ∩B: both A and B (happens).

union: ∪, A ∪B: either A or B (happens).

complement: Ac = Ω \A: everything except for A, or A does not happen.

6

minus: A \B = A ∩Bc: A but not B.

An elementary theorem about set operation is

DeMorgan’s identity:

(

∪∞j=1Aj

)c

= ∩∞j=1Acj ,

(

∩∞j=1Aj

)c

= ∪∞j=1Acj .

In particular, (A ∪B)c = (Ac ∩Bc), i.e., (A ∩B)c = (Ac ∪Bc).

Remark. Intersection can be generated by complement and union; and union can be generated bycomplement and intersection.

Relation: A ⊂ B, if ω ∈ A ensures ω ∈ B.

A sequence of sets An : n ≥ 1 is called increasing (decreasing) if An ⊂ An+1 (An ⊃ An+1.)

A = B if and only if A ⊂ B and B ⊂ A.

Indicator functions. (A very useful tool to translate set operation into numerical operation)

The relation and operation of sets are equivalent to the indication set functions. For any subsetA ⊂ Ω, define its indicator function as

1A(ω) =

1 if ω ∈ A0 Otherwise.

The indicator function is a function defined on Ω.

Set operations vs. function operations:

A ⊂ B ⇐⇒ 1A ≤ 1B.

A ∩B ⇐⇒ 1A × 1B = 1A∩B = min(1A, 1B).

Ac = Ω \A ⇐⇒ 1− 1A = 1Ac .

A ∪B ⇐⇒ 1A∪B = 1A + 1B, if A ∩B = ∅⇐⇒ 1A∪B = max(1A, 1B).

Set limits.

There are two limits of sets: upper limit and low limit.

lim supAn ≡ ∩∞n=1 ∪∞k=n Ak = An infinitely occurs.1lim sup An = lim sup 1An

ω ∈ lim supAn if and only if ω belongs to infinitely many An.

Lower limit.

lim inf An ≡ ∪∞n=1 ∩∞k=n Ak

= An always occurs except for finite number of times.1lim inf An = lim inf 1An

ω ∈ lim inf An if and only if ω belongs to all but finitely many An.

We say the set limit of A1, A2, ... exists if their lower limit is the same as the upper limit.

Algebra and σ-algebra

7

A is a non-empty collection (set) of subsets of Ω.

Definition. A is called an algebra if

(i). Ac ∈ A if A ∈ A;

(ii). A ∪B ∈ A if A,B ∈ A.

A is called an σ-algebra if, (ii) is strengthened as,

(iii). ∪∞n=1An ∈ A if An ∈ A for n ≥ 1.

An algebra is closed for (finite) set operations. Ω ∈ A and ∅ ∈ A.

A σ-algebra is closed for countable operations.

(Ω,A) is called a measurable space, if A is a σ-algebra of Ω.

Measure, measure space and probability space.

A, containing ∅, is a non-empty collection (set) of subsets of Ω. µ is a nonnegative set function onA.

µ is called a measure, if

(i). µ(∅) = 0.

(ii). µ(A) =∑∞

n=1 µ(An) if A,A1, A2, ... are all in A and A1, A2, ... are disjoint.

(Ω,A, µ) is called a measure space, if µ is a measure on A and A is a σ-algebra of Ω.

(Ω,A, P ) is called a probability space if (Ω,A, P ) is a measure space and P (Ω) = 1.

For probability space (Ω,A, P ), Ω is called sample space, every A in A is an event, and P (A) is theprobability of the event, the chance that it happens.

Random variable (r.v.).

Loosely speaking, given a probability space (Ω,F , P ), a random variable (r.v.) X is defined asa real-valued function of Ω, satisfying certain measurability condition. Loosely speaking, viewingX = X(ω) as a mapping from Ω to R, the real line, then X−1(B) must be in F for all Borel sets B.(Borel sets on real line are the σ-algebra generated by intervals, i.e., the sets generated by countableoperations on intervals).

A random variable X defined on a probability space (Ω,A, P ) is a function defined on Ω, such thatX−1(B) ∈ A for every interval B on [−∞,∞], where X−1(B) = ω : X(ω) ∈ B. (We need toidentify its probability.)

X−1(B) is called the inverse image of B.

X = X(·) can be viewed as a map or transformation from (Ω,A) to (R,B), where R = [−∞,∞]and B is the σ-algebra generated by the intervals in R.

X is a measurable map/transformation since X−1(B) ∈ A for every B ∈ B (DIY.)

Because A is a σ-algebra, the upper and lower limits of Xn is a r.v. if Xn are r.v.s., and thealgebraic operations: +,−,×, /, of r.v.s are still r.v.s.

Measurable map and random vectors.

f(·) is called a measurable map/transformation/function from a measurable space (Ω,A) to anothermeasurable space (S,S), if f−1(B) ∈ A for every B ∈ S. i.e. w : f(w) ∈ B ∈ A.

X is called a random vector of p dimension if it is a measurable map from a probability space(Ω,A, P ) to (Rp,Bp), where Bp is the Borel σ-algebra in p dimensional real space, Rp = [−∞,∞]p.

Proposition 1.1 If X = (X1, ..., Xp) is a random vector of p dimension on a probability space(Ω,A, P ), and f(·) is measurable function from (Rp,Bp) to (R,B), then f(X) is a random variable.

8

Proof. For any Borel set B ∈ B,

ω : f(X(ω)) ∈ B = ω : X(ω) ∈ f−1(B) ∈ A

since f−1(B) ∈ Bp.

Proposition 1.2 If X1, X2, ... are r.v.s. So are

infnXn, sup

nXn lim inf

nXn and lim sup

nXn.

Proof. Let the probability space be (Ω,A, P ). For any x,

ω : infnXn(ω) ≥ x = ∩nω : Xn(ω) ≥ x ∈ A;

ω : supnXn(ω) ≤ x = ∩nω : Xn(ω) ≤ x ∈ A;

lim infn

Xn > x = ∪n infk≥n

Xk > x ∈ A;

lim supn

Xn < x = ∪nsupk≥n

Xk < x ∈ A.

Therefore, infnXn, supnXn, lim infnXn and lim supnXn are r.v.s.

Proposition 1.3 Suppose X is a map from a measurable space (Ω,A) to another measurablespace (S,S). If X−1(C) ∈ A for every C ∈ C and S = σ(C). Then, X is a measurable map, i.e.,X−1(S) ∈ A for every S ∈ S. In particular, when (S,S) = ([−∞,∞],B), X−1([−∞, x]) ∈ A forevery x is enough to ensure X is a r.v..

Proof. Note that σ(C), the σ-algebra generated by C, is defined mathematically as the smallestσ-algebra containing C.Set B∗ = B ∈ S : X−1(B) ∈ A.We first show B∗ is a σ-algebra. Observe that

(i). for any B ∈ B∗, X−1(B) ∈ A and, therefore, X−1(Bc) = (X−1(B))c ∈ A;

(ii). for any Bn ∈ B∗, X−1(Bn) ∈ A and X−1(∪nBn) = ∪nX−1(Bn) ∈ A.

Consequently, B∗ is a σ-algebra. Since C ⊂ B∗ ⊂ S, it follows that B∗ = S.

set operations.

a σ-algebra of Ω and P a set function such that

DIY Exercises:

Exercise 1.1 ⋆⋆ Show 1lim inf An = lim inf 1An and DeMorgen’s identity.

Exercise 1.2 ⋆⋆ Show that, the so called “countable additivity” or “σ-additivity”, (P (∪nAn) =∑

n P (An) for countable disjoint An ∈ A), is equivalent to “finite additivity” plus “continuity” (ifAn ↓ ∅, then P (An)→ 0.)

Exercise 1.3 ⋆ ⋆ ⋆ (Completion of a Probability space) Let (Ω,F , P ) be a probability space.Define

F = A : P (A \B) + P (B \A) = 0, for someB ∈ F,And for each A ∈ F , P (A) is defined as P (B) for the B given above. Prove that (Ω, F , P ) is also aprobability space. (Hint: need to show that F is a σ-algebra and that P is a probability measure.)

Exercise 1.4 ⋆ ⋆ ⋆ If X1 and X2 are two r.v.s, so is X1 +X2. (Hint: cite Propositions 1.1 and1.3)

9

Chapter 2. Distribution, expectation and inequalities.

Expectation, also called mean, of a random variable is often referred to as the location or center ofthe random variable or its distribution. To avoid some non-essential trivialities, unless otherwisestated, the random variables will usually be assumed to take finite values and those taking values−∞ and ∞ are considered as r.v.s in extended sense.

(i). Distribution.

Recall that, given a probability space (Ω,F , P ), a random variable (r.v.) X is defined as a real-valued function of Ω, satisfying certain measurability condition. The cumulative distribution func-tion of X is then

F (t) = P (X ≤ t) = P (w ∈ Ω : X(w) ≤ t) = P (X−1((−∞, t])), t ∈ (−∞,∞).

F (·) is then a right-continuous function defined on the real line (−∞,∞).

Remark. The distribution function of a single r.v. may be considered as complete profile/descriptionof the r.v.. The distribution function F (·) defines a probability measure on (−∞,∞). This is theinduced measure, induced by the random variable as a map/function from the probability mea-sure P on (Ω,F , P ) to ((−∞,∞),B, F ). In this sense, the original probability space is often leftunspecified or seemingly irrelevant when dealing with one single random variable.

We often call a random variablediscrete random variableif it takes countable number of values, andcall a random variablecontinuous random variableif the chance it takes any particular value is 0.In statistics, continuous random variableis often, by default, given a density function. In general,continuous random variablemay not have a density function (with respect to Legbegue measure).An example is the Cantor measure.

For two random variables X and Y , their joint c.d.f. is

FX,Y (t, s) = P (X ≤ t andY ≤ s) = P (X−1((−∞, t]) ∩ Y −1((−∞, s])), t, s ∈ (−∞,∞).

Joint c.d.f can be extended for finite number of variables in a straightforward fashion. If the (joint)c.d.f. is differentiable, the derivative is then called (joint) density.

(ii). Expectation.

Definitions. For a nonnegative random variableX with c.d.f F , its expectation is defined as

E(X) ≡∫ ∞

0

xdF (x).

In general, let X+ = X1X≥0, X− = −X1X≤0,

E(X) ≡ E(X+)− E(X−).

If E(X+) =∞ = E(X−), E(X) does not exist.

A more original definition of the expectation is through that of Lebesgue integral: for nonnegativeX ,

E(X) ≡∫

X(w)dP (w) formally

≡ limm→∞

∞∑

k=0

k

2mP

( k

2m< X ≤ n+ 1

2m

)

.

If X takes ∞ with positive probability, E(X+) =∞. Note that X has finite mean is equivalent toE|X | <∞. And the mean of X does not exist is the same as E(X+) = E(X−) =∞.

10

The expectation defined above is mathematically an integral or summation with respect to certainprobability measure induced by the random variable. In layman’s words, it is the weighted ”average”of the values taken by the r.v., weighted by chances which sum up to 1.

Some basic properties of expectation:

(1). E(f(X)) =∫

f(x)dF (x) where F is the c.d.f. of X .

(2). If P (X ≤ Y ) = 1, then E(X) ≤ E(Y ). If P (X = Y ) = 1 then E(X) = E(Y ).

(3). E(X) is finite if and only if E(|X |) is finite.

(4). (Linearity) E(aX + bY ) = aE(X) + bE(Y ).

(5). If a ≤ X ≤ b, then a ≤ E(X) ≤ b.

(iii). Some typical distributions of random variables.

(1.) Commonly used discrete distributions:

Bernoulli: X ∼ Bin(1, p). P (X = 1) = p = 1− P (X = 0). E(X) = p and var(X) = p(1− p).

Binomial: X ∼ Bin(n, p). X =∑n

i=1 xi and xi are iid with B(1, p) (the number of successes of nBernoulli trials.

P (X = k) =

(

n

k

)

pk(1 − p)n−k, k = 0, 1, ..., n.

E(X) = np. var(X) = np(1− p).

Poisson: X ∼ P(λ). E(X) = var(X) = λ.

P (X = k) =1

k!λke−λ, k = 0, 1, 2, ...

Key fact: B(n, p)→ P(λ) if n→∞, np→ λ. (Law of rare events.)

Geometric: X ∼ G(p): time to the first success in a series of Bernoulli trials.

P (X = k) = (1− p)k−1p, k = 1, 2, ...

E(X) = 1/p, var(X) = (1− p)/p2.

Negative binomial: X ∼ NB(p, r): time to the first r successes in a series of Bernoulli trials.Therefore X =

∑rj=1 ξj where ξj are iid ∼ G(p).

P (X = k) =

(

k − 1

r − 1

)

pr(1− p)k−r , k = r, r + 1, ...

E(X) = r/p and var(X) = r(1 − p)/p2.

Hyper-geometric: X ∼ HG(r, n,m): the number of black balls when r balls are taken withoutreplacement from an urn containing n black balls and m white balls.

P (X = k) =

(

n

k

)(

m

r − k

)

/

(

n+m

r

)

, k = 0 ∨ (r −m), 1, ..., r ∧ n.

E(X) = rn/(m+ n) and var(X) = rnm(n+m− r)/[(n +m)2(n+m− 1)].

(2) Commonly used continuous distributions:

Uniform: X ∼ Unif [a, b]f(x) = (b− a)1x∈[a,b]

11

E(X) = (a+ b)/2 and var(X) = (b− a)2/12.

Normal: X ∼ N(µ, σ2), E(X) = µ and var(X) = σ2. Central limit theorem.

f(x) = 1/√

2πσ2e−(x−µ)2/(2σ2), x ∈ (−∞,∞).

Exponential: X ∼ E(λ). Density:

f(x) = e−x/λ/λ, x > 0

E(X) = λ and var(X) = λ2. No memory: (X − t) | X ≥ t ∼ E(λ).

Gamma: Γ(α, γ). Density:

f(x) =1

Γ(α)γxα−1e−x/γ , x > 0.

E(λ) = Γ(1, λ), χ2n = Γ(n/2, 2). Sum of independent Γ(αi, γ) follows Γ(

∑

i αi, γ).

Beta: B(α, β). Density:

f(x) =Γ(α+ β)

Γ(α)Γ(β)xα−1(1 − x)β−1, x ∈ [0, 1]

ξ/(ξ+ η) ∼ B(α, β) where ξ ∼ Γ(α, γ) and η ∼ Γ(β, γ) are independent. X(k) ∼ B(k− 1, n− k+ 1)as the k-th smallest of X1, ..., Xn iid ∼ Unif [0, 1]

Cauchy: density f(x) = 1/[π(1 + x2)]. Symmetric about 0, but expectation and variance not exist.

χ2n (with d.f. n): sum of n i.i.d standard normal r.v.s. χ2

2 is E(2).

tn (with d.f n): ξ/√

η/n where ξ ∼ N(0, 1), η ∼ χ2n and ξ and η are independent.

Fm,n (with d.f. (m,n)): (ξ/m)/(η/n) where ξ ∼ χ2m, η ∼ χ2

n and ξ and η are independent.

(iv). Some basic inequalities:

Inequalities are extremely useful tools in theoretical development of probability theory. For sim-plicity of notation, we use ‖X‖p, which is also called Lp norm if p ≥ 1, to denote [E(|X |p)]1/p fora r.v. X . In what follows, X and Y are two random variables.

(1) the Jensen inequality: Suppose ψ(·) is a convex function andX and ψ(X) have finite expectation.Then ψ(E(X)) ≤ E(ψ(X)).

Proof. Convexity implies for every a, there exists a constant c such that ψ(x) − ψ(a) ≥ c(x − a).Let a = E(X) and x = X , the right hand side is mean 0. So Jensen’s inequality follows.

(2). the Markov inequality: For any a > 0, P (|X | ≥ a) ≤ 1/aE(|X |).

Proof. aP (|X | ≥ a) = E(a1|X|≥a ≤ E(|X |1|X|≥a) ≤ E(|X |).

(3). the Chebyshev (Tchebychev) inequality: for a > 0,

P (|X − E(X)| ≥ a) ≤ var(X)/a2

Proof. The inequality holds if var(X) = ∞. Assume var(X) < ∞, then E(X) is finite andY ≡ (X − E(X))2 is well defined. It follows from the Markov inequality that

P (|X − E(X)| ≥ a) = P (Y ≥ a2) ≤ E(Y )/a2 = var(X)/a2.

12

(4). the Holder inequality: for 1/p+ 1/q = 1 with p > 0 and q > 0,

E|XY | ≤ ‖X‖p‖Y ‖q

Proof. Observe that for any two nonnegative numbers a and b, ab ≤ ap/p+ bq/q. (This is a resultof the concavity of the log-function. please DIY.) Let a = |X |/‖X‖p and b = |Y |/‖Y ‖q and takeexpectation on both sides. The Holder inequality follows.

(5). the Schwarz inequality:

E(|XY |) ≤ [E(X2)E(Y 2)]1/2.

Proof. A special case of the Holder inequality.

(6). the Minkowski inequality: for p ≥ 1,

‖X + Y ‖p ≤ ‖X‖p + ‖Y ‖p.

Proof. If p = 1, the inequality is trivial. Assume p > 1. Let q = p/(p− 1). Then 1/p+ 1/q = 1.By the Holder inequality,

E[|X ||X+Y |p−1] ≤ ‖X‖p‖|X+Y |p−1‖q = ‖X‖pE[|X+Y |(p−1)q]1/q = ‖X‖pE[|X+Y |p](p−1)/p.

Likewise,

E[|Y ||X + Y |p−1] ≤ ‖Y ‖pE[|X + Y |p](p−1)/p.

Summing up the above two inequalities leas to

E(|X + Y |p) ≤ (‖X‖p + ‖Y ‖p)E[|X + Y |p](p−1)/p,

and the Minkowski inequality follows.

Remark. Jensen’s inequality is a powerful tool. For example, straightforward applications include

[E(|X |)]p ≤ E(|X |p), for p ≥ 1,

which implies

‖X‖p ≤ ‖Y ‖q, for 0 < p < q.

Moreover,

E(log(|X |)) ≤ log(E(|X |)).If E(X) exists,

E(eX) ≥ eE(X).

These inequalities are all very commonly used. For example, the validity of the maximum likelihoodlikelihood estimation essentially rests on the fact,

E log( fθ(X)

fθ0(X)

)

≤ logE( fθ(X)

fθ0(X)

)

= log(

∫

fθ(x)

fθ0(x)fθ0(x)dx

)

= log(

∫

fθ(x)dx)

= log(1) = 0,

which is a result of Jensen’s inequality. Here fθ(·) is a parametric family of density of X with θ0being the true value of θ.

The Markov inequality, despite its simplicity, shall be frequently used in the order of a sequenceof random variables, especially when coupled with the technique of truncation. The Chebyshevinequality is so mighty that, as an example, it directly proves the weak law of large numbers.

13

The Schwarz inequality shows that covariance is an inner product, and, furthermore, the space ofmean 0 r.v.s with finite variances forms a Hilbert space. The Minkowsky inequality is the triangleinequality for Lp norm, without which Lp cannot be a norm.

DIY Exercises.

Exercise 2.1. ⋆ Suppose X is a r.v. taking values on all rational numbers on [0, 1], Specifically,P (X = qi) = pi > 0 where q1, q2, ... denotes all rational numbers on [0, 1]. Then, the c.d.f of X iscontinuous at irrational numbers and discontinuous at rational numbers.

Exercise 2.2. ⋆⋆⋆ Show var(X+) ≤ var(X) and var(min(X, c)) ≤ var(X) where c is any constant.

Exercise 2.3. ⋆ ⋆ ⋆ (Generalizing Jensen’s inequality). Suppose g(·) is a convex function and X isa random variable with finite mean. Then, for any constant c,

Eg(X − E(X) + c) ≥ g(c).

Exercise 2.4. ⋆⋆⋆ Lyapunov (Liapounov) : Show that the function logE(|X |p) is a convex functionof p on [0,∞). Or, equivalently, for any 0 < s < m < l, show

E(|X |m) ≤ [E(|X |s)]r[E(|X |l)]1−r

where r = (l −m)/(l − s). (Hint: use the H‡older inequality on

E(|X |λp1+(1−λ)p2) ≤ [E(|X |p1)]λ[E(|X |p2)]1−λ

for positive p1, p2 and 0 < λ < 1.)

14

Chapter 3. Convergence

Unlike convergence of a sequence of numbers, the convergence of a sequence of r.v.s at least hasfour commonly used modes: almost sure convergence, in probability convergence, Lp convergenceand in distribution convergence. The first is sometimes called convergence almost everywhere oralmost certain and the last convergence in law.

(i). Definitions

In what follows, we give definitions. Suppose X1, X2, ... are a sequence of r.v.s.

Xn → X almost surely, (a.s.) if P (ω : Xn(ω) → X(ω)) = P (Xn → X) = 1. Namely, a.s.convergence is a point-wise convergence “everywhere” except for a null set.

Xn → X in probability, if P (|Xn −X | > ǫ)→ 0 for any ǫ > 0.

Xn → X in Lp, if E(|Xn −X |p)→ 0.

Xn → X in distribution. There are four equivalent definitions:

1). For every continuity point t of F , Fn(t)→ F (t), where Fn and F are c.d.f of Xn and X .

2). For every closed set B, lim supn P (Xn ∈ B) ≤ P (X ∈ B).

3). For every open set B, lim infn P (Xn ∈ B) ≥ P (X ∈ B).

4). For every continuous bounded function g(·), E(g(Xn))→ E(g(X)).

Remark. The Lp convergence preclude the limit X taking values of infinity with positive chances.Sometimes in some textbooks, a sequence of numbers going to infinity is called convergence toinfinity rather than divergence to infinity. If this is the case, the limit X can be ∞ or −∞, for a.s.convergence and, by slightly modifying the definition, for in probability convergence. For example,Xn → ∞ in probability is naturally defined as, for any M > 0, P (Xn > M)→ 1. Convergence indistribution only has to do with distributions.

(ii). Convergence theorems.

The following three theorems/lemma, tantamount to their analogues in real analysis, play importantrole in the technical development of probability theory.

(1). Monotone convergence theorem. If Xn ≥ 0, and Xn ↑ X , then E(Xn) ↑ E(X).

Proof. E(Xn) ≤ E(X). For any a < E(X), there exists a N and m such that∑N

i=0i

2mP(

i2m <

X(w) ≤ i+12m

)

> a. But P(

i2m < Xn(w) ≤ i+1

2m

)

→ P(

i2m < X(w) ≤ i+1

2m

)

(why?). Therefore,

limE(Xn) ≥ a. Hence, E(Xn)→ E(X).

(2). Fatou’s lemma. If Xn ≥ 0, a.s., then

E(lim inf Xn) ≤ lim inf E(Xn)

Proof. Let X∗n = inf(Xk : k ≥ n), then X∗

n ↑ lim inf Xn, so the Monotone convergence theo-rem, E(X∗

n) ↑ E(lim inf Xn). On the other hand, X∗n ≤ Xn so, E(X∗

n) ≤ E(Xn). As a result,E(lim inf Xn) ≤ lim inf E(Xn).

(3). Dominated convergence theorem. If |Xn| ≤ Y , E(Y ) < ∞, and Xn → X a.s., then E(Xn) →E(X).

Proof. Observe that Y −Xn ≥ 0 ≤ Y +Xn. By Fatou’s lemma, E(Y − limXn) ≤ lim inf E(Y −Xn),leading to E(X) ≥ lim supE(Xn). Likewise E(Y + limXn) ≤ lim inf E(Y + Xn), leading toE(X) ≤ lim inf E(Xn). Consequently, E(Xn)→ E(X).

15

The essence of the above convergence theorems is to use a bound, upper or lower, to ensure thedesired convergence in expectation. These bounds, lower bounds as 0 in the monotone convergencetheorem and the Fatou lemma, and both lower and upper bounds in the dominated convergencetheorem, can actually be relaxed; see DIY exercises. The most general extension is through theconcept of uniform integral r.v.s, which shall be introduced later if necessary.

(iii). Relations between convergence modes.

The relations are partly illustrated in the following diagram:

a.s. conv.

in.prob. conv. in.dist. conv.

Lp conv.

†

‡

‡

†: exist a subsequence that converges a.s. ‡: if |Xn| ≤ Y where Y ∈ Lp.

(iv) Some examples.

We use following examples to clarify the above diagram.

a). in prob. conv. but not a.e. conv.

Let ξ ∼ Unif [0, 1]. Set X2j+k = 1 if ξ ∈ [k/2j, (k+1)/2j] and 0 otherwise, for all 0 ≤ k ≤ 2j−1 andj = 0, 1, 2, .... Then, Xn → 0 in probability as n→∞, but Xn 9 0, a.e.. In fact, P (Xn → 0) = 0.

Let ξn be i.i.d ∼ Unif [0, 1]. Let Xn = 1 if ξn ≤ 1/n and 0 otherwise. Then Xn → 0 in probability,but Xn 9 0, a.e. by Borel-Contelli lemma.

b). in distribution conv. but not in probability conv..

This is in fact quite trivial. Any sequence of (non-constant) i.i.d. random variables converge indistribution, but not in probability. Observe that convergence in distribution only concerns thedistribution. The variables even do not have to be in the same probability space.

c). a.s. but not Lp conv.

Let ξ ∼ Unif [0, 1]. Let Xn = en if ξ ≤ 1/n and 0 otherwise. Then Xn → 0 a.s. but E(|Xn|p|) =enp/n→∞.

(v). Technical proofs.

1 . a.s. convergence =⇒ in probability convergence.

Proof. Let An = |Xn − X | > ǫ. a.s. convergence implies P (An, i.o.) = 0. But An, i.o. =∩∞n=1 ∪∞k=n Ak. So 0 = P (An, i.o.) = limn P (∪∞k=nAk) ≥ lim supn P (An).

2 . Lp convergence =⇒ in prob convergence.

Proof. 0← E(|Xn −X |p) ≥ E(|Xn −X |p1|Xn−X|>ǫ) ≥ ǫpP (|Xn −X | > ǫ).

3 . in prob convergence =⇒ in distribution convergence.

Proof. For any t, and any ǫ > 0, lim supP (Xn ≤ t) ≤ lim supP (Xn ≤ t ∩ X ≤ Xn + ǫ) ≤P (X ≤ t+ ǫ). Let ǫ ↓ 0, we have lim supP (Xn ≤ t) ≤ P (X ≤ t). Likewise lim supP (−Xn ≤ −t) ≤P (−X ≤ −t). (Why?) Then lim inf P (Xn < t) ≥ P (X < t). Suppose now, t is a continuity pointof X . Then P (X < t) = P (X ≤ t). As a result, limn P (Xn ≤ t) = P (X ≤ t).

16

4 . in prob convergence =⇒ existence of a subsequence that converges a.s.

Proof. Let ǫk ↓ 0. Since P (|Xn − X | > ǫk) → 0 as n → ∞, there exists an nk such thatP (|Xnk

−X | > ǫk) < 2−k. Therefore∑∞

k=1 P (|Xnk−X | > ǫk) < ∞, which implies by the Borel-

Contelli lemma, which is introduced in the next section, that P (|Xnk− X | > ǫk, i.o.) = 0. This

means that, with probability 1, |Xnk− X | ≤ ǫk for all large k. This is tantamount to Xnk

→ Xa.s..

5 . Lp convergence =⇒ Lq convergence for p > q > 0.

Proof. Let Yn = |Xn −X |. For any ǫ > 0, E(Y qn ) ≤ ǫ+E(Y q

n 1Yn≥ǫ) ≤ ǫ+E(Y qn 1Yn≥1 + P (ǫ ≤

Yn ≤ 1) ≤ ǫ + E(Y pn 1Yn≥1 + P (ǫ ≤ Yn) → ǫ as n → ∞. Since ǫ > 0 is arbitrary, it follows that

Xn → X in Lq.

6 . Suppose |Xn| ≤ c > 0 a.s., then, in probability convergence ⇐⇒ Lp convergence for all (any)p > 0.

Proof. ⇐= follows from 2 . And =⇒ follows from the dominated convergence theorem.

7 . The four equivalent definitions of in distribution convergence.

Proof. 2) ⇐⇒ 3). The complement of any closed set is open. Likewise, the complement of anyclosed set is open.

1) =⇒ 3). Continuity points of F are dense (why?). Consider interval (−∞, t), there exists conti-nuity points tk ↑ t. Then,

lim infn

P (Xn ∈ (−∞, t)) ≥ lim infn

P (Xn ∈ (−∞, tk]) = P (X ∈ (−∞, tk])→ P (X ∈ (−∞, t)).

The result can be extended for general open sets. We omit the proof.

3) =⇒ 1). Suppose t is a continuity point. Then lim supn Fn(t) ≤ F (t) by 2) and the equivalencyof 2) and 3). lim infn Fn(t) ≥ lim infn P (Xn < t) ≥ P (X < t) = F (t) as t is a continuity point. So1) follows.

4) =⇒ 1). Let t be a continuity point of F . For any small ǫ > 0, choose a non-increasing continuousfunction f of x which is 1 for x < t, and is 0 for x > t + ǫ. Then, P (Xn ≤ t) ≤ E(f(Xn)) →E(f(X)) ≤ P (X ≤ t+ ǫ). Therefore the lim supP (Xn ≤ t) ≤ P (X ≤ t). Likewise (how?), one canshow lim inf P (Xn ≤ t) ≥ P (X ≤ t). The desired convergence follows.

1) =⇒ 4). Continuity points of the cdf of X are dense (why?). Suppose |f(t)| < c. Choosecontinuity points −∞ = t0 < t1, ... < tK < tK+1 = ∞ such that F (t1) < ǫ > 1 − F (tK), and|f(t)− f(s)| < ǫ for any t, s ∈ [tj , tj+1] for j = 1, ...,K − 1. Then,

|E(f(Xn))− E(f(X))| = |∫

f(t)dFn(t)−∫

f(t)dF (t)|

≤K

∑

j=0

|∫ tj+1

tj

f(t)[dFn(t)− dF (t)]|

≤ 2cǫ+K−1∑

j=1

|∫ tj+1

tj

f(t)[dFn(t)− dF (t)]|

≤ 2cǫ+ +

K−1∑

j=1

|∫ tj+1

tj

f(tj)[dFn(t)− dF (t)]|

+

K−1∑

j=1

|∫ tj+1

tj

[f(t)− f(tj)][dFn(t)− dF (t)]|

17

≤ 2cǫ+K−1∑

j=1

c|∫ tj+1

tj

[dFn(t)− dF (t)]| +K−1∑

j=1

ǫ

∫ tj+1

tj

[dFn(t) + dF (t)]

→ 2cǫ+ 2ǫ

∫ tK

t1

dF (t) as n→∞.

≤ (2c+ 1)ǫ,

which can be arbitrarily small.

DIY Exercises.

Exercise 3.1 ⋆⋆ Suppose Xn ≥ η, with E(η−) <∞. Show E(lim inf Xn) ≤ lim inf E(Xn).

Exercise 3.2 ⋆⋆ Show the dominated convergence theorem still holds if Xn → X in probability orin distribution.

Exercise 3.3 ⋆ ⋆ ⋆ Let Sn =∑n

i=1Xi. Raise a counter-example to show Sn/n 6→ 0 in probabilitybut Xn → 0 in probability.

Exercise 3.4 ⋆ ⋆ ⋆ Let Sn =∑n

i=1Xi. Show that Sn/n→ 0 a.s. if Xn → 0 a.s., and Sn/n→ 0 inLp if Xn → 0 in Lp for p ≥ 1.

18

Chapter 4. Independence, conditional expectation, Borel-Cantelli lemmaand Kolmogorov 0-1 laws.

(i). Conditional probability and independence of events.

For any two events, say A and B, the conditional probability of A given B is defined as

P (A|B) = P (A ∩B)/P (B), if P (B) 6= 0.

This is the chance of A to happen, given B has happened.

In common sense, the independence between events A and B should be, information about eventB happens/or not, does not change the chance of A to happen/or not, and vice versus. In otherwords, whether B (A) happens or not does not contain any information about whether A (B)happens. Therefore the definition of independence should be P (A|B) = P (A) or P (B|A) = P (B).But to include that case of P (A) = 0 or P (B) = 0, the mathematical definition of independence isP (A∩B) = P (A)P (B), which is equivalent to P (Ac∩B) = P (Ac)P (B) or P (A∩Bc) = P (A)P (Bc)or P (Ac∩Bc) = P (Ac)P (Bc). The definition is extended in the following to independence betweenn events.

Definition Events A1, ..., An are called independent if P (∩ni=1Bi) =

∏ni=1 P (Bi) where Bi is Ai

or Aci . Events A1, ..., An are called pairwise independent if any pair of two events are independent.

The above definition implies, if A1, ..., An are independent (pairwise independent), then Ai1 , ..., Aik

are independent (pairwise independent). (Please DIY).

The σ-algebra generated by a single set A, denoted as σ(A) is ∅, A,Ac,Ω. Independence betweenA1, ..., An can be interpreted as independence between the σ-algebras: σ(Ai), i = 1, ..., n.

(ii). Borel-Cantelli Lemma.

The Borel-Contelli Lemma is considered as sine qua non of probability theory and is instrumentalin proving the law of large numbers. Please note in the proof below the technique of using theindicator functions to handle probability of sets,

Theorem 4.1. (Borel-Contelli Lemma) For events A1, A2, ...,

(1)

∞∑

n=1

P (An) <∞ =⇒ P (An, i.o.) = 0;

(2) If An are independent,∞∑

n=1

P (An) =∞ =⇒ P (An, i.o.) = 1.

Here An, i.o. means An happens infinitely often, i.e., ∩∞n=1 ∪∞k=n Ak.

Proof. (1): Let 1An be the indicator function of An. Then, An, i.o. is the same as∑∞

n=1 1An =∞.Hence,

E(

∞∑

i=1

1An) =

∞∑

n=1

E1An =

∞∑

n=1

P (An) <∞.

It implies∑n

i=1 1An <∞ with probability 1. This is equivalent to P (An, i.o.) = 0.

(2).∑∞

n=1 P (An) = ∞ implies∏∞

k=n(1 − P (Ak)) = 0 since log(1 − x) ≤ −x for x ∈ [0, 1]. for alln ≥ 1. By dominated convergence theorem

E(lim inf 1Acn) = E(lim

n

∞∏

k=n

1Ack) = lim

nE(

∞∏

k=n

1Ack) = lim

n

∞∏

k=n

(1− P (Ak)) = 0.

Then, P (lim infnAcn) = 0 and hence P (lim supAn) = 1.

As an immediate consequence,

19

Corollary (Borel’s 0-1 law) IfA1, ..., An, ... are independent, then P (An, i.o.) = 1 or 0 accordingas

∑

n P (An) =∞) or <∞.

Even though the above 0-1 law appears to be simple, its impact and implication is profound. Moregenerally, suppose A ∈ ∩∞n=1σ(Aj , j ≥ n), the so-called tail σ-algebra. A is called a tail event. Then,the independence of A1, ..., An, ... implies P (A) = 0 or 1. The key fact here is that A is independentof An for any n ≥ 1, such as, for example, An, i.o. or ∑n

i=1 1Ai/ log(n) →∞. A more generalresult involving independent random variables to be introduced below is the Kolmogorov’s 0-1 lawto be introduced later.

The following example can be viewed as a strengthening of the Borel-Cantelli lemma.

Example 4.1 Suppose A1, ..., An, ... are independent events with∑

n pn =∞ where pn = P (An).Then,

Xn ≡∑n

i=1 1Ai∑n

i=1 pi→ 1 a.s..

Proof Since

E(Xn − 1)2 =

∑ni=1 pi(1− pi)

(∑n

i=1 pi)2≤ 1

∑ni=1 pi

→ 0,

it follows that Xn → 1 in L2 and therefore also in probability by the Chebyshev inequality:

P (|Xn − 1| > ǫ) ≤ E(Xn − 1)2

ǫ2≤ 1

ǫ2∑n

i=1 pi→ 0.

Consider nk ↑ ∞ as k →∞, such that

∞∑

k=1

1∑nk

i=1 pi<∞ and

∑nk+1

i=1 pi∑nk

i=1 pi→ 1.

Then,∞∑

i=1

P (|Xnk− 1| > ǫ) <∞.

The Borel-Cantelli lemma implies Xnk→ 1 a.s.. Observe that, for nk ≤ n ≤ nk+1,

1←∑nk

i=1 1Ai∑nk+1

i=1 pi≤ Xn =

∑ni=1 1Ai

∑ni=1 pi

≤∑nk+1

i=1 1Ai∑nk

i=1 pi→ 1, a.s..

The desired convergence holds.

Remark. The trick of bracketing Xn by the two quantities in the above inequality is also used inproving the uniform convergence of the empirical distribution to the population distribution:

|Fn(x) − F (x)| → 0, a.s.,

where Fn(x) = (1/n)∑n

i=1 1ξi≤x and ξi are iid with cdf F . The idea is further elaborated in thecontext of empirical approximation in terms of bracketing/packing numbers.

Example 4.2. Repeatedly toss a coin, which has probability p to be head and q = 1− p to be tailon each toss. Let Xn = H or T when n-th toss is a head or tail. Let

ln = maxm ≥ 0 : Xn = H,Xn+1 = H, ...,Xn+m−1 = H,Xn+m = T

be the length of run of heads starting from n-th toss. Then,

lim supn

ln/ logn = 1/ log(1/p).

Proof. ln follows a geometric distribution, i.e.,

P (ln = k) = qpk, P (ln ≥ k) = P (Xn = 1, ..., Xn+k−1 = 1) = pk k = 0, 1, 2, ...

20

For any ǫ > 0,

∞∑

n=1

P(

ln > (1 + ǫ)logn

log(1/p)

)

≤∞∑

n=1

p(1+ǫ) log nlog(1/p) ≤

∞∑

n=1

e−(1+ǫ) log n =∞∑

n=1

n−(1+ǫ) <∞

By the Borel-Cantelli lemma,

lim supn

lnlogn/ log(1/p)

≤ 1.

We next try to find a subsequence with limit as large as 1. . Let dn be the integer part oflogn/ log(1/p) and let rn =

∑ni=1 di. Then rn ≈ n logn/ log(1/p) and log(rn) ≈ log(n). Set

An = Xrn = H,Xrn+1 = H, ...,Xrn+dn−1 = H

Then An, n ≥ 1 are independent, and

P (An) = pdn = edn log p ≈ 1/n

Therefore,∑

n P (An) = ∞. It then follows from the Borel Cantelli lemma that P (An, i.o, ) = 1.Since An = lrn ≥ dn, we have

lim supn

lnlogn/ log(1/p)

≥ lim supn

lrn

log(rn)/ log(1/p)= lim sup

n

lrn

dn≥ 1.

Remark. An analogous problem occurs in the setting of Poisson processes. Consider a Poissonprocess with intensity λ > 0. The sojourn times (time between two consecutive events) ξ0, ξ1, ... areiid ∼ exponential distribution with mean 1/λ. Then, lim supx→∞ lx/x = 1/λ, where lx the timeperiod between x and the time of the event right after x.

(iii). Independence between σ-algebras and between random variables.

Definitions. Let A1, ...,An be σ-algebras. They are called independent if A1, ..., An are independentfor any Aj ∈ Aj , j = 1, ..., n. Random variables X1, ..., Xn are called independent, if the σ-algebrasgenerated by Xj , 1 ≤ j ≤ n, are independent, i.e.,

P (∩nj=1X

−1j (Bj)) =

n∏

j=1

P (X−1j (Bj)) or P (X1 ∈ B1, ..., Xn ∈ Bn) =

n∏

j=1

P (Xj ∈ Bj)

for any Borel sets B1, ..., Bn in (−∞,∞).

There are several equivalent definition of the independence of random variables:

Two r.v.s X and Y are called independent, if E(g(X)f(Y )) = E(g(X))E(f(Y )) for all bounded(measurable) functions g and f . or, equivalently, if

P (X ≤ t, and Y ≤ s) =

n∏

i=1

P (Xj ≤ tj) for all tj ∈ (−∞,∞), j = 1, ..., n.

i.e., in terms of cumulative distribution functions.

FX,Y (x, y) = FX(x)FY (y) for all x, y.

If the joint density exists, This is the same as fX,Y (x, y) = fX(x)fX(y).

Roughly speaking, independence between two r.v.s X and Y is interpreted as X taking any value“has nothing to do with” Y taking any value, and vice versus.

(iv). Conditional expectation.

21

(1). Conditional distribution and conditional expectation with respect to a set A.

Suppose A is a set with P (A) > 0, and X is a random variable. Then, the conditional expectationis

E(X |A) ≡ E(X1A)/P (A).

The conditional distribution of X given A is

P (X ≤ t|A) = P (X ≤ t ∩A)/P (A)

Then, E(X |A) =∫

tdP (X ≤ t|A), if exist.

As a simple example, let X ∼ Unif [0, 1]. Let Ai = i− 1/n < X ≤ i/n for i = 1, ..., n.

E(X |Ai) ≡ E(X1Ai)/P (Ai) = (i− 1/2)/n.

Similarly E(X |Aci) ≡ E(X1Ac

i)/P (Ac

i ).

Interpretation: E(X |A) is the weighted “average” (expected value) of X over the set A.

(2). Conditional expectation with respect to a r.v..

For two random variables X,Y , E(X |Y ) is a function of Y , i.e., measurable to σ(Y ), such that, forany A ∈ σ(Y ),

E(X1A) = E[E(X |Y )1A].

Interpretation: E(X |Y ) is the weighted “average” (expected value) of X over the set Y = y forall y. It is a function of Y and therefore is a r.v. measurable to σ(Y ).

If their joint density f(x, y) exists, then the conditional density of X given Y = y is fX|Y (x|y) ≡f(x, y)/fY (y). And

E(X |Y = y) ≡∫

xfX|Y (x|y)dx.

(3). Conditional expectation with respect to a σ-algebra A.

Conditional expectation w.r.t. a σ-algebra is the most fundamental concept in probability theory,especially in martingale theory in which the very definition of martingale depends on conditionalexpectation.

Recall that a random variable, say X , is measurable to a σ-algebra A is that for any interval (a, b),ω : X(ω) ∈ (a, b) ∈ A. In other words, σ(X) ⊆ A is interpreted as all information about X ,(which is σ(X)), is contained in A.

If A = σ(A1, ..., An) where Ai ∩Aj = ∅, then X measurable to A implies X must be constant overeach Ai. If A is generated by a r.v. Y , then X measurable to A implies ξ must be a function of Y .A heuristic understanding is that if Y is known, then there is no uncertainty of X , or if Y assumesone value, X cannot assume more than one values.

Definition For a random variable X and a completed σ-algebra A, E(X |A) is defined as an A-measurable random variable such that, for any A ∈ A,

E(X1A) = E(E(X |A)1A),

i.e. E(X |A) = E(E(X |A)|A) for every A ∈ A with P (A) > 0.

If A = σ(A1, ..., An) where Ai ∩Aj = ∅, then

E(X |A) =n

∑

j=1

E(X |Ai)1Ai ,

which is a r.v. that, on each Ai, takes the conditional average of X , i.e., E(X |Ai), as its value.Motivated from this simple case, we may obtain an important understanding of the conditional

22

expectation X w.r.t. a σ-algebra A: a new r.v. as the “average” of the r.v. X on each “un-splitable” or “smallest” set of the σ-algebra A.

Conditional mean/expectation with respect to σ algebra shares many properties just like the ordi-nary expectation.

Properties:

(1). E(aX + bY |A) = aE(X |A) + bE(Y |A)

(2). If X ∈ A, then E(X |A) = X .

(4). E(E(X |F)|A) = E(X |A) for two σ-algebras A ⊆ F .

Further properties, such as the dominated convergence theorem, Fatou’s lemma and monotoneconvergence theorem also hold for conditional mean w.r.t. a σ-algebra. (See DIY exercises.)

(v). Kolmogorov’s 0-1 law.

One of the most important theorem in probability theory is the martingale convergence theorem.In the following, we provide a simplified version, without a rigorous introduction of martingale andwithout giving a proof.

Theorem 1.2 (simplified version of martingale convergence theorem) Suppose Fn ⊆Fn+1 for n ≥ 1. Let F = σ(∪∞n=1Fn). For any random variable X with E(|X |) <∞,

E(X |Fn)→ E(X |F), a.s.

The martingale convergence theorem, even with the simplified version, has broad applications. Forexample, One of the most basic 0-1 laws: the Kolomogorov 0-1 law, can be established upon it.

Corollary (Kolomogorov 0-1 law) Suppose X1, ..., Xn, ... are a sequence of independent r.v.s.Then all tails events are have probability 0 or 1.

Proof. Suppose A is a tail event. Then A is independent of X1, ..., Xn for any fixed n. ThereforeE(1A|Fn) = P (A) where Fn is the σ-algebra generated by X1, ..., Xn. But, by Theorem 1.2,E(1A|Fn)→ 1A a.s.. Hence 1A = P (A), and A can only be 0 or 1.

A heuristic interpretation of Kolmogorov’s 0-1 law could be in the perspective of information. Whenσ-algebras A1, ...,An, ... are independent, the information carried by each Ai are independent orunrelated or non-overlapping. Then, the information carried by An,An+1, ... shall shrink to 0 asn→∞, as, if otherwise, An,An+1, ... would have something in common.

As straightforward applications of Kolmogorov’s 0-1 law:

Corollary Suppose X1, ..., Xn, ... are a sequence of independent random variables. Then,

lim infn

Xn, lim supn

Xn, lim supn

Sn/an and lim infn

Sn/an

must be either a constant or ∞ or −∞, a.s., where Sn =∑n

i=1Xi and an ↑ ∞.

Proof. Consider A = ω : lim infnXn(ω) > a. Try to show A is a tail event. (DIY).

Remark. Without invoking martingale convergence theorem, Kolmogorov’s 0-1 law can be shownthrough π − λ theorem, which we do not plan to cover.

DIY Exercises.

Exercise 4.1 ⋆⋆ Suppose Xn are iid random variables. Then Xn/n1/p → 0 a.s. if and only if

E(|Xn|p) <∞ for p > 0. Hint: Borel-Cantelli lemma.

Exercise 4.2 ⋆⋆⋆ Let Xn be iid r.v.s with E(Xn) =∞. Show that lim supn |Sn|/n =∞ a.s. whereSn = X1 + · · ·+Xn.

Exercise 4.3 ⋆ ⋆ ⋆ Suppose Xn are iid nonnegative random variables such that∑∞

k=1 kP (X1 >ak) <∞ for ak ↑ ∞. Show that lim supn max1≤i≤nXi/an ≤ 1 a.s.

23

Exercise 4.4 ⋆ ⋆ ⋆⋆ (Empirical Approximation) For every fixed t ∈ [0, 1], Sn(t) is a sequenceof random variables such that, with probability 1 for some p > 0,

|Sn(t)− Sn(s)| ≤ n|t− s|p,

for all n ≥ 1 and all t, s ∈ [0, 1]. Suppose for every constant C > 0, there exists an c > 0 such that

P (|Sn(t)| > C(n logn)1/2) ≤ e−cn for all n ≥ 1 and t ∈ [0, 1].

Show that, for any p > 0,max|Sn(t)| : t ∈ [0, 1]

(n logn)1/2→ 0 a.s..

Hint: Borel-Cantelli lemma.

24

Chapter 5. Weak law of large numbers.

For a sequence of independent r.v.s X1, X2, ..., classical law of large numbers is typically about theconvergence of partial sums

Sn − E(Sn)

n=

∑ni=1[Xi − E(Xi)]

n,

where Sn =∑n

i=1Xi here and throughout this Chapter. A more general form is the convergence of

Sn − an

bn

for some constants an and bn. Weak law is convergence in probability and strong law is convergencea.s..

The following proposition may be called L2 weak law of large numbers which implies the weak lawof large numbers.

Proposition Suppose X1, ..., Xn, ... are iid with mean µ and finite variance σ2. Then,

Sn/n→ µ in probability and in L2.

Proof. WriteE(Sn/n− µ)2 = (1/n)σ2 → 0.

Therefore L2 convergence holds. And convergence in probability is implied by the Chebyshevinequality.

The above proposition implies that classical weak law of large numbers holds quite trivially in astandard setup with the r.v.s being iid with finite variance. In fact, in such a standard setup stronglaw of large numbers also holds, as to be shown in Chapter 6. However, the fact that convergencein probability is implied in L2 convergence plays a central role is establishing weak law of largenumbers. For a example, a straightforward extension of the above proposition can be:

For independent r.v.s X1, ...,, (Sn − E(Sn))/bn → 0 in probability if (1/b2n)∑n

i=1 var(Xi) → 0, forsome bn ↑ ∞.

The following theorem about general weak law of large numbers is a combination of the aboveextension and the technique of truncation.

Theorem 5.1. Weak Law of Large Numbers Suppose X1, X2, ... are independent. Assume

(1).∑n

i=1 P (|Xi| > bn)→ 0,

(2). b−2n

∑ni=1E(X2

i 1|Xi|≤bn)→ 0,

where 0 < bn ↑ ∞. Then (Sn − an)/bn → 0 in probability, where an =∑n

j=1 E(Xi1|Xi|≤bn).

Proof. Let Yj = Xj1|Xj |≤bn. Consider

∑nj=1 Yj − an

bn=

∑nj=1[Yj − E(Yj)]

bn,

which is mean 0 and converges to 0 in L2 by (2). Therefore it also converges to 0 in probability.Notice that

P(Sn − an

bn=

∑nj=1 Yj − an

bn

)

= P (Sn =

n∑

j=1

Yj)

25

≥ P (Xj = Yj for all 1 ≤ j ≤ n) =

n∏

j=1

P (Xj = Yj) by independence

=

n∏

j=1

P (|Xj | ≤ bn) =

n∏

j=1

[1− P (|Xj | > bn)] = e∑n

j=1 log[1−P (|Xj |>bn)]

≈ e−∑ n

j=1 P (|Xj |>bn)

→ 1 by (1).

Hence (Sn − an)/bn → 0 in probability.

Theorem 5.2. Suppose X,X1, X2, ... are iid. Then, Sn/n− µn → 0 in probability for some µn,if and only if

xP (|X1| > x)→ 0 as x→∞.

in which case µn = E(X1|X|≤n) + o(1).

Proof. “⇐=” Let an = nµn and bn = n in Theorem 5.1. Condition (1) follows. To checkCondition (2), write, as n→∞,

b−2n

n∑

i=1

E(X2i 1|Xi|≤bn) =

1

nE(X21|X|≤n) ≤

1

nE(min(|X |, n)2)

=1

n

∫ ∞

0

2xP (min(|X |, n) > x)dx =1

n

∫ n

0

2xP (|X | > x)dx

=1

n

∫ n

M

2xP (|X | > x)dx + o(1) for any fixed M > 0

=2

n

∫ n

M

xP (|X | > x)dx + o(1) ≤ 2 supx≥M

xP (|X | > x) + o(1),

as n→∞. Since M is arbitray, Condition (2) holds. And the WLLN follows from Theorem 1.3.

“=⇒” Let X∗, X∗1 , ... be iid following the same distribution of X and are independent of X,X1, ....

Set ξi = Xi −X∗i (symmetrization) and Sn =

∑ni=1 ξi. Then, Sn/n → 0 in probability. The Levy

inequality in Exercise 5.1 implies max|Sj| : 1 ≤ j ≤ n/n→ 0 in probability, which further ensuresmax|ξj | : 1 ≤ j ≤ n/n→ 0 in probability. For any ǫ > 0,

nP (|X | ≥ nǫ)P (|X∗| ≤ .5nǫ) = nP (|X | ≥ nǫ, |X∗| ≤ .5nǫ) ≤ nP (|X −X∗| ≥ .5nǫ)≈ 1− [1− P (|X −X∗| ≥ .5nǫ)]n = P ( max

1≤j≤n|ξj | > .5nǫ)→ 0.

As a result, for any ǫ > 0,

nP (|X | ≥ nǫ) ≈ nP (|X | ≥ nǫ)[1− P (|X | ≥ .5nǫ)]→ 0,

which is equivalent to xP (|X | > x)→ 0 as x→∞.

Example 5.1. Suppose X1, X2, ... are i.i.d. with common density f symmetric about 0 and c.d.fsuch that 1− F (t) = 1/(t log t), for t > 3. Then, Sn/n→ 0 in probability. But Sn/n 9 0, a.s..

The convergence in probability is a consequence of Theorem 5.2 with µn = 0 and checking thecondition xP (|X | > x) → 0 as x → ∞. The convergence a.s. is untrue because Xn/n 9 0 a.s. byBorel-Cantelli lemma.

Corollary. Suppose X1, ..., Xn, ... are i.i.d. with E(|Xi|) <∞. Then, Sn/n→ E(X1) in probabil-ity.

26

Proof. Since, as x→∞,

xP (|Xi| > x) = o(1)

∫ x

0

P (|Xi| > t)dt = o(1)

∫ ∞

0

P (|Xi| > t)dt = o(1)E(|Xi|),

the WLLN follows from Theorem 5.2.

Example 5.2. The St. Petersberg Paradox. Let X,X1, ..., Xn, ... be iid with P (X = 2k) =2−k, k = 1, 2, .... Then, E(X) =∞ and

Sn

n logn→ 1

log 2in probability.

Proof. Notice that P (X ≥ 2k) = 2−k+1. Let kn ≈ log logn/ log 2, mn = logn/ log 2 + kn andbn = 2mn = 2knn ≈ n logn. mn is an integer. Then,

nP (X ≥ bn) = n2−mn+1 ≈ 2n/n · 2−kn → 0.

And

E(X21|X|≤bn) =

mn∑

k=1

22k2−k =

mn∑

k=1

2k ≤ 2× 2mn = 2bn.

Then,nE(X21|X|≤bn)

b2n≤ 2nbn

b2n=

2n

bn=

2n

2mn=

2n

n2kn→ 0.

Let an = nE(X1|X|≤bn).

an = n

mn∑

k=1

2k2−k = nmn = n logn/ log 2 + nkn ≈ bn log 2.

The desired convergence is implied by Theorem 5.2.

Example 5.3. “Unfair fair game”. You pay one dollar to buy a lottery. The lottery hasinfinite number of numbered balls. If number k occurs, you are paid by 2k dollars. The number kball occurs with probability

pk ≡1

2kk(k + 1).

Is this a fair game?

In a sense, it is fair. Let X be gain/loss of the outcome. Then P (X = 2k − 1) = pk, k = 1, 2, ....and P (X = −1) = 1−∑

k pk. Then E(X) = 0.

If one buys the lottery on daily basis, one time every day. Let Xn be gain/loss of day n and Sn bethe cumulative gain/loss up to day n. Then,

Sn

n/ logn→ − log 2 in probability,

meaning that in the long time, he/she is nearly certainly in red.

Example 5.4. Compute the limit of

∫ 1

0

· · ·∫ 1

0

x21 + · · ·x2

1+

x1 + · · ·+ xndx1 · · · dxn.

Solution. The above integral is the same as

E(X2

1 + · · ·+X2n

X1 + · · ·+Xn

)

,

27

where X1, ..., Xn, ... are iid ∼ Unif [0, 1]. Since, by the WLLN

(1/n)∑

i=1

X2i → E(X2

1 ) =

∫ 1

0

x2dx = 1/3 and (1/n)∑

i=1

Xi → E(X1) = 1/2,

with the convergence being convergence in probability, we have

X21 + · · ·+X2

n

X1 + · · ·+Xn→ 2/3 in probability.

The r.v. on the left hand side is bounded by 1. By the dominated convergence, its mean alsoconverges to 2/3. Then the limit of the integral is 2/3.

Remark. The following WLLN for array of r.v.s. is a slight generalization of Theorem 5.1.

Suppose Xn,1, ..., Xn,n are independent r.v.s. If

n∑

i=1

P (|Xn,i| > bn)→ 0 and (1/b2n)

n∑

i=1

E(X2n,i1|Xn,i|≤bn)→ 0,

Then,∑n

i=1Xn,i − an

bn→ 0 in probability

where an =∑n

i=1 E(Xn,i1|Xn,i|≤bn).

DIY Exercises.

Exercise 5.1 (Levy’s Inequality) Suppose X1, X2, ... are independent and symmetric about 0.Then,

P ( max1≤j≤n

|Sj | ≥ ǫ) ≤ 2P (|Sn| ≥ ǫ)

Exercise 5.2 Show Sn/(n logn) → − log 2 in probability in Example 5.4. Hint: Choose bn = 2mn

with mn = k : 2−kk−3/2 ≤ 1/n and proceed as in Example 5.2.

Exercise 5.3 For Example 1.4, prove that Sn/bn → 0 in probability, if bn/(n/ logn) ↑ ∞.

Exercise 5.4 (Marcinkiewicz-Zygmund weak law of large numbers) Suppose xpP (|X | >x)→ 0 as x→∞ for some 0 < p < 2. Prove that

Sn − nE(X1|X|≤n1/p)

n1/p→ 0 in probability.

28

Chapter 6. Strong law of large numbers.

For r.v.s X1, X2, ..., convergence of series means the convergence of its partial sums Sn =∑n

i=1Xi,as n → ∞. We shall denote the convergence of Sn a.s. just as

∑∞n=1Xn < ∞ a.s.. The following

Kolmogorov inequality is the key to establishing a.s. convergence of series for independent r.v.s.

(i). Kolmogorov inequality.

Theorem 6.1. Kolmogorov inequality Suppose X1, X2, ..., Xn are independent with E(Xi) =0 and var(Xi) <∞. Sj = X1 + ...+Xj. Then,

P ( max1≤j≤n

|Sj | ≥ ǫ) =var(Sn)

ǫ2.

Proof. Let T = minj ≤ n : |Sj | ≥ ǫ, with minimum of empty set being ∞, i.e., T =∞ |Sj | ≤ ǫfor all 1 ≤ j ≤ n. Then, T ≤ j or T = j only depends on X1, ..., Xj. And, as a result,

T ≥ j = T ≤ j − 1c = Si ≤ ǫ, 1 ≤ i ≤ j − 1

only depends on X1, ..., Xj−1 and therefore is independent of Xj , Xj+1, .... Write

P ( max1≤j≤n

|Sj | ≥ ǫ) = P (T ≤ n) ≤ ǫ−2E(|ST |21T≤n) ≤ ǫ−2E(|ST∧n|2)

= ǫ−2E(|T∧n∑

j=1

Xj |2) = ǫ−2E(|n

∑

j=1

Xj1T≥j|2)

= ǫ−2

E(n

∑

j=1

X2j 1T≥j) + 2

n∑

1≤i<j≤n

E(XjXi1T≥j1T≥i

= ǫ−2

n∑

j=1

E(X2j )P (T ≥ j) + 2

n∑

1≤i<j≤n

E(Xj)E(Xi1T≥j1T≥i)

= ǫ−2n

∑

j=1

E(X2j )P (T ≥ j) + 0

≤ var(Sn)/ǫ2.

Example 6.1. (Extension to continuous time process.) Suppose St : t ∈ [0,∞) is a processwith increments that are independent, zero mean and finite variance. If the path of St is rightcontinuous, e.g.

P(

maxt∈[0,τ ]

|St| > ǫ)

≤ var(Sτ )

ǫ2.

The examples of such processes are, e.g., compensated Poisson process and Brownian Motion.

Kolmogorov’s inequality will later on be seen as a special case of martingale inequality. In the proofof Kolmogorov inequality, we have used a stopping time T , which is a r.v. associated with a processSn or, more generally, a filtration, such that T = k only depends on past and current values of theprocess: S1, ..., Sk. Stopping time is one of the most important concepts and tools in martingaletheory or stochastic processes.

(ii). Khintchine-Kolmogorov convergence theorem.

Theorem 6.2. (Khintchine-Kolmogorov Convergence Theorem) Suppose X1, X2, ... areindependent with mean 0 such that

∑

n var(Xn) < ∞. Then,∑

nXn < ∞ a.s., i.e., Sn convergesa.s. as well as in L2 to

∑∞n=1Xn.

29

Proof. Define Am,ǫ = maxj>m |Sj − Sm| ≤ ǫ. Then, ∑∞n=1Xn < ∞ = ∩ǫ>0 ∪m Am,ǫ. By

Kolmogorov’s inequality

P ( maxm<j≤n

|Sj − Sm| > ǫ) ≤ var(Sn − Sm)

ǫ2=

1

ǫ2

n∑

i=m+1

var(Xi) ≤1

ǫ2

∞∑

i=m+1

var(Xi).

By letting n→∞ first and then m→∞, we have

limm→∞

P (maxj>m|Sj − Sm| > ǫ)→ 0.

Then limm P (Am,ǫ)→ 1. So P (∪m≥1Am,ǫ) = 1 for every ǫ > 0. Hence,

P (∑

n

Xn <∞) = P (∩ǫ>0 ∪m Am,ǫ) = 1.

And a.s. convergence of Sn holds. Denote the a.s. limit as S∞.

To show convergence of Sn in L2, write

E[(Sn − S∞)2] = E[(Sn − limkSk)2] = E[lim

k(Sn − Sk)2]

≤ lim infk

E[(Sn − Sk)2] by Fatou’s lemma

= = lim infk

k∑

j=n

var(Xj) =

∞∑

j=n

var(Xj)

which tends to 0, as n→∞. Therefore convergence in L2 holds.

Example 6.2. Suppose X1, ... are iid with zero mean and finite variance. Then∑

n anXn < ∞a.s. if and only if

∑

n a2n <∞.

“⇐=” is a direct consequence of Theorem 6.2.. “=⇒” follows from the central limit theorem to beshown in Chapter 8.

(iii). Kolmogorov three series theorem

For independent random variables, Kolmogorov three series theorem is the ultimate result in pro-viding sufficient and necessary conditions for the convergence of series a.s..

Theorem 6.3. (Kolmogorov Three Series Theorem) Suppose X1, X2, ... are independent.Let Yn = Xn1|Xn|≤1 Then,

∑

nXn < ∞ a.s. if and only if (1).∑

n P (|Xn| > 1) < ∞; (2).∑

nE(Yn) <∞; and (3).∑

n var(Yn) <∞.

Proof. “⇐=”: The convergence of∑

n(Yn −E(Yn)) is implied by (3) and Theorem 6.2. Togetherwith (2), it ensures

∑

n Yn < ∞ a.s.. On the other hand, Condition (1) and Borel-Cantelli lemmaimplies P (Xn 6= Yn, i.o.) = 0. Consequently,

∑

nXn converges.

“=⇒” (An unconventional proof). It’s straightforward that Condition (1) holds. Then∑

n Yn <∞a.s. since P (Xn 6= Yn, i.o.) = 1. If condition (3) does not hold, by the central limit theorem to beshown in the next chapter,

1√

∑ni=1 var(Yi)

n∑

i=1

[Yi − E(Yi)]→ N(0, 1),

in distribution. Hence P (|∑ni=1 Yi| > M) → 0 as n → ∞ for any fixed M > 0, which contradicts

with∑

n Yn <∞ a.s.. Hence condition (3) holds. Theorem 6.2 then ensures∑

n(Yn−E(Yn)) <∞a.s.. As a result,

∑

nE(Yn) <∞ and condition (2) also holds.

Remark. Suppose Xn is truncated at any constant ǫ > 0 rather than 1 in Theorem 6.3, the theoremstill holds.

30

Corollary. Suppose X,X1, X2, ... are iid with E(|X |p) < ∞ for some 0 < p < 2. Then,∑∞

n=1[Xn − E(X)]/n1/p <∞ a.s. for 1 < p < 2; and∑∞

n=1Xn/n1/p <∞ a.s. for 0 < p < 1.

We leave the proof as Exercise 6.2.

Strong law of large numbers (SLLN) is a central result in classical probability theory. The conver-gence of series estabalished in Section 1.6 paves a way towards proving SLLN using the Kroneckerlemma.

(iv). Kronecker lemma and Kolmogorov’s criterion of SLLN.

Kronecker Lemma. Suppose an > 0 and an ↑ ∞. Then∑

n xn/an <∞ implies∑n

j=1 xj/an → 0.

Proof. Set bn =∑n

i=1 xi/ai and a0 = b0 = 0. Then, bn → b∞ <∞ and xn = an(bn− bn−1). Write

1

an

n∑

j=1

xj =1

an

n∑

j=1

aj(bj − bj−1) =1

an

[

n∑

j=1

ajbj −n

∑

j=1

ajbj−1

]

= bn +1

an

[

n−1∑

j=1

ajbj −n

∑

j=1

ajbj−1

]

= bn +1

an

[

n∑

j=1

aj−1bj−1 −n

∑

j=1

ajbj−1

]

= bn −1

an

n∑

j=1

bj−1(aj − aj−1)

→ b∞ − b∞ = 0.

The following proposition is an immediate application of the Kronecker lemma and the Khintchine-Kolmogorov convergence of series.

Proposition (Kolmogorov’s criterion of SLLN). Suppose X1, X2..., are independent such thatE(Xn) = 0 and

∑

n var(Xn)/n2 <∞. Then, Sn/n→ 0 a.e..

Proof. Consider the series∑n

i=1Xi/i <∞, n ≥ 1. Then Theorem 6.2 implies∑

nXn/n <∞ a.s..And the above Kronecker Lemma ensures Sn/n→ 0 a.s..

Obviously, if X,X1, X2, ... are iid with finite variance, the above proposition implies the SLLN:Sn/n→ E(X) a.s.. In fact, a stronger result than the above SLLN is also straightforward:

Corollary. If X1, X2, ... are iid with mean µ and finite variance. Then,

Sn − nµ√

n(logn)δ→ 0 .a.s.

for any δ > 1.

We leave the proof as an exercise.

The corollary gives a rate of a.s. convergence of sample mean Sn/n to population mean µ at a raten−1/2(logn)δ with δ > 1/2. This is, although not the sharpest rate, close to the sharpest rate ofa.s. convergence at n−1/2(log logn)1/2 given in Kolmogorov’s law of iterated logarithm:

lim sup Sn−nµ√2σ2n log log n

= 1 a.s..

lim inf Sn−nµ√2σ2n log log n

= −1 a.s..

for iid r.v.s with mean µ and finite variance σ2. We do not intend to cover the proofs of Kolmogorov’slaw of iterated logarithm.

(v) Kolmogorov’s strong law of large numbers.

31

The above SLLN requires finite moments of the series. The most standard classical SLLN, estab-lished by Kolmogorov, for iid r.v.s. holds as long as the population mean exist. In statistical view,the sample mean shall always converge to the population mean as long as the population meanexists, without any further moment condition. In fact, the sample mean converges to a finite limitif and only if the population mean is finite, in which case, the limit is the population mean.

Theorem 6.4. Kolmogorov’s strong law of large numbers. Suppose X,X1, X2, ... are iid andE(X) exists. Then,

Sn/n→ E(X), a.s..

Conversely, if Sn/n→ µ which is finite, then µ = E(X).

Proof. Suppose first E(X1) = 0. We shall utilize the above proposition of Kolmogorov’s criterionof SLLN. Consider

Yn = Xn1|Xn|≤n − E(Xn1|Xn|≤n).

Write

∞∑

n=1

var(Yn)

n2≤

∞∑

n=1

1

n2E(X21|X|≤n) = E

(

X2∞∑

n=1

1

n21|X|≤n

)

≤ E(

X2∞∑

n≥|X|∨1

2

n(n+ 1)

)

≤ 2E(|X |+ 1) <∞

It then follows from Kolmogorov’s criterion of SLLN that

1

n

n∑

i=1

Yi → 0 a.s..

Next, since E(Xn1|Xn|≤n)→ E(X) = 0.∑n

i=1E(Xi1|Xi|≤i)/n→ 0. Hence,

1

n

n∑

i=1

Xi1|Xi|≤i → 0 a.s..

Observe that E(X) = 0 implies E|X | <∞, and

E|X | <∞⇐⇒∑

n

P (|X | > n) <∞⇐⇒ P (|Xn| > n, i.o.) = 0⇐⇒ Xn/n→ 0 a.s..

Therefore,∑n

i=1Xi1|Xi|>i/n→ 0 a.s.. As a result, the SLLN holds.

Suppose E(X) <∞. the SLLN holds by considering Xi − E(X), which is mean 0.

Suppose E(X) = ∞. Then, (1/n)∑n

i=1Xi ∧ C → E(X1 ∧ C) a.s., which ↑ ∞ when C ↑ ∞. SinceSn ≥

∑ni=1Xi ∧ C, the SLLN holds. Likewise for the case E(X) = −∞.

Conversely, if Sn/n → µ a.s. where µ is finite, Xn/n → 0 a.s.. Hence, E|X | < ∞ and µ = E(X)by the SLLN just proved.

Remark Kolmogorov’s SLLN also holds for r.v.s that are pairwise independent following the samedistribution, which is slightly more general. We have chosen to follow the historic development ofthe classical probability theory.

(vi). Strong law of large numbers when E(X) does not exist.

Kolmogorov’s SLLN in Theorem 6.4 already shows that the classical SLLN does not hold if E(X)does not exist, i.e., E(X+) = E(X−) = ∞. The SLLN becomes quite complicated. We introducethe theorem proved by W. Feller:

32

Proposition Suppose X,X1, ... are iid with E|X | =∞. Suppose an > 0 and an/n is nondecreas-ing. Then,

lim sup |Sn|/an = 0 if∑

n P (|X | ≥ an) <∞lim sup |Sn|/an =∞ if

∑

n P (|X | ≥ an) =∞.

The proof is somewhat technical but still along the same line as the that of Kolmogorov’s SLLN.Interested students may refer to the textbook by Durrett (page 67). We omit the details.

Example 6.2. (The St. Petersburg Paradox) See Example 1.5 in which we have shown

Sn

n logn→ 1

log 2in probability

Analogous to the calculation therein,

∞∑

n=2

P (X ≥ n logn) =

∞∑

n=2

P (X ≥ 2log(n log n)/ log 2) ≥∞∑

n=2

2− log(n log n)/ log 2 =

∞∑

n=2

1/(n logn) =∞

By the above proposition,

lim supSn

n logn=∞ a.s..

On the other hand, one can also show with same calculation that, for δ > 1,

lim supSn

n(logn)δ= 0 a.s..

The following Marcinkiewicz-Zygmund SLLN is useful in connecting the rate of convergence withthe moments of the iid r.v.s.

Theorem 6.5. (Marcinkiewicz-Zygmund strong law of large numbers). SupposeX,X1, X2, ... are iid and E(|X |p) <∞ for some 0 < p < 2. Then,

Sn−nE(X)n1/p → 0, a.s. for 1 ≤ p < 2

Sn

n1/p → 0 a.s. for 0 < p < 1.

Proof. The case with p = 1 is Kolmogorov’s SLLN. The cases with 0 < p < 1 and 1 < p < 2 areconsequences of the corollary following Theorem 1.6 and the Kronecker lemma.

Example 6.3 Suppose X,X1, X2, ... are iid and X is symmetric with P (X > t) = t−α for someα > 0 and all large t.

(1). α > 2: Then, E(X2) < ∞, Sn/n → 0 a.s. and, moreover, Kolmogorov’s law of iteratedlogarithm gives the sharp rate of the a.s. convergence.

(2). 1 < α ≤ 2: for any 0 < p < αSn

n1/p→ 0, a.s.

It implies that Sn/n converges to 0 a.s. at a rate faster than n−1+1/p, but not at the rate ofn−1+1/α. In particular, if α = 2, Sn/n converges to E(X) a.s. at a rate faster than n−β with any0 < β < 1/2, but not at the rate of n−1/2.

(3). 0 < α ≤ 1: E(X) does not exist. For any 0 < p < α,

Sn

n1/p→ 0, a.s.

33

Moreover, the above proposition implies

lim sup|Sn|n1/α

=∞ a.s. andSn

n1/α(log n)δ/α→ 0 a.s.

for any δ > 0.

Remark. In the above example, for 0 < α < 2, Sn/n1/α converges in distribution to a nonde-

generate distribution called stable law. In particular, if α = 1, Sn/n converges in distribution to aCauchy distribution. For α = 2, Sn/(n logn)1/2 converges to a normal distribution, and for α > 2,Sn/n

1/2 converges to a normal distribution,

DIY Exercises.

Exercise 6.1. ⋆ ⋆ ⋆ Suppose S0 ≡ 0, S1, S2, ... form a square integrable martingale, i.e., fork = 0, 1, ..., n, E(S2

k) <∞ and E(Sk+1|Fk) = Sk where Fk is the σ-algebra generated by S1, ..., Sk.Show that Kolmogorov’s inequality still holds.

Exercise 6.2. ⋆ ⋆ ⋆ Prove the Corollary following Theorem 6.2.

Exercise 6.3. ⋆ ⋆ ⋆⋆ For positive independent r.v.s X1, X2, ..., show that the following threestatements are equivalent: (a).

∑

nXn < ∞ a.s.; (b).∑

nE(Xn ∧ 1) < ∞; (c).∑

nE(Xn/(1 +Xn)) <∞.

Exercise 6.4. ⋆⋆⋆⋆ Raise a counterexample to show that there exists X1, X2, ... iid with E(X) = 0but

∑

nXn/n 6<∞ a.s..

Exercise 6.5 ⋆ ⋆ ⋆⋆ If X1, ... are iid with mean µ and finite variance. Then,

Sn − nµ√

n(logn)δ→ 0 a.s.

for any δ > 1.

Exercise 6.6 ⋆⋆⋆ Suppose X,X1, ... are iid. Then, (Sn−Cn)/n→ 0 a.s. if and only if E(|X |) <∞.

Exercise 6.7 ⋆ ⋆ ⋆⋆ Suppose X,X1, ... are iid with E(|X |p) = ∞ for some 0 < p < ∞. Then,lim sup |Sn|/n1/p =∞ a.s..

Exercise 6.8 ⋆ ⋆ ⋆⋆ Suppose Xn, n ≥ 1 are independent with mean µn and variance σ2n such that

µn → 0 and∑n

j=1 σ2j →∞. show that

∑nj=1Xj/σ

2j

∑nj=1 σ

−2j

→ 0 a.s.

Hint: Consider the series∑n

j=1(Xj − µj)/(σ2j

∑jk=1 σ

−2k ).

34

Chapter 7. Convergence in distribution and characteristic functions.

Convergence in distribution, which can be generalized slightly to weak convergence of measures,has been introduced in Chapter 3. This section provides a more detailed description.

(i). Definition, basic properties and examples.

Recall that in Section 1.3, we have already defined convergence in distribution for a sequence ofrandom variables. Here we present the same definition in terms of weak convergence of their distri-butions. We first note that a function F is a cdf if and only if it is right continuous, nondecreasingwith F (t)→ 1 and 0 when t→∞ and −∞, respectively.

Definition. A sequence of distribution function Fn is called converging to another distributionfunction F∞ weakly, if

(1) Fn(t)→ F∞(t) for every continuity points of F∞; or

(2), lim infn Fn(B) ≥ F∞(B) for every open set B in (−∞,∞); or

(3) lim supn Fn(C) ≤ F∞(C) for every closed set C in (−∞,∞); or

(4)∫

g(x)dFn(x)→∫

g(x)dF∞(x) for every continuous function g.

Here Fn(A) is defined as∫

AdFn(x) =

∫

1x∈AdFn(x) for any Borel set A. The above four claimsare equivalent to each other, as proved in Chapter 3.

Remark. If F∞ is continuous, the inequalities in (2) and (3) are actually equalities. On the otherhand, if Xn all takes integer values, then Xn → X in distribution is equivalent to P (Xn = k) →P (X = k) for all integer values k.

Remark. (Sheffe’s Theorem) Suppose Xn has density function fn(·) and fn(t)→ f(t) for everyfinite t and f is a density function. Then, Xn → X in distribution, where X has density f . Thiscan be shown quite straightforwardly as follows:

2 =

∫

lim infn

(fn + f − |fn(x)− f(x)|)dx ≤ lim infn

∫

(fn(x) + f(x)− |fn(x) − f(x)|)dx

= lim infn

(

2−∫

|fn(x) − f(x)|dx)

= 2− lim supn

∫

|fn(x) − f(x)|dx.

Certainly, for any Borel set B,

P (Xn ∈ B)− P (X ∈ B) =

∫

B

(fn(x)− f(x))dx ≤∫

|fn(x) − f(x)|dx→ 0.

In the above proof, we have used Fatou lemma with Lebesgue measure. In fact, the monotoneconvergence theorem, Fatou lemma and dominated convergence theorem that we have establishedwith probability measure all hold with σ-finite measures, including Lebesgue measure.

Remark. (Slutsky’s Theorem) Suppose Xn → X∞ in distribution and Yn → c in probability.Then, XnYn → cX∞ in distribution and Xn + Yn → Xn − c in distribution.

We leave the proof as an exercise.

In the following, we provide some classical examples about convergence in distribution, only to showthat there are a variety of important limiting distributions besides the normal distribution as thelimiting distribution in CLT.

Example 7.1. (Convergence of maxima and extreme value distributions) Let Mn =max1≤i≤nXi where Xi are iid r.v.s with c.d.f. F (·). Then,

P (Mn ≤ t) = P (X1 ≤ t)n = F (t)n.

35

As n → ∞, the limiting distribution of properly scaled Mn, should it converge, should only berelated with the right tail of the distribution of F (·), i.e., the F (x) when x is large. The followingare some examples.

(a). F (x) = 1− x−α for some α > 0 and all large x. Then, for any t > 0,

P (Mn/n1/α < t) = (1 − n−1t−α)n → e−t−α

(b). F (x) = 1− |x|β for x ∈ [−1, 0] and some β > 0. Then, for any t < 0,

P (n1/βMn ≤ t) = (1− n−1|t|β)n → e−|t|β

(c). F (x) = 1− e−x for x > 0, i.e., Xi follows exponential distribution. Then for all t,

P (Mn − logn ≤ t)→ e−e−t

These limiting distributions are called extreme value distributions.

Example 7.2. (Birthday problem) Suppose X1, X2, ... are iid with uniform distribution onthe integers 1, 2, ..., N with n < N and , Let

TN = mink : there exists a j < k such that Xj = Xk .Then, for k ≤ N ,

P (TN > k) = P ( X1, ..., Xk all take different values )

=

k∏

j=2

(

1− P ( Xj takes one of the values of X1, .., Xj−1))

=

k∏

j=2

(1− j − 1

N) = exp

k−1∑

j=1

log(1− j/N)

Then, for any fixed x > 0, as N →∞,

P (TN/N1/2 > x) = P (TN > N1/2x) ≈ exp

∑

1≤j<N1/2x

log(1− j/N)

≈ exp−∑

1≤j<N1/2x

j/N ≈ exp−(1/N)N1/2x(N1/2x+ 1)/2 ≈ exp−x2/2

In other words, TN/N1/2 converges in distribution to a distribution F (t) = 1−exp(−t2/2) for t ≥ 0.

Suppose now N = 365. By this approximation, we have P (T365 > 22) ≈ .5153 and P (T365 > 50) ≈.0326, meaning that, with 22 (50) people there is about half (3%) probability that all of them havedifferent birthday.

Example 7.3. (Law of rare events) Suppose there are totally n flights worldwide each year,and each flight has chance pn to have an accident, independent of rest flights. There is on average λaccidents a year worldwide. The distribution of the number of accidents is B(n, pn) with npn closeto λ. Then this distribution approximates Poisson distribution with mean λ, namely,

Bin(n, pn)→ P(λ) if n→∞ and npn → λ > 0.

Proof. For any fixed k ≥ 0, and n ≥ k

P (Bin(n, pn) = k) =

(

n

k

)

pkn(1− pn)n−k =

n!

k!(n− k)!(npn)k

nk

(1− pn)n

(1 − pn)k

=1

k!

n(n− 1) · · · (n− k + 1)

nk

(npn)ken log(1−pn)

(1− pn)k

→ λke−λ

k!, as n→∞.

36

Example 7.4. (The secretary/marriage problem) Suppose there are n secretary to beinterviewed one by one and, right after each interview, you must make immediate decision of “hireor fire” the interviewee. You observe only the relative ranks of the interviewed candidates. Whatis the optimal strategy is maximize the chance of hiring the best of the n candidates? (Assume noties of performance.)

One type of strategy is to give up the firstm candidates, whatever their performance in the interview.Afterwards, the one that outperforms all previous candidates is hired. In other words, starting fromm+1-th interview, the first candidate that outperforms the first m candidates is hired. Or else yousettle with the last candidate. The chance that the k-th best among all n candidates is hired is

Pk =

n∑

j=m+1

P ( the k-th best is the j-th interviewee and is hired)

=

n∑

j=m+1

1

nP (the best among first j − 1 appears in the first m,

the j-th candidate is the k-th best, and the k − 1 best all appear after the j-th candidate.)

≈n

∑

j=m+1

m

j − 1× 1

n× (

n− jn

)k−1

Let n → ∞, and m ≈ nc where c is the percentage of the interviews to be given up. Then theprobability of hiring the k-th best

Pk ≈ cn

∑

j=m

1

j(1− j/n)k−1 ≈ c

∫ 1

c

(1− x)k−1

xdx = cAk, say.

Since Ak+1 = Ak − (1− c)k/k, for k ≥ 1, and A1 = − log c, it follows that

Pk → c(

− log c−k−1∑

j=1

(1− c)j

j

)

, as n→∞.

In particular, P1 → −c log c. The function c log c is maximized at c = 1/e = 0.368. The beststrategy is to give up the first 36.8% of the interviews and then hire the best to date. The chance ofhiring the best overall is also 36.8%. The chance of hiring the last person is also c. This phenomenonis also called 1/e law.

You might please formulate this problem in terms of a sequence of random variables.

(ii). Some theoretical results about convergence in distribution.

(a). Fatou Lemma Suppose Xn ≥ 0 and Xn → X∞ in distribution. Then E(X∞) ≤lim infnE(Xn).

Proof. Write

E(X∞) =

∫ ∞

0

P (X∞ ≥ t)dt ≤∫ ∞

0

lim infn

P (Xn ≥ t)dt = lim infn

∫ ∞

0

P (Xn ≥ t)dt ≤ lim infn

E(Xn).

The dominated convergence theorem also holds with convergence in distribution, which is left asan exercise.

(b). Continuous mapping theorem: Xn → X∞ in distribution and g(·) is a continuous function.Then, g(Xn)→ g(X∞) in distribution.

37

Proof. For any bounded continuous function f , f(g(·)) is still bounded continuous function. HenceE(f(g(Xn)))→ E(f(g(X∞))), proving that g(Xn)→ g(X∞) in distribution.

(c). Tightness and convergent subsequences.

In studying the convergence of a sequence of numbers, it is very useful that boundedness of thesequence, guarantees a convergent subsequence. The same is true for uniformly bounded monotonefunctions, such as, for example, distribution functions. This is the following Helly’s Selectiontheorem, which is useful in studying weak convergence of distributions.

Helly’s Selection Theorem. A sequence of cumulative distribution functions Fn always con-tains a subsequence, say Fnk

, that converges to a function, say F∞, which is nondecreasing andright continuous, at every continuity point of F∞. If F∞(−∞) = 0 and F∞(∞) = 1. Then, F∞ isa distribution function and Fnk

converges to F weakly.

Proof Let t1, t2, ... be all rational numbers. In the sequence Fn(t1), n ≥ 1, there is always a

convergent subsequence. Denote one of them as, say n(1)k , k = 1, 2, .... Among this subsequence there

is again a further subsequence, denoted as n(2)k , k = 1, 2, ..., with n

(2)1 > n

(1)1 , such that F

n(2)k

(t2) is

convergent. Repeat this process of selection infinitely. Let nk = n(k)1 be the first element of the k-th

sub-sub-sequence. Then, for any fixed m, nk : k ≥ m is always a subsequence of n(l)k : k ≥ 1 for

all l ≤ m. Hence Fnkis convergent on every rational number. Denote the limit as F ∗(tl) on every

rational tl. Monotonicity of Fnkimplies the monotonicity of F ∗ on rational numbers. Define, for all

t, F∞(t) = infF ∗(tl) : tl > t, tl are rational. Than, F∞ is right continuous and non-decreasing.The right continuity of Fn ensures that, if s is a continuity point of F∞, Fnk

(s)→ F∞(s).

Not all sequence of distributions Fn would converge weakly to a distribution function. The easiestexample is Fn(n) = Fn(n) − Fn(n−) = 1, i.e., P (Xn = n) = 1. Then, Fn(t) → 0 for allt ∈ (−∞,∞). If Fn all have little probability mass near ∞ or −∞, then the convergence to afunction which is not a distribution function can be avoided. A sequence of distribution functions Fn

is called tight if, for any ǫ > 0, there exists a M > 0 such that lim supn→∞(1−Fn(M)+Fn(M) < ǫ;Or, in other words,

supn

(1 − Fn(x) + Fn(−x))→ 0 as x→∞.

Proposition. Every tight sequence of distribution functions contains a a subsequence that weaklyconverges to a distribution function.

Proof Repeat the proof Helly’s Selection Theorem. The tightness ensures the limit is a distributionfunction.

(iii). Characteristic functions.

Characteristic function is one of the most useful tools in developing theory about convergence indistribution. The technical details of characteristic functions involve some knowledge of complexanalysis. We shall view them as only a tool and try not to elaborate the technicalities.

1. Definition and examples.

For a r.v. X with distribution F , its characteristic function is

ψ(t) = E(eitX) = E(cos(tX) + isin(tX)) =

∫

eitxdF (x), t ∈ (−∞,∞)

where i =√−1.

Some basic properties are:

ψ(0) = 1; |ψ(·)| ≤ 1; ψ(·) is continuous on (−∞,∞)

If ψ is characteristic function of X , then eitbψ(at) is characteristic function of aX + b.

38

Product of characteristic functions is still a characteristic function. And the characteristic functionof X1 + ...+Xn is the product of those of X1, ..., Xn.

The following table lists some characteristic functions for some commonly used distributions:

Distribution Density/Probability function characteristic function (of t)Degenerate P (X = a) = 1 eiat

Binomial Bin(n, p) P (X = k) =(

nk

)

pk(1− p)n−k, k = 0, 1, ..., n (peit + 1− p)n

Poisson P(λ): P (X = k) = λke−λ/k!, k = 0, 1, ... exp(λ(eit − 1))

Normal N(µ, σ2): f(x) = e−(x−µ)2/(2σ2)/√

2πσ2, x ∈ (−∞,∞) eiµt−σ2t2/2

Uniform Unif [0, 1]: f(x) = 1, x ∈ [0, 1] (eit − 1)/(it)Gamma : f(x) = λαxα−1e−λx/Γ(α), x > 0 (1− it/λ)−α

Cauchy: f(x) = 1/[π(1 + x2)], x ∈ (−∞,∞) e−|t|

2. Levy’s inversion formula.

Proposition Suppose X is r.v. with characteristic function ψ(·). Then, for all a < b,

limn→∞

1

2π

∫ T

−T

e−ita − e−itb

itψ(t)dt = P (a < X < b) +

1

2(P (X = a) + P (X = b)).

Proof. The proof uses Fubini’s theorem to interchange the the expectation with the integrationand the fact that

∫ ∞0sin(x)/xdx = π/2. We omit the proof.

The above theorem clearly implies that two different distribution cannot have same characteristicfunction, as formally presented in the following corollary.

Corollary. There is one-to-one correspondence between distribution functions and characteristicfunctions.

3. Levy’s continuity theorem.

Theorem 7.1 Levy’s continuity theorem. Let Fn, F∞ be cdf with characteristic functionψn, ψ∞. Then,

(a). If Fn → F∞ weakly, the ψn(t)→ ψ(t) for every t.

(b). If ψn(t) → ψ(t) for every t, and ψ(·) is continuous at 0, then Fn → F weakly, where F is acdf with characteristic function ψ.

Proof. Part (a) directly follows from the definition of convergence in distribution since eitx is acontinuous function of x for every t. Proof of part (b) uses the Levy inversion formula. We omitthe details.

Remark. Levy’s continuity theorem enables us to show convergence of distribution through point-wise convergence of characteristic functions. This shall be our approach to establish the centrallimit theorem.

DIY Exercises:

Exercise 7.1. ⋆ ⋆ ⋆ Prove Slutsky’s Theorem.

Exercise 7.2. ⋆ ⋆ ⋆ (Dominated convergence theorem) Suppose Xn → X∞ in distributionand |Xn| ≤ Y with E(Y ) <∞. Show that E(Xn)→ E(X∞).

Exercise 7.3. ⋆⋆ Suppose Xn is independent of Yn, and X is independent of Y . Use characteristicfunctions to show that, if Xn converges to X in distribution and Yn converges to Y in distributionand , then Xn + Yn converges in distribution to X + Y .

39

Chapter 8. Central limit theorem.

The most ideal case of the CLT is that the random variables are iid with finite variance. Althoughit is a special case of the more general Lindeberg-Feller CLT, it is most standard and its proofcontains the essential ingredients to establish more general CLT. Throughout the chapter, Φ(·) isthe cdf of standard normal distribution N(0, 1).

(i). Central limit theorem (CLT) for iid r.v.s.

The following lemma plays a key role in the proof of CLT.

Lemma 8.1 For any real x and n ≥ 1,

|eix −n

∑

j=0

(ix)j

j!| ≤ min

( |x|n+1

(n+ 1)!,2|x|nn!

)

.

Consequently, for any r.v. X with characteristic function ψ and finite second moment,

∣

∣

∣ψ(t) − [1 + itE(X)− t2

2E(X2)]

∣

∣

∣≤ |t|

2

6E(min(|t||X |3, 6|X |2)). (8.1)

Proof. The proof relies on the identity

eix −n

∑

j=0

(ix)j

j!=in+1

n!

∫ x

0

(x− s)neisds =in

(n− 1)!

∫ x

0

(x− s)n−1(eis − 1)ds,

which can be shown by induction and by taking derivatives. The middle term is bounded by|x|n+1/(n+ 1)!, and the last bounded by 2|x|n/n!.

Theorem 8.2 Suppose X,X1, ..., Xn, ... are iid with mean µ and finite variance σ2 > 0. Then,

Sn − nµ√nσ2

→ N(0, 1) in distribution.

Proof. Without loss of generality, let µ = 0. Let ψ be the common characteristic function of Xi.Observe that, by dominated convergence

E(min(|tn||X |3, 6|X |2))→ 0 as |tn| → 0

The characteristic function of Sn/√nσ2 is, by applying the above lemma,

E(eitSn/√

nσ2) = E(eitnSn) =

n∏

j=1

E(eitXj/√

nσ2) = ψn(

t√nσ2

)

= [1 +it√nσ2

E(X)− t2

2nσ2E(X2) + o(

1

n)]n = [1− t2

2n+ o(

1

n)]n

→ e−t2/2,

which is the characteristic function of N(0, 1). Then, Levy’s continuity theorem implies the aboveCLT.

In the case the common variance is not finite, the partial sum, after proper normalization, may ormay not converge to a normal distribution. The following theorem provides sufficient and necessarycondition. The key point here is whether there exists appropriate truncation, which is a trick thatwe have used so many times before.

40

Theorem 8.3 Suppose X,X1, X2, ... are iid nondegenerate. Then, (Sn − an)/bn converges to anormal distribution for some constants an and 0 < bn →∞, if and only if

x2P (|X | > x)

E(X21|X|≤x)→ 0, as x→∞. (8.2)

The proof is omitted. We note that (8.2) holds if Xi has finite variance σ2 > 0, in which caseCLT of Theorem 8.2 holds with an = nE(X) and bn =

√nσ. Theorem 8.3 is of interest when

E(X2) =∞. In this case, one can choose to truncate the Xis at

cn = supc : nE(|X |21|X|≤c)/c2 ≥ 1

With some calculation, condition (8.2) ensures

nP (|X | > cn)→ 0 and nE(|X |21|X|≤cn)/c2n → 1.

Separate Sn into two parts, one with Xi beyond ±cn and the other bounded by ±cn. The formertakes value 0 with chance going to 1. The latter, when standardized by

an = nE(X1|X|≤cn) and bn =√

nE(X21|X|≤cn) ≈ cn.

converges to N(0, 1), which can be shown by repeating the proof of Theorem 8.2 or by citingLindeberg-Feller CLT. We note that bn ≈

√

nvar(X1|X|≤cn) by (8.2).

Example 8.1 Recall Example 6.3, in which X,X1, X2, ... are iid symmetric such that P (|X | >x) = x−α for some α > 0 all large x. Then, Theorem 8.3 implies (Sn − an)/bn → N(0, 1) if andonly if α ≥ 2. Indeed, when α > 2, the common variance is finite and CLT applies. When α = 2,

Sn/(n logn)1/2 → N(0, σ2)

for some σ2.

When α < 2, the condition in Theorem cannot hold. In fact, Sn when properly normalized shallconverge to non-normal distribution.

(ii). The Lindeberg-Feller CLT.

Theorem 8.4 Lindeberg-Feller CLT. Suppose X1, ..., Xn, ... are independent r.v.s with mean0 and variance σ2

n. Let s2n =∑n

j=1 σ2j denote the variance of partial sum Sn = X1 + · · ·+Xn. If,

for every ǫ > 0,

1

s2n

n∑

j=1

E(X2j 1|Xj |>ǫsn)→ 0, (2.3)

then Sn/sn → N(0, 1). Conversely, if maxj≤n σ2j /s

2n → 0 and Sn/sn → N(0, 1), then (8.3) holds.

Proof. “⇐=” The Lindeberg condition (8.3) implies

max1≤j≤n

(σ2j

s2n

)

≤ ǫ2 +1

s2nmax

1≤j≤nE(X2

j 1|Xj|>ǫsn)→ 0, (8.4)

by letting n→∞ and then ǫ ↓ 0. Observe that for every real x > 0, |e−x−1+x| ≤ x2/2. Moreover,for complex zj and wj with |zj| ≤ 1 and |wj | ≤ 1,

|n

∏

j=1

zj −n

∏

j=1

wj | ≤n

∑

j=1

|zj − wj |, (8.5)

41

which can be proved by induction. With Lemma 2.1, it follows that, for any ǫ > 0,

|E(eitXj /sn)− e−t2σ2j /2s2

n |

≤ |E(

1 + itXj −(tXj)

2

2s2n

)

−(

1−t2σ2

j

2s2n

)

|+ E[

min( t2X2

j

s2n,|tXj|36s3n

)]

+t4σ4

j

8s4n

≤ E( t2X2

j

s2n1|Xj|>ǫsn

)

+ E( |tXj|3

6s3n1|Xj|≤ǫsn

)

+t4σ4

j

8s4n

≤ t2

s2nE(X2

j 1|Xj |>ǫsn) +|t|3ǫs2n

E(X2j ) +

t4σ2j

s2nmax

1≤k≤n

σ2k

s2n

Then, for any fixed t,

|E(eitSn/sn)− e−t2/2|

= |n

∏

j=1

E(eitXj/sn)−n

∏

j=1

e−t2σ2j /2s2

n |

≤n

∑

j=1

|E(eitXj/sn)− e−t2σ2j /2s2

n | by (8.5)

≤n

∑

j=1

( t2

s2nE(X2

j 1|Xj |>ǫsn) +|t|3ǫs2n

E(X2j ) +

t4σ2j

s2nmax

1≤j≤n

σ2j

s2n

)

≤( t2

s2n

n∑

j=1

E(X2j 1|Xj |>ǫsn) + ǫ|t|3 + t4 max

1≤j≤n

σ2j

s2n

)

→ ǫ|t|3, as n→∞, by (8.3) and (8.4).

Since ǫ > 0 is arbitrary, it follows that E(eitSn/sn) → e−t2/2 for all t. Levy’s continuity theoremimplies Sn/sn → N(0, 1).

“⇐=” Let ψj be the moment generating function of Xj . The asymptotic normality is equivalent

to∏n

j=1 ψj(t/sn)→ e−t2/2. Notice that (8.1) implies

|ψj(t/sn)− 1| ≤ 2t2σ2

j

sn(8.6)

Write, as n→∞,

n∑

j=1

[ψj(t/sn)− 1] + t2/2

=

n∑

j=1

[ψj(t/sn)− 1− logψj(t/sn)] +

n∑

j=1

[logψj(t/sn)] + t2/2

≤n

∑

j=1

|ψj(t/sn)− 1− logψj(t/sn)|+ +o(1)

≤n

∑

j=1

|ψj(t/sn)− 1|2 + o(1)

≤ max1≤k≤n

|ψk(t/sn)− 1| ×n

∑

j=1

|ψj(t/sn)− 1|+ o(1)

≤ 4 max1≤k≤n

t2σ2k

sn×

n∑

j=1

t2σ2j

sn+ o(1) by (8.6)

= o(1), by the assumption maxj≤n σ2j /s

2n → 0.

42

On the other hand, by definition of characteristic function, the above expression is, as n→∞,

o(1) =

n∑

j=1

[ψj(t/sn)− 1] + t2/2

=

n∑

j=1

E(eitXj/sn − 1) + t2/2 =

n∑

j=1

E(cos(tXj/sn)− 1) + t2/2 + i

n∑

j=1

E(sin(tXj/sn))

=

n∑

j=1

E(cos(tXj/sn)− 1)1|Xj |>ǫsn+

n∑

j=1

E(cos(tXj/sn)− 1)1|Xj |≤ǫsn+ t2/2

+imaginary part (immaterial).

Since cos(x) − 1 ≥ −x2/2 for all real x,

1

s2n

n∑

j=1

E(X2j 1Xj |>ǫsn) = 1− 2

t2

n∑

j=1

E(t2X2

j

2s2n1Xj |≤ǫsn)

≤ 2

t2

( t2

2+

n∑

j=1

E(cos(tXj/sn)− 1)1|Xj |≤ǫsn)

≤ 2

t2

(

|n

∑

j=1

E(cos(tXj/sn)− 1)1|Xj|>ǫsn|+ o(1))

≤ 2

t2

n∑

j=1

2P (|Xj| > ǫsn) + o(1)

≤ 4

t2

n∑

j=1

σ2j

(ǫsn)2+ o(1) by Chebyshev inequality

≤ 4

t2ǫ2+ o(1).

Since t can be chosen arbitrarily large, Lindeberg condition holds.

Remark. Sufficiency is proved by Lindeberg in 1922 and necessity by Feller in 1935. Lindeberg-Feller CLT is one of the most far-reaching results in probability theory. Nearly all generalizationsof various types of central limit theorems spin from Lindeberg-Feller CLT, such as, for example,CLT for martingales, for renewal proceses, or for weakly dependent processes. The insights of theLindeberg condition (8.3) are that the “wild” values of the random variables, compared with sn,the standard deviation of Sn as the normalizing constant, are insignificant and can be truncatedoff without affecting the general behavior of the partial sum Sn.

Example 8.2. Suppose Xn are independent and

P (Xn = n) = P (Xn = −n) = n−α/4 and P (Xn = 0) = 1− n−α/2,

with 0 < α < 3. Then, σ2n = E(X2

n) = n2−α/2 and s2n =∑n

j=1 j2−α/2, which increases to ∞

at the order of n3−α. Note that Lindeberg condition (2.3) is equivalent to n2/n3−α → 0, i.e.,0 < α < 1. On the other hand, max1≤j≤n σ

2j /s

2n → 0. Therefore, it follows from Theorem 8.4 that

Sn/sn → N(0, 1) if and only if 0 < α < 1.

Example 8.3 Suppose Xn are independent and P (Xn = 1) = 1/n = 1− P (Xn = 0). Then,

[Sn − log(n)]/√

log(n)→ N(0, 1) in distribution.

It’s clear that E(Xn) = 1/n and var(Xn) = (1 − 1/n)/n. So, E(Sn) =∑n

i=1 =∑n

i=1 1/i, andvar(Sn) =

∑ni=1(1− 1/i)/i ≈ log(n). As Xn are all bounded by 1 and var(Sn) ↑ ∞, the Lindeberg

43

condition is satisfied. Therefore, by the CLT,

Sn −∑n

i=1 1/i

[∑n

i=1(1− 1/i)/i]1/2→ N(0, 1), in distribution.

Then, [Sn−log(n)]/√

log(n)→ N(0, 1) in distribution since | log(n)−∑ni=1 1/i| ≤ 1 and var(Sn)/ log(n)→

1.

Theorem 8.2 as well as the following Lyapunov CLT are both special cases of the Lindeberg-FellerCLT. Nevertheles they are convenient for application.

Corollary (Lyapunov CLT) Suppose Xn are indendent with mean 0 and∑n

j=1 E(|Xj |δ)/sδn → 0

for some δ > 2, then Sn/sn → N(0, 1).

Proof. For any ǫ > 0, as n→∞,

1

s2n

n∑

j=1

E(X2j 1|Xj |>ǫsn) =

n∑

j=1

E(X2

j

s2n1|Xj |/sn>ǫ) ≤

1

ǫδ−2

n∑

j=1

E(Xδ

j

sδn

)→ 0.

Lindeberg condition (8.3) holds and hence CLT holds.

In Example 8.2, for any δ > 2,∑n

j=1E|Xj |δ =∑n

j=1 jδj−α/2 which increasing at the order nδ−α+1,

while sδn increases at the order of n(3−α)δ/2. Simple calculation shows, when 0 < α < 1, Lypunov

CLT holds.

(iii). CLT for arrays of random variables.

Very often Lindeberg-Feller CLT is presented in the form of arrays of random variables as given inthe textbook.

Theorem 8.5 (CLT for arrays of r.v.s) Let Xn,1, ..., Xn,n be n independent random variableswith mean 0 such that, as n→∞,

n∑

j=1

var(Xn,j)→ 1 and

n∑

j=1

E(X2n,j1|Xn,j|>ǫ)→ 0, for any ǫ > 0.

Then, Sn ≡ Xn,1 + · · ·+Xn,n → N(0, 1).

This theorem is slightly more general than Lindeberg-Feller CLT, although the proof is identicalto that of the first part of Theorem 8.4. Theorem 8.4. is a special case of Theorem 8.5 by lettingXn,i = Xi/sn. Thus Xn,k are undertood as the usual r.v.s normalized by the standard deviation ofthe partial sums. Thus Sn in this theorem is already standardized.

DIY Exercises

Exercise 8.1 ⋆ ⋆ ⋆ Suppse Xn are independent with

P (Xn = nα) = P (Xn = −nα) =1

2nβand P (Xn = 0) = 1− 1

nβ

with 2α > β − 1. Show that the Lindeberg condition holds if and only if 0 ≤ β < 1.

Exercise 8.2 ⋆ ⋆ ⋆ Suppose Xn are iid with mean 0 and variance 1. Let an > 0 be such thats2n =

∑nj=1 a

2i →∞ and an/sn → 0. Show that

∑ni=1 aiXi/sn → N(0, 1).

Exercise 8.3 ⋆ ⋆ ⋆ Suppose X1, X2... are independent and Xn = Yn + Zn, where Yn takes values1 and −1 with chance 1/2 each, and P (Zn = ±n) = 1/(2n2) = (1 − P (Zn = 0))/2. Show thatLindeberg condition does not hold, yet Sn/

√n→ N(0, 1).

Exercise 8.4 ⋆ ⋆ ⋆ Suppse X1, X2, ... are iid nonnegative r.v.s with mean 1 and finite varianceσ2 > 0. Show that 2(

√Sn −

√n)→ N(0, 1).

44

.

Review of Probability Theory

1. Probability Calculation.

Calculation of probabilities of events for discrete outcomes (e.g., coin tossing, roll of dice, etc.)

Calculation of the probability of certain events for given density functions of 1 or 2 dimension.

2. Probability space.

(1). Set operations: ∪, ∩, complement.

(2). σ-algebra. Definition and implications (the collection of sets which is closed on set operations).

(3). Kolmogorov’s trio of probability space.

(4). Independence of events and conditional probabilities of events.

(5). Borel-Cantelli lemma.

(6). Sets and set-index functions. 1A∩B = 1A1B and 1Ac = 1− 1A. 1An,i.o. = lim supn 1An . etc.

An, i.o. = ∩∞n=1 ∪∞k=1 Ak = limn→∞

∪∞k=nAk = lim supn

An (lim supn

1An(ω)).

ω ∈ An, i.o. means ω ∈ An for infinitely many An. (Mathematically precisely, there exists asubsequence nk →∞, such that ω ∈ Ank

for all nk.)

∪∞n=1 ∩∞k=1 Ak = limn→∞

∩∞k=nAk = lim infn

An (lim infn

1An(ω)).

ω ∈ lim infnAn means ω ∈ An for all large n. (Mathematically precisely, there exists an N , suchthat ω ∈ An for all n ≥ N .)

(7). ⋆ completion of probability space.

3. Random variables.

(1). Definitions.

(2). c.d.f., density or probability functions.

(3). Expectation: definition, interpretation as weighted (by chance) average.

(4). Properties:

(i). Dominated convergence. (Proof and Application).

(ii). ⋆ Fatou’s lemma and monotone convergence.

(iii). Jensen’s inequalities

(iv) Chebyshev inequalities.

(5). Independence of r.v.s.

(6). ⋆ Conditional distribution and expectation given a σ-algebra (Definition and simple properties.)

(7). Commonly used distributions and r.v.s.

4. Convergence.

(1). Definition of a.e., in prob, Lp and in distribution convergence.

(2). ⋆ Equivalence of four definitions of in distribution convergence.

(3). The diagram about the relation among convergence modes.

(4). The technique of truncating off r.v.s

5. LLN.

(1). WLLN. Application of the theorem. (⋆ the proof)

(2). Kolmogorov’s inequality (⋆ the proof).

(3). Corollary (⋆ the proof)

45

(4). ⋆ Kolmogorov’s 3-series theorem

(5). Kolmogorov’s criterion for SLLN. (⋆ the proof)

(6) Kronecker lemma. (⋆ the proof)

(7). SLLN for iid r.v.s (⋆ the proof)

(8). Application.

6. CLT.

(1) Characteristic function. (Definition and simple properties.)

(i) ψ continuous. |ψ| ≤ 1 and ψ(0) = 1.

(ii) FX = FY =⇒ ψX = ψY . (⋆ proof of ⇐=.)

(iii) Fn → F =⇒ ψn → ψ. (⋆ proof of ⇐=.)

(iv) X and Y are independent =⇒ ψX+Y = ψX + ψY .

(2). CLT for iid r.v.s

Theorem, application and the proof for the case of bounded r.v.s.

(3). Lindeberg condition.

Application and Heuristic interpretation. (⋆ proof.)

(4). Application.

Remark. ⋆ means not required in the midterm exam.

advanced probability and statistics

Documents