chapter 2 some basic large sample theorychapter 2 some basic large sample theory 1. modes of...
TRANSCRIPT
1
Chapter 2Some Basic Large Sample Theory
1. Modes of ConvergenceConvergence in distribution, →d
Convergence in probability, →p
Convergence almost surely, →a.s.
Convergence in r−th mean, →r
Metrics on probability distributions
2. Classical Limit TheoremsWeak and strong laws of large numbersClassical (Lindeberg) CLTLiapounov CLTLindeberg-Feller CLTCramer-Wold device; Mann-Wald theorem; Slutsky’s theoremDelta-method
3. ExamplesOne-sample t−testOne-sample normal theory test of varianceTwo-sample tests for meansLinear regression with non-normal errorsThe correlation coefficientPearson’s chi-square tests, simple null hypothesis
4. Replacing →d by →a.s.
5. Empirical Measures and Empirical ProcessesThe empirical distribution function; the uniform empirical processThe Brownian bridge processFinite-dimensional convergenceA hint of Donsker’s Mann-Wald type theoremGeneral empirical measures and processes
6. The Partial Sum Process and Brownian motionBrownian motionExistence of Brownian motion and Brownian bridge as continuous processes
7. Sample Quantiles
2
Chapter 2
Some Basic Large Sample Theory
1 Modes of Convergence
Consider a probability space (Ω,A, P ). For our first three definitions we suppose that X, Xn, n ≥ 1are all random variables defined on this one probability space.
Definition 1.1 We say that Xn converges a.s. to X, denoted by Xn →a.s. X, if
Xn(ω)→ X(ω) for all ω ∈ A where P (Ac) = 0 ,(1)
or, equivalently, if, for every ε > 0
P ( supm≥n|Xm −X| > ε)→ 0 as n→∞ .(2)
Definition 1.2 We say that Xn converges in probability to X and write Xn →p X if for everyε > 0
P (|Xn −X| > ε)→ 0 as n→∞ .(3)
Definition 1.3 Let 0 < r < ∞. We say that Xn converges in r−th mean to X, denoted byXn →r X, if
E|Xn −X|r → 0 as n→∞ for functions Xn, X ∈ Lr(P ) .(4)
Definition 1.4 We say that Xn converges in distribution to X, denoted by Xn →d X, or Fn → F ,or L(Xn)→ L(X) with L referring to the the “law” or “distribution”, if the distribution functionsFn and F of Xn and X satisfy
Fn(x)→ F (x) as n→∞ for each continuity point x of F .(5)
Note that Fn ≡ 1[1/n,∞) →d 1[0,∞) ≡ F even though Fn(0) = 0 does not converge to 1 = F (0).The statement →d will carry with it the implication that F corresponds to a (proper) probabilitymeasure P .
Definition 1.5 A sequence of random variables Xn is uniformly integrable if
limλ→∞
lim supn→∞
E|Xn|1[|Xn|≥λ]
= 0 .(6)
3
4 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
Theorem 1.1 (Convergence implications).A. If Xn →a.s. X, then Xn →p X.B. If Xn →p X, then Xn′ →a.s. X for some subsequence n′.C. If Xn →r X, then Xn →p X.D. If Xn →p X and |Xn|r is uniformly integrable, then Xn →r X.
If Xn →p X and lim supn→E|Xn|r ≤ E|X|r, then Xn →r X.E. If Xn →r X then Xn →r′ X for all 0 < r′ ≤ r.F. If Xn →p X, then Xn →d X.G. Xn →p X if and only if every subsequence n′ contains a further subsequence n′′ for which
Xn′′ →a.s. X.
Theorem 1.2 (Vitali’s theorem). Suppose that Xn ∈ Lr(P ) where 0 < r < ∞ and Xn →p X.Then the following are equivalent:A. |Xn|r are uniformly integrable.B. Xn →r X.C. E|Xn|r → E|X|r.D. lim supnE|Xn|r ≤ E|X|r.
Before proving the theorems we need a short review of some facts about convex functions andsome inequalities. We first briefly review convexity. A real valued function f is convex if
f(αx+ (1− α)y) ≤ αf(x) + (1− α)f(y)(7)
for all x, y and all 0 ≤ α ≤ 1. This holds if and only if
f
(1
2(x+ y)
)≤ 1
2[f(x) + f(y)](8)
for all x, y provided f is continuous and bounded. Note that (8) holds if and only if
f(s) ≤ 1
2[f(s− r) + f(s+ r)] for all r, s .(9)
Also
f ′′(x) ≥ 0 for all x implies f is convex .(10)
We call f strictly convex if strict inequality holds in any of the above. If f is convex, then thereexists a linear function l such that f(x) ≥ l(x) with equality at a prespecified x0 in the interior ofthe domain of f ; this is called the supporting hyperplane theorem.
Definition 1.6 Assuming the following expectations (integrals) exist,
µ ≡ E(X) = the mean of X .(11)
σ2 ≡ V ar[X] ≡ E(X − µ)2 = the variance of X .(12)
E(Xk) = k-th moment of X for k ≥ 1 an integer .(13)
E|X|r = r-th absolute moment of X for r ≥ 0 .(14)
E(X − µ)k = k- th central moment of X .(15)
Cov[X,Y ] ≡ E[(X − µX)(Y − µY )] = the covariance of X and Y .(16)
1. MODES OF CONVERGENCE 5
Proposition 1.1 If E|X|r < ∞, then E|X|r′ and E(Xk) are finite for all r′ ≤ r and integersk ≤ r.
Proof. Now |x|r′ ≤ 1 + |x|r; and integrability is equivalent to absolute integrability. 2
Proposition 1.2 V ar(X) ≡ σ2 <∞ if and only if E(X2) <∞. In this case σ2 = E(X2)− µ2.
Proof. Suppose that σ2 <∞. Then σ2 + µ2 = E(X − µ)2 + E(2µX − µ2) = E(X2). Supposethat E(X2) <∞. Then E(X2)− µ2 = E(X2)− E(2µX − µ2) = E(X − µ)2 = V ar[X]. 2
Proposition 1.3 (cr−inequality). E|X + Y |r ≤ crE|X|r + crE|Y |r where cr = 1 for 0 < r ≤ 1and cr = 2r−1 for r ≥ 1.
Proof. Case 1: r ≥ 1. Then |x|r is a convex function of x; take second derivatives. Thus|(x+ y)/2|r ≤ [|x|r + |y|r]/2; and now take expectations.Case 2: 0 < r ≤ 1: Now |x|r is concave and ↑ for x ≥ 0; examine derivatives. Thus
|x+ y|r − |x|r =
∫ x+y
xrtr−1dt =
∫ y
0r(x+ s)r−1ds
≤∫ y
0rsr−1ds = |y|r ,
and now take expectations. 2
Proposition 1.4 (Holder inequality). E|XY | ≤ E1/r|X|rE1/s|Y |s ≡ ‖X‖r‖Y ‖s for r > 1 where1/r + 1/s = 1 defines s. When the expectations are finite we have equality if and only if thereexists A and B not both 0 such that A|X|r = B|Y |s a.e.
Proof. The result is trivial if E|X|r = 0 or ∞. Likewise for E|Y |s. So suppose that E|X|r > 0.Now
|ab| ≤ |a|r
r+|b|s
s, as in the figure .
Now let a = |X|/‖X‖r and b = |Y |/‖Y ‖s; and take expectations. Equality holds if and only if|Y |/‖Y ‖s = (|X|/‖X‖r)1/(1−s) a.e.; if and only if
|Y |s
E|Y |s=
(|X|‖X‖r
) ss−1
=|X|r
E|X|ra.e.
if and only if there exist A,B 6= 0 such that A|X|r = B|Y |s. This also gives the next inequality asan immediate consequence. 2
Proposition 1.5 (Cauchy-Schwarz inequality). (E|XY |)2 ≤ E(X2)E(Y 2) with equality if andonly if there exists A, B not both 0 such that A|X| = B|Y | a.e.
6 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
Remark 1.1 Thus for non-degenerate random variables (i.e. non-zero variance) with finite vari-ance we have
−1 ≤ ρ ≤ 1(17)
where
ρ ≡ Corr[X,Y ] ≡ Cov[X,Y ]√V ar[X]V ar[Y ]
(18)
is called the correlation of X and Y . Note that ρ = 1 if and only if X − µX = A(Y − µY ) for someA > 0 and ρ = −1 if and only if X − µX = A(Y − µY ) for some A < 0. Thus ρ measures lineardependence, not dependence.
Proposition 1.6 logE|X|r is convex in r for r ≥ 0. It is linear if and only if |X| = c a.e. for somec.
Proof. Let 0 ≤ r ≤ s. Apply the Cauchy-Schwarz inequality to |X|(s−r)/2 and |X|(s+r)/2 andtake logs to get
logE|X|s ≤ 1
2
logE|X|s−r + logE|X|s+r
.
2
Proposition 1.7 (Liapunov inequality). Let X be a random variable. Then E1/r|X|r is ↑ in r forr ≥ 0.
Proof. The slope of the chord of y = logE|X|r is ↑ in r by proposition 1.6. That is,(1/r) logE|X|r is ↑ in r. We used P (Ω) = 1 < ∞ to show that E|X|r′ < ∞ if E|X|r < ∞ forr′ ≤ r in proposition 1.1. 2
Exercise 1.1 Let µr ≡ E|X|r. For r ≥ s ≥ t ≥ 0 we have µs−tr µr−st ≥ µr−ts .
Proposition 1.8 (Minkowski’s inequality). For r ≥ 1 we have E1/r|X+Y |r ≤ E1/r|X|r+E1/r|Y |r.
Proof. It is trivial for r = 1. Suppose that r > 1. Then for any measure
E|X + Y |r ≤ E|X||X + Y |r−1 + E|Y ||X + Y |r−1(a)
≤ ‖X‖r + ‖Y ‖r ‖|X + Y |r−1‖s by Holder’s inequality
= ‖X‖r + ‖Y ‖rE1/s|X + Y |(r−1)s
= ‖X‖r + ‖Y ‖rE1/s|X + Y |r .
If E|X + Y |r = 0, it is trivial. If not, we divide to get the result. 2
Proposition 1.9 (Basic inequality). Let g ≥ 0 be an even function which is ↑ on [0,∞). Then forall random variables X and for all ε > 0
P (|X| ≥ ε) ≤ Eg(X)
g(ε).(19)
1. MODES OF CONVERGENCE 7
Proof. Now
Eg(X) = Eg(X)1[|X|≥ε]+ Eg(X)1[|X|<ε](a)
≥ Eg(X)1[|X|≥ε] ≥ g(ε)E1[|X|≥ε]= g(ε)P (|X| ≥ ε)
as claimed. 2
The next two inequalities are immediate corollaries of the basic inequality.
Proposition 1.10 (Markov’s inequality).
P (|X| ≥ ε) ≤ E|X|r
εrfor all ε > 0 .(20)
Proposition 1.11 (Chebychev’s inequality).
P (|X − µ| ≥ ε) ≤ V ar[X]
ε2for all ε > 0 .(21)
Proposition 1.12 (Jensen’s inequality). If g is convex on (a, b) where −∞ ≤ a < b ≤ ∞ and ifP (X ∈ (a, b)) = 1 and E(X) is finite (and hence a < E(X) < b), then
g(EX) ≤ Eg(X) .(22)
If g is strictly convex, then equality holds in (22) if and only if X = E(X) with probability 1.
Proof. Let l be a supporting hyperplane to g at EX. Then
Eg(X) ≥ El(X)
= l(EX) since l is linear and P (Ω) = 1
= g(EX) .
Now g(X)− l(X) ≥ 0. Thus Eg(X) = El(X) if and only if g(X) = l(X) almost surely, if and onlyif X = EX almost surely. 2
Exercise 1.2 For any function h ∈ L2(0, 1), define a new function Th on (0, 1) by Th(u) =u−1
∫ u0 h(s)ds for 0 < u ≤ 1. Note that T is an averaging operator. use the Cauchy-Schwarz
inequality to show that ∫ 1
0Th(u)2du ≤ 4
∫ 1
0h2(u)du .
Thus T : L2(0, 1) → L2(0, 1) is a bounded linear operator with ‖T‖ ≤ 2. [Hint: write Th(u) =u−1
∫ u0 h(s)sαs−αds for some α.]
Exercise 1.3 Suppose that X ∼ Binomial(n, p). Use the basic inequality proposition 1.9 withg(x) = exp(rx), r > 0, to show that for ε ≥ 1
P
(X/n
p≥ ε)≤ exp(−nph(ε))(23)
8 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
where h(ε) = ε(log(ε)− 1) + 1. From this, show that for λ > 0 we have
P
(√n
(X
n− p)≥ λ
)≤ exp
(−λ
2
2pψ
(λ
p√n
)),(24)
where ψ(x) ≡ 2h(1 + x)/x2 is monotone decreasing on [0,∞) with ψ(0) = 0 and ψ(x) ∼ 2 log(x)/xas x→∞.
Proof. (Proof of theorem 1.1). A follows easily since, for any fixed ε > 0
P (|Xn −X| ≥ ε) ≤ P (∪m≥n[|Xm −X| ≥ ε])→ 0 .
To prove B, first note that Xn →p X implies that for every k ≥ 1 there exists an interger nk suchthat P (|Xnk
−X| > 1/2k) < 2−k; we can assume nk ↑ in k; if not, take n′k ≡ nk + k. Let Am ≡∪k≥m[|Xnk
−X| > 2−k] so that P (Am) ≤∑∞
k=m 2−k = 2−m+1. On Acm = ∩k≥m[|Xnk−X| ≤ 2−k|],
|Xnk−X| ≤ 2−k for all k ≥ m; i.e. on Acm, Xnk
(ω) → X(ω). Thus Xnk→ X on A ≡ ∪∞m=1A
cm,
and P (Ac) = P (∩∞m=1Am) = limm P (Am) ≤ limm 2−m+1 = 0.Markov’s inequality gives C via P (|Xn − X| ≥ ε) ≤ E|Xn − X|r/εr → 0. Holder’s inequality
with 1/(r/r′) + 1/q = 1 gives E via
E|Xn −X|r′ ≤ E|Xn −X|r
′(r/r′)r′/rE1q1/q
= E|Xn −X|rr′/r → 0 ;(a)
or, alternatively, use Liapunov’s inequality.Vitali’s theorem 1.2 gives D.Consider F. Let Xn ∼ Fn and X ∼ F . Now
Fn(t) = P (Xn ≤ t) ≤ P (X ≤ t+ ε) + P (|Xn −X| ≥ ε)≤ F (t+ ε) + ε for all n ≥ some Nε .
Also
Fn(t) = P (Xn ≤ t) ≥ P (X ≤ t− ε and |Xn −X| ≤ ε) ≡ P (AB)
≥ P (A)− P (Bc) = F (t− ε)− P (|Xn −X| > ε)
≥ F (t− ε)− ε for n ≥ some N ′ε .
Thus
F (t− ε)− ε ≤ lim inf Fn(t) ≤ lim supFn(t) ≤ F (t+ ε) + ε .(b)
If t is a continuity point of F , then letting ε→ 0 in (b) gives Fn(t)→ F (t). Thus Xn →d X.Half of G follows from B since any Xn′ →p X. We turn to the other half. Assume that Xn →p X
fails. Then there exists ε0 > 0 for which δ0 ≡ lim supP (|Xn − X| ≥ ε0) > 0. Thus there existsa subsequence n′ for which P (|Xn′ − X| ≥ ε0) → δ0 > 0. Thus neither Xn′ nor any furthersubsequence Xn′′ can →a.s. X. This is a contradiction. Thus Xn →p X. 2
Proof of Vitali’s theorem, theorem 1.2: Suppose that A holds: thus |Xn|rn≥1 is uniformly inte-grable. Now Xn′ →a.s. X for some subsequence by Theorem 1.1. Thus E|X|r = E(lim inf |Xn′ |r) ≤lim inf E|Xn′ |r < ∞ by Fatou’s lemma and uniform integrability. Thus X ∈ Lr(P ). Now the
1. MODES OF CONVERGENCE 9
Cr−inequality gives |Xn − X|r ≤ Cr|Xn|r + |X|r, so that the random variables |Xn − X|r areuniformly integrable. Thus we can write, for each fixed ε > 0,
E|Xn −X|r ≤ E|Xn −X|r1[|Xn−X|r≥ε] + E|Xn −X|r1[|Xn−X|r≤ε]
≤ E|Xn −X|r1[|Xn−X|r≥ε] + εr
≤ ε+ εr
by the equivalence theorem for uniform integrability by choosing ε so small that lim supP (|Xn −X| ≥ ε) ≤ δε. Thus B holds. B implies C by Theorem 1.1 part E. C implies D trivially. Supposethat D holds. Define fλ to be the bounded continuous function on [0,∞) given by
fλ(x) =
xr, |x|r ≤ λ0, |x|r ≥ λ+ 1λr − λr(x− λ), λ < |x|r < λ+ 1.
Then
lim supn
E|Xn|r1[|Xn|r>λ+1] = lim supn
E|Xn|r − E|Xn|r1[|Xn|r≤λ+1]
≤ E|X|r − lim inf E|Xn|r1[|Xn|r≤λ+1]≤ E|X|r − lim inf Efλ(Xn)
= E|X|r − Efλ(X) by the Helly-Bray theorem 2.3.5
≤ E|X|r1[|X|r≥λ]→ 0 as λ→∞ since X ∈ Lr(P ),
thus A holds. 2
Some Metrics on Probability Distributions
Suppose that P and Q are two probability measures on some measurable space (or samplespace) (X,A). Let P denote the collection of all probability distributions on (X,A).
Definition 1.7 The total variation metric dTV on P is defined by
dTV (P,Q) = supA∈A|P (A)−Q(A)| .(25)
Definition 1.8 The Hellinger metric H on P is defined by
H2(P,Q) =1
2
∫|√p−√q|2dµ(26)
where p, q are densities with respect to any common dominating measure µ of P and Q (the choiceµ = P +Q always works).
Proposition 1.13 For P,Q ∈ P, let p and q denote densities with respect to any common domi-nating measure µ (µ = P +Q always works). Then
supA∈A|P (A)−Q(A)| = 1
2
∫|p− q|dµ .(27)
In other words,
dTV (P,Q) =1
2
∫|p− q|dµ .(28)
10 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
Proof. Let r ≡ p− q. Note that 0 =∫rdµ =
∫r+dµ−
∫r−dµ, so that
∫r+dµ =
∫r−dµ, and∫
|p− q|dµ =
∫r+dµ+
∫r−dµ = 2
∫r+dµ .
Let B ≡ [p− q ≥ 0] = [r ≥ 0]. Then for any measurable set A,
P (A)−Q(A) =
∫Apdµ−
∫Aqdµ =
∫A
(p− q)dµ
=
∫A∩B
(p− q)dµ+
∫A∩Bc
(p− q)dµ
≤∫Ar+dµ ≤
∫r+dµ =
1
2
∫|p− q|dµ .(a)
By a similar argument,
Q(A)− P (A) ≤∫Ar−dµ ≤
∫r−dµ =
1
2
∫|p− q|dµ .(b)
Combining (a) and (b) yields
|P (A)−Q(A)| ≤ 1
2
∫|p− q|dµ .(c)
On the other hand
|P (B)−Q(B)| = |∫B
(p− q)dµ| =∫r+dµ =
1
2
∫|p− q|dµ .(d)
The claimed equality follows immediately from (c) and (d). 2
Proposition 1.14 (Scheffe’s theorem). Suppose that Pnn≥1, and P are probability distribu-tions on a measurable space (X,A) with corresponding densities pnn≥1, and p with respect to adominating measure µ, and suppose that pn → p almost everywhere with respect to µ. Then
dTV (Pn, P )→ 0 .(29)
Proof. From the proof of proposition 1.13 it follows that
dTV (Pn, P ) = dTV (P, Pn) =
∫r+n dµ(a)
where r+n = (p−pn)+ satisfies r+n →a.e. 0 and r+n ≤ p for all n with∫pdµ = 1 <∞. The conclusion
follows from (a) and the dominated convergence theorem. 2
Example 1.1 Suppose that Xn,1, . . . , Xn,n are i.i.d. Bernoulli(pn) random variables with pn = λ/nfor some λ > 0. Then Sn ≡
∑ni=1Xn,j ∼ Binomial(n, pn) and it is well-known that Sn →d Y ∼
1. MODES OF CONVERGENCE 11
Poisson(λ). But more is true: for fixed 0 ≤ k < n we have
pn(k) ≡ P (Sn = k) =
(n
k
)pkn(1− pn)n−k
=n!
(n− k)!
1
k!
(λ
n
)k (1− λ
n
)n−k=
n(n− 1) · · · (n− k + 1)
n · n · · ·n· λ
k
k!
(1− λ
n
)n−k→ 1 · λ
k
k!e−λ = Pλ(Y = k).
Thus by Scheffe’s theorem, with µ = counting measure on 0, 1, 2, . . ., Sn ∼ Pn, Y ∼ Pλ,
dTV (Pn, Pλ) =1
2
∞∑k=0
|P (Sn = k)− Pλ(Y = k)| → 0.
How fast does this convergence happen? An inequality of Le Cam and Hodges (1960) handlesan even more general situation: if Xn,1, . . . , Xn,n are independent with Xn,j ∼ Bernoulli(pn,j),j = 1, . . . , n. then with λn =
∑nj=1 pn,j ,
dTV (Pn, Pλn) ≤n∑j=1
p2n,j .
When pn,j = λ/n ≡ pn for all 1 ≤ j ≤ n, the right side in the last display becomes λ2/n. Forfurther inequalities for Poisson approximation via the methods of Stein and Chen, see Barbour,Holst, and Janson (1992).
Example 1.2 Suppose that Xr ∼ tr; thus Xr = Z/√Vr/r where Z ∼ N(0, 1) and Vr ∼ χ2
r are
independent. Since Vr/rd= r−1
∑rj=1 Z
2j where Z1, . . . , Zr are i.i.d. N(0, 1), we have Vr/r →p 1,
and hence Xr →p Z ∼ N(0, 1). Again, more is true. The density of Xr is given by
fr(x) =Γ((r + 1)/2)√πrΓ(r/2)
· 1(1 + x2
r
)(r+1)/2→ 1√
2πe−x
2/2
by using Stirling’s formula and (1 + b/r)r → eb as r → ∞ for fixed b ∈ R. By Scheffe’s theoremwith µ being Lebesgue measure on R it follows that with Xr ∼ Pr
dTV (Pr, N(0, 1))→ 0.
How fast does this happen? I claim that dTV (Pr, N(0, 1)) ≤ .7/r for all r ≥ 1. Reference?
Example 1.3 (Poincare’s lemma). Let Z1, Z2, . . . be independent N(0, 1) random variables. Thenwith Zn ≡ (Z1, . . . , Zn)T , the vector Zn/‖Zn‖ has a Uniform distribution on the (surface of) theunit sphere Sn−1 in Rn, and
√nZn/‖Zn‖ has a uniform distribution on
√nSn−1, the (surface
of) the unit sphere in Rn with radius√n. But then
√nZk/‖Zn‖ has the same distribution as√
n(Y1, . . . , Yk) where (Y1, . . . , Yn) has a uniform distribution on Sn−1. Since√nZk‖Zn‖
=Zk(
n−1∑n
i=1 Z2i
)1/2 →p Zk,
12 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
by the weak law of large numbers, it follows that if (Y1, . . . , Yn) ∼ Uniform(Sn−1), then
√nY n,k =
√n(Y1, . . . , Yk)→d Zk ∼ Nk(0, I).
Furthermore, by computing the density of√nY n,k it is easily shown that f√nY n,k
(y) → φk(y) for
each fixed y ∈ Rk where φk is the density of Zk. Thus by Scheffe’s theorem, with√nY n,k ∼ Qn,k
and Zk ∼ Pk,dTV (Qn,k, Pk)→ 0 as n→∞
for fixed k (and even for k = o(n)). Diaconis and Freedman (1987) show that
dTV (Qn,k, Pk) ≤3(k + 3)
n− k − 3for 1 ≤ k ≤ n− 4.
Exercise 1.4 Show that the Hellinger distance H(P,Q) does not depend on the choice of a dom-inating measure µ.
Exercise 1.5 Show that
H2(P,Q) = 1−∫√pq dµ ≡ 1− ρ(P,Q)(30)
where the Hellinger affinity ρ(P,Q) satisfies ρ(P,Q) ≤ 1 with equality if and only if P = Q.
Exercise 1.6 Show that
dTV (P,Q) = 1−∫p ∧ qdµ ≡ 1− η(P,Q)(31)
where the total variation affinity η(P,Q) satisfies η(P,Q) ≤ 1 with equality if and only if P = Q.
The Hellinger and total variation metrics are different, but they metrize the same topoplgy onP, as follows from the inequalities in the following proposition.
Proposition 1.15 (Inequalities relating Hellinger and total variation metrics).
H2(P,Q) ≤ dTV (P,Q) ≤ H(P,Q)1 + ρ(P,Q)1/2 ≤√
2H(P,Q) .(32)
Exercise 1.7 Show that (32) holds.
2. CLASSICAL LIMIT THEOREMS 13
2 Classical Limit Theorems
We now state some of the classical limit theorems of probability theory which are of frequent usein statistics.
Proposition 2.1 (WLLN). If X,X1, . . . , Xn, . . . are i.i.d. with mean µ (so E|X| < ∞ and µ =E(X), then Xn →p µ.
Proposition 2.2 (SLLN). If X,X1, . . . , Xn, . . . are i.i.d with mean µ (so E|X| < ∞ and µ =E(X)), then Xn →a.s. µ.
Proposition 2.3 (CLT). If X,X1, . . . , Xn are i.i.d. with mean µ and variance σ2 (so E|X|2 <∞),then
√n(Xn − µ)→d N(0, σ2).
Proposition 2.4 (Multivariate CLT). If X,X1, . . . , Xn are i.i.d. random vectors in Rd with meanµ = E(X) and covariance matrix Σ = E(X − µ)(X − µ)′ (so E(X ′X) = E‖X‖2 < ∞), then√n(X − µ)→d Nd(0,Σ).
Proposition 2.5 (Liapunov CLT). Let Xn1, . . . , Xnn be row independent random variables withµni = E(Xni), σ
2ni ≡ V ar(Xni), and γni ≡ E|Xni − µni|3 < ∞. Let µn ≡
∑n1 µni, σ
2n =
∑ni=1 σ
2ni,
γn ≡∑n
1 γni. If γn/σ3n → 0, then
∑ni=1(Xni − µni)/σn →d N(0, 1).
Proposition 2.6 (Lindeberg-Feller CLT). Let Xni be row independent with 0 means and finitevariances σ2ni ≡ V ar(Xni). Let Sn ≡
∑ni=1Xni and σ2n =
∑ni=1 σ
2ni. Then both Sn/σn →d N(0, 1)
and maxσ2ni/σ2n : 1 ≤ i ≤ n → 0 if and only if the Lindeberg condition
1
σ2n
n∑i=1
E|Xni|21[|Xni|≥εσn] → 0 for all ε > 0(1)
holds.
Proposition 2.7 (The Cramer-Wold device). Random vectors Xn in Rd satisfy Xn →d X if andonly if a′Xn →d a
′X in R for all a ∈ Rd.
Proposition 2.8 (Continuous mapping or Mann - Wald theorem). Suppose that g : Rd → R iscontinuous a.s. PX . Then:A. If Xn →a.s. X then g(Xn)→a.s. g(X).B. If Xn →p X then g(Xn)→p g(X).C. If Xn →d X then g(Xn)→d g(X).
Proposition 2.9 (Slutsky’s theorem). Suppose that An →p a, Bn →p b, where a, b are constants,and Xn →d X. Then AnXn +Bn →d aX + b.
Proposition 2.10 (g′-theorem or the delta-method). Suppose that Zn ≡ an(Xn− b)→d Z in Rm
where an →∞, and suppose that g : Rm → Rk has a derivative g′ at b; here g′ is a k ×m matrix.Then
an(g(Xn)− g(b))→d g′(b)Z .(2)
14 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
Definition 2.1 A sequence of random variables Xn is said to be bounded in probability, andwe write Xn = Op(1), if
limλ→∞
lim supn→∞
P (|Xn| ≥ λ) = 0 .(3)
If Yn →p 0, then we write Yn = op(1). For any sequence of non-negative real numbers an we writeXn = Op(an) if Xn/an = Op(1), and we write Yn = op(an) if Yn/an = op(1).
Proposition 2.11 If Xn →d X, then Xn = Op(1).
Exercise 2.1 Prove Proposition 2.11.
Exercise 2.2 (a) Show that if Xn = Op(1) and Yn = op(1), then XnYn = op(1).(b) Show that if Xn = Op(an) and Yn = Op(bn) then Xn + Yn = Op(cn) where cn = maxan, bn.(c) Show that if Xn = Op(an) and Yn = Op(bn), then XnYn = Op(anbn).
Proposition 2.12 (Polya - Cantelli lemma). If Fn →d F and F is continuous, then ‖Fn−F‖∞ ≡sup−∞<x<∞ |Fn(x)− F (x)| → 0.
Exercise 2.3 Suppose that ξ1, . . . , ξn are i.i.d. Uniform(0, 1).(a) Show that nξn:1 = nξ(1) →d Exponential(1).(b) What is the joint limiting distribution of (nξn:1, nξn:2)?(c) Can you extend the result of (b) to (ξn:1, . . . , ξn:k) for a fixed k ≥ 1?(d) How would you extend (c) to the situation with kn →∞ as n→∞?
Exercise 2.4 Suppose that X1, . . . , Xn are independent Exponential(λ) random variables withdistribution function Fλ(x) = 1− exp(−λx) for x ≥ 0.(a) We expect Xn:n to be on the order of bn ≡ F−1λ (1− 1/n). Compute this explicitly.(b) Find a sequence of constants an so that an(Xn:n− bn)→d “something” and find “something”.
3. EXAMPLES. 15
3 Examples.
Example 3.1 (One-sample t−test) Suppose that X1, . . . , Xn are i.i.d. with E(X1) = µ andV ar(X1) = σ2. Consider testing H : µ ≤ µ0 versus K : µ > µ0. The normal theory test is“reject H if Tn ≥ tn−1,α” where
Tn ≡√n(Xn − µ0)
Sn
with S2n ≡ (n − 1)−1
∑n1 (Xi − Xn)2 and where P (tn−1 ≥ tn−1,α) = α. We are interested in the
behavior of this test when the Xi’s are not normally distributed.(a) What if µ = µ0 is true? Note that by the Lindeberg central limit theorem
√n(Xn − µ0) →d
N(0, σ2) when µ = µ0 is true, and by the WLLN and Slutsky’s theorem
S2n =
n
n− 1
1
n
n∑i=1
(Xi − µ0)2 − (X − µ0)2→p 1
σ2 − 0
= σ2.
Thus by Slutsky’s theorem again, Tn →d Z ∼ N(0, 1) when µ = µ0 is true, and we have Pµ0(Tn ≥tn−1,α)→ P (Z ≥ zα) = α.(b) What if µ > µ0 is true, with µ fixed? In this case
Tn =
√n(Xn − µ)
Sn+
√n(µ− µ0)Sn
→p Z +∞/σ =∞,
so Pµ(Tn > tn−1,α)→ P (Z +∞ > zα) = 1.(c) What if µ = µn > µ0 with
√n(µn − µ0)→ c > 0? Then it will usually hold that
Tn =
√n(Xn − µn)
Sn+
√n(µn − µ0)
Sn→d Z + c/σ
where we may need to apply a Lindeberg-Feller or Liapunov CLT to justify the convergence tonormality in the first term. If this holds, then
Pµn(Tn > tn−1,α)→ P (Z + c/σ > zα) = P (Z > zα − c/σ) = 1− Φ(zα − c/σ)
gives the limiting power of the test under the local alternatives µn. Note that 1−Φ(zα− c/σ) > αfor c > 0.
Example 3.2 (One sample normal - theory test of variance) Now suppose that X1, . . . , Xn arei.i.d. with E(X1) = µ, V ar(X1) = σ2, and µ4 ≡ E(X1 − µ)4 <∞.(a) Now with Yi ≡ (Xi − µ)2 ∼ (σ2, µ4 − σ4),
√n(n−1
∑n1 (Xi − µ)2 − σ2)√
2σ2=
√n(Y n − σ2)√
2σ2
→dN(0, µ4 − σ4)√
2σ2
= N
(0,µ4 − σ4
2σ4
)= N
(0,
2σ4 + µ4 − 3σ4
2σ4
)= N(0, 1 + 2−1γ2) with γ2 ≡
µ4σ4− 3.
16 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
(b) Since
n−1n∑1
(Xi −X)2 = n−1n∑1
(Xi − µ)2 − (X − µ)2
we have√n(n−1
∑n1 (Xi −Xn)2 − σ2)√
2σ2=
√n(Y n − σ2)√
2σ2−√n(Xn − µ)(Xn − µ)√
2σ2
→d N(0, 1 + 2−1γ2)−N(0, 1) · 0 = N(0, 1 + 2−1γ2)
by Slutsky’s theorem. Thus with S2n ≡ (n− 1)−1
∑n1 (Xi −Xn)2,
√n(S2
n − σ2)√2σ2
→d N(0, 1 + 2−1γ2).
Now consider testing H : σ = σ0 versus K : σ > σ0. If Xi ∼ N(µ, σ20), then (n − 1)S2n/σ
20 ∼ χ2
n−1under H, so the usual normal theory test is “reject H if (n − 1)S2
n/σ20 > χ2
n−1,α”. Then since
γ2(N(µ, σ2)) = 0 we have
α = Pσ0,Norm
((n− 1)S2
n
σ20> χ2
n−1,α
)= Pσ0,Norm
(√n
2
(S2n
σ20− 1
)>
√n
2
(χ2n−1,αn− 1
− 1
))→ P (Z ≥ zα) = α,
which forces√n
2
(χ2n−1,αn− 1
− 1
)→ zα.
Now suppose we carried out the normal theory test, but the Xi’s are not normal. Then, under H,
Pσ0
((n− 1)S2
n
σ20> χ2
n−1,α
)= Pσ0
(√n
2
(S2n
σ20− 1
)>
√n
2
(χ2n−1,αn− 1
− 1
))→ P (N(0, 1 + 2−1γ2) ≥ zα) 6= α
when γ2 6= 0. In general the asymptotic size is smaller than α if γ2 < 0, but the asymptotic size isgreater than α if γ2 > 0.
Example 3.3 (Two-sample tests for means) Suppose that X1, . . . , Xm are i.i.d. with mean µ andvariance σ2, and that Y1, . . . , Yn are i.i.d. with mean ν and variance τ2, independent of the Xj ’s.If we suppose that λN ≡ m/N ≡ m/(m+ n)→ λ ∈ [0, 1], then√
mn
N
(Xm − Y n − (µ− ν)
)=
√n
N
√m(Xm − µ)−
√m
N
√n(Y n − ν)
→d
√1− λZ1 −
√λZ2 (Z1, Z2) ∼ N2(0,diag(σ2, τ2))
∼ N(0, (1− λ)σ2 + λτ2).
3. EXAMPLES. 17
Thus we see that
Xm − Y n − (µ− ν)√S2Xm +
S2Yn
=
√mnN
(Xm − Y n − (µ− ν)
)√nN S
2X + m
N S2Y
→d N(0, 1)
by Slutsky’s theorem. On the other hand, again by Slutsky’s theorem,
Tm,n(µ, ν) ≡√
mnN
(Xm − Y n − (µ− ν)
)√(m−1)S2
X+(n−1)S2Y
N−2
→dN(0, (1− λ)σ2 + λτ2)√
λσ2 + (1− λ)τ2
= N
(0,
(1− λ)σ2 + λτ2
λσ2 + (1− λ)τ2
)6= N(0, 1)
unless λ = 1/2 or σ2 = τ2.
Since the two-sample t−test of H : µ ≤ ν versus K : µ > ν rejects H if Tm,n(0, 0) > tN−2,α, itfollows that the test is not size (or level) robust against violations of the assumption σ2 = τ2 whenλ 6= 1/2.
Example 3.4 (Simple linear regression with non-normal errors.) Suppose that Yi = α + β(xi −x) + εi for i = 1, . . . , n where x ≡ n−1
∑n1 xi and ε1, . . . , εn are i.i.d. with mean zero and finite
variance, V ar(ε1) = σ2; the εi’s are not assumed to be normally distributed. In matrix form
Y = Xβ + ε ≡
1 x1 − x...
...1 xn − x
( αβ
)+ ε.
The least squares estimators α and β of α and β are given by
α = Y , β =
∑n1 (xi − x)Yi∑n1 (xi − x)2
.
Claim: if max1≤i≤n(xi − x)2/∑n
1 (xi − x)2 → 0, then
(XTX)1/2(β − β) =
( √n(α− α)√∑n
1 (xi − x)2(β − β)
)→d N2(0, σ
2I2).(1)
Here is a partial proof. Now√n(α− α) =
√n(Y − α) =
√nε→d N(0, σ2)
by the Lindeberg CLT. Thus the first coordinate in (1) converges to the claimed limit marginally.We now use the Lindeberg-Feller CLT to show that the same is true for the second coordinate, andhence the two claimed marginal convergences hold. Note that√√√√ n∑
1
(xi − x)2(β − β) = σ
∑n1 (xi − x)εi
σ√∑n
1 (xi − x)2≡ σSn
σn
18 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
in the context of the Lindeberg-Feller CLT where Xn,i ≡ (xi − x)εi for i = 1, . . . , n, and henceµn,i = EXn,i = 0, σ2n,i = V ar(Xn,i) = (xi − x)2σ2, and σ2n = σ2
∑n1 (xi − x)2. Thus we need to
verify the condition LFn(δ)→ 0 for every δ > 0 where
LFn(δ) ≡ 1
σ2n
n∑1
E|Xn,i|21[|Xn,i|≥δσn].
But in the present case,
LFn(δ) =1
σ2∑n
1 (xi − x)2
n∑i=1
(xi − x)2E
ε2i 1|xi − x||εi| ≥ δσ√√√√ n∑
1
(xi − x)2
=
1
σ2∑n
1 (xi − x)2
n∑i=1
(xi − x)2E
ε211|ε1| ≥ δσ√|xi−x|2∑n1 (xi−x)2
≤ 1
σ2∑n
1 (xi − x)2
n∑i=1
(xi − x)2E
ε211|ε1| ≥ δσ√
max1≤i≤n|xi−x|2∑n1 (xi−x)2
=1
σ2E
ε211|ε1| ≥ δσ/√√√√max
1≤i≤n|xi − x|2/
n∑1
(xi − x)2
→ 0
for every δ > 0 by the DCT since the integrand converges a.s. to zero by the hypothesis and sinceE(ε21) <∞, so ε21 gives an integrable dominating function. Thus the second coordinate satisfies theclaimed marginal convergence. All that remains to be shown is the claimed joint convergence.
Conclusion: the normal theory tests and confidence intervals for α and β have the right asymp-totic size and coverage probabilities as long as σ2 <∞ and max1≤i≤n |xi − x|2/
∑n1 (xi − x)2 → 0.
Example 3.5 (Multiple linear regression with non-normal errors). Insert?
Example 3.6 (The correlation coefficient). Suppose that (X1, Y1)T , . . . , (Xn, Yn)T are i.i.d. with
means E(X1, Y1) = (µX , µY ), covariance matrix(σ2X ρσXσY
ρσXσY σ2Y
),
and E|X1|4 <∞, E|Y1|4 <∞. Let
SXY ≡ n−1n∑1
(Xi −X)(Yi − Y ), SXX ≡ n−1n∑1
(Xi −X)2, SY Y ≡ n−1n∑1
(Yi − Y )2,
and consider the sample correlation r ≡ rn defined by
rn ≡SXY√SXXSY Y
.
3. EXAMPLES. 19
since rn is invariant with respect to linear transformations of each axis, we may assume without lossof generality that µX = µY = 0 and σ2X = σ2Y = 1; if not replace (Xi, Yi) by ((Xi − µX)/σX , (Yi −µY )/σY ), i = 1, . . . , n. Note that
√n
SXY − ρSXX − 1SY Y − 1
=√n
XY − ρX2 − 1
Y 2 − 1
+
op(1)op(1)op(1)
→d Z ∼ N3(0,Σ)
by the multivariate CLT where
Σ =
E(X2Y 2)− ρ2 E(X3Y )− ρ E(XY 3)− ρE(X3Y )− ρ E(X4)− 1 E(X2Y 2)− 1E(XY 3)− ρ E(X2Y 2)− 1 E(Y 4)− 1
.
Since g(u, v, w) ≡ u/√vw has ∇g(ρ, 1, 1) = (1,−ρ/2,−ρ/2), it follows by the g′ theorem (or delta
method) that
√n(rn − ρ)→d ∇g(ρ, 1, 1)Z = Z1 −
ρ
2(Z2 + Z3).
Note that if X1 and Y1 are independent, then ρ = 0 and the covariance matrix Σ becomes
Σ =
1 0 00 µ4,X 00 0 µ4,Y
,
and hence√n(rn − 0) =
√nrn →d N(0, 1). Thus under independence the test of H : ρ = 0 versus
K : ρ > 0 that rejects if√nrn > zα has asymptotic level α even if the true distribution is not
Gaussian (but we have E|X1|4 + E|Y1|4 < ∞). If the true distribution of the (Xi, Yi) pairs isNormal, then Σ becomes
Σ =
1 + ρ2 2ρ 2ρ2ρ 2 2ρ2
2ρ 2ρ2 2
,
and hence
√n(rn − ρ)→d N(0, (1,−ρ/2,−ρ/2)Σ(1,−ρ/2− ρ/2)T ) = N(0, (1− ρ2)2).
Finally, it is often useful to transform the distribution of rn (which always has support in [−1, 1])to the whole line R: if g(x) ≡ 2−1 log((1 + x)/(1− x)), then g′(x) = 1/(1− x2), and hence, undernormality of the (Xi, Yi)’s,
√n(g(rn)− g(ρ))→d N(0, 1).
If we let Zn ≡ g(rn) and ξ ≡ g(ρ) (this is sometimes known as Fisher’s Z-transform), then
√n− 3
(Zn − ξ −
ρ
2(n− 1)
)≈ N(0, 1)
is an excellent approximation.
20 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
Example 3.7 (Chi-square test of a simple null hypothesis). Suppose that ∆1, . . . ,∆n, . . . are i.i.d.Multinomialk(1, p) random vectors so that ∆i vector contains only zeros and 1’s, but only one 1.Thus
Nn ≡n∑i=1
∆i ∼ Multk(n, p),
and pn≡ n−1Nn →p,a.s. p.
Consider testing H : p = p0
versus K : p 6= p0. One simple test statistic is the (Pearson)
chi-square statistic Qn defined by
Qn ≡k∑j=1
(Nj − np0,j)2
np0,j.
To carry out the test based on Qn we need to know the distribution of Qn under the null hypothesiseither exactly (which is complicated: its depends on k, n, and p
0), or at least asymptotically which
is relative easy. We claim that: when p = p0,
Qn →d χ2k−1.
Here is the proof. Set
Zn =
(N1 − np0,1√
np0,1, . . . ,
Nk − np0,k√np0,k
)T≡ 1√
n
n∑i=1
Y i
where The Y i’s are i.i.d. with E(Y i) = 0, and covariance matrix Σ = I − √p0
√p0T where
√p0
= (√p0,1, . . . ,
√p0,k)
T . Thus
Zn →d Z ∼ Nk(0,Σ)
by the multivariate CLT. Now Qn = ZTnZn is a continuous function of Zn, so Qn = ZTnZn →d
ZTZ ≡ Q by the Mann-Wald theorem. It remains to show that the distribution of Q is χ2k−1.
To see this, note that for any orthogonal matrix Γ we have
Q = ZTZ = (ΓZ)T (ΓZ) ≡ V TV
where V ∼ N(0,ΓΣΓT ). Here is a convenient choice of Γ: choose Γ to be the orthogonal matrixwith first row
√p0T , and filled out with orthogonal rows. Then
ΓΣΓT = ΓIΓT − Γ√p0
√p0TΓT = I − (1, 0, . . . , 0)T (1, 0, . . . , 0)
=
(0 0T
0 I(k−1)×(k−1)
).
Thus V1 = 0 with probability 1 and V2, . . . , Vk are i.i.d. N(0, 1). It follows that Q =∑k
j=2 V2j ∼
χ2k−1. Thus we have
Pp0(Qn ≥ χ2
k−1,α)→ P (χ2k−1 ≥ χ2
k−1,α) = α.
What if p 6= p0? In this case, since p
n→p p, we can write
n−1Qn =k∑j=1
(pj − p0,j)2
p0,j→p
k∑j=1
(pj − p0,j)2
p0,j≡ q > 0.
3. EXAMPLES. 21
Thus Qn →p ∞, and it follows that
Pp(Qn ≥ χ2k−1,α)→ 1
as n→∞; i.e. the test is (power) consistent.What if p = p
nsatisfies p
n= p
0+ cn−1/2 where 1T c = 0 and hence 1T p
n= 1 for all n? In this
case, by using the Cramer-Wold device together with either the Liapunov CLT or the Lindeberg-Feller CLT,
Zn =
(N1 − np0,1√
np0,1, . . . ,
Nk − np0,k√np0,k
)T=
(N1 − npn,1√
np0,1, . . . ,
Nk − npn,k√np0,k
)T+ diag(1/
√p0)√n(p
n− p
0)
=1√n
n∑i=1
Y n,i + diag(1/√p0)c
→d Z + diag(1/√p0)c
where Z ∼ Nk(0,Σ) with Σ = I − √p0
√p0T . Thus it follows by the Mann-Wald theorem that
under the local alternatives pn
we have
Qn = ZTnZn →d (Z + diag(1/√p0)c)T (Z + diag(1/
√p0)c)
=(
Γ(Z + diag(1/√p0)c))T (
Γ(Z + diag(1/√p0)c))
≡ V TV
where
V ∼ Nk
(Γdiag(1/
√p0)c,
(0 0T
0 I(k−1)×(k−1)
))≡ Nk
(µ,
(0 0T
0 I(k−1)×(k−1)
))Noting that µTµ = cTdiag(1/p
0)c =
∑kj=1 c
2j/p0,j and
µ1 =√p0Tdiag(1/
√p0)c = 1T c = 0,
it follows that V TV ∼ χ2k−1(δ) with δ =
∑k1 c
2j/p0,j . We conclude that
Ppn(Qn ≥ χ2
k−1,α)→ P (χ2k−1(δ) > χ2
k−1,α).
This leads to approximating the power of the chi-square test based on Qn by
Pp(Qn ≥ χ2k−1,α) ≈ P (χ2
k−1(δn) > χ2k−1,α)
with δn ≡∑k
j=1 c2n,j/p0,j where cn ≡
√n(p− p
0).
Another approach to approximating the power of the chi-square test is based on fixed alterna-tives as follows. Suppose that p 6= p
0. Then with q =
∑kj=1(pj − p0,j)2/p0,j ,
√n(n−1Qn − q
)=
k∑j=1
√n
(pj − p0,j)2
p0,j− (pj − p0,j)2
p0,j
22 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
=
k∑j=1
√n(pj − p0,j)− (pj − p0,j)(pj − p0,j) + (pj − p0,j)
p0,j
=
k∑j=1
√n(pj − pj)√
pj
2(pj − p0,j) + op(1)√pjp0,j
≡k∑j=1
Bn,jZn,j
→d bTZ ∼ N(0, bT (I −√p√pT )b) = N(0, V arp(D))
where, with bj ≡ 2(pj/p0,j − 1)√pj for j = 1, . . . , k,
P (D = bj/√pj) = pj , j = 1, . . . , k.
Here the claimed convergence holds since
Zn →d Z ∼ Nk(0, I −√p√pT ),
Bn,j =
2
(pjp0,j− 1
)+ op(1)
√pj →p
2
(pjp0,j− 1
)√pj = bj ,
for j = 1, . . . , k. This suggests the following approximation to the power of the chi-square test:
Pp(Qn ≥ χ2k−1,α) = Pp
(√n
(Qnn− q)≥√n
(χ2k−1,αn
− q
))
≈ P
(N(0, V arp(D)) ≥
√n
(χ2k−1,αn
− q
))
= P
(N(0, 1) ≥
√n
(χ2k−1,αn
− q
)/√V arp(D)
)
= 1− Φ
(√n
(χ2k−1,αn
− q
)/√V arp(D)
).
4. SKOROKHOD’S THEOREM: REPLACING →D BY →A.S. 23
4 Skorokhod’s Theorem: Replacing →d by →a.s.
Our goal in this section is to show how we can convert convergence in distribution into the strongermode of almost sure convergence. This often simplifies proofs and makes them more intuitive.
Definition 4.1 For any distribution function F define F−1 by F−1(t) ≡ infx : F (x) ≥ t for0 < t < 1.
Proposition 4.1 F−1 is left - continuous.
Proof. To show that F−1 is left - continuous, let 0 < t < 1, and set z ≡ F−1(t). Then F (z) ≥ tby the right continuity of F . If F is discontinuous at z, F−1(t − ε) = z for all small ε > 0, andhence left continuity holds. If F is continuous at z, then assume F−1 is discontinuous from the leftat t. Then for all ε > 0, F−1(t− ε) < z− δ for some δ > 0, and hence F (z− δ) ≥ t− ε for all ε > 0.hence F (z − δ) ≥ t, which implies F−1(t) ≤ z − δ, a contradiction. 2
Proposition 4.2 If X has continuous distribution function F , then F (X) ∼ Uniform(0, 1). Forany distribution function F and any t ∈ (0, 1),
P (F (X) ≤ t) ≤ t
with equality if and only if t is in the range of F . Equivalently, F (F−1(t)) ≡ F F−1(t) ≥ tfor all 0 < t < 1 with equality if and only if t is in the range of F . Also, F−1 F (x) ≤ x forall −∞ < x < ∞ with strict inequality if and only if F (x − ε) = F (x) for some ε > 0. ThusP (F−1 F (X) 6= X) = 0 where X ∼ F .
Exercise 4.1 Prove proposition 4.2.
Theorem 4.1 (The inverse transformation). Let ξ ∼ Uniform(0, 1) and let X = F−1(ξ). Then forall real x,
[X ≤ x] = [ξ ≤ F (x)] .(1)
Thus X has distribution function F .
Proof. Now ξ ≤ F (x) implies X = F−1(ξ) ≤ x by the definition 4.1 of F−1. If X = F−1(ξ) ≤ x,then F (x + ε) ≥ ξ for all ε > 0, so that right continuity of F implies F (x) ≥ ξ. Thus the claimedevent identity holds. 2
Proposition 4.3 (Elementary Skorokhod theorem). Suppose that Xn →d X0. Then there existrandom variables X∗n, n ≥ 0, all defined on the common probability space ([0, 1],B[0, 1], λ) for which
X∗nd= Xn for every n ≥ 0 and X∗n →a.s. X
∗0 .
Proof. Let Fn denote the distribution function of Xn and let
X∗n ≡ F−1n (ξ) for all n ≥ 0(a)
24 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
where ξ ∼ Uniform(0, 1). Then X∗nd= Xn for all n ≥ 0 by theorem 4.1. It remains only to show
that X∗n →a.s. X∗0 .
Let t ∈ (0, 1) be such that there is at most one value z having F (z) = t. (Thus t corresponds toa continuity point of F−1.) Let z = F−1(t). Then F (x) < t for x < z. Thus Fn(x) < t for n ≥ Nx
provided x < z is a continuity point of F . Thus F−1n (t) ≥ x provided x < z is a continuity pointof F . Thus lim inf F−1n (t) ≥ x provided x < z is a continuity point of F . Thus lim inf F−1n (t) ≥ zsince there are continuity points x that ↑ z.
We also have F (x) > t for z < x. Thus Fn(x) > t, and hence F−1n (t) ≤ x for n ≥ some Nx
provided x > z is a continuity point of F . Thus lim supn F−1n (t) ≤ x provided x > z is a continuity
point of F . Thus lim supn F−1n (t) ≤ z since there are continuity points x that ↓ z.
Thus F−1n (t)→ F−1(t) for all but a countable number of t’s. Since any such set has Lebesguemeasure zero, it follows that X∗n = F−1n (ξ)→a.s. F
−1(ξ) = X∗0 . 2
Proposition 4.4 (Continuous mapping or Mann-Wald theorem). Suppose that g : R → R iscontinuous a.s. PX . Then:A. If Xn →a.s. X0, then g(Xn)→a.s. g(X0).B. If Xn →p X0, then g(Xn)→p g(X0).C. If Xn →d X0, then g(Xn)→d g(X0).
Proof. A. Let N1 be the null set such that Xn(ω) → X0(ω) for all ω ∈ N c1 , and let N2 be
the null set such that g is continuous at X0(ω) for all ω ∈ N c2 . Then for ω ∈ N c
1 ∩ N c2 we have
g(Xn(ω)) → g(X0(ω)). But P (N1 ∪ N2) ≤ P (N1) + P (N2) = 0 + 0 = 0, and the convergenceasserted in A holds.
B. By theorem 1.1 part G,Xn →p X0 if and only if for every subsequence Xn′ there is a furthersubsequence Xn′′ ⊂ Xn′ such that Xn′′ →a.s. X0. We will apply this to Yn = g(Xn). Let Yn′ =g(Xn′) be an arbitrary subsequence of Yn. By part G of theorem 1.1 there exists a subsequenceXn′′ of Xn′ such that Xn′′ →a.s. X0. By A we conclude that Yn′′ = g(Xn′′) →a.s. g(X0) = Y0.But by part G of theorem 1.1 (in the converse direction) it follows that Yn = g(Xn)→p g(X0) = Y0.
C. Replace Xn, X0 by X∗n, X∗0 of the Skorokhod theorem, proposition 4.3. Thus
g(Xn)d= g(X∗n)→a.s. g(X∗0 )
d= g(X0) .(a)
Since →a.s. implies →p which in turn implies →d, (a) implies that g(Xn)→d g(X0). 2
Remark 4.1 Proposition 4.4 remains true for random vectors in Rk and, still more generally,for convergence in law (weak convergence) of random elements in a separable metric space. SeeBillingsley (1986), Probability and Measure, page 399, for the first, and Billingsley (1971), WeakConvergence of Measures: Applications in Probability, theorem 3.3, page 7, for the second. Theoriginal paper is Skorokhod (1956) where the separable metric space case was treated immediately.
Proposition 4.5 (Helly Bray theorem). If Xn →d X0 and g is bounded and continuous (a.s. PX),then Eg(Xn)→ Eg(X0).
Proof. For the random variables X∗n of proposition 4.3, it follows from A of proposition 4.4that g(X∗n)→a.s. g(X∗0 ). Thus by equality in distribution guaranteed by the construction of propo-sition 4.3 and the dominated convergence theorem,
Eg(Xn) = Eg(X∗n)→ Eg(X∗0 ) = Eg(X0) .
4. SKOROKHOD’S THEOREM: REPLACING →D BY →A.S. 25
2
Remark 4.2 If Eg(Xn) → Eg(X) for all bounded continuous functions g, then Xn →d X0.(Proof: box in the indicator function 1(−∞,x] by the bounded continuous functions g+, g− definedby connecting (x, 1) to (x+ ε, 0) linearly and (x− ε, 1) to (x, 0) linearly, respectively.) This gives away of defining →d more generally:
Definition 4.2 Suppose that Xn, n ≥ 0 take values in the complete separable metric space (M,d).Then we say that Xn converges in law or distribution to X0, and we write Xn →d X0 or Xn ⇒ X0,if
Eg(Xn)→ Eg(X0) for all g ∈ Cb(M);
here Cb(B) denotes the collection of all bounded continuous functions from M to R.
Proposition 4.6 If Xn →d X0, then E|X0| ≤ lim infn→∞E|Xn|.
Proof. For the random variables X∗n of proposition 4.3, X∗n →a.s. X∗0 . It follows from the
equality in distribution of proposition 4.3 and Fatou’s lemma
E|X0| = E|X∗0 | = E(lim infn|X∗n|) ≤ lim inf
nE|X∗n| = lim inf
nE|Xn| .
2
Corollary 1 If Xn →d X0, then V ar(X0) ≤ lim infn V ar(Xn).
Exercise 4.2 Prove corollary 1. Hint: Note that with X ′nd= Xn for all n ≥ 0 with X ′n independent
of Xn, we have V ar(Xn) = (1/2)E(Xn −X ′n)2.
Proposition 4.7 (Slutsky’s theorem). If An →p a, Bn →p b, and Zn →d Z, then AnZn + Bn →d
aZ + b.
Proof. Now An →p a, Bn →p b, and Zn →d Z where a, b are constants, implies that(Zn, An, Bn) →d (Z, a, b) in R3. Hence by the R3 version of Skorokhod’s theorem, there exists a
sequence (Z∗n, A∗n, B
∗n)
d= (Zn, An, Bn) such that (Z∗n, A
∗n, B
∗n)→ (Z∗, a, b)
d= (Z, a, b). Hence
AnZn +Bnd= A∗nZ
∗n +B∗n →a.s. aZ
∗ + bd= aZ + b .(a)
Since →a.s. implies →p which in turn implies →d, (a) yields the desired conclusion. 2
26 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
5 Empirical Measures and Empirical Processes
We first introduce the empirical distribution function Gn and empirical process Un of i.i.d. Uniform(0, 1)random variables. Suppose that ξ1, . . . , ξn, . . . are i.i.d. Uniform(0, 1). Their empirical distributionfunction is
Gn(t) =1
n
n∑i=1
1[0,t](ξi) for 0 ≤ t ≤ 1(1)
=#ξi ≤ t, i = 1, . . . , n
n
=k
nfor ξn:k ≤ t < ξn:k+1, k = 0, . . . , n
where 0 ≡ ξn:0 ≤ ξn:1 ≤ · · · ≤ ξn:n ≤ ξn:n+1 ≡ 1 are the order statistics. The uniform empiricalprocess is defined by
Un(t) ≡√n(Gn(t)− t) for 0 ≤ t ≤ 1 .(2)
The inverse function G−1n of Gn is the uniform quantile function. Thus
G−1n (t) = ξn:i for (i− 1)/n < t ≤ i/n, i = 1, . . . , n .(3)
The uniform quantile process Vn is defined by
Vn(t) ≡√n(G−1n (t)− t) for 0 ≤ t ≤ 1 .(4)
Note that
nGn(t) ∼ Binomial(n, t) for 0 ≤ t ≤ 1 ,(5)
so that
Un(t) has mean 0 and variance t(1− t) for 0 ≤ t ≤ 1 .(6)
In fact
Cov[1[0,s](ξi), 1[0,t](ξi)] = s ∧ t− st for 0 ≤ s, t ≤ 1 .(7)
Moreover, applying the multivariate CLT to (1[0,s](ξi), 1[0,t](ξi)), it is clear that
(Un(s),Un(t))→d N2
((00
),
(s(1− s) s ∧ t− sts ∧ t− st t(1− t)
))as n→∞ ,(8)
for 0 ≤ s, t ≤ 1.We define U(t) : 0 ≤ t ≤ 1 to be a Brownian bridge process if it is a Gaussian process indexed
by t ∈ [0, 1] having
EU(t) = 0 and Cov[U(s),U(t)] = s ∧ t− st(9)
for all 0 ≤ s, t ≤ 1. The process U exists and has sample functions U(·, ω) which are continuousfor a.e. ω as we will show below. Of course the bivariate result (8) immediately extends to all thefinite-dimensional marginal distributions of Un: for any k ≥ 1 and any t1, . . . , tk ∈ [0, 1],
(Un(t1), . . . ,Un(tk))→d (U(t1), . . . ,U(tk)) ∼ Nk(0, (tj ∧ tj′ − tjtj′)) .(10)
5. EMPIRICAL MEASURES AND EMPIRICAL PROCESSES 27
Thus we have convergence of all the finite-dimensional distributions of Un to those of a Brownianbridge process U, and we write
Un →f.d. U as n→∞ .(11)
We would like to be able to conclude from (11) that g(Un)→d g(U) as n→∞ for various continuousfunctionals such as g(x) = sup0≤t≤1 |x(t)| for x ∈ D[0, 1], the space of all right-continuous functionson [0, 1] with left-limits. The conclusion (11) is not strong enough to imply this, but (10) can bestrengthened to a result that does. The Mann-Wald theorem suggests g(Un) →d g(U) should betrue for “continuous” functionals g, and this raises the question of what metric should be used todefine continuous.
The Empirical Process on R
Let X1, . . . , Xn, . . . be i.i.d. F with order statistics Xn:1 ≤ · · · ≤ Xn:n. Their empirical distri-bution function Fn is defined by
Fn(x) =1
n
n∑i=1
1(−∞,x](Xi) for −∞ < x <∞ .(12)
The empirical process is defined to be√n(Fn − F ). It will be very useful to suppose that random
variables X∗i , i = 1, . . . , n, are defined by
X∗i = F−1(ξi) i = 1, . . . , n for the ξi’s of (1) .(13)
Theorem 4.1 shows that these X∗i ’s are indeed i.i.d. F . Recall from Theorem 4.1 that also
1[X∗i ≤x] = 1[ξi≤F (x)] on (−∞,∞)(14)
for these particular X∗i ’s. Thus for these X∗i ’s we have
F∗n = Gn(F ) on (−∞,∞),(15)
and
√n(F∗n − F ) = Un(F ) on (−∞,∞) .(16)
Note from (10) and (16) that
√n(Fn − F )→f.d. U(F ) as n→∞ .(17)
Theorem 5.1 (Glivenko - Cantelli). Let I denote the identity function on [0, 1], I(t) = t, for0 ≤ t ≤ 1. Then
‖Gn − I‖∞ ≡ sup0≤t≤1
|Gn(t)− t| →a.s. 0(18)
and
‖Fn − F‖∞ ≡ sup−∞<x<∞
|Fn(x)− F (x)| →a.s. 0(19)
as n→∞.
28 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
Proof. Since
‖Fn − F‖∞d= ‖F∗n − F‖∞ = ‖Gn(F )− F‖∞ ≤ ‖Gn − I‖∞ ,(a)
where the equality in distribution holds jointly in n and with equality if F is continuous, it sufficesto prove the first part.
Fix a large integer M . Then
‖Gn − I‖∞ = max1≤j≤M
sup(j−1)/M≤t≤j/M
|Gn(t)− t|
= max1≤j≤M
sup(j−1)/M≤t≤j/M
(Gn(t)− t) ∨ sup(j−1)/M≤t≤j/M
(t−Gn(t))
≤ max1≤j≤M
(Gn(j/M)− (j − 1)/M) ∨ (j/M −Gn((j − 1)/M))
≤ max1≤j≤M
(Gn(j/M)− j/M) ∨ ((j − 1)/M −Gn((j − 1)/M))+ 1/M
→a.s. 0 + 1/M
since Gn(j/M)→a.s. j/M , j = 1, . . . ,M . But M was arbitrary; hence ‖Gn − I‖∞ →a.s. 0. 2
The next natural step is to show that
Un ⇒ U as n→∞ in (D[0, 1], ‖ · ‖∞)(20)
and√n(Fn − F )
d=√n(F∗n − F ) = Un(F )⇒ U(F ) as n→∞ in (D(−∞,∞), ‖ · ‖∞) .(21)
This is essentially what was proved by Donsker (1952). However, it turned out later thatthere are measurability difficulties here: (D[0, 1], ‖ · ‖∞) is an inseparable Banach space, and eventhough U takes values in the separable Banach space (C[0, 1], ‖ · ‖∞), in this case the unfortunateconsequence is that Un is not a measurable element of (D[0, 1], ‖·‖∞); see Billingsley (1968), Chapter18. Roughly, the Borel sigma-field is too big. Hence an attractive alternative formulation is onethat works around this difficulty essentially by carrying out an explicit Skorokhod construction
of uniform empirical processes U∗nd= Un defined on a common probability space with a Brownian
bridge process U∗ and satisfying
‖U∗n − U∗‖∞ = sup0≤t≤1
|U∗n(t)− U∗(t)| →a.s. 0 .(22)
This is the content of the following theorem:
Theorem 5.2 There exists a (sequence of) uniform empirical processes U∗n corresponding to atriangular array of row independent Uniform(0, 1) random variables ξn1, . . . , ξnn, n ≥ 1, and aBrownian bridge process U∗ all defined on a common probability space (Ω,A, P ), such (22) holds.Thus it follows that
‖√n(F∗n − F )− U∗(F )‖∞ →a.s. 0 as n→∞.(23)
The convergence in (22) was strengthened in the papers of Komlos, Major, and Tusnady (1975),(1978) as follows: the construction can be carried out so that the convergence in (22) holds withthe rate n−1/2 log n: there is a construction of U∗n and U∗ so that
‖U∗n − U∗‖∞ ≤ Clog n√n
a.s. ,(24)
5. EMPIRICAL MEASURES AND EMPIRICAL PROCESSES 29
for some absolute constant C. Moreover, there is a construction of the sequence(s) U∗nn≥1 andU∗ = U∗n on a common probability space so that the joint in n distributions are correct and
‖U∗n − U∗‖∞ ≤ C(log n)2√
na.s. .(25)
In any case, these results have the following corollary:
Corollary 1 (Donsker’s theorem). If g : D[0, 1]→ R is ‖ · ‖∞−continuous, then g(Un)→d g(U).
Here are some examples of this:
Example 5.1 (Kolmogorov’s (two-sided) statistic). If F is continuous, then
‖√n(Fn − F )‖∞
d= ‖Un(F )‖∞ = ‖Un‖∞ →d ‖U‖∞ .(26)
It is known, via reflection methods (see Shorack and Wellner (1986), pages 33-42) that
P (‖U‖∞ > λ) = 2∞∑k=1
(−1)k+1 exp(−2k2λ2) for λ > 0 .(27)
Dvoretzky, Kiefer, and Wolfowitz (1956) showed that
P (‖Un‖∞ ≥ λ) ≤ Ce−2λ2
for all λ > 0 and n ≥ 1 for some constant C. Massart (1990) showed that the inequality in the lastdisplay holds with C = 2 as conjectured by Birnbaum and McCarty (1958).
Example 5.2 (Kolmogorov’s one-sided statistic). If F is continuous, then
‖√n(Fn − F )+‖∞ ≡ sup
−∞<x<∞
√n(Fn(x)− F (x))
d= ‖U+
n (F )‖∞ = ‖U+n ‖∞ →d ‖U+‖∞ .(28)
It is known (see Shorack and Wellner (1986), pages 37 and 142) that
P (‖U+‖∞ > λ) = exp(−2λ2) for λ > 0 .(29)
Example 5.3 (Birnbaum’s statistic). If F is continuous,∫ ∞−∞
√n(Fn(x)− F (x))dF (x)
d=
∫ ∞−∞
Un(F )dF =
∫ 1
0Un(t)dt→d
∫ 1
0U(t)dt .(30)
Now∫ 10 U(t)dt is a linear combination of normal random variables, and hence it has a normal distri-
bution. It has expectation 0 by Fubini’s theorem since E(U(t)) = 0 for each fixed t. Furthermore,again by Fubini’s theorem,
E
(∫ 1
0U(t)dt
)2
= E
(∫ 1
0U(s)ds
∫ 1
0U(t)dt
)=
∫ 1
0
∫ 1
0EU(s)U(t)dsdt
=
∫ 1
0
∫ 1
0(s ∧ t− st) dsdt =
1
12.
Hence∫ 10 U(t)dt ∼ N(0, 1/12).
30 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
Example 5.4 (Cramer - von Mises statistic). If F is continuous,∫ ∞−∞√n(Fn(x)− F (x))2dF (x)
d=
∫ ∞−∞Un(F )2dF =
∫ 1
0Un(t)2dt→d
∫ 1
0U(t)2dt .
In this case it is known that∫ 1
0U(t)2dt d
=∞∑j=1
1
j2π2Z2j ,(31)
where the Zj ’s are i.i.d. N(0, 1), and this distribution has been tabled; see Shorack and Wellner(1986), page 148.
Example 5.5 (Anderson - Darling statistic). If F is continuous,∫ ∞−∞
√n(Fn − F )2
F (1− F )dF
d=
∫ ∞∞
Un(F )2
F (1− F )dF =
∫ 1
0
Un(t)2
t(1− t)dt→d
∫ 1
0
U(t)2
t(1− t)dt .
It is known in this case that∫ 1
0
U(t)2
t(1− t)dt
d=∞∑j=1
1
j(j + 1)Z2j(32)
where the Zj ’s are i.i.d. N(0, 1). This distribution has also been tabled; see Shorack and Wellner(1986), page 148.
General Empirical Measures and Processes
Now suppose that X1, X2, . . . , Xn, . . . are i.i.d. P on the measurable space (S,S). We let Pndenote the empirical measure of the first n of the Xi’s:
Pn ≡1
n
n∑i=1
δXi ;(33)
here δx denotes the measure with mass 1 at x ∈ S: δx(B) = 1B(x) for B ∈ S. Thus for a set B ∈ S,
Pn(B) =1
n
n∑i=1
1B(Xi) =1
n#i ≤ n : Xi ∈ B .(34)
Note that when S = R so that the Xi’s are real-valued, and B = (−∞, x] for x ∈ R, then
Pn(B) = Pn((−∞, x]) = Fn(x) ,(35)
the empirical distribution function of the Xi’s at x.The empirical process Gn is defined by
Gn ≡√n(Pn − P ) .(36)
The question is how to “index” Pn and Gn as stochastic processes.Some history: For the case S = Rd, the empirical distribution function Fn(x), x ∈ Rd, is the
case obtained by choosing the class of sets to be the lower-left orthants
C = Od ≡ (−∞, x] : x ∈ Rd ,
5. EMPIRICAL MEASURES AND EMPIRICAL PROCESSES 31
and this direction was pursued in some detail through the 1950’s and early 1960’s. However, with alittle thought it becomes clear that many other classes of sets will be of interest in Rd. For examplewhy not consider the empirical process indexed by the class of all rectangles
Rd ≡ A = [a1, b1]× · · · × [ad, bd] : aj , bj ∈ R, j = 1, . . . , d
or the class of all closed balls
Bd ≡ B(x, r) : x ∈ Rd, r > 0
where B(x, r) = y ∈ Rd : |y − x| ≤ r; or the class of all half-spaces
Hd ≡ H(u, t) : u ∈ Sd−1, t ∈ R
where H(u, t) ≡ y ∈ Rd : 〈y, u〉 ≤ t and Sd−1 ≡ u ∈ Rd : |u| = 1 denotes the unit sphere inRd; or the class of all convex sets in Rd
Cd ≡ C ⊂ Rd : C is convex ?
All of these cases correspond to the empirical process indexed by some class of indicator functions
1C : C ∈ C
for the appropriate choice of C. Thus we can consider the empirical measure and the empiricalprocess as functions on a class of sets C which map sets C ∈ C to the real-valued random variables
Pn(C) and Gn(C) =√n(Pn(C)− P (C)) .
Note that for any class of sets C we have
supC∈C
Pn(C) ≤ 1 <∞ and supC∈C|√n(Pn(C)− P (C))| ≤
√n <∞ ,
so we can regard both Pn and Gn as elements of the space l∞(C) ≡ x : C → R| supC∈C |x(C)| <∞.More generally still, we can think of indexing the empirical process by a class F of functions
f : S → R. For example, when S = Rd a natural class which might easily arise in applications isthe class of functions
F = ft(x) : t ∈ Rd
where ft(x) = |x− t|. This is already an interesting class of functions when d = 1.For any fixed measureable function f : S → we will use the notation
P (f) =
∫fdP, Pn(f) =
∫fdPn =
1
n
n∑i=1
f(Xi) .
From the strong law of large numbers it follows that for any fixed function f with E|f(X)| <∞
Pn(f)→a.s. P (f) = Ef(X1) .(37)
By the central limit theorem (CLT) it follows that for any fixed function f with E|f(X)|2 <∞
Gn(f) =√n(Pn(f)− P (f))→d G(f) ;∼ N(0, V ar(f(X1))(38)
32 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
here G denotes a P−Brownian bridge process: i.e. a mean zero Gaussian process with covariancefunction
Cov[G(f),G(g)] = P (fg)− P (f)P (g) .(39)
The question of interest is: for what classes C of subsets of S, C ⊂ A or classes of functions F , canwe make these convergences hold uniformly in C ∈ C, or uniformly in f ∈ F? These are the kindsof questions with which modern empirical process theory is concerned and can answer.
To state some typical results from this theory, we first need several definitions. If d is a metricon a set F , then we define the covering numbers of F with respect to d as follows:
N(ε,F , d) ≡ infk : there exist f1, . . . , fk ∈ F such that F ⊂ ∪kj=1B(fj , ε) ;(40)
here B(f, ε) ≡ g ∈ F : d(g, f) ≤ ε. Another useful notion is that of a bracket: if l ≤ u are tworeal-valued functions defined on S, then the bracket [u, l] is defined by
[u, l] ≡ f : l(s) ≤ f(s) ≤ u(s) for all s ∈ S .(41)
We say that a bracket [u, l] is an ε−bracket for the metric d if d(u, l) ≤ ε. Then the bracketingcovering number N[](ε,F , d) for a set of functions F is
N[](ε,F , d) ≡ infk : there exist ε−brackets [l1, u1], . . . , [lk, uk] such that F ⊂ ∪kj=1[lj , uj ] .(42)
One more bit of notation is needed before stating our theorems: an envelope function F for aclass of functions F is any function satisfying
|f(x)| ≤ F (x) for all x ∈ S, and all f ∈ F .(43)
Usually we will take F to be the minimal measurable majorant
F (x) ≡
(supf∈F|f(x)|
)∗,(44)
where here the ∗ stands for “smallest measurable function above” the quantity in parentheses(which need not be measurable since it is, in general, a supremum over an uncountable collection).[Note that this F is not a distribution function!]
Now we can state two generalizations of the Glivenko-Cantelli theorems.
Theorem 5.3 Suppose that F is a class of functions with finite L1(P )−bracketing numbers:N[](ε,F , L1(P )) <∞ for every ε > 0. Then
‖Pn − P‖F ≡ supf∈F|Pn(f)− P (f)| →a.s. 0 .(45)
Theorem 5.4 Suppose that F is a class of functions with:A. An integrable envelope function F : P (F ) <∞.B. The truncated classes FM ≡ f1[F≤M ] : f ∈ F satisfy
n−1 logN(ε,FM , L1(Pn))→a.s. 0(46)
for every ε > 0 and 0 < M <∞. Then, if F is also “suitably measurable”,
‖Pn − P‖F ≡ supf∈F|Pn(f)− P (f)| →a.s. 0 .(47)
5. EMPIRICAL MEASURES AND EMPIRICAL PROCESSES 33
Note that the key hypothesis (46) of Theorem 5.4 is clearly satisfied if
supQN(ε,FM , L1(Q)) <∞(48)
for every ε > 0 and M > 0; here the supremum is over finitely discrete measures Q.Finally, here are two generalizations of the Donsker theorem.
Theorem 5.5 (Ossiander’s uniform CLT). Suppose that F is a class of functions with L2(P )bracketing numbers N[](ε,F , L2(P )) satisfying∫ ∞
0
√logN[](ε,F , L2(P )) dε <∞ .(49)
Then
Gn =√n(Pn − P )⇒ G in l∞(F) as n→∞ .(50)
Theorem 5.6 (Pollard’s uniform CLT). Suppose that F is a class of functions satisfying:A. The envelope function F of F is square integrable: P (F 2) <∞.B. The uniform covering numbers supQ logN(ε‖F‖Q,2,F , L2(Q)) satisfy∫ ∞
0
√supQ
logN(ε‖F‖Q,2,F , L2(Q)) dε <∞ .(51)
Then
Gn =√n(Pn − P )⇒ G in l∞(F) as n→∞ .(52)
For proofs of Theorems 45 - 5.6, see van der Vaart and Wellner (1996). Treatments of empiricalprocess theory are also given by Dudley (1999, 2014), Van de Geer (1999), and Gine and Nickl(2016).
34 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
6 The Partial Sum Process and Brownian Motion
We define S(t) : 0 ≤ t ≤ 1 to be Brownian motion if S is a Gaussian process indexed by t ∈ [0, 1]having
E(S(t)) = 0 and Cov[S(s),S(t)] = s ∧ t(1)
for all 0 ≤ s, t ≤ 1. These finite-dimensional distributions are “consistent”, and hence a theoremof Kolmogorov shows that the process S exists. Note that (1) and normality imply that
S has stationary independent increments .(2)
Exercise 6.1 Suppose that U is a Brownian bridge and Z ∼ N(0, 1) is is independent of U. Let
S(t) ≡ U(t) + tZ for 0 ≤ t ≤ 1(3)
is a Brownian motion.
Exercise 6.2 Suppose that S is a Brownian motion. Show that
U(t) ≡ S(t)− tS(1) for 0 ≤ t ≤ 1(4)
is a Brownian bridge.
Now suppose that X1, . . . , Xn are i.i.d. random variables with mean 0 and and variance 1, andset X0 ≡ 0. We define the partial sum process Sn by
Sn(t) ≡ Sn(k/n) =1√n
k∑i=1
Xi fork
n≤ t < k + 1
n,(5)
for 0 ≤ k <∞. Note that
Cov[Sn(j/n), Sn(k/n)] =1
n
j∑i=1
k∑i′=1
Cov[Xi, Xi′ ]
=1
n
j∧k∑i=1
V ar[Xi] =j ∧ kn
→ s ∧ t if j/n→ s and k/n→ t .
Also,
Sn(t) =1√n
[nt]∑i=1
Xi =
√[nt]
n
1√[nt]
[nt]∑i=1
Xi →d
√tN(0, 1) ∼ N(0, t) .(6)
for 0 ≤ t ≤ 1 by the CLT and Slutsky’s theorem. This suggests that
Sn →f.d. S as n→∞ .(7)
This will be verified in exercise 6.3. Much more is true: Sn ⇒ S as processes in D[0, 1], and henceg(Sn)→d g(S) for continuous functionals g.
6. THE PARTIAL SUM PROCESS AND BROWNIAN MOTION 35
Exercise 6.3 Show that (7) holds: i.e. for any fixed t1, . . . , tk ∈ [0, 1]k,
(Sn(t1), . . . ,Sn(tk))→d (S(t1), . . . ,S(tk)) .
Existence of Brownian motion and Brownian bridge as continuous processes on C[0, 1]
The aim of this subsection to convince you that both Brownian motion and Brownian bridgeexist as continuous Gaussian processes on [0, 1], and that we can then extend the definition ofBrownian motion to [0,∞).
Definition 6.1 Brownian motion (or standard Brownian motion, or a Wiener process) S is aGaussian process with continuous sample functions and:(i) S(0) = 0;(ii) E(S(t)) = 0, 0 ≤ t ≤ 1;(iii) ES(s)S(t) = s ∧ t, 0 ≤ s, t ≤ 1.
Definition 6.2 A Brownian bridge process U is a Gaussian process with continuous sample func-tions and:(i) U(0) = U(1) = 0;(ii) E(U(t)) = 0, 0 ≤ t ≤ 1;(iii) EU(s)U(t) = s ∧ t− st, 0 ≤ s, t ≤ 1.
Theorem 6.1 Brownian motion S and Brownian bridge U exist.
Proof. We first construct a Brownian bridge process U. Let
h00(t) ≡ h(t) ≡
t 0 ≤ t ≤ 1/2 ,1− t 1/2 ≤ t ≤ 1 ,0 elsewhere .
(a)
For n ≥ 1 let
hnj(t) ≡ 2−n/2h(2nt− j), j = 0, . . . , 2n − 1 .(b)
For example, h10(t) = 2−1/2h(2t), h11(t) = 2−1/2h(2t− 1), while
h20(t) = 2−1h(4t), h21(t) = 2−1h(4t− 1) ,
h22(t) = 2−1h(4t− 2), h23(t) = 2−1h(4t− 3) .
Note that |hnj(t)| ≤ 2−n/22−1.The functions hnj : j = 0, . . . , 2n − 1, n ≥ 0 are called the Schauder functions; they
are integrals of the orthonormal (with respect to Lebesgue measure on [0, 1]) family of functionsgnj : j = 0, . . . , 2n − 1, n ≥ 0 called the Haar functions defined by
g00(t) ≡ g(t) ≡ 21[0,1/2](t)− 1,
gnj(t) ≡ 2n/2g00(2nt− j) , j = 0, . . . , 2n − 1, n ≥ 1 .
Thus ∫ 1
0g2nj(t)dt = 1,
∫ 1
0gnj(t)gn′j′(t)dt = 0 if n 6= n′, or j 6= j′ ,(c)
36 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
and
hnj(t) =
∫ t
0gnj(s)ds, 0 ≤ t ≤ 1 .(d)
Furthermore, the family gnj2n−1j=0, n≥0 ∪ g(·/2) is complete: any f ∈ L2(0, 1) has an expansion in
terms of the g’s. In fact the Haar basis is the simplest wavelet basis of L2(0, 1), and is the startingpoint for further developments in the area of wavelets.
Now let Znj2n−1j=0, n≥0 be independent identically distributed N(0, 1) random variables; if we
wanted, we could construct all these random variables on the probability space ([0, 1],B[0,1], λ).Define
Vn(t, ω) =2n−1∑j=0
Znj(ω)hnj(t) ,
Um(t, ω) =m∑n=0
Vn(t, ω) .
For m > k
|Um(t, ω)− Uk(t, ω)| = |m∑
n=k+1
Vn(t, ω)| ≤m∑
n=k+1
|Vn(t, ω)|(e)
where
|Vn(t, ω)| ≤2n−1∑j=0
|Znj(ω)||hnj(t)| ≤ 2−(n/2+1) max0≤j≤2n−1
|Znj(ω)|(f)
since the hnj , j = 0, . . . , 2n − 1 are 6= 0 on disjoint t intervals.Now P (Znj > z) = 1− Φ(z) ≤ z−1φ(z) for z > 0 (by “Mill’s ratio”) so that
P (|Znj | ≥ 2√n) = 2P ((Znj ≥ 2
√n) ≤ 2√
2π(2√n)−1e−2n .(g)
Hence
P
(max
0≤j≤2n−1|Znj | ≥ 2
√n
)≤ 2nP (|Z00| ≥ 2
√n) ≤ 2n√
2πn−1/2e−2n ;(h)
since this is a term of a convergent series, by the Borel-Cantelli lemma max0≤j≤2n−1 |Znj | ≥ 2√n
occurs infinitely often with probability zero; i.e. except on a null set, for all ω there is an N = N(ω)such that max0≤j≤2n−1 |Xnj(ω)| < 2
√n for all n > N(ω). Hence
sup0≤t≤1
|Um(t)− Uk(t)| ≤m∑
n=k+1
2−n/2n1/2 ↓ 0(i)
for all k,m ≥ N ′ ≥ N(ω). Thus Um(t, ω) converges uniformly as m → ∞ with probability one tothe (necessarily continuous) function
U(t, ω) ≡∞∑n=0
Vn(t, ω) .(j)
6. THE PARTIAL SUM PROCESS AND BROWNIAN MOTION 37
Define U ≡ 0 on the exceptional set. Then U is continuous for all ω.Now U(t) : 0 ≤ t ≤ 1 is clearly a Gaussian process since it is the sum of Gaussian processes.
We now show that U is in fact a Brownian bridge: by formal calculation (it remains only to justifythe interchange of summation and expectation),
EU(s)U(t) = E
∞∑n=0
Vn(s)∞∑m=0
Vm(t)
=∞∑n=0
EVn(s)Vn(t)
=
∞∑n=0
E
2n−1∑j=0
Znj
∫ s
0gnjdλ
2n−1∑k=0
Znk
∫ t
0gnkdλ
=
∞∑n=0
2n−1∑j=0
∫ s
0gnjdλ
∫ t
0gnjdλ
=∞∑n=0
2n−1∑j=0
∫ 1
01[0,s]gnjdλ
∫ 1
01[0,t]gnjdλ+ st− st
=
∫ 1
01[0,s](u)1[0,t](u)du− st
= s ∧ t− st
where the next to last equality follows from Parseval’s identity. Thus U is Brownian bridge.Now let Z be one additional N(0, 1) random variable independent of all the others used in the
construction, and define
S(t) ≡ U(t) + tZ =
∞∑n=0
Vn(t) + tZ .(k)
Then S is also Gaussian with 0 mean and
Cov[S(s),S(t)] = Cov[U(s) + sZ,U(t) + tZ]
= Cov[U(s),U(t)] + stV ar(Z)
= s ∧ t− st+ st = s ∧ t .
Thus S is Brownian motion. Since U has continuous sample paths, so does S. 2
Exercise 6.4 Graph the first few gnj ’s and hnj ’s.
Exercise 6.5 Justify the interchange of expectation and summation used in the proof. [Hint: usethe Tonelli part of Fubini’s theorem.]
Exercise 6.6 Let U be a Brownian bridge process. For 0 ≤ t <∞ define a process B by
B(t) ≡ (1 + t)U(
t
1 + t
).(8)
Show that B is a Brownian motion process on [0,∞).
38 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
7 Quantiles and Quantile Processes
Let X1, . . . , Xn be i.i.d. real-valued random variables with distribution function F , and let X(1) ≤X(2) ≤ . . . ≤ X(n) denote the order statistics. For t ∈ (0, 1), let
F−1n (t) ≡ infx : Fn(x) ≥ t(1)
so that
F−1n (t) = X(i) fori− 1
n< t ≤ i
n, i = 1, . . . , n .(2)
Let ξ1, . . . , ξn be i.i.d. Uniform(0, 1) random variables, and let 0 ≤ ξ(1) ≤ . . . ≤ ξ(n) ≤ 1 denotetheir order statistics. Thus, with G−1n (t) ≡ infx : Gn(x) ≥ t,
G−1n (t) = ξ(i) fori− 1
n< t ≤ i
n, i = 1, . . . , n .(3)
Now
(X∗1 , . . . , X∗n) ≡ (F−1(ξ1), . . . , F
−1(ξn))d= (X1, . . . , Xn),(4)
so
(X∗(1), . . . , X∗(n)) ≡ (F−1(ξ(1)), . . . , F
−1(ξ(n)))d= (X(1), . . . , X(n)).(5)
Hence it follows that
F−1n ( · ) d= F−1(G−1n ( · )) ,(6)
and to study F−1n it suffices to study G−1n .
Proposition 7.1 The sequence of uniform quantile functions G−1n satisfy
‖G−1n − I‖∞ ≡ sup0≤t≤1
|G−1n (t)− t| = ‖Gn − I‖∞ →a.s. 0 ,(7)
and hence, if F−1 is continuous on [a, b] ⊂ [0, 1], then
‖F−1n − F−1‖ba ≡ supa≤t≤b
|F−1n (t)− F−1(t)| →a.s. 0 .(8)
Proof. Note that ‖G−1n − I‖∞ = ‖Gn − I‖∞ by inspection of the graphs. Thus
‖F−1n − F−1‖bad= ‖F−1(G−1n )− F−1(I)‖ba →a.s. 0(a)
since F−1 is uniformly continuous on [a, b] and ‖G−1n − I‖∞ →a.s. 0. 2
Definition 7.1 The uniform quantile process Vn is defined by
Vn ≡√n(G−1n − I) .(9)
The general quantile process is defined by
√n(F−1n − F−1)
d=√n(F−1(G−1n )− F−1) .(10)
7. QUANTILES AND QUANTILE PROCESSES 39
Theorem 7.1 The uniform quantile process can be written as
Vn = −Un(G−1n ) +√n(Gn G−1n − I) ,(11)
and hence for the specially constructed Un of theorem 5.2 it follows that, with V ≡ −U d= U,
‖Vn − V‖∞ →a.s. 0 as n→∞ .(12)
Proof. First we prove the identity (11):
Vn =√n(G−1n − I)
=√n(G−1n −Gn(G−1n )) +
√n(Gn(G−1n )− I)
= −Un(G−1n ) +√n(Gn G−1n − I) .
Now ‖G−1n − I‖∞ →a.s. 0 by proposition 7.1, and
‖Gn G−1n − I‖∞ = sup0≤t≤1
|Gn(G−1n (t))− t| = 1
n.(a)
Hence
‖Vn − V‖∞ ≤ ‖Un(G−1n )− U‖∞ + ‖√n(Gn(G−1n )− I)‖∞
≤ ‖Un(G−1n )− U(G−1n )‖∞ + ‖U(G−1n )− U‖∞ +1√n
≤ ‖Un − U‖∞ + ‖U(G−1n )− U‖∞ +1√n
→a.s. 0 + 0 + 0 = 0 .
since U is a continuous (and hence uniformly continuous) function on [0, 1]. 2
Theorem 7.2 Let Q = F−1, and suppose that Q is differentiable at 0 < t1 < · · · < tk < 1. Then√n(F−1n (t1)− F−1(tk))
···√
n(F−1n (tk)− F−1(tk))
→d
Q′(t1)V(t1)
···
Q′(tk)V(tk)
∼ Nk(0,Σ)(13)
where
Σ ≡ (σij) =(Q′(ti)Q
′(tj)(ti ∧ tj − titj)).(14)
Moreover, if Q′ is nonzero and continuous on [a, b] ⊂ [0, 1], then for any [c, d] ⊂ [a, b]
‖√n(F−1(G−1n )− F−1)−Q′V‖dc →a.s. 0 as n→∞ .(15)
Note that Q′(t) = 1/f(F−1(t)).
40 CHAPTER 2. SOME BASIC LARGE SAMPLE THEORY
Proof. Suppose that k = 1 and let t1 = t. Then√n(F−1n (t)− F−1(t)) =
√n(Q(G−1n (t))−Q(t))
=Q(G−1n (t))−Q(t)
G−1n (t)− t√n(G−1n (t)− t)
d=
Q(G−1n (t))−Q(t)
G−1n (t)− tVn(t) for the special Vn process
→a.s. Q′(t)V(t) ∼ N(0, (Q′(t))2t(1− t)) .Similarly
√n(F−1n (t1)− F−1(tk))
···√
n(F−1n (tk)− F−1(tk))
d=√n
Q(G−1n (t1))−Q(t1)
···
Q(G−1n (t1))−Q(t1)
→a.s.
Q′(t1)V(t1)
···
Q′(tk)V(tk)
.(a)
2
Theorem 7.3 (Bahadur representation of quantile processes). The uniform quantile process canbe written as
Vn = −Un + op(1)(16)
where the op(1) term is uniform in 0 ≤ t ≤ 1; i.e.
‖Vn + Un‖∞ →p 0 .(17)
In fact
lim supn→∞
n1/4‖Vn + Un‖∞√bn(log n)
=1√2
a.s.(18)
where bn ≡√
2 log log n. Moreover, if Q′(t) exists, then√n(F−1n (t)− F−1(t)) = −Q′(t)
√n(Fn(F−1(t))− t) + op(1) .(19)
Corollary 1 (Asymptotic normality of the t−th quantile). Suppose that Q = F−1 is differentiableat t ∈ (0, 1). Then
√n(F−1n (t)− F−1(t))→d Q
′(t)N(0, t(1− t)) = N
(0,
t(1− t)f2(F−1(t))
).(20)
Corollary 2 (Asymptotic normality of a linear combination of order statistics). Suppose that Jis bounded and continuous a.e. F−1, and suppose that E(X2) <∞. Let
Tn ≡1
n
n∑i=1
J
(i
n+ 1
)X(i) , µ =
∫ 1
0J(u)F−1(u)du .(21)
Then
√n(Tn − µ)→d
∫ 1
0JVdF−1 ∼ N(0, σ2(J, F ))(22)
where
σ2(J, F ) =
∫ 1
0
∫ 1
0J(s)J(t)(s ∧ t− st)dF−1(s)dF−1(t) .(23)